A simplex dispersion model for improving precision in the odds ratio confidence interval in mixture experiments

A new approach to data analysis in mixture experiments is proposed using the simplex regression, that is in the class of dispersion models family. The advantages of this approach are illustrated in an experiment studying the mixture effect of fat, carbohydrate, and fiber on tumors’ proportion in mammary glands of rats. Model was evaluated by goodness of fit criteria, simulated envelope charts for residuals of adjusted models, odds ratios graphics and their respective confidence intervals. The simplex regression model showed better quality of fit and smaller odds ratio confidence intervals.


Introduction
A mixture experiment consists in optimizing a response variable (y) with the constraint that Equation 1: (1) where: ( ) is the proportion of i -th component ( ), with q is the number of components (Dal Bello & Vieira, 2011). Sometimes, can be referred as compositional data (Pawlowsky-Glahn, Egozcue, & Tolosana-Delgado, 2015), but often this term refers to the response y. We will restrict our discussion to design variables. In this case, E[y] is a function of x's, explanatory variables in a regression approach. The space spanned by design variables takes the form of a (q-1) regular simplex size. In the q = 3 case, the simplex is a triangular region. Additional restrictions (for economical, physical or practical reasons) are sometimes imposed on individual components, being and , respectively, are the upper and lower limits for x1. In this sense a restricted region as given in Equation 1 arises.
Statistical modeling is done using polynomial models assuming normality for the response variable (Leão, Vieira, & Dal Bello, 2015). If response variables follows other known distributions, one can use generalized linear (mixed) models. Especially when the response variable is binary or binomial, the binomial regression model (logistic) has been widely used, but this model does not accommodate the effect of under or overdispersion, which often occurs on grouped data. For such situations, Zhang and Qiu (2014) proposed the use of a simplex regression model which is a model that belongs to the family of dispersion models, that can also account for under or over-dispersion from binomial distribution.
When response variable is binary or binomial, odds ratios are of great practical interest. Conventional methods of analysis and interpretation of the parameters of the mixture model are not suitable, since restrictions implies complex interactions in the mixture from Equation 1 (Akay, 2007). Analysis of mixture components' effects can use Cox directions, a concept that allows one to obtain precision and confidence intervals for the odds ratios in mixture experiments affected by collinearity of main effects.
Designs for mixture experiments are highly affected by collinearities, which is caused by the constraint in Equation 1. Several alternatives have been proposed in the literature to overcome this problem, such as the use of pseudo-components, inverse terms or ratio variables (Akay & Tez, 2011).
In this paper we used a simplex regression model in evaluating mixture experiment instead of logistic regression. The advantages of our approach are illustrated in an actual experiment, which studied the effect of different diets consisting of fat, carbohydrate and fiber on tumors' expression in mammary glands of female rats. Odds ratios and their respective confidence intervals for the effect of diets on promotion of tumors in rats were evaluated. Goodness of fit criteria were worked out to compare results.

Experiment description
Data from Akay and Tez (2011) were used. Authors present a mixture experiment to study effects of diets (levels of fat, carbohydrate and fiber) on the expression of mammary gland tumors induced by Dimethylbenzathracene (DMBA) in female rats. Experiment spanned 26 weeks. Figure 1 contains the number of tumor responses observed in nine diet groups (with 30 rats per group) with different caloric proportions of fat ( ), carbohydrate ( ) and fiber ( ).

Regression models applied to mixture experiments
Purpose of experiment is to model response as a function of mixture components . In this case, to model tumor rate (y) as a function of diet (x's). The functional form of the response is not known, but first and second order polynomial approximation models are widelly used (McCulloch, Searle, & Neuhaus, 2009).
Common mixture models are presented in Table 1. The models Equation 2 and 3 are the Scheffé's canonical polynomials of first and second degree, respectively (Cruz-Salgado, 2016). However, when response variable shows extreme responses to one (or more) component in the formula, this limits usefulness of the simplex. For these situations, the Scheffé's models do not accommodate possible curvilinear effects from the extreme response behavior (Brown, Donev, & Bissett, 2015). To solve this problem, inverse terms can be included, producing better fitting, however, this brings a nonlinear impact in Equation 1. Other approaches in literature have been succesfully attemted, as the inclusion of ratio variables, like in the models Equation 4 and 5 of Table 1, where corresponds to the mixture component that causes the border effect (Akay & Tez, 2011). Acta Scientiarum. Technology, v. 42, e44068, 2020

Analysing mixture experiments using Dispersion models
Let n independent realizations from a binomial distribution with parameters and . Assuming that the transformed value of the probability of response to the i-th observation is related to a linear combination of q mixture of components, that is, , where is a the link function and is any of the models in Table 1. We adopt as the logistic transformation. In this case, we have logit , which results in the logistic regression model Equation 6: The usual method to estimate is through Maximum Likelihood. Logistic regression can be used in situations where the response is a Bernoulli event or arises as the proportion of events Y = 1 in n trials and it belongs to the exponential family (Hosmer Jr., Lemeshow, & Sturdivant, 2013 ) A distribution that can be used to study a variable continuous response restricted to the (0,1) interval is a simplex distribution (Zhang and Qiu, 2014). The simplex distribution is included among dispersion models, which extend the generalized linear models (Barndorff-Nielsen & Jørgensen, 1991;Jørgensen, 1997b;López, 2013;Quintero & Contreras-Reyes, 2018). A random variable y following a simplex distribution with mean and dispersion parameter has density function given by Equation 7. where: . The distribution of y is denoted by . For a random sample , each , , the simplex regression model is defined by the density of the Equation 7 and the averages modeled by , where and are as previously defined. One desirable feature of simplex regression is fitting for heterocaedastic variances. Unlike beta regression model, expected variances are not a function of . The extra dispersion parameter provides greater flexibility for modeling (Jørgensen, 1997a;Zhang and Qiu, 2014).
The procedure for estimation of the simplex regression model parameters is similar to the logistic regression, with the difference that the additional parameter should be estimated. The log-likelihood function for n independent observations is given by Equation 8.
The maximum likelihood estimators of parameters and are obtained through the solution of the homogeneous equations. However, for it results in closed form. The estimation of requires the use of a numerical maximization, usually any method for this works, like Newton-Raphson or Fisher's scoring and its variations (McCulloch et al., 2009).

Selecting models and goodness of fit criterion
Akaike information criterion (AIC) is a relative measure of goodness of fit, defined by , where is the (neperian) logarithm of the model likelihood function evaluated at the point estimates and p is the number of model parameters. Alternatively, a slight modification of AIC, known as Bayesian Information Criterion (BIC), weights the parameters by , where n is sample size (Menezes, Liska, Cirillo, & Vivanco, 2017). Bozdogan (2010) proposed the information complexity index (ICOMP), which uses the Fisher's information matrix evaluate model complexity, as this accounts for correlation of parameters's estimates (Silhavy, Senkerik, Oplatkova, Prokopova, & Silhavy, 2017). The ICOMP is defined as Equation 9. (9) where: , are the parameters estimated, the inverse of the Fisher information matrix, and . Diagonal elements of are estimated variances of model parameters and off-diagonal elements are their covariances. This is a measure of collinearity between columns of and the degree of independence of parameters (or their estimates). According to this criterion, the best model within a set of models is the one which minimizes the ICOMP (Bozdogan, 2010).
Goodness of fit can be also checked by a normal probability plot for residuals, a graphical indication that distributional assumptions are violated. This plot, also called envelope simulated chart, contains resampling confidence bands, and it is judged that suitable adjustment has occurred if all model residuals (or at least most of them) are contained within these bands (Moral, Hinde, & Demétrio, 2017).

Interpreting parameters and measuring the effect of the components in mixture experiments
For mixture experiments with logistic transformation, model coefficients estimates are not directly interpreted as odds ratios, as the restrictions limit interpretation and unexpected interactions may be present. In other words, if the estimate for increases, then estimates for other components should decrease, but their ratio to one another remains constant. To better understand this concept, we should use the Cox direction, as explained by Cornell (1998).

Cox direction for trace response plot
In the case of a restricted experimental region to be a regular simplex, an alternative representation of the Cox direction may be formulated, considering the fact that . In

Odds ratio plots for mixture components
Odds ratio are used for easy interpretation of parameter's coefficient estimates in logistic regression (Chen, Cohen, & Chen, 2010). For mixture experiments, techniques based upon trace response graphics can be used for such comparisons. Considering any point taken as a control group on the experimental region, the odds ratio is given along the axis by Equation 13.

Odds
Odds control (13) In the simplest way, , with , and .
The precision of the odds ratio can be determined by the confidence interval and its range reflects its variability (Hosmer Jr. et al., 2013). Using methods for calculating the variance of a sum, we can obtain estimated variance of the logarithm of the odds ratio. The confidence interval is given by Equation 14. where: is the neperian logarithm standard error for the odds ratio and is the -th standard normal quantile with significance level. Lower and upper limits for odds ratios can be back transformed exponentiating the limits in Equation 14. Narrow confidence intervals are also a criteria for selecting better models or estimation methods. Thus, plotting confidence regions for odds ratios can be also used to compare models.

Model selection
Logistic regression results described in Table 2 are identical to Akay and Tez (2011), which presented the model with ratio variables as a better alternative than the logistic regression model with variables in pseudo-components on polynomial models of Scheffé and Backer. Thus, compared to the results obtained by these authors, this indicates that simplex regression model performs better. For models with ratio variables, the lowest values of ICOMP, AIC and BIC criteria were obtained. Therefore, the simplex regression model with ratio variables was the best of the models used and the model provided lower standard errors of the model parameter estimates, indicating more precision. Parameter estimates for the component is larger than the other parameters. Menard (2010) warns that the reason for this discrepancy between the estimates of the model parameters is due to collinearity. This can be seen studying covariance matrices and of the models M1 and M3. Diagonal elements of indicate more precise estimates for models M2 and M4.
The covariances between parameters in the M2 and M4 models are smaller than those of models M1 and M3, as can be seen in the and matrices, which provides an explanation for the difference between the ICOMP values. As Bozdogan (2010) mentioned, systems whose covariance between its components are more evident tend to have higher values for ICOMP and, on the other hand, smaller covariance results in lower values for ICOMP. In addition, the M4 model has more information about the parameters, since the variances thereof are smaller than those of other models. Some care is always needed interpreting those models as some coefficients are for ratios of design variables (Equation 15 to 18). Akay and Tez (2011) addressed the presence of the under or over-dispersion effect on pooled data, such as the data presented in Figure 1 and the authors mentioned that this fact must be taken into account when selecting the model. The estimated dispersion parameters of the M1 and M2 models are and , respectively. In this case, it can be said that the under-dispersion effect in model M1 is present and therefore is missspecified. When using the model with ratio variables (M2), the dispersion parameter estimate is close to the unit value, which is the default value for the usual logistic regression model. Thus, it can be said that the model M2 controlled the under-dispersion effect. In the case of the M3 and M4 models, the simplex regression model naturally models the dispersion and the estimates are given by and . Given the above, it can be concluded that the simplex regression model showed better adjustment of quality indicators for the proportion of breast tumors in female rats (Table 2). For comparison purposes, the M2 and M4 models have been discussed, since the M2 was the best among the ones proposed by Akay and Tez (2011) and M4 considered the best among simplex distribution. Thus, the normal probability plot of the residual deviation component for the M2 and M4 models supports the claim that the assumption of binomial ( Figure 2a) and/or simplex (Figure 2b) response for the analyzed response is adequate and the adjustment of the models were satisfactory (Figure 2).

Models discussion about the mixture component effect plots
In what follows we describe graphic interpretation of M2 and M4 models. Trace plots for the reference point in Cox direction was given to the centroid , as can be seen in Figure 1.
In Figure 3a, M2 model, the (fat) and (carbohydrate) components have opposite effect on the response. As the proportion of fat increases, the expected tumor incidence increases. On the other hand, as the proportion of carbohydrate increases the expected tumor incidence decreases. The (fiber) component has more effect on the response than other components, since successive increments of dietary fiber lead to a higher expected decrease in tumors. Similarly, the same conclusions can be ma de for the M4 model. Figure 4 and 5 present the odds ratios for different reference points in relation to the control group for each evaluated model. To this end, we considered three different reference points (0.7, 0.275, 0.025), (0.275, 0.7, 0.025) and (0.332, 0.466, 0.202). The first two points of reference are contained in the region where the sample points lie. The third reference point is the centroid of the constrained experimental region (Figure 1). The control group was given by the centroid of sampling points . This work is not adhered to the biological reasons for the choice of these points, but the fact that such choice was made strictly by inspection of the experimental region and applicability in mixture experiments.
Odds are that mammary tumors occurrence increase with larger values in the component. The respective M2 model 95% confidence interval contains the value 1, used to compare the odds ratios in amounts from 0.4 to 0.6 approximately. Therefore although the chance increases to the amounts 0.4 to 0.6, approximately, of , it is not significant in the sense that the component in the population (fat) does not significantly influence the occurrence of mammary tumors in rats ( Figure 4A).
For point estimates, the same conclusions apply for the component in the M4 model. However, we note that the 95% confidence interval for the odds ratio does not contain the value 1 for some values of ( Figure 5A), indicating that this component significantly influences the occurrence of mammary tumors in different quantities than those explained by model M2 (Figure 5). This fact is evidenced by the width of the confidence interval for the odds ratio of this component, which is narrower in the M4 model than in the M2 model. Therefore, the M4 model provides estimates of the component more precise than model M2 ( Figure 5).
M4 model provided more precise estimates of the odds ratio than the M2 model in all adopted reference points ( Figure 4A-C). This fact can be explained by inspection of the covariance between the parameters of the models evaluated. Therefore, we concluded that considering the proportion of mammary tumors incidence in rats as a random variable with simplex distribution, the use of ratio variables to study the relationship between fat, carbohydrate, and fiber mixture components is a viable alternative to the model proposed by Akay and Tez (2011).
Other information that can be provided by the model is the particular mixture that provide the maximum (or minimum) tumor incidence, respecting the constraints of each component. Minimum expected tumor incidence in the model that showed the best fit quality indicators (M4) is 55.08% and the mixture providing this value is formulated as 13.36 fat, 86.34% carbohydrate and 0.30% fiber (Table 3). Major difference on components that maximize or minimize response was achieved varying proportion of fat and carbohydrate in the mixture.

Conclusion
Simplex regression model showed good fit to the analysis of a mixture experiment that evaluated the incidence of mammary tumors in female rats, being a viable option in the analysis of situations where the outcome is limited to the (0,1) interval. The use of this model also accounts for the under or over-dispersion present in grouped data.
Confidence intervals for the odds ratio were severely affected by choices of reference points. The simplex regression model provided more precise estimates for the odds ratio (narrower confidence limits). The model gave more stable estimates for odds ratios in different reference points in the experimental region, compensating for border effect.