Modeling asymmetric compositional data

Compositional data belong to the simplex sample space, but they are transformed to the sample space of the real numbers using the additive log-ratio transformation to allow the application of standard statistical techniques. This study aims to model compositional skewed data of three soil components after additive log-ratio transformation. The current modeling was done for compositional data of sand, silt and clay (simplex), and bivariate data (real) using the standard skew theory with and without the inclusion of the covariate soil porosity. The analyses were run using the R statistical software and the package sn, and the goodness-of-fit was found after applying the covariate.


Introduction
Compositional data are vectors whose elements are proportions with summation equal to 1.The natural sample space is the unitary simplex with dimension equal to the number of elements which means a restrictive part of the real space.The modeling of these types of data can be done through the simplex or by the set of real numbers.Aitchison (1986), and Pawlowsky-Glahn and Olea ( 2004) described the additive log-ratio transformation (ALR) and the centered log-ratio (CLR) for data analysis, for example, as an analytical method in the real sampling space, but requiring a new data transformation to the original sample space.The current experiment uses the real space as the sampling space given the possibility of applying the theory developed by Martins et al. (2009), who applied a multivariate normal distribution to compositional data.Sometimes, however, the normal distribution does not provide the best fit of the data.Thus, the skew-normal distribution developed by Azzalini (1985) has been profitable for these cases and applied to several areas of research and development.Azzalini (1985) formally introduced the skewnormal distribution after previous studies.Gupta et al. (2004) developed a survey on the different articles/authors as Branco andDey (2001), Loperfido (2001), Azzalini and Capitanio (1999), Gupta and Chen (2001), Gupta and Brown (2001) and reported two experiments to characterize this distribution considering quadratic statistics.Considering the compositional data, the literature contains articles as reported by Aitchison et al. (2003) who characterized the shape of the logistic normal additive and its tests.Màteu-Figueras et al. (2005) proposed the additive logistic skew-normal distribution to analyze compositional data, and highlighted the analytical difficulties to obtain the estimates.Màteu-Figueras and Pawlowsky-Glahn (2007) used a measurement like Lebesgue compatible with the algebraic and geometric structure of the simplex to define the skew-normal distribution in such sample space.
The objective of this experiment was to fit a bivariate skew-normal model to compositional data after the ALR transformation.The application had a data set with 82 compositions of sand, silt and clay.

Material and methods
In the current analysis, we used a data set collected from an irrigated area under the centralpivot of the Areão Farm, campus of the Escola Superior de Agricultura -"Luiz de Queiroz" (ESALQ-USP).A quadrant was designed in the highest part of the hill (at the top of the slope) from which 82 soil samples were collected from 0 to 0.20 m using a regular squared net for sampling the soil at every 20 m.Every soil sample had the contents of sand, silt and clay evaluated.

Compositional data
The composition is a vector with B components representing the proportions.The sample space is the simplex given by However, it is possible for vectors with positive components measured in the same scale to become a composition by dividing every component by the summation of all of them.
Visual graphic of a sample with three components, for example, is done by a ternary diagram designed on equilateral triangle where each vertex represents an individual component.Lemos and Santos (1996 apud REICHARDT;TIMM, 2004), describes the soil classification using a diagram that divides the soil into classes so that a location in the diagram allows the classification, ranging from Sandy to Very Clay.
Avoiding the application of standard statistical techniques that produce inconsistent results because of the intrinsic correlation among the components, Aitchison (1986) proposed, amidst others, the log additive ratio transformation as the enabling the analysis in the real space and the inverse transformation that is the generalized additive logistic, Skew-normal distribution Azzalini (1985) defined a random variable Z as a standard skew-normal distribution if the density function was given by: where ( )   and ( )   are the density function and the standard normal distribution with zero mean and variance 1, respectively, and α is a parameter controlling the asymmetry of the distribution ranging from ( , )   .The expected value and the variance of Z are given by Based on Azzalini (2005), if Z has skew-normal distribution denoted by  , and The asymmetric coefficient of the skew-normal distribution based on Bayes and Branco (2007), is which characterizes how and how much the distribution deviates from the symmetry.
After Azzalini and Dalla Valle (1996), Azzalini and Capitanio (1999) described the extension of the Equation (2) for the multivariate case, the skewnormal multivariate density function for a standard normal random vector Z with k dimension as the with vector of means equal to zero and correlation matrix Z  , ( )   is the distribution function N(0,1) , univariate, and  a vector of k dimension.
Specifically for two dimensions, Azzalini and Dalla Valle (1996) described the bivariate skewnormal density function of is the correlation matrix, and After introducing the location parameter   and the scale 0   in the Equation (2), we have the density function for a random variable Y as    .In the multivariate case, from Azzalini (2005), the skew-normal distribution for a random vector kvariate Y with n observations 1 n (y ,..., y ) can be , in which ( )   is defined as previously, the vector k-variate  controls the shape of the distribution and determines the direction of maximum asymmetry.Thus, we denote where *  is the square root of the Y diag( )  , which denotes the diagonal matrix formed by Y  .The vector can be written as where is the correlation matrix associated with Y  .Thus, the expected value and the variance of Y are where Z  is given by the Equation (3).
Based on Azzalini and Capitanio (1999), for n independent observations 1 n (y ,..., y ) sampled from In the case of covariates is the parameter matrix and n p X  is a matrix of covariates.
Testing the model significance for the distribution asymmetry, Azzaline and Capitanio (1999) suggested the log-likelihood ratio test to verify the normality.The null hypothesis is 0   and the test statistics is: where are the estimates of maximum likelihood of ( , )   under the assumption of normality.Under 0 H , if the statistical test is higher than the p 2  , the hypothesis of data normality is rejected.
In the current analysis, we fit the bivariate skewnormal model for the response variables Y 1 and Y 2 , and then, we fit the model using the soil porosity as the covariate.The analyses were carried out using the R Software (R DEVELOPMENT CORE TEAM, 2011) and adaptations of the functions of the package sn (AZZALINI, 2011).
We compared the model using the Akaike information criterion -AIC (AKAIKE, 1974) and corrected Akaike information criterion -AICc (BOZDOGAN, 1987) , respectively, where n par is the number of parameter in the model and 1 is the maximum value of the likelihood function for the estimated model.We verified the normality of the bivariate data using the quantil-quantil plots and the probability-probability plots (QQ and PP, respectively).The purpose of QQ plot is to calculate the expected value for the observed value of the variable based on the distribution, i.e. the normal.The PP plot, in turn, compares the empirical cumulative distribution of the variable with the theoretical cumulative distribution function as the best fit.In both plots, dots close to the line indicate that the data are following the distribution under study.We also showed the contours of the bivariate normal density for both models.Thereafter, we applied the Equation (1) to transform the bivariate data into compositional data.Finally, we presented a diagram of texture classification using the original compositions and the compositions estimated by the model.

Results and discussion
Table 1 lists the parameter estimates with and without the total soil porosity.Setting the limits of IC at 95% for the first model, the results of the average of the variable Y 2 as the skew components for the first model were significant, unlike those verified in the second model.However, the confidence intervals had limits close to zero maintaining the skew parameter of the model (Tables 1 and 2).This result is consistent with the responses in the Table 2 where the hypothesis of bivariate normality of data is also rejected for the model without the covariate.
In Figures 1 and 2, the QQ plots exhibited the dots closest to the line for the skew-normal model with the covariate.Similarly, the PP plots also confirmed the best goodness-of-fit to the skew-model when the total soil porosity is considered in the model.
Figure 3 shows the contours of the fitted models indicating the suitability of the skew-normal distribution.

Conclusion
The results indicate that the bivariate skewnormal distribution is an alternative for modeling transformed (ALR) compositional soil data.The model is more appropriate using the soil porosity as covariate than the normal distribution.Experiments with particle size analysis can achieve the benefits of this methodology because of the inclusion of compositional data with three components.
and the results are transformed to the original scale.Similar toMartins et al. (2009) who applied the ALR transformation Figure 1.QQ-plots for the normal (a) and skew-normal distribution (b), PP-plots for the normal (c) and skew-normal (d) for the model without covariate.
Percentiles of chi-square distribution

Figure 2 .Figure 3 .
Figure 2. QQ-plots for the normal (a) and skew-normal distribution (b), PP-plots for normal (c) and skew-normal distribution (d) with the covariate.

Figure 4
Figure 4 illustrates the diagram of texture classification using the soil composition and the composition estimates by the model with the covariate.Thus, this soil is classified as Clay-loam to Very Clay, across the Clay.Otherwise, the model is satisfactory and the composition determines the soil classification as Clay-loam and Clay.

Figure 4 .
Figure 4. Diagram of the textural classification for soil composition estimated by the skew-normal model using soil porosity as covariate.
which are

Table 1 .
Estimates and confidence interval for parameters from the skew-normal model with and without the covariate.

Table 2 .
Estimates and confidence interval for the asymmetric coefficient for the skew-normal model with and without the covariate.

Table 1 .
The models are compared by the Equation (5) that confirmed based on the previous results that the model with the covariate had a best goodness-of-fit due to the lower AIC and AICc values.

Table 3 .
Test statistics and p-value for the likelihood ratio test for bivariate normality.

Table 4 .
Log-likelihood and comparison of the skew-normal models with and without the covariate.