Comparison of tests on covariance structures of normal populations

In some studies, there is interest in testing the variance structure, as in the context of multivariate or modelling techniques. Therefore, the importance of using hypothesis tests on covariance structures is emphasized. The purpose of this study was to perform a detailed performance study regarding the power and type I error rate of some existing identity and sphericity tests, considering the scenarios with different numbers of variables (2 to 64) and sample sizes (5 to 100). The proposal of Ledoit and Wolf (2002) is the most appropriate to test the identity structure. For the sphericity test, the version of John (1972), modified by Ledoit and Wolf (2002), followed by the proposal of Box (1949), were the ones with the best performance.


Introduction
In data analysis, for the univariate or multivariate analysis to be applied more effectively, some assumptions must hold. Some of these assumptions refer to the covariance structure. In order to ensure the assumptions about covariance, it is necessary to perform some hypothesis test capable of evaluating the structure of the matrix under study, which can be presented in different formats. When you have some hypothesis testing options available, you need to choose the most appropriate test, which allows for greater confidence in the results. The selection of the test is made with the knowledge of the performance, that is, of the control of the type I error rate and power. The presence of such a detailed and descriptive study of the behaviour of the tests of interest, considering its importance, was not verified in the specific literature. This study aimed to carry out performance in relation to the power and type I error rate of the existing identity and sphericity tests, in addition to classifying the tests as conservative, exact or liberal.
The use of a test with inadequate power and type I error rate affects decision-making and, for example, dependence or independence between variables, the applicability of multivariate techniques, and adequate modelling of covariance structure, among others. The importance of knowledge on type I error and power of any hypothesis test are directly related to success in decision-making. According to Cantelmo and Ferreira (2007), a perfect hypothesis test is one that never rejects a true null hypothesis and always rejects a false null hypothesis, a situation considered unreal. In practice, an ideal test is one that has type I error rate close to the level of significance adopted and power approaching 100%. For this purpose, the particular cases of the likelihood ratio tests, the identity and sphericity tests for a normally distributed population were used. These tests were compared considering different simulated scenarios.
Methods such as regression require the assumption for the covariance matrix when independence is assumed. The F-test requires a structure in the form of composite symmetry or HF to be valid, (Huynh & Feldt, 1976;Littell, Henry, & Ammerman, 1998). The time series can be adjusted by regression models (Kedem & Fokianos, 2005) and the dependence controlled by the covariance matrix (uncorrelated waste), several structures can occur in these series. Generalized and mixed linear models expand this application allowing the use of other covariance structures. In the study of (Gouveia, Silva, Ferreira, Gadelha, & Lima Filho, 2015), it intended to estimate the volume of Eucalyptus clones using mixed models, the heterogeneous first order autoregressive structure was adopted. Xavier and Dias (2001) verified through the observation of cases in which the covariance matrix satisfies or not the sphericity condition. If the matrix does not meet the sphericity condition, corrections should be used for the degrees of freedom of intra-distribution. The test presented below is a particular case of the test proposed by Korin (1968), according Equation 1: where: tr( ) denotes the trace of a matrix, (×) the multiplication mathematical operator, S refers to the estimator of the biased sample covariance matrix, being this statistic asymptotically distributed ( ) with f degrees of freedom. Ledoit and Wolf (2002) also presented an alternative test for this case, also suitable for the situation where the number of variables p exceeds the sample size n. This test is not limited by the uniqueness of the sample covariance matrix and the test statistic is given by Equation 2: According to Ferreira (2008), for the independence test it is assumed that population covariance is null. Among its many uses, according to Li and Yao (2016), we can mention the particular interest of the biological area to test the genetic independence in genomic studies that inspired a range of discussions about the importance of tests of covariance matrix structures. In this case, there are still two situations to consider. Variances may be different, variances can be the same or homogeneous, H0: Σ . Under the null hypothesis, the likelihood function can be written by (X;μ, ) = } = } which denotes the covariance matrix specified in the null hypothesis, so that the likelihood ratio test statistics is given by Equation 3: where: , so that the statistic has chi-square asymptotic distribution with f = degrees of freedom i.e., according Equation 4: Bartlett (1954) proposed a correction to improve the chi-square asymptotic approximation. Under the null hypothesis, it can be written as Equation 5: It refers to the sphericity tests on the structure of a covariance matrix whose variances are homogeneous. According to Ferreira (2008), the objective is to know if the study variables correlate considerably with the same variability. Some multivariate methodologies may require that the variables under study be correlated.
According to Malhotra (2012), Bartlett's sphericity test is used when one wants to verify the hypothesis that the variables are not correlated in the population, that is, the correlation matrix is a spherical structure and each variable correlates perfectly with itself, but does not correlate with the other variables under study. It is considered by hypothesis: null covariance, equal or homogeneous variances. Further cases of sphericity can be tested using diagonal structure covariance matrices where variance values are constant and known.
Under the null hypothesis, the likelihood function can be written by: The likelihood ratio test for the sphericity hypothesis is therefore is the unbiased covariance matrix estimator, S is the estimator of the biased covariance matrix, with asymptotic chi-square distribution, that is, Equation 6: Box (1949) presented a proposal for better performance, in which the corrected statistic is given by Equation 7: where: p > n, John (1972) and Ledoit and Wolf (2002) presented an alternative test developed with the aim of making a new correction for the existing one. The test statistic that makes it robust against high dimensionality is given by Equation 8: The same test was performed in the work of Timm (2002), however a change was made in the degrees of freed m, because instead f n, n − 1 was used. This m dificati n was pr p sed by Su iura (1972). Sphericity tests are a particular case of composite symmetry tests.

Material and methods
The data used in this work are of a fictitious nature, obtained through Monte Carlo simulations. Observations of normal multivariate distributions were made with the mean vector , without loss of generality. Different sample sizes (n) and number of variables (p) were considered. In each scenario, independent observations were generated on index k (sample unit) k = 1, ..., n and with different correlation in each scenario (index l = 1, ..., p variables). The scenarios under study were created through several combinations of n and p, satisfying the restriction p < n.
The behaviour of the effective type-I error rate, for a nominal value α = 5% in scenarios with increasing n (5, 10, 20, 30, 50 and 100) and increasing p (2, 4, 8, 16, 32 and 64) was evaluated. After obtaining the n observations of the p variables, the statistics of the identity and sphericity tests presented were obtained using the software R (R Core Team, 2018). For the simulation of the data, it was used the mvtnorm function of the computational language R, proposed by Genz and Bretz (2009), which generates random data through a number of data (n) and pre-established variables (p). One thousand simulated samples were generated, depending on the matrix that originated the data under the null hypothesis and under the alternative hypothesis. After the simulation, the analysis and interpretation of the test was performed in relation to the type-I error rate (with nominal probability α) and power (complementary to type-II error) in the proposed scenarios.

Study 1: power of the tests for identity
H0: Σ = I versus H1: Σ  I. To obtain the type I error rate, we first started the scenario in H0 which consists of an identity matrix (p×p). For the construction of the power curve, an arbitrary measure (δ) was used that characterizes the ratio of determinants between the observed matrix and H0. Where δ = | |/| with i = 2, 3, ..., k-1 refers to the difference between the last matrix generated on the matrix specified in H0, which characterizes a δ (delta) variability ratio. The maximum value set for the δ was 64, so that all tests reach the maximum power. The tests involved in this study were: Korin (1968

Study 2
Power of the sphericity tests: H0: Σ = 64I versus H1: Σ  σ² I. Starting from H0 (which consists of the spherical covariance structure obtained with the highest value of δ in Study 1), the deviation of H0 was caused by the gradual increase of the value of ρ from 0 to 0.9 in steps of 0.1. Negative ρ values were not considered. In this study, the following sphericity tests were evaluated: Original likelihood ratio test [Equation (4) -code: LRT II]; Bartlett (1954) Ledoit and Wolf (2002) and John (1972) [Equation 8 -jlw] and Sugiura (1972) [Equation 8 -code: sug].
In the process of classification of the test between conservative, exact and liberal, the exact confidence interval was used for the proportion based on the binomial distribution obtained in the Sisvar software (Ferreira, 2014). The range used for the level of significance was 99% CI (α): [0.0339; 0.0705] because it has the true accuracy ratio of a test at 1% significance, values below the lower limit denote a conservative test and values above the upper limit, a liberal test, in a simulation process with 1,000 repetitions. The criterion used to indicate the most powerful test was: given that a test was considered exact, the test of greater power was recommended.

Results and discussion
The order of the studies was followed for the presentation of the results. In Study 1, two specific tests for identity, Ledoit I and Korin I, were evaluated. For the p = 2 scenario, both tests preserved the type I error rate equal to the level of significance adopted. Thus, they were considered exact for every n evaluated. The scenarios in which the sample size is small n = 5 and n = 10, the Ledoit I test was more powerful for every δ value, and should be recommended in these situations. According to Nogueira and Pereira (2013), it is important to have tests with levels of significance close to the α adopted a priori and that the power is high, even in situations of small samples. It can be summarized, in this scenario, that the power of the tests increases with the decrease of δ as the sample of data increases. In the case p = 2, regardless of sample size, the Ledoit I test is recommended.
For the p = 4 case, all sample sizes indicated that the Ledoit I and Korin I tests showed exact behaviour. In these situations, the Ledoit I test was more powerful for all δ values in relation to Korin I. The power curves of the tests only approach for n = 100. After analysis, regardless of all possible combinations, the use of the Ledoit I test is indicated because it presents the type I error rate close to the level of significance adopted and higher levels of power.
For p = 8, Ledoit I is liberal and Korin I exact in the scenarios n = 20 and n = 30. Increasing the sample size to n  50 caused the decrease in the type I error rate of both tests, Ledoit I that was liberal became exact and considered more powerful. For n = 100, the values δ = 2 to 16, the Ledoit I test was more powerful, for other variations of δ, both tests reach the maximum probability of power. The results suggest the Ledoit I test as the most appropriate.
In the p = 16 case considering n = 30, a liberal behaviour of both tests was obtained. The Ledoit I test was more powerful for every δ, but did not approximate the maximum probability of rejection of H0 in the scenario evaluated. Remembering that, in this case, it is a false power. For n = 50, Ledoit I is liberal and Korin I exact, but both do not reach the maximum probability of power. In n = 100 for all points of the power curve, the Ledoit I test was considered the most powerful. The power curve showed a similar behaviour for all n used. The increase in n caused the decrease in the value of the type I error rate and improvement of power in both tests.
When considering p = 32 and n = 50, the Ledoit I and Korin I tests were evaluated as liberal. The Korin I test was more powerful than Ledoit I. The higher the value of the type I error rate, the more power it has, but this is not a reliable result. For the n = 100 scenario, the Ledoit I test was liberal and Korin I was rated as exact and most powerful. Setting p = 32, increasing the sample size from 50 to 100 provided a reduction in the type I error rate of both tests where they presented similar behaviour.
When assessing p = 64 and n = 100, it was observed that both are liberal. The Korin I test was the most powerful for all δ values assessed, but also the most liberal. Both presented low power, which is contradictory with the expected, moreover, liberal tests are naturally more powerful. In this scenario, the discussion arises that committing type I error may be more serious than type II error. It was observed in Study 1 (Figure 1) that as the value of n and p increases, both tests become less powerful. In the classification of the type I error rate (δ = 1), the Ledoit I test is the most indicated for Study 1 as presented in Table 1. A fact noted is that the Ledoit I test created for situations where p >n, in this study where n >p, situation that avoids non-zero eigenvalues in the numerator of the statistic Λ according to Wang and Yao (2013), obtained a satisfactory result.
In Study 2, the sphericity tests were evaluated. In the p = 2 case, considering n = 5 the LRT II, jlw and sug tests are conservative, Box and Bartlett exact and LRT I is considered liberal. In this scenario, given the results obtained, the most indicated test is Bartlett for better control of the type I error rate and greater power between the exact ones, followed by Box. Wang and Yao (2013) proposed corrections for the likelihood ratio test and John's (1972) test for large sphericity.
The performance of these tests was evaluated in situations of normality and non -normality. It was concluded that when the sample size n is fixed, the power of the test decreases when the value of p is close to n; this could be explained by the fact that some of the eigenvalues of the maximum likelihood estimator S approach zero, making the test almost degenerate and lose power. In the n = 5 case, where the sample size is small and approaches the number of p variables, the jlw test also did not perform satisfactorily. In the n = 10 scenario, none of the assessed tests were liberal, the jlw, sug, Bartlett and LRT II tests were conservative, Box and LRT I exact. From the results, it was concluded that the most indicated test is the test proposed by Box (1949), to lead to a better approximation for the chi-square distribution in small samples.
In the n = 20 and 30 scenarios, the tests were classified in relation to the type I error rate: LRT I, Bartlett and LRT II as conservatives, and Box, jlw and sug as exact tests. For ρ = 0.9, all tests reached the maximum probability of power. As in the previous scenario, the most effective test was the Box test. For the n = 50 case, LRT I, Bartlett and LRT II were considered conservative; Box, jlw and sug were considered exact. None of the tests was considered liberal, for ρ = 0.7 only Bartlett and LRT II do not reach the maximum probability of power, only occurring for all in ρ = 0.9. Keeping the same number of variables p = 2 and increasing the sample number to n = 100, the classification in relation to the type I error rate was that the tests LRT I, Bartlett and LRT II are conservative and Box, jlw and sug are exact. For ρ = (0.6, ..., 0.9), all tests reached the maximum probability of power. For p = 2, it was observed that the increase in the number of n provided the tests with a power towards the maximum probability with this simulation proposal.
According to Hair, Black, Babin, and Anderson (2010), when the sample size is small the hypothesis test is little sensitive, it also states that increasing the sample size will imply increasing the power of the test, which was noted in this scenario. The most suitable test for n > 5 is, unanimously, the Box test.
Increasing the number of variables for p = 4 and in the n = 10 scenario, sug and LRT II tests were conservative; Box, jlw and Bartlett exact and LRT I liberal. From the sphericity tests evaluated, considering the cases for n  20, LRT I, Box, jlw and sug were classified as exact, while Bartlett and LRT II as conservative. For n = 20, in ρ = 0.8, all tests reach the maximum probability, with the exception of LRT II. Increasing the value of n to 30 and keeping the number of variables p = 4 for ρ  0.7, all reached maximum power. In the scenarios n = 50 of ρ = 0.1 to 0.4 the sequence of greatest power is jlw, sug, Box, LRT I, Bartlett and LRT II. It was found that, by increasing the sample size, the jlw test achieved higher levels of power when compared to the others in most scenarios.
In the p = 4 scenario, the increase in the number of variables provided a decrease in the type I error rate and an improvement in the power of the tests. The LRT I test which was a liberal test became exact, LRT II remained as conservative, sug changed from conservative to exact, Bartlett changed the classification from exact to conservative, Box and jlw remained exact. In these scenarios the jlw test, a comprehensive test created even for non-singular matrices and p >n (Ferreira, 2008), is a satisfactory and recommended test. Considering the sample size n = 20 and the increase in the number of variables for p = 8, LRT II showed conservative behaviour, Box and sug exact behaviour, LRT I, jlw and Bartlett liberal behaviour. The LRT I test showed liberal behaviour and of course, higher level of power is expected. In these situations, jlw was considered more appropriate for maintaining the control of the type I error rate. With the increase of the sample number and variables, the tests were more powerful and some of them more conservative, and better controlling the type I error rate.
In the p = 16 case and n = 30, LRT II is the only conservative test, sug is the only exact, LRT I, Box, jlw and Bartlett liberal. At n = 50, Box and jlw become exact and sug liberal, the decreasing sequence of power is LRT I, jlw, sug, Bartlett, Box, and LRT II. For the p = 32 case, jlw remained the most appropriate test for controlling the type I error rate and having the best power curve.
In the work of Wang and Yao (2013), when studying the behaviour of sphericity tests in large dimensions for the normality situation when n is equal to 64 or 128, with variable values of p and below the two alternatives mentioned above, the power of the tests increases with the sample size. For the large sample size n = 256 and p ranging from 16 to 240, all the evaluated tests of interest showed power around 1. Similar situation occurs in this study, when n = 100, the maximum sample size evaluated, the tests present high power close to 1 as the correlation increases. Assuming p = 64 and n = 100 sug is an exact test, all the other tests are liberal, in heterogeneity ρ = 0.1, all tests reached the maximum power level except the LRT II test. In Study 2, regardless of the proposed combinations of n and p given in Table 2, the test that was most indicated disregarding the situations of false power i s the jlw. [See Figure 2].
In this study, as in the work of Lim, Li, and Lee (2010), it was possible to observe that, in general, the power performance of the likelihood ratio tests were low. According to the authors, it is well known that the majority of the likelihood ratio tests based on the chi-square approximation limit have a high likelihood of rejection, the modifications applied to the test statistics improved performance based on chi-square.

Conclusion
It was found that the modifications collaborated to increase the power of the tests in Studies 1 and 2. In order to evaluate identity, Ledoit and Wolf's (2002) proposal was the most appropriate one; for sphericity, the version of John (1972) modified by Ledoit and Wolf (2002) followed by Box´s (1949) proposal were the ones with the best performance. Bartlett's proposal should be used only for small samples and small number of variables.