Nonparametric hypothesis testing for equality of means on the simplex

In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerations regarding implementation, support the use of bootstrap-calibrated James statistic.


Introduction
Data that lie on the simplex where d = D−1 are sometimes called compositional data, and they arise in many disciplines, including geology, [1] economics, [2] archaeology [3] and political sciences. [4] There has been extensive and sometimes highly charged debate over how best to analyse compositional data; see Scealy and Welsh [5] for a recent review. Some authors strongly advocate the use of a log-ratio transformation, [6] while others advocate different approaches; e.g. Baxter et al. [3] consider the direct analysis of compositional data with no transformation. Here, our goal is purely to explore the performance of various nonparametric methods of testing for equality of means of two populations on the simplex using linear statistics based on independent samples from each population.
We consider two forms of nonparametric likelihood: Empirical likelihood (EL) is a nonparametric likelihood which shares many of the properties of parametric likelihoods [7][8][9][10]; and Exponential Empirical Likelihood (EEL), due to [11], who obtained it by exponential tilting. EEL has similar firstorder asymptotic properties to EL but different second-order properties, e.g. in contrast to EL, which is Bartlett correctable, [12] EEL is not Bartlett correctable. [13] Zhu et al. [14] consider a correction to EEL. However, in this paper we focus on higher order corrections based on bootstrap calibration rather than Bartlett correction or other types of analytic correction (see [15], for discussion of these

Quadratic tests for two population mean vectors
The two quadratic-form test statistics we will use are the Hotelling statistic and James statistic defined as follows.

Two-sample equality of mean vector test when 1 = 2 (Hotelling test)
If the covariance matrices can be assumed equal, the Hotelling T 2 test statistic for two d-dimensional samples is given by [22] where S p = ((n 1 − 1)S 1 + (n 2 − 1)S 2 )/(n 1 + n 2 − 2) is the pooled covariance matrix with S 1 and S 2 being the two unbiased sample covariance matrices wherex 1 andx 2 are the two sample means and n 1 and n 2 are the two sample sizes. Under H 0 and when the central limit theorem holds true for each population, we have that [22] T 2 ∼ (n 1 + n 2 )d n 1 + n 2 − d + 1 F d,n 1 +n 2 −d+1 .

Two-sample equality of mean vector test when 1 = 2 (James test)
James [23] proposed a test for linear form of hypotheses of the population means when the variances are not known. The test statistic for two d-dimensional samples is whereS =S 1 +S 2 = S 1 /n 1 + S 2 /n 2 . James [23] suggested that the test statistic is to be compared with 2h(α), a corrected χ 2 quantile whose form is where χ 2 ν,1−α is such that P[χ 2 ν ≤ χ 2 ν,1−α ] = 1 − α, where χ 2 ν is a chi-squared random variable with ν being the degrees of freedom and Krishnamoorthy and Yu [24] showed that under the multivariate normality assumption for each sample The advantage of the calibration proposed by James [23] is that it can be applied to more than two samples, whereas Krishnamoorthy and Yu [24] calculated the degrees of freedom of the F distribution for the two samples only.

EL for the two sample case
In Jing [13] and Liu et al., [25] the two-sample hypothesis testing using EL is described.
The two constraints imposed by EL where the λ j s are Lagrangian parameters introduced to maximize (7). Note that the maximization of Equation (7) is with respect to the λ j s. The probabilities of each of the j samples have the Notes: The solid horizontal line indicates the nominal level (5%) and the two dashed lines are the lower and upper limits of the simulations error. The black and red lines refer to the Hotelling and James tests, respectively, the green and blue lines refer to the EEL and EL test, respectively.
following form: where λ 1 + λ 2 = 0 is a convenient constraint that can be used. The log-likelihood ratio test statistic can be written as  where S j (μ) = (n j − 1)/n j S j + (x − μ)(x − μ) T with S j denoting the unbiased sample covariance matrix.
Asymptotically, under H 0 ∼ χ 2 d , since S j (μ) p → j (from Slutsky's theorem), where j is the population covariance matrix. A derivation of the asymptotic distribution of the test statistic when we have more than two samples, in both the univariate and multivariate case, can be found in [9].
However, asymptotically, this test statistic is the same as the test statistic suggested by James. [23] As mentioned in Section 2.2, James [23] used a corrected χ 2 distribution whose form in the two sample means cases is given in Section 2.2. Therefore, we have strong grounds to suggest James corrected χ 2 distribution for calibration of the EL test statistic instead of the classical χ 2 distribution.
Maximization of Equation (7) with respect to a scalar λ, in the univariate case, is easy since a simple search over an interval is enough. In the multivariate case though, the difficulty increases with the dimensionality. Another important issue we highlight is that EL test statistic cannot be computed if μ lies outside the convex hull of the data. [19] This issue becomes more crucial again as the dimensions increase.
As for the distribution of the test statistic under H 1 , let us assume that each mean μ j deviates from the common mean μ by a quantity which is a function of the sample covariance matrix, the sample size plus a constant vector τ j for each sample. We can then write the mean as a function of the covariance matrix and of the sample size where j is the true covariance matrix of the jth sample and Asymptotically, the scalar factor [1 + ((z j − τ j ) T (z j − τ j ) − 1)/n j ] −1 will disappear and under H 1 Equation (9) will be equal to the sum of k−1 independent non-central χ 2 variables, where each of them have a non-centrality parameter equal to τ j 2 . Consequently, Equation (9) follows asymptotically a non-central χ 2 with non-centrality parameter k j=1 τ 2 j .

EEL for the two-sample case
EEL or exponential tilting was first introduced by Efron [11] as a way to perform a 'tilted' version of the bootstrap for the one-sample mean hypothesis testing. Similarly to the EL, positive weights p i , which sum to one, are allocated to the observations, such that the weighted sample meanx is equal to a population mean μ under the null hypothesis. Under the alternative hypothesis, the weights are equal to 1/n, where n is the sample size. Following Efron, [11] the choice of p i s will minimize the Kullback-Leibler distance from H 0 to H 1 subject to the constraint The probabilities take the form Similarly to EL a numerical search over λ is required. We can derive the asymptotic form of the test statistic (10) in the two sample means case but in a simpler form, generalizing the approach of Jing and Robinson [26] to the multivariate case as follows. The three constraints are ⎛ Similarly to EL the sum of a linear combination of the λs is set to zero. We can equate the first two constraints of Equation (12) Also, we can write the third constraint of Equation (12) as λ 2 = −n 1 /n 2 λ 1 and thus rewrite Equation This trick allows us to avoid the estimation of the common mean. It is not possible though to do this in the EL method. Instead of minimisation of the sum of the one-sample test statistics from the common mean, we can define the probabilities by searching for the λ which makes the last equation hold true. The third constraint of Equation (12) is a convenient constraint, but Jing and Robinson [26] mention that even though as a constraint is simple it does not lead to second-order accurate confidence intervals unless the two sample sizes are equal. Asymptotically, the test statistic follows a χ 2 d under the null hypothesis. The asymptotic form of the test statistic under H 1 is equal to When the sample sizes are large, the scalar (1 − 1/n j ) −1 will disappear, and thus the asymptotic distribution is the sum of k−1 independent non-central χ 2 distributions, where each of them has a non-central parameter equal to τ j 2 . in Equation (14) follows asymptotically a non-central χ 2 distribution with non-centrality parameter k j=1 τ 2 j . Thus, under H 1 , to the leading term, the distribution of the EEL test statistic is the same as that of the EL.

Nonparametric bootstrap hypothesis testing
The use of null asymptotic distributions described above is justifiable when the sample size is sufficiently large. When the sample size is not large bootstrap often offers better performance. [27] The nonparametric bootstrap procedure that we use in Section 3 is as follows: (1) Define the test statistic T obs as one of Equations (2), (3), (7) or (14) and define T obs to be T calculated for the available data x 11 , . . . , x 1n 1 and x 21 , . . . , x 2n 2 with meansx 1 andx 2 and covariance matrices S 1 and S 2 . (2) Transform the data so that the null hypothesis (μ 1 = μ 2 ) is true T is the estimated common mean under the null hypothesis.

Simulations studies for the performance of the testing procedures applied to two compositional sample means
In this section we compare the aforementioned testing procedures via extensive simulation studies. We will apply the following transformation to each compositional vector x where H is the Helmert sub-matrix (i.e. the Helmert matrix [28,29] with the first row omitted) and x ∈ S d (1). The multiplication by the Helmert sub-matrix is essentially a linear transformation of the data (and of the simplex). The various testing procedures are invariant to this transformation which speeds up computations, particularly for EL.
The comparison of all the testing procedures was in terms of the probability of type-I error and of the power. Bootstrap calibration was necessary for all tests in the small samples case even though it is quite computationally intensive for the EL. The number of bootstrap replications was equal to 299 and 1000 simulations were performed. In each case a 4-dimensional simplex was used. When the estimated probability of Type-I error falls within (0.0365, 0.0635) (theoretical 95% confidence interval based on 1000 simulations) we have evidence that the test attains the correct probability of Type-I error.
For the implementation of EL, the R package emplik [30] was used. The procedure was to calculate the common mean which minimizes the sum of the two EL tests. [20] We used the χ 2 corrected distribution suggested by James [23] and the F, with degrees of freedom given in Equation (4), suggested by Krishnamoorthy and Yu. [24] The term inside the parentheses indicates the calibration, (χ 2 ), (F) or (bootstrap), corresponding to the χ 2 or the F distribution and bootstrap, respectively. R codes for all testing procedures are available at the R package Compositional [31].

Scenario 1. Simulated data from Dirichlet populations
Data were generated, under H 0 , from two Dirichlet populations such that the two arithmetic means in S d are the same. The first population was Dir(0.148, 0.222, 0.296, 0.333) and the second came from a mixture of two Dirichlets: For power calculations, we set the mean vector of the second sample equal to for values of δ range, from −0.21 up to 0.21 in steps of 0.03 (δ = 0 corresponds to H 0 ).

Scenario 2. Simulated data from different distributions
We Then the inverse of the additive log-ratio transformation [6] x For power calculations, we changed δ in the mean vector of the first component of the Dirichlet mixtures as for δ ranging from −0.21 up to 0.21 each time at a step equal to 0.03, with δ = 0 corresponding to H 0 . Table 1 shows the 95% confidence intervals for different levels of probabilities calculated using the Monte Carlo simulations error based on 1000 simulations. They will help us compare the powers of the different testing procedures as a guide of how large is the simulations error at different levels of power.

Type-I error with equal sample sizes.
The sample sizes were set equal for both samples and equal to 15, 30, 50 and 100. When the sample sizes were equal to 15 the true Type-I error was not achieved by any procedure (Table 2). Bootstrap calibration, however corrected the size of all testing procedures and for the sample sizes. For the sample sizes 30 and 50, we can see that EL and EEL calibrated with the F-distribution performed better than the other testing procedures. What is more, is that James test when calibrated using an F rather than a corrected χ 2 distribution, shows no significant improvement. Finally, when the sample sizes are large, all procedures attain the nominal Type-I error of the test.

Type-I error with unequal sample sizes.
The second sample, which came from the mixture of two Dirichlets, had observations which were less spread (its covariance determinant was smaller) and for this reason it will now have a larger size. We can see in Table 3 that Hotelling's test clearly fails as expected. However, when bootstrap calibrated, it works reasonably well for large sample sizes. The F calibration of EL and EEL works better than the χ 2 calibration. However, bootstrap is again necessary for the medium sizes. The conclusion is again that the bootstrap calibration of the p-values does a very good job. Table 4 shows the power of these testing procedures under some alternatives for different sample sizes. We have included four different sample sizes in the simulations studies when the null hypothesis is true. But, when examining the power of the testing procedures, we considered only the bootstrapcalibrated procedures. The reason for this is that the testing procedures were size correct when bootstrap calibration was implemented.

Estimated power of the tests with equal sample sizes.
In all cases we can see that there is a little difference between the two quadratic tests when calibrated with bootstrap. What is evident from all testing procedures is that as the sample size increases the powers increase as expected. When looking at the case when the sample size is equal to 30 (Table 4), we see that the power of the quadratic tests is higher than the power of the ELs. When the δ in Equation (15) is negative, the power is higher than when δ is positive. But as we move towards the null hypothesis, the differences between the two types of testing procedures decrease. When the sample sizes are equal to 100 there are almost no differences between the quadratic tests and the nonparametric likelihood methods ( Table 4). As seen from Table 3, when the sample sizes are small, no procedure managed to attain the correct size. The F calibration of the EL and EEL tests and bootstrap calibration of all tests decreased the Type-I error, yet not enough. When the sample sizes increase, both nonparametric likelihood methods estimate the probability of Type-I error correctly only when the F or bootstrap calibration is applied. As for the quadratic tests, bootstrap calibration has proved very useful too. Finally, when the sample sizes are large we can see that Hotelling test is not size correct, as expected (Hotelling assumes equality of the covariance matrices), but all the other testing procedures estimate the probability of Type-I error within the acceptable limits regardless of bootstrap calibration. Table 5 presents the estimated powers of the bootstrap-calibrated testing procedures. The nonparametric likelihood methods with bootstrap was computationally heavy and for this reason we estimated the powers of these two methods only for two sample sizes, 30 and 100.
Similarly to the previous example when both samples have the same size and come from Dirichlet populations, the power of the James and Hotelling tests are very similar when bootstrap is employed. When the sample sizes are equal to 30 the quadratic tests exhibit higher powers than the nonparametric likelihood methods in the case of a negative change in the first component (see Table 5). This is not true though when the change is positive. In addition, when the negative change gets closer to zero, the power of the nonparametric likelihood methods is better than the power of the quadratic tests and when the positive change gets closer to zero the opposite is true.
When the sample sizes are large (equal to 100) the quadratic tests and the nonparametric likelihood methods seem to perform equally well as seen in Table 5. But, as the change approaches zero from the negative side, we can see that the tests based on the nonparametric likelihoods reject the null hypothesis more times than the quadratic tests and the converse is true when the change approaches zero from the positive side.

Estimated power of the tests with unequal sample sizes.
The sample sizes of the two groups are the same as before, the second sample, which came from one Dirichlet population and was less spread and thus had larger size. The direction of the alternatives was the same as in the case of equal sample sizes. It is worthy to mention that when we had relatively small samples (n 1 = 15 and n 2 = 30) none of the tests was size correct even after bootstrap calibration was applied (see Table 3). Thus, we estimated the powers for the other two combinations of the sample sizes.   Hotelling test seemed to perform slightly better than James in the small samples but in overall there was almost no difference between them. This difference was more obvious in the unequal sample sizes case, where James test showed evidence that is slightly more powerful than Hotelling, especially in the case where the change in the fourth component is positive (see Table 6). When both sample sizes are large though, the powers of the testing procedures are almost the same.
It is worthy to mention that when we had relatively small samples (n 1 = 15 and n 2 = 30) James test, EL and EEL were size correct after bootstrap calibration (see Table 3). The alternatives in this case were chosen as in the case of equal sample sizes. We chose the second mean of the mixture of two Dirichlet populations and changed it. The second sample (from the mixture of Dirichlet distributions) had always smaller size since its covariance determinant was larger.
When the sample sizes are relatively small, the power of the quadratic tests is better than the power of the empirical methods regardless of the sign in the change. In fact Hotelling test performs better than James test and it is better when the change in the first component is positive and small. As for the larger sample sizes both quadratic tests and nonparametric likelihood methods perform very well with. But even then, Hotelling test still performs better than James test when the change in the alternative hypothesis decreases.

Discussion and conclusions
In most cases, the nominal level (5%) of the Type-I error was not attained by the procedures unless the sample sizes were large. In the small sample or unequal sample cases, bootstrap calibration played an important role in correcting the test size. The cost of this re-sampling procedure was computation time but only for EL was this substantial. The quadratic tests require no numerical optimisation, only matrix calculations and with 299 bootstrap re-samples, the calculation of the p-value requires less than a second. EEL requires less than a second when calibrated using 299 bootstrap samples, whereas EL requires more than 1 minute. We proposed the use of the F distribution, with the degrees of freedom of the F-distribution as suggested by Krishnamoorthy and Yu, [24] for calibration of the empirical and the EEL test statistics. Our results showed that it works better than the χ 2 . Another alternative is to use a corrected χ 2 distribution. [23] However, these alternative calibrations do not work when the sample sizes are small or very different. Bootstrap calibration on the other hand, performed very well in almost all cases.
As for the power comparisons, the differences between the quadratic and the nonparametric likelihood tests were less than in the null case. However, since bootstrap was used to calibrate the test statistic when the null hypothesis was true, the same calibration had to be employed in testing the power under the different alternative hypotheses, when the null hypothesis was not true. The computational cost was high for EL due to the fact that it required two optimisations, one to find the common mean and one to obtain the ratio test statistic value. EEL on the other hand requires one root search only.
The conclusion that can be drawn is that either the Hotelling or the James test statistic with bootstrap calibration is to be preferred when it comes to algorithmic simplicity and computational cost. Nonparametric likelihood methods perform equally well when bootstrap calibration is present but they require significantly more time than the James or Hotelling test statistics. Furthermore, we can see that the modified (in terms of the calibration) James test performs the same as the classical James test (using a corrected χ 2 distribution). This is certainly a disadvantage of using EL.
Based on our simulations, we saw that when bootstrap calibration is applied, both methods tend to work almost equally well. Just as in EEL, if the EEL testing procedure could be as quick as the James (or the Hotelling) test then we would say that there is little to choose between the various testing procedures in terms of statistical performance, perhaps the only reason to choose James test would be because of the convex hull limitation.
The picture we got from the unequal sample sizes is similar to the one in the equal sample size cases. Again, the EL testing procedure is computationally more expensive. Bootstrap calibration of the James, Hotelling and EEL testing procedures requires less than a second when 299 bootstrap resamples are implemented. The EL method on the other hand requires a significant amount of time in the case of bootstrap. Even if we increase the number of bootstrap re-samples, James test will still require maybe a couple of seconds, whereas EL will require more minutes.
However, the availability of parallel computing in a desktop computer and a faster implementation of the nonparametric likelihood tests, can reduce the time required to bootstrap calibrate the EL. Even then, if one takes into account the fact that bootstrap calibration allowed for 299 re-samples it becomes clear that EL is much more computationally expensive. On the evidence of the simulation study results showed in this paper there may be no particular gains in statistical performance from doing so.