An Empirical Comparison of the Anova F-Test, Normal Scores Test and Kruskal-Wallis Test Under Violation of Assumptions

The present research compares the ANOVA F-test, the Kruskal-Wallis test, and the normal scores test in terms of empirical alpha and empirical power with samples from the normal distribution and two exponential distributions. Empirical evidence supports the use of the ANOVA F-test even under violation of assumptions when testing hypotheses about means. If the researcher is willing to test hypotheses about medians, the Kruskal-Wallis test was found to be competitive to the F-test. However, in the cases investigated, the normal scores test was not consistently better than the F-test or the Kruskal-Wallis test and could not be recommended on the basis of this research.

developed for the two-sample and the k-sample cases (Hoeffding, 1951;Terry, 1952; Van der Waerden, 1953;Hajek and Sidak, 1967;Puri, 1964). McSweeney and Penfield (1969) have presented a review of the literature, as well as rationale for and derivation of the k-sample case. The Terry-Hoeffding form of the k-sample normal scores test requires the use of special tables (Harter, 1961 ) to transform ranked data into expected normal order statistics. The Van der Waerden form replaces ranks with inverse normal statistics which can be computed from any standard normal table. Normal scores tests were derived to test the hypothesis of equal populations but are sensitive to location shifts; underlying continuous distributions are assumed and observations are assumed to be drawn randomly and independently from their respective populations. The calculation of the test statistic is performed on the expected normal scores, where: n; = the number of observations in the ith sample, N = ~ n ;, the number of observations in all samples combined, Wii = the jth expected normal order statistic in the ith sample. rather than on the ranks or the original data. The test is asymptotically distributed under the null hypothesis as chi-square with k -1 degrees of freedom, where k is the number of treatment levels or samples. Large values of the test statistic lead to the rejection of the null hypothesis.
The Kruskal-Wallis test is based on ranks and is suitable for the k-sample case. It is a direct generalization of the two-sample Mann-Whitney U test (Kruskal, 1952;Kruskal and Wallis, 1952 hypotheses of equal populations and is sensitive to location shifts. Under the null hypothesis, the Kruskal-Wallis test is also asymptotically distributed as chi-square with k -1 degrees of freedom. It is assumed that sampling is random, that samples are drawn from populations with continuous distributions, and that populations are infinite or sampling is with individual replacement. Large values of the statistic lead to the rejection of the null hypothesis. The most common index for comparing nonparam.etric tests to parametric tests is asymptotic relative efficiency or ARE. This index compares the power of one test to the power or efficiency of the other, by using mathematical computations based on extremely large sample sizes and extremely small central tendency or location differences. In fact, sample size is permitted to approach infinity while at the same time location differences approach zero. The ARE of the normal scores test as compared to the F-test has a value of unity for the normal distribution and'a lower bound ARE of unity, for non-normal distributions. Therefore, asymptotically the normal scores test can be said to be at least as powerful as F, and when ANOVA violations are present can be more efficient than F. The Kruskal-Wallis test as compared to F has an ARE of .95 for the normal distribution and a lower bound ARE of .864. Thus asymptotically, the Kruskal-Wallis H-test is 95% as powerful as the F-test for the normal distribution and can never asymptotically be less than 86% as powerful. Therefore, with no further information, the normal scores test would appear to be quite competitive to F. In addition, McSweeney and Penfield (1969) have shown that upon comparing the Kruskal-Wallis test and the normal scores test with samples from both normal and uniform distributions that &dquo;the small sample power of the normal scores test is clearly superior to that of the Kruskal-Wallis test in those marginal cases in which a test at a moderate significance level is used to detect small differences in location among non-normal distributions.&dquo; They state &dquo;that the comparison is dependent on the significance level of the test, the location parameter, and sample sizes as well as on the distributions sampled.&dquo; Both the enticing ARE and the favorable comparison of the normal scores test to the Kruskal-Wallis test as cited by McSweeney and Penfield ( 1969) have shown the need for further research in this area. Keeping in mind that asymptotic relative efficiencies are computed for unrealistically large sample sizes with minuscule differences in measures of location, it would seem profitable to the researcher to be aware of the small, medium, and large sample size performance of the normal scores test. In addition, there has been no comparison of the k-sample normal scores test to its parametric analogue, F, and neither the normal scores test nor the Kruskal-Wallis test have been compared to the ANOVA F-test for skewed distributions.
Further, current literature (Bradley, 1968;Kendall and Stuart, 1961) refers to the nonparametric sensitivity to detect location differences without stating whether mean or median differences will be equally detected. Therefore, a. Monte Carlo comparison of the three statistical tests was completed for realistic location differences and realistic sample sizes from a normal distribution and two exponential distributions. One of the exponential distributions was scaled to have equal means under the null hypothesis to investigate the sensitivity of the three tests in detecting mean differences. The other exponential distribution was scaled to have equal medians under the null hypothesis in order to investigate sensitivity of the tests to median differences.

Procedure
Random numbers were selected using a pseudo-random number generator. Depending upon the assumption violation, the numbers were selected from either a. normal distribution or from one of two exponential distributions. The random deviates were allocated to four treatment levels that comprised a one-way fixed effects analysis of variance situation.
The observations from the normal distribution were derived by a technique developed by Box and Muller (1958), which generates pseudo-random variables distributed N (0, 1 ) . For the null situation, the means of the four treatment levels were zero. The non-null situation was established by defining values of aj, j = 1, 2, 3, 4, such that the power for the ANOVA F-test would be about .86 for the equal variance condition for the normal distribution. Then the defined cc/s were used for all three statistical procedures, for all three distributions, and for both equal and unequal variance conditions. Specification of the a/s for the normal distribution was made through the non-centrality parameter, ~9, (Pearson and Hartley, 1951) where Setting (Te2 = 1 and J = 4 and using probability of a Type 1 error equal to .05, the values of '(Xj were found such that the power was about .86. Since the equal sample size and unequal size cases would lead to different values of aj for each of three sample sizes, the values of aj were calculated for both equal and unequal sample sizes. Values of Ctj for the total sample sizes of 28, 68, and 200 are presented in Table 1. The appropriate .«~'s were added to the samples in each of the four treatment levels for the non-null situation. Variance differences were established for particular cases by utilizing unequal variances in the ratio of 1:2:3:4, with the average variance equal to unity. The variances used were .4, .8, 1.2, and 1.6. When equal variance cases were desired, the variances were all given a value of unity. The exponential distributions were derived by a method given by Lehman and Bailey (1968): t(t) = pe-Pt (4) with p = 1, E (t) = 1/p = 1, and var(t) = 1/pS = 1. Pseudorandom exponential variables were generated by multiplying the negative of the mean, -E (t) = -1, times the natural logarithm of uniform random variates distributed on the unit interval (IBM, 1968). The exponential variates were then scaled so that either the medians would be zero or the means would be zero depending upon which of the two exponential distributions was desired. The resulting skewed populations had either mean or median equal to zero, a variance of -Uj2, a skewness measure of yl = 2, and a, kurtosis measure of -y.2 = 6.
For the exponential distribution scaled to have equal means of zero value under the null distribution, the mean of unity was subtracted from every score. Thus the median of .69315 also had the value of unity subtracted from it, yielding a median of -.30685 when variances were equal. When variances were unequal and means were equal and of zero value, the median for group j was &horbar;.30685~; thus the medians were -.19407, -.27445, -.33614, and -.38814.
For the exponential distribution scaled to have equal medians of zero value under the null hypothesis, the means will be nonzero. For equal variances, the value of the means was .30685. For unequal variances and equal medians of zero value the mean for group j is .30685~; thus the means were .19407, .27445, .33614, and .38814.
In order to simulate null and non-null conditions in the exponential distributions, the values of aj were identical to those used in the normal distributions as shown in Table 1. The variance for equal variance cases for the two exponential distributions was, as in the normal distribution, equal to unity, and for unequal variance cases were equal to .4, .8, 1.2, and 1.6.
Comparisons among the F-test, the normal scores test! and the Kruskal-Wallis test were made on five combinations of sample sizes and variances. These combinations were as follows: (1) equal sample sizes and equal variances, (2) equal sample sizes and unequal variances, (3) unequal sample sizes and equal variances, (4) unequal sample sizes and unequal variances which were positively related, and (5) unequal sample sizes and unequal variances which were negatively related. For each of the five cases, 1000 experiments were performed using observations from the normal distribution, the exponential distribution scaled to have equal means under the null hypothesis, and the exponential distribution scaled to have equal medians under the null hypothesis, where an experiment consisted of computation of each statistical test. The proportion of rejections in 1000 experiments when there were no location differences was referred to as empirical alpha. When differences in location were specified, the proportion of rejections was referred to as empirical power. Theoretical alpha (level of significance) was set at .05. The three statistical tests were then compared in terms of empirical alpha and empirical power for total sample sizes of 28, 68, and 200. It should be noted that the equal sample size, equal variance case for the normal distribution was included in the present study for the purpose of establishing validity of the Monte Carlo method and has been established for the statistics in prior studies.

Normal Distribution
For a total sample size of 28, the ANOVA F-test surpasses the performance of both the Kruskal-Wallis test and the normal scores test in terms of approximating theoretical alpha and having greater power in all but one case of assumption violation (negatively related sample sizes and variances). The empirical alphas and power of the nonparametric methods were comparable to each other for N = 28, with the Kruskal-Wallis test being more preferred than the normal scores test in the unequal variance situations. Current literature has suggested that the normal scores procedures are far less sensitive to heterogeneity of variance than are the parametric or rank procedures (McSweeney and Penfield, 1969), but this robustness was not substantiated for N = 28 as shown in Table 2. In general, for this sample size neither of the nonparametric methods compete favorably with F, except for negatively related sample sizes and variances.
The ANOVA F-test is generally the most powerful technique for the larger sample sizes (N = 68, N = 200), but at the expense of making a few more Type 1 errors than the nonparametric methods. For N = 68, the Kruskal-Wallis test provides the best overall approximation to theoretical alpha when variances are unequal. For N = 200, while the normal scores test generally provides the best approximation to alpha, the Kruskal-Wallis test also gives a good approximation to alpha with comparable or better power. All three of the tests are competitive for the larger sample sizes. Note.-Entries are the proportion of rejections in 1,000 experiments for the ANOVA F-test (F), the Kruskal-Wallis Test (KW) and the Normal Scores Test (NS) in terms of probability of a Type 1 error (a) and power (1 -0). Nominal alpha was set at .05, N = total sample size, and n = sample size per treatment level.

Exponential Distribution: Scaled to Have Equal Means under the Null Hypothesis
For this distribution, the ANOVA F-test consistently outperforms the nonparametric methods. As sample size increases (N = 68, N = 200), the nonparametric tests begin to make far too many Type I errors when variances are unequal. For example, in Table 3, the negatively related sample sizes and variances case for N = 200, shows an a = .440 for the normal scores test, and an a = .381 for the Kruskal-Wallis test while for the F-test, a = .092. This extreme increase in empirical alphas is thought to be caused by the nonparametric sensitivity to the unequal medians. Scaling to equal means of zero value under the null hypothesis for a skewed distribution leaves nonzero medians. If the variances are equal, then the medians (though nonzero) are equal; however when variances are unequal, the medians are also unequal. Thus for cases where variances were equal, the nonparametric tests approximated theoretical alpha fairly well with good power. However, the F-test still provided the best approximation to theoretical alpha in most cases for both equal and unequal variance situations.  Note.-Entries are the porportion of rejections in 1,000 experiments for the ANOVA F-test (F), the Kruskal-Wallis Test (KW) and the Normal Scores Test (NS) in terms of probability of a Type I error (a) and power (1 -(3). Nominal alpha was set at .05, N = total sample size, and n = sample size per treatment level. a marked decrease as compared to the exponential distribution scaled to have equal means. For the case cited in section 3.2, N = 200, negatively related sample sizes and variances, the empirical alpha for the Kruskal-Wallis test has dropped from .381 to .095 and for the normal scores test from .440 to .121. This decrease in empirical alphas with equality of medians under the null hypothesis substantiates the nonparametric sensitivity to median differences. When there exist no median differences, the nonparametric procedures (especially the Kruskal-Wallis, see Table 4) do well in approximating theoretical alpha, and provide good power in detecting median differences when they are specified. The Kruskal-Wallis test provides the best approximation to theoretical alpha with high power for all sample sizes, however the F-test is a good competitor with the normal scores test falling in close proximity.

Conclusion and Summary
When normality and/or homogeneity of variance is doubtful, the ANOVA F-test is the recommended procedure for testing hypotheses about means. The researcher does have the option of testing hypotheses about medians with the assurance that if a significant F-value is obtained, both mean and median differences will be present. When using the Kruskal-Wallis test or the normal scores test in investigating mean differences, with non-normality and heterogeneity of variances, the researcher might very well reject the null hypothesis due Note.-Entries are the proportion of rejections in 1,000 experiments for the ANOVA F-test (F), the Kruskal-Wallis Test (KW) and the Normal Scores Test (NS) in terms of probability of a Type I error (a) and power (1 -(3). Nominal alpha was set at .05, N = total sample size, and n = sample size per treatment level. to median differences, when means are in fact equal. Thus, the researcher in the case of finding a significant value for the Kruskal-Wallis test or the normal scores test has little assurance that mean differences actually are present. The F-test can further be recommended on the grounds that the ~-distribution is more extensively tabled, and that the F-test is the most easily used of the three tests for large sample sizes. There are computer programs readily available for ANOVA, and the tedious task of transforming data is not necessary, as it is for the nonparametric tests.
With non-normality and inequality of variances, the Kruskal-Wallis test might be considered to be the recommended procedure. The researcher, however, must be aware that he is testing for median differences, and must state his null hypothesis in these terms. The Kruskal-Wallis test for large samples does require the tedious chore of ranking, but computer programs are becoming more and more accessible.
The normal scores test, despite its enticing asymptotic relative efficiency, cannot be recommended on the basis of this study. In none of the cases investigated does the normal scores test consistently outperform the Kruskal-Wallis test or the ANOVA F-test. Only in isolated cases could the normal scores test be recommended, and even then, only with reserve, because of the difficulty in transforming data from ranks to expected normal scores.