N = 1 Designs: The Failure of ANOVA-Based Tests

Several methods have been proposed for the analysis of data from single-subject research settings. This research focuses on the modifications of ANOVA-based tests proposed by Shine and Bower, a procedure that precedes the ANOVAF test by preliminary testing of within-phase lag one serial correlation and the one-way ANOVA as presented by Gentile, Roden and Klein. Monte Carlo simulation is used to investigate these tests with respect to robustness and power. Each test was analyzed under various patterns of serial correlation, various patterns of phase and trial means, normal and exponential distributions, and equal and unequal phase variances. The findings indicate that the probability of a Type I error for these ANOVA-based tests is seriously inflated by nonzero serial correlation. These tests, therefore, cannot be recommended for use with data that have nonzero serial correlation.

Single-subject research settings exist for a wide range of research topics in a variety of disciplines: community research (Friesema, Caporaso, Lineberry, & Goldstein, 1978), political science (Deutsch & Alt, 1977), law (Glass, Tiao, & Maguire, 1971), medicine (Sterman, 1973), behavioral research (Juliano & Gentile, 1974;Leitenberg, Agras, & Thomson, 1968), school and/or educational psychology (Gottman & McFall, 1972;Kratochwill, 1977), clinical psychology and psychiatry (Barlow & Hersen, 1973;Chassan, 1967), and experimental psychology (Shine, Wiant, & DaPalito, 1972). Treatment of data from such designs has been controversial in that some advocate visual inspection while others call for inferential methods (Kratochwill, 1978). In addition, controversy exists over choice of inferential methods, including advocacy of traditional statistical tests which assume independence of observations. Dangers in using traditional methods include excessive rejection rates for serially correlated data (Box, 1954;Hibbs, 1974) while dangers in visual inspection include inability to separate chance from nonchance variabiUty and lack of clarity of decision rules (for discussion on visual inspection see Kazdin, 1982, andParsonson &Baer, 1978). While single-subject designs need special consideration with respect to the question of validity (see Levin, Marascuilo, & Hubert, 1978), if researchers are interested in testing hypotheses in such designs, they must choose from several available methods. testing procedure, see Hartmann, 1974.) Shine and Bower (1971) presented what they called a "one-way" ANOVA, which was actually a modification of the simple repeated measures design (SRMD) or a two-way ANOVA with one observation per cell. Instead of using a group of I random subjects measured once for each of / fixed levels of an experimental factor (as is the case for the SRMD), the Shine-Bower model assumes a series of I fixed trials (A) for each of / fixed levels of an experimental factor (B) for only one subject. Tests for B and AB are given by Successive differences of trial means are squared and summed over the odd values of /, from 1 to I -1 (for even I). MSE' is supposed to have expected value of oj if the effect of trial / is equal in the population to the effect of trial / + 1 for odd / (this is the slow change assumption given by Shine and Bower). Because MSE' and the mean square for the A, or trial, effect (MSA) are not independent (Shine, 1975) due to MSE' being based on successive trial means, MSE' cannot be used to test for the main effect due to trials. The test proposed by Shine and Bower (1971) for the trial main effect is the mean square successive difference (MSSD) test due to Bennet and Franklin (1961): Critical values of i] for 5 percent and 1 percent for one-tailed tests are given in Bennet and Franklin (1961), and, for / > 25, will be approximately normally distributed. A two-tailed MSSD test is supposed to be a test for the main effect of trials and an upper tailed MSSD test is supposed to be a test of the slow change assumption. Modifications to these tests have been proposed by Shine (1974Shine ( , 1976Shine ( , 1977, and extension has been made to higher order ANOVA models (Shine, 1973). A note of caution, however, is that Bennet and Franklin (1961) originally present the MSSD test using successive differences of observations, not means, and they present it as a test to detect nonrandomness. Thus, the MSSD should be sensitive to nonzero serial correlation.
Several authors have criticized the independence assumption common to the models presented by Gentile et al. (1972) and Shine and Bower (1971). Kratochwill et al. (1974), Hartmann (1974), Thoreson and Elashoff (1974), and Keselman and Leventhal (1974) all focused attention on the independence assumption and problems associated with use of ANOVA-based models in the presence of nonindependence. Some presented modifications or alternatives but none of these authors showed any empirical or comparative research on these ANOVA-based models. Hartmann (1974) states that (1) researchers should use only the one-way ANOVA model and F test if the assumption of independence is met, "until such time as the nature and extent of the violations of the F test are more fully examined," and (2) if a researcher still wants to use ANOVA models when independence assumptions are not met, "then he should use either the relatively unexplored but more sophisticated ANOVA model suggested by Shine and Bower (1971)" or a variation that uses preliminary testing for lack of independence. Hartmann's (1974) modification includes using only asymptotic responses with 12 or more stable data points per phase, preliminary tests for nonzero serial correlation for at least lag-one within each phase and for nonzero cross correlations of at least lag zero and one of two different ANOVA designs on the stable data. The simple preliminary testing procedure presented earlier uses only tests on within-phase, lag-one serial correlation and used all the data in the ANOVA F test. Thus, one proceeds with the ANOVA F test only if the preliminary test(s) are nonsignificant (indicating compliance with the independence assumption). Of course, one difficulty with such modifications is the question of what to do if the preliminary test(s) are significant. Does one proceed with caution or switch to another procedure? Another question to be raised is the influence on such tests of nonsignificant serial correlation. If the correlation is so low that it is not detected by the preliminary tests, is it not also problematic for the main effect tests of interest?
The preliminary testing procedure is offered as acceptable statistical analysis by Kazdin (1982) and is intimated as appropriate by Hersen and Barlow (1976). Hartmann's (1974) suggestion of the more complicated preliminary testing method was accompanied by the indication that the Shine-Bower model could be used as an alternative. Gottman and Glass (1978) indicate that the preliminary testing suggestions offered by Hartmann (1974) would be appropriate for certain time series, those that have lag one as the only large serial correlation. Shine and Bower (1971) indicate that "any such correlation can be carried, in a manner similar to that of the standard repeated measures design, by certain effects in the proposed design" (p. 107). Kazdin (1982, pp. 319-321) labels the Shine-Bower model as an alternative or option that uses F to deal with the problem of serial dependency and that is more complex than the preliminary testing procedure. Whether or not Kazdin (1982) suggests use of N = 1 Designs 293 the Shine-Bower model is questionable. Other authors either uncritically list the Shine-Bower procedure along with other statistical methods forN= 1 data (Edgington, 1980;Kratochwill, 1978) or clearly deny the appropriateness of any ANOVA-based method for N = 1 data (Levin, Marascuilo, & Hubert, 1978), including the Shine-Bower model (also see Gottman & Glass, 1978). Negative comments on these procedures are usually based on the ANOVA assumption of independent errors and the likelihood of violating this assumption with most N = 1 research. None of these sources offer results from empirical or analytical research on the preliminary testing procedure or the Shine-Bower model, and are not in accord on the worthiness (or worthlessness) of these models. Given the ambivalence of the literature on these methods, the present research has as its purpose answering questions about robustness of the Shine-Bower tests, ANOVA F test, and the simple preliminary testing procedure. The ANOVA F test is included as a standard to compare to the known results (Box, 1954). These tests will be examined for various patterns/levels of serial correlation, various patterns of phase and trial means, normal and exponential distributions, and equal and unequal phase variances.

Procedure
Monte Carlo simulation was used to study the Shine-Bower F B , F AB , MSSD (two tailed), MSSD (one tailed), the ANOVA F, and a simple preliminary testing procedure. The simple preliminary testing procedure consisted of separate preliminary tests using the procedure due to Bartlett (1946) (see Kendall & Stuart, 1966, p. 432), on each of the within-phase lag-one serial correlations and proceeding to F B = MS B /MS W only if all preliminary tests were nonsignificant. Only within-phase correlation is examined due to the influence of intervention effects on the serial correlations. If any of the preliminary tests were significant, F B was not computed, and essentially no decision was made. For all statistics, two different design sizes were used. The smaller design had 7 = 4 levels of phases with 7=10 trials per phase, and the larger design had J -4 levels of phases with I = 30 trials per phase. Generally, the sampling distributions of the statistics were simulated using computer generated data from a pseudo-random number generator. Random unit-interval uniform variates (Chen, 1971) were generated and then transformed into random variates with mean zero and variance one (z) such that z was distributed as a unit normal (Box & Muller, 1958) or the exponential (Lehmann & Bailey, 1968). These z were then given the desired variance-covariance matrix (C) transformed from a simplex-patterned correlation matrix (Guttman, 1955). Because data from a single-subject are characterized by decreasing correlation the further apart the positions of the scores, it seemed that a matrix fitting Guttman's simplex would be appropriate as a model for a correlation matrix for n -1 data. Using a gram-factor decomposition of C, then 294 Toothaker, Banz, Noble, Camp, and Davis where Q is a matrix of eigenvectors of C, and D is a symmetric diagonal matrix of the eigenvalues of C. If we let F = QD l/2 and Z be the vector of variates described above, then G = ZF' has variance-covariance matrix C, as given by Here E is the expected-value operator. Thus, G has the desired property of being distributed as multivariate normal or multivariate exponential with variance-covariance matrix C. The matrix G has dimensions of number-of-subjects by number-of-variables, where number-of-subjects is set equal to one for single-subject data and number-of-variables is equal to the total number of trials, 40 or 120. By specifying C, both the desired simplex correlation pattern and the desired variances were specified. For the four phases, the equal variances were arbitrarily chosen to be a 2 = 15. The unequal variances were 3, 14, 16, and 27 for a, 2 , 7 = 1 to 4, respectively. Variances for each trial were constant within a phase.
The patterns of serial correlation were chosen to give a zero correlation pattern and three nonzero patterns which had the simplex form. Examples of these are given in Table I. The three nonzero patterns were selected to represent not only increasing lag-one serial correlations, but also differing number of large higher order lag serial correlations. The low pattern had no serial correlations larger than .3044, the .05 two-tailed critical value of r, df = 40. The other patterns had various numbers of large serial correlations (see Table I).
Values of the means for particular combinations of phases and trials were manipulated by adding values of /i,.. to G. For the null hypothesis case of equal phase means, all trials in all phases had constant JU, = 0. Two non-null cases were examined: an ABAB pattern and a linear pattern. For the pattern of phase means given by ABAB, all trials in a given phase had the same mean. The phase means were 1.6, -1.6, 1.6, and -1.6, respectively. The linear pattern of phase means was -2.25, -.75, .75, and 2.25, with all trials in a given phase having the same mean. For the learning curve case, the trial means within a phase represented a gradual increase typical of a learning curve (see Table II). All phases had the same pattern of trial means, giving another example of the null hypothesis. The last pattern of means examined was computed from the learning curve means such that the assumption of slow change was met (Mi 7 -= MI+IJ f°r °dd /)• These slow-change means also represented the null hypothesis (see Table II).
For each combination of design, serial correlation pattern, variances, and distribution, the statistics were simulated by running 1000 replications of a pseudoexperiment. For each replication, a vector G of // scores was generated as if it had come from a single subject, the statistics were computed and compared to critical values, and rejections were counted. The proportion of N = 1 Designs 295 rejections in 1000 replications is an estimate of the probability of either a Type I error or a correct rejction (power). All mean patterns were investigated for the same set of generated data to reduce variability in the results for different mean patterns. That is, for G the data were formed by G + p,.., and all mean patterns were based on the same raw data G. Each new combination of the other variables resulted in a new G.
For each replication, the serial correlations for all the I J scores for lag 1-20 were computed for the vector G. These serial correlations were then averaged over the 1000 replications to give estimates of the actual serial correlations for the generated data. Example of these values for lag 1-10 were given in Table I.

Results
Results in the form of proportions as estimates of probabilities (empirical probabilities) are given in Tables III-X. For the mean patterns of equal, learning curve, and slow change, these empirical probabilities are estimates of the probability of a Type I error, or a, for all tests except the MSSD upper tailed test. The learning-curve data are a violation of the slow-change assumption, and the proportions given for the MSSD upper tailed test are estimates of power. For the ABAB and linear patterns of means, the empirical probabilities are estimates of power for the Shine-Bower F B , ANOVA F and preliminarytesting F and estimates of a for the other tests. For zero serial correlations, all tests give reasonable control of a. The a values for the unequal variances normal distribution case were slightly inflated for the ANOVA and preliminary-testing tests, and the a values for the exponential distribution (equal variances) were slightly conservative for all tests. Combinations of unequal variances and the exponential distribution showed that the effect of unequal variances was more potent and yielded slightly liberal a values for ANOVA and preliminary-testing tests. For zero serial correlation, these tests were more powerful than the Shine-Bower F B .
For nonzerro serial correlation patterns, all tests showed distinct sensitivity to increasing degree of correlation. All tests except the MSSD upper tailed test were excessively liberal for nonzero serial correlation. For example, the Shine-Bower F B gave a values of .497, .768, and .995 for low, medium, and high serial correlation, respectively, for equal means, normal distribution, and equal variances for J = 4, / = 10. While the proportion of rejections out of 1000 replications for the preliminary-testing F decreased from .591 for low correlation to .129 for high correlation, the effective rejection rate of proportion of rejections out of the replications for which the F test was actually computed (those having all within-groups, lag-one serial correlation nonsignificant) increased from .633 for low correlation to .985 for high correlation. For medium or high correlation, the preliminary testing procedure usually would not progress to doing the F test, and if progress to the F test was allowed, a high percentage of false rejections would result. The learning-curve and slow-change

23
.14 .14 24 .14 .14 25 .14 .14 26 .14 .14 27 .14 .14 28 .14 .14 29 .14 .14 30 .14 .14     Nominal probability = .10; all others, .05.    Nominal probability = .10; all others, .05. mean patterns gave similar results. These same results were replicated for unequal variances, exponential distribution, and all cases for the larger design. Given the excessive a values in the face of nonzero serial correlation, the value of a discussion of power is questionable. The MSSD upper tailed test was increasingly conservative as a function of increasing serial correlation and was not sensitive to violation of the slow-change assumption.

MSSD two-tailed (N) y and MSSD upper-tailed (U) Tests, Four Phases by 30 Trials with a Normal Distribution and Unequal Phase Variances
Conclusions Considerable attention has been given to the problem of data analysis for N = 1 designs. Several authors mentioned or alluded to the need for further research on traditional ANOVA-based tests, and Hartmann (1974) suggested using the Shine-Bower tests when independence assumptions are not met. Shine and Bower (1971) indicated that serial correlation in the data would be carried "by certain effects" without giving any methodology for separating the effects of nonzero serial correlation from the effects of the experimental factor or interaction. Subsequent articles by Shine (1980Shine ( , 1981Shine ( , 1982 argue for two single-subject behavior functions, one of which is an independent error model such as claimed for the Shine-Bower model. This paper attempts to clarify that use of the Shine-Bower analysis should be restricted only to data that fits the independent error situation. The present research shows that the Shine-Bower, ANOVA, and preliminary testing tests for the experimental factor are seriously influenced by violation of the independence assumption. Positive nonzero serial correlation causes excessively liberal a values, and these tests are not robust to even nonsignificant serial correlation, which usually would not be detected by tests for serial correlation. Thus, the results of Box (1954) and Hibbs (1974) generalize to these ANOVA-based methods for n = 1 data: the statistics are inflated by the presence of nonzero serial correlation. Because none of these procedures can be recommended for hypothesis testing in single-subject research with positive lag-one serial correlation, the researcher with such data may turn to the other methods mentioned earlier. For the researcher who wants to use statistical methods for serially correlated n = 1 data, time series should suffice because it is designed to model dependency such as used in the present research. For the researcher who has data that are not serially correlated, the ANOVA F test or the preliminary testing procedure provide more power than the Shine-Bower tests.