Content uploaded by Allan Stewart-Oaten
Author content
All content in this area was uploaded by Allan Stewart-Oaten on Jun 14, 2021
Content may be subject to copyright.
Ecological Society of America
is collaborating with JSTOR to digitize, preserve and extend access to
Ecology.
http://www.jstor.org
Ecological Society of America
Rules and Judgments in Statistics: Three Examples
Author(s): Allan Stewart-Oaten
Source:
Ecology,
Vol. 76, No. 6 (Sep., 1995), pp. 2001-2009
Published by: Ecological Society of America
Stable URL: http://www.jstor.org/stable/1940736
Accessed: 15-10-2015 20:46 UTC
REFERENCES
Linked references are available on JSTOR for this article:
http://www.jstor.org/stable/1940736?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/
info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact support@jstor.org.
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
Ecology, 76(6), 1995, pp. 2001-2009
? 1995 by the Ecological Society of America
RULES AND JUDGMENTS IN STATISTICS: THREE EXAMPLES'
ALLAN STEWART-OATEN
Department of Biological Sciences, University of California, Santa Barbara, California 93106 USA
Abstract. Statistical analyses are based on a mixture of mathematical theorems and
judgments based on subject matter knowledge, intuition, and the goals of the investigator.
Review articles and textbooks, aiming for brevity and simplicity, sometimes blur the dif-
ference between mathematics and
judgment. A folklore can develop, where judgments based
on opinions become laws of what "should" be done. This can intimidate authors and readers,
waste their time, and sometimes lead to analyses that obscure the information in the data
rather than clarify it. Three familiar examples are discussed: the choice between Normal-
based and non-parametric methods, the use of multiple-comparison procedures, and the
choice of sums of squares for main effects in unbalanced ANOVA. In each case, commonly
obeyed rules are shown to be judgments with which it is reasonable to disagree. A greater
stress on model selection, aided by informal methods, such as plots, and by informal use
of formal methods, such as tests, is advocated.
Key words: analysis of variance; efficiency; multiple comparisons; nonparametric methods; sums
of squares; validity.
INTRODUCTION
A common problem in statistical analyses in ecology
is that judgments expressed in statistical papers, text-
books, and reviews are sometimes interpreted as math-
ematically rigorous, mandated rules, rather than as
opinions more relevant to some cases than others. As
a result, authors, reviewers, or editors sometimes select
or require statistical analyses that, after much effort,
obscure more than they clarify.
Reviews of statistical methods in biological journals
have a valuable but difficult role. They may be the main
route from theory to widespread application, so essen-
tial to the progress of both biology and statistics. The
difficulties arise from the need to present the methods
concisely and, as far as possible, painlessly. Mathe-
matical derivations rarely appear, so the models and
approximations on which the methods are based, and
their limitations, may be downplayed. This can lead to
a backlash later, when a method is seen to be exact
only in special circumstances, such as an underlying
Normal distribution, and is incorrectly tarred as "in-
valid" in all others. In addition, assertions tend to rely
on proof by authority and to take the form of exhor-
tations and instructions rather than statements of re-
sults. The distinction between fact and judgment is
blurred. In time, some judgments become accepted as
laws. These can be hard to challenge, since refutations
may be viewed by statistical journals as "obvious" and
by biological journals as "wrong."
Manuscript
received 9 May 1994; revised 7 December
1994; accepted
27 January
1995.
I discuss three such judgments in areas recently re-
viewed in Ecology and Ecological Monographs: the
choice of nonparametric vs. Normal-based procedures
(Potvin and Roff 1993), the use of multiple testing
methods (Day and Quinn 1989), and the choice of nu-
merator sums of squares in unbalanced ANOVA (Shaw
and Mitchell-Olds 1993). In each case, I offer and de-
fend a judgment that disagrees in part with common
beliefs described in these reviews.
NONPARAMETRIC
vs. NORMAL-BASED
PROCEDURES
"Nonparametric" or "distribution-free" methods
arose mainly in the 1930s and 1940s from concerns
about the validity of methods based on the Normal
distribution, although Arbuthnot (1710) used the sign
test to prove the existence of God. Two judgments com-
monly used to choose between nonparametric and Nor-
mal-based methods are: (1) Normal-based methods are
valid only for Normal distributions, while nonpara-
metric methods are valid for all distributions. (By "val-
id," I mean that the true probabilities of a test rejecting
a true null hypothesis, or of a confidence interval cov-
ering an unknown parameter, equal the nominal alpha
level or confidence.) One should test whether the data
are Normal and, if they fail this test, use nonparametric
methods. (2) Those nonparametric methods that are
based on ranks (to make computations manageable)
discard information, so they are less efficient than
methods using the raw data. (By "efficient," I mean
that the tests are less powerful, the estimates have
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
2002a=... ALLAN S~hWART4)ATEN
E.o...y..V...76,.....6
greater standard errors, and the confidence intervals are
wider, for a given sample size.)
Both beliefs are (in my "judgment") more wrong
than right.
Potvin and Roff (1993) appear to support belief (1):
"The main advantage of nonparametric methods is the
absence of assumptions regarding the distribution un-
derlying the observations." One way to check belief
(1) is to compare results from the t table to those from
the "randomization" distribution: the collection of all
possible t values that can be obtained by keeping the
values in the data but rearranging the labels ("treat-
ment" or "control") attached to them. Under the null
hypothesis of no effect, the value obtained from each
unit was "pre-ordained," regardless of whether it was
"treatment" or "control;" the chance calculated in a
P-value (the significance level of the data) derives from
the experimenter's deliberate randomized assignment
of units to "treatment" or "control." Possibly the first
such test was used by Fisher in 1935 (Fisher 1960) to
criticize belief (1): for 15 data pairs from an experiment
by Darwin (1876), the P value from the t table was
almost identical to that obtained from the randomiza-
tion distribution (i.e., all 215
possible t values obtainable
by labelling one member of each data pair "treatment"
and the other "control"). This example does not con-
stitute a proof, but Hoeffding (1952) proved that the
standard and randomization two-sample t tests have the
same validity and power for large samples.
A second check is to study the behavior of Normal-
based methods when samples are drawn from non-Nor-
mal distributions. Numerous studies based on geometry
(Efron 1969), asymptotics (e.g., Cressie and Whitford
1986), and simulations (e.g., Posten 1978, 1979), have
indicated the broad validity of Normal-based methods
for moderate non-Normal samples, even as low as n =
5. The exceptions are one-tailed tests on a single
skewed distribution or on two skewed distributions
with different skewnesses, variances, or sample sizes.
The advice to test for Normality before using Nor-
mal-based methods is almost paradoxical. If: "Nor-
mality is a myth: there never has been, and never will
be, a Normal distribution" (Geary 1947), then the null
hypothesis of Normality is always false. Whether it will
be rejected in a given case will depend on the power
of the Normality test, if a fixed cutoff (e.g., 0.05) is
used. Obviously this power will be greater for large
than for small samples. Thus protesters will tend to
accept Normality and use Normal-based methods when
samples are small, but reject Normality and use other
methods when samples are large. This is almost the
opposite of what they "should" do since, when the
underlying distribution is non-Normal, many Normal-
based methods have high validity with large samples
(courtesy of the Central Limit Theorem) but lower va-
lidity with small samples.
Belief (1) is wrong not only about the broad validity
of Normal-based methods, but also about the universal
validity of nonparametric methods. Unless there is de-
liberate randomized assignment of experimental units
to treatments (as in the Darwin example), the latter are
almost always based on the assumption that, under the
null hypothesis, all observations are independent draws
from the same distribution. When used for confidence
intervals, e.g., for the difference between the means of
two populations, they usually assume that the popu-
lations are identical except for a location or shift pa-
rameter. These assumptions of identical distributions
require equal variances under different treatments.
Hoeffding's (1952) results can be extended to show
that the randomization and standard (equal variances)
t tests have the same validity and power when variances
are unequal (Romano 1990): both are asymptotically
invalid if the sample sizes are unequal (i.e., if their
ratio does not converge to unity). Fligner and Policello
(1981) show that the Wilcoxon-Mann-Whitney test can
be invalid when comparing distributions having dif-
ferent variances, and suggest an adjustment similar to
the Welch-Satterthwaite modification of the t test.
Without this adjustment, the Wilcoxon is likely to be
less valid than the modified t: the "rule" that nonpara-
metric methods "should" be used when variances are
unequal is especially unfortunate.
The t and the Wilcoxon two-sample tests are actually
testing different things: E(X) = E(Y) for the t, but P(X
> Y) = 0.5 for the Wilcoxon. Each implies the other
if the distributions are symmetric, or have the same
shape and spread. But if the two distributions are dif-
ferently skewed (as in the example given in Table 2
and text of Potvin and Roff [1993]), or have the same
skew but different variances, the tests are hard to com-
pare: one hypothesis could be true while the other is
false (or both could be false, but in opposite direc-
tions!), and neither test is strictly valid.
Overall, "(1) unless randomization has been per-
formed, the 'distribution free' tests do not possess the
properties claimed for them, and (2) if randomization
is performed, standard parametric tests such as the t
test usually supply adequate approximations" (Box et
al. 1978: 104).
Potvin and Roff's (1993) review provides a good
survey of evidence contradicting belief (2). It was first
attacked by Pitman (1948), who formalized ideas of
the efficiency of tests, and later in classic papers by
Hodges and Lehmann (1956) and Chernoff and Savage
(1958). In brief, nonparametric methods are often only
slightly less efficient than Normal-based methods when
the true underlying distribution is Normal, but can be
much more efficient when it is not (Lehmann 1975).
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
Setmei 95NGFRM1TI E$$20
Ironically, this is a by-product of the reduction of in-
formation to ranks, a device aimed originally not at
increasing efficiency but at making computations man-
ageable in pre-computer environments. The use of
ranks reduces the effect of extreme observations, which
carry less information than moderate observations for
non-Normal distributions, and can be seriously mis-
leading. Thus the greater efficiency derives not from
being distribution-free (i.e., using randomization dis-
tributions rather than theoretical sampling distribu-
tions), but from the use of statistics chosen for com-
putational convenience.
But extreme observations can also be downweighted
in parametric approaches. Trimmed means, modified
maximum likelihood (Tiku et al. 1986), and other "L,"
"R," or "M" estimators (Andrews et al. 1972, Huber
1981) all have only slightly larger variances than the
sample mean when the underlying distribution is Nor-
mal, but often much smaller variances when it is not.
The smaller variances lead to smaller confidence in-
tervals and more powerful tests (Gross 1976, Kafadar
1982). These "robust" methods often produce esti-
mates of means, regression slopes, etc., more easily
than rank-based methods. Messy computations may be
needed to derive confidence intervals and tests from
them (e.g., to estimate variances), but these can be
automated and the results appear to be valid for non-
Normal distributions. Their main drawbacks may be
(1) an author may be suspected (perhaps rightly) of
selecting the method whose results best fit his pet the-
ory, and (2) like rank methods, they estimate a variety
of different things when distributions are skewed.
In summary, the main advantage of nonparametric
methods is not greater validity, which is usually slight,
but greater efficiency, which is often considerable, al-
though it depends on the underlying distribution. Rote
resort to nonparametric methods because of suspected
non-Normality, or the failure of a pre-test, is not a good
strategy, especially if it leads to testing a plainly im-
plausible null hypothesis rather than estimating a mean-
ingful parameter. Rather than pre-test for Normality, it
makes sense to plot the data. Severe kurtosis might
suggest a nonparametric or a robust parametric pro-
cedure, while severe skewness might suggest a trans-
formation, a topic too large to discuss here. If the
"identical distributions" assumption can be trusted,
then nonparametric methods seem especially useful
when samples are small (so computations are simple
and Normal-based approximations may be seriously in-
accurate), their target parameters are of particular in-
terest (e.g., the median), a complicated statistic is need-
ed (e.g., to estimate a mode), or there is a need for both
efficiency and simplicity (e.g., high kurtosis and an
audience likely to be suspicious of exotic methods).
MULTIPLE COMPARISONS
Most studies involve several statistical tests or con-
fidence intervals. The chance that at least one of these
makes an error (a false rejection or an interval that fails
to cover the true value), will be larger than the chance
of an error on any particular one. Also, some "un-
planned" tests or intervals may have been constructed
only because the investigator noticed something odd
about the data: in effect, many tests were carried out
mentally, but only the most "significant" was reported.
Multiple-comparison methods allow simultaneous in-
ference with a prespecified overall error rate (proba-
bility of at least one false rejection or incorrect interval)
in cases like these.
Several different judgments have been made about
the choice between overall rates controlled by multiple-
comparison methods and the usual comparison-wise
rates that consider each test or interval in isolation.
Scheff6 (1959: 66, 80) recommends using overall rates
for all inferences on a single data set, following a sig-
nificant ANOVA F test. Snedecor and Cochran (1989:
234) recommend overall rates for unplanned tests or
intervals, but comparison-wise rates for those that were
planned before the data became available. Sokal and
Rohlf (1981: 233, 241) and Day and Quinn (1989: 449)
recommend comparison-wise rates for a set of planned
orthogonal comparisons, but overall rates for other
planned and all unplanned comparisons. (In one-way
ANOVA, a contrast of the means is a linear function,
1jCj[
j, where the cjs are constants with ljcj = 0;
two contrasts, using {clj and IC2j}, are orthogonal if
IiC jCjlcrj = 0, where rj is the number of observations
on treatment
j). Finally there is a viewpoint rather rare
in biology but common among statisticians:
"Multiple comparison methods have no place at all
in the interpretation of data" (Nelder 1971).
"I recommend therefore that multiple comparison
methods be avoided; that the idea of experiment-wise
error rates be retained, but only as a general principle"
(Mead 1988).
My judgment is closest to these last. A list of reasons
for avoiding multiple comparisons may be useful, pro-
vided it is kept in mind that these reasons are
judgments
rather than theorems, not all of them apply in all cases,
not all opponents of multiple comparisons have the
same list, and there may be cases where none of these
reasons apply and multiple comparisons are useful.
Miller (1981) and the review by Day and Quinn (1989)
are good guides in such cases and exceptions to Finney
(1990), who suggests rejecting all multiple-compari-
sons papers as "rarely more useful than a horoscope."
The role of significance tests
It is an illusion to see testing as a system of objective,
automated decision-making: "an effect is real if and
.. . .. . . . .. . . .. . .. . . .. . . .. . . . . . . . .. .. g gY :x1::g :g::t . . . .
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
2004 ALLA 8WRO
.....
only if the null hypothesis is rejected at the 0.05 level."
Multiple-comparison methods are in part an attempt to
maintain this illusion by protecting the integrity of the
test level. The task is impossible. Dozens of scientists
are working every day on effects that, unknown to
them, are small, unimportant, or non-existent. By
chance, a few will get "significant" results. These are
far more likely to be submitted and published than the
others. Unpublished dissertations can correct the bal-
ance a little, but the integrity of 0.05 is hopelessly lost.
In any case, few of us would make up our minds on
the basis of a single test on a single data set. We will
want to consider evidence on both presence and size
from several sources, and also our biological intuition
and knowledge of mechanisms, similar systems and
species, etc. In selecting a best treatment (e.g., in med-
icine), we will want to assess costs, availability, ease
of use, side effects, etc. (Anscombe 1985).
Probably no single number can combine all this in-
formation (though Bayesians and some meta-analysts
may disagree). An accept/reject result at the 0.05 level
is only one of several summaries to consider, perhaps
one of the least useful. It contains less information than
a level of significance (P) or a confidence interval.
Plots, averages and other estimates, mean squares, stan-
dard deviations and F values, and well-organized tables
may also tell us more about a particular data set. In-
formation from other data sets, or about mechanisms
from quite different studies, may tell us more still.
Thus, even with an uncontaminated test level, multiple-
comparison tests devote great effort to delivering one
of the least important data summaries, and inflating its
importance.
Interpretability
The comparison-wise error rate is the probability of
rejecting a true null hypothesis. The experiment-wise
rate is the probability of rejecting any of a set of true
null hypotheses tested in an experiment. We could also
define error rates for papers, studies, or (for journal
editors) issues or volumes. Recent issues of the Royal
Statistical Society's RSS News have (as a joke) sug-
gested controlling lifetime error rates: the first test of
your career is done at level 0.05/2, the next at 0.05/4,
and so on. None of these rates provides perfect pro-
tection; e.g., none protects a reader against the selection
bias mentioned above. Once test results lose their sa-
cred aura, and are recognized as only one, perhaps
minor, summary of a single data set, it makes sense to
choose on the basis of simplicity and interpretability.
This points to the comparison-wise rate. An additional
consistency argument (Saville 1990) is that, for all oth-
er rates, the test result depends not only on the data
relevant to the question, but also on irrelevant infor-
mation such as the number of other questions studied
with it, and perhaps their results.
Which experiment-wise rate?
Many multiple-comparison procedures are used only
if an ANOVA F test first rejects the hypothesis that all
contrasts in the class are zero (e.g., that all means are
equal). Thus the long-run fraction of errors (tests that
falsely reject, intervals that fail to cover the true pa-
rameter) will be the conditional probability of a wrong
answer given the significant F test, not the usual un-
conditional value. This conditional probability depends
on unknown parameters, but is always :oL,
sometimes
much greater, for Scheff6 (1959) tests and confidence
intervals (Olshen 1973).
Difficulty
Multiple-comparison tests are relatively difficult to
understand and carry out. This has three bad effects:
they may distract investigators from more effective
ways of assessing their data, they may be used inap-
propriately, and unsophisticated readers attribute more
importance to them than they deserve. Mead (1988)
gives startling examples of published work in which a
clutter of incomprehensible multiple comparisons
served to obscure biologically significant patterns that
plotting made immediately apparent. The most fre-
quently cited statistical paper for 1945-1988, ranked
24th in all of science (Garfield 1990), was Duncan
(1955), whose multiple-range test is almost always un-
suitable and inappropriately applied because it does not
control experiment-wise error rates, or any other error
rate that can be succinctly described (Day and Quinn
1989).
Rigidity
The choice of test level is usually arbitrary: it is not
derived mathematically from generally agreed criteria.
It makes sense to use a small so if Type 1 errors are
very serious and Type 2 errors only mildly so, and to
use a larger (x in the reverse case. Since seriousness
often depends on the user, levels of significance (P) are
preferable to accept/reject decisions. They are also
more informative as partial data summaries. Multiple-
comparison tests usually ignore this: all tests are treated
equally.
Planned comparison-wise P values can be approxi-
mately converted to experiment-wise P values, as an
informal caution against overinterpretation. The prob-
ability of getting a comparison-wise P value of 0.03
in 10 tests of true null hypotheses is <10(0.03) = 0.3,
using the Bonferroni inequality (and often close to this
value). But if a given hypothesis is not rejected at so
= 0.05 by a multiple-comparison test, we can do little
more than say that its comparison-wise P value is
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
Septembervi>+tO-20v
1995 h Ll::: ;:
;;NONPARAMET
-0.005, possibly very much greater. This point has less
force for unplanned comparisons that, in theory, are
selected from infinitely many possible comparisons.
But this theory may be unrealistic. Usually, only a few
pairwise differences, and a few differences between
averages of one type and averages of another type, have
reasonable biological interpretations. Thus, even here,
it may be possible to allow approximately for the num-
ber of tests that could have been made.
Power
For a given ot, a given null hypothesis is less likely
to be rejected by a multiple-comparison test than by a
single test. This reduces Type I error but increases Type
II error. There is no obvious reason for this to be a
good trade-off. It is clearly bad if, as a result, "sig-
nificance" requires a virtually (or literally) impossible
effect size: the test result will contain no information.
It is even less justified if a general F test has already
discredited the presumption that all null hypotheses are
true. A related problem is that the extreme test statistics
required for "significance" will be in the distant tails
of the null distribution, where approximations like the
Central Limit Theorem may not apply for moderate
samples.
The experimenter's intentions as a datum
A contrast is to be tested by one method if it was
planned but by another if it was not. Methods for
planned tests depend on the number of tests planned:
if k tests were planned, the Bonferroni procedure tests
at level clk. How is a reader to know whether an author
is being honest about his intentions? Also, why should
the reader care? Perhaps the author did not think of a
particular contrast until he saw the data, so felt com-
pelled to use the Scheff6 method. But the reader may
have thought of it immediately because of other ob-
servations she had made. As a result, she may have
been interested in this comparison and no others. Why
should his prior beliefs take precedence over hers?
Orthogonality
Some authors propose special treatment for orthog-
onal contrasts because their estimates, E cjj. and E
c2jyj.
(the dots indicate averaging over the missing sub-
script), are independent. However, independence re-
quires equal variances unless cljc2j = 0 for all j. Also,
the orthogonality condition only ensures zero corre-
lation; this implies independence if the errors are Nor-
mal, but not otherwise. Even then, inferences are not
usually independent: they use the same variance esti-
mate, the residual mean square. And even if the tests
were independent, the chance of rejecting at least one
true null hypothesis increases with the number of tests
just as inexorably as for correlated tests.
An emphasis on orthogonal contrasts risks allowing
the needs of the statistical analysis to determine the
scientific questions to be studied: the tail wags the dog.
Snedecor and Cochran (1989) and Day and Quinn
(1989) contend that orthogonal contrasts provide sep-
arate answers to separate biological questions, but this
implies that the questions "Is Al + [12 - K3 -L4 = 0?"
and "Is [l - VL2
+ U3 - IL4 = 0?" are separate biological
questions if variances and samples sizes are equal, but
not otherwise.
Orthogonality is important, but the time to consider
it is not when the data are analyzed but when the ex-
periment is being designed, so that estimates of the
contrasts of main interest will (as far as possible) be
uncorrelated and have small variances.
In summary, it would be risky to claim that multiple-
comparison methods should never be used: one can't
think of everything. But they seem best suited for the
privacy of one's own lab, where (along with some bi-
ological thought) a small set of methods (e.g., Bon-
ferroni, Scheff6, and Tukey-Kramer; see Day and
Quinn 1989) might reduce overinterpretation without
great effort. For publication, comparison-wise methods
seem preferable, with a warning that the use of many
inferences raises the overall probability of misleading
results. In some cases, e.g., fishing expeditions where
the tests are used mainly as an exploratory tool, the
warning could be reinforced by a few multiple-com-
parison results, to indicate roughly the amount of al-
lowance needed.
SUMS OF SQUARES
FOR
ANOVA TESTS
With what F ratios should effects be tested in an
unbalanced k-way ANOVA? This is a perennial and
potentially frustrating problem, as those who have
struggled with Types I, II, and III sums of squares in
SAS will testify. Several authors, and the SAS manuals,
appear to favor Type III sums of squares. In explaining
my disagreement, I focus mainly on the two-way setup.
Its Full model, without restrictions, can be written in
either of two standard forms. One is the "cell means"
model
Yijk = ?ij + EijO (1)
where Yijk
is the kth observation (k = 1, 2, nij),
using row treatment i (e.g., fertilizer i: i = 1, 2,.
J) and column treatment j (e.g., variety j: j = 1, 2, . .
v), pj is the mean for this combination of treatments,
and Eijk
is "error" due to other sources of variation.
The other form is the "main effects and interactions"
("effects") model:
Yijk
= P + Oti
+ Pj + (o3) i + Eik,, (2)
where p, is the "grand mean," (xi and Hi are the main
effects of the ith fertilizer and of the jth variety, and
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
..: i .. li: E g E E g S: E : i. g E iS g~ ~ ~ ~ ~ ~ ~ ~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
. ..............
-.":.". ::u.:.....
... ........ .: . . ::, .......... .. .. .... .E.- . iE.- ... .... E - . -
-:::- :-:-:-: :- -::- :-::-:: -: :-S- :-0
-:-0-::-: :-:---:-0- iS:- :: ::- -0--0- g -:-:-00--0- --;
.
...........-
. .
......000--0
TABLE 1. Standard tests of Ho: all ct = 0 (e.g., no fertilizer effects).*
SAS Type Complete model Cell means hypothesis tested
Type I (., A) HIo: p.i is the same for all i
Type II (p, A, B) H20: [Li# = 2n
ij .#j/ni+
for all i
Type III (p., A, B, AB) H30: p.i. is the same for all i
* pi. given by Eq. 3; Iip. given by Eq. 5. Type I ss assumes the model is entered in SAS as "A B A*B."
(ap)0j is their interaction. This model is over-parame-
terized, but one can reasonably define p, = PA.., i =
p-. - .., by = .j- p- and (o3)ij = pij- (p. + ti +
By),
where the dots indicate unweighted averages over
the missing subscriptss. E.g.,
ti. = lipIJlv, (3)
so ci is (average of the means using fertilizer i) -
(average of all means). These definitions introduce the
"usual" side conditions, Phi = E i =1j(ot)ij =
1j(o)ij = 0, so there are really only f - 1 x's, v - 1
P's and (f- 1)(v - 1) (o43)'s. (These effects parameters
can be defined in other ways, leading to different side
conditions.)
The usual null hypotheses posit sets of linear rela-
tions among the means, f.ij, of Eq. 1. In principle, any
relation can be tested, but the only ones tested in prac-
tice are more easily expressed in terms of the effects
model, Eq. 2, as "all ot's (or Pt's
or (o43)'s) are zero."
We can describe the corresponding models by listing
the components, using p, A, B, and AB to indicate the
inclusion of AL, the It's, the P3's,
and the (atp)'s. Thus
the model Yijk = p. +
?i + (Otp)ij + Eijk, which assumes
that all Pj's are zero, is designated (IL,
A, AB).
Assuming (in decreasing order of importance) in-
dependent observations, equal variances, and Normal-
ity, the hypotheses that there are no interactions, no
row treatment effects, or no column treatment effects,
are tested by comparing the fit of any "Complete"
model known (or believed) to be true to that of the
same model with the parameters under test omitted.
There are eight possible Complete models for testing
"Ho: all oi's are zero:" (p., A, B, AB), (p, A, B), (p,
A), (p, A, AB), (A, B, AB), (A, B), (A, AB) and (A); in
each case the Reduced model is obtained by omitting
"A." The Complete model is sure to fit better, but the
improvement may be due only to chance. To test this,
the difference in fits is compared to an independent
estimate of 02, the variance of the Eijk'S, obtained by
comparing the sum of squares of the observations with
the sum of squares of the fitted values for any "Trust-
ed" model believed to be true. The F test statistic is:
F = {(Complete ss - Reduced ss)/dfn}/
{(Observed ss - Trusted ss)/dfd} (4)
where Complete ss = Sum of squares of fitted values
using the Complete model, etc., and dfn (= the number
of parameters under test) and dfd (= n - the number
of parameters in the Trusted model) are the degrees of
freedom of the numerator and denominator.
The problem is, what should be the Complete and
Trusted models? We focus on the Complete model,
which is usually the main concern. The Trusted model
is often the Complete model, but can be larger (contain
more parameters). E.g., it could be the Full model,
containing all effects and interactions ([I, A, B, AB]
in the two-way case). We also focus on tests of main
effects in the two-way model: there is little dispute that
(p, A, B, AB) is the appropriate Complete model for
tests of interactions. To avoid extraneous complica-
tions, we assume all nii > 0.
Shaw and Mitchell-Olds (1993) advocate use of (p.,
A, B, AB) for testing main effects too. This follows the
advice of Speed et al. (1978), who argue that other
choices test the wrong hypotheses. They describe the
hypotheses not in terms of the act's
or P3's
of the effects
model, Eq. 2, but in terms of the pij's of the cell means
model, Eq. 1. For a given choice of Complete model,
the hypothesis tested is the relation among the .ij's
for
which, if there were no errors, Complete ss = Reduced
ss. These relations can be expressed in terms of row
and column averages of the pij's (pi. and p..), and of
row and column weighted averages, the weights being
the cell counts, nij:
VFi# = E nijPip.ni+ and p#j = Ii nijlipln+j, (5)
where ni+ = I ni0 and n+j = Ii ni0.
Of the eight possible Complete models for testing
"Ho: all ot's are zero," given in the paragraph
preceding
Eq. 4, only the first three are used in practice: it is
almost never realistic to suppose that p. = 0, and the
model (p, A, AB) is rarely plausible. Thus, Table 1
gives the standard choices.
Speed et al. (1978) argue that H30 (and its counterpart
for the P3's) "seem to be reasonable," while H10 and
especially H20
are "not easy to understand" and "very
difficult to justify." This is clearly a set of judgments
about what "seems reasonable," rather than a set of
mathematical results. But there are reasonable counter-
arguments. When Hocking (1982) made similar argu-
ments in connection with the Analysis of Covariance,
Cox and McCullagh (1982) replied that he "appears to
favor estimating and testing main effects in the pres-
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
............... ............... -
..... ....... ....... ....... ...........
... ... . ....... . .... . .... ............ ..........
....... . .. ... .......
...... ...... ...........
........ ....... ................... .............. .............
.......... ...... ...........
.........
. ........... ...................
............... . ............
ence of interaction, a thing we consider rarely physi-
cally meaningful."
If there are no AB interactions, H20 is equivalent to
H30, and both to "all (xi's are equal." For example, pi,
becomes ffi + (xi
+ Yjnij3j/ni+ while X1nAjVj/ni+ becomes
p + 1j~knkjniJ(xkln+ini ++ Yjnijjlni+, so
5 i = ljlknkjn ij
n+jni+, a weighted average with no zero weights; this
holds for all i so all a-i's
must be equal: e.g., it would
be impossible for the "largest" (x to be a weighted
average of the others. Thus H20 is "not easy to under-
stand" if there are interactions, but seems easy to jus-
tify otherwise, since it becomes "all (x's are 0" or (in
the means model) "Kij = jw.j for all i and j. " Similarly,
if there are neither AB nor B effects, then H10
is equiv-
alent to the other two, and "seems" quite reasonable.
Thus, H30 is clearly preferable only when there are
interactions. But all three Ho's "seem" likely to be
uninteresting then. Shaw and Mitchell-Olds (1993)
give an example in which Y = tree height at the end
of a study, A refers to removal or non-removal of con-
specifics within a certain distance, and B is initial
height (Small or Large). With interactions, H30
would
mean that removal increases height of one B class (say,
the initially small trees) but decreases that of the other,
the two effects being exactly equal. This seems an im-
practical result, more an unlikely coincidence than any-
thing else: the averages are unlikely to remain equal if
we choose a different dividing line between Small and
Large initial heights, or divide initial heights into three
classes rather than two. In experiments with less ar-
bitrary categories, adding a new category (e.g., a new
seed type) to the Bs would also usually make the means
of the As unequal. Once we know the removal effect
varies, the main effects are usually of little interest
compared to the mechanisms or consequences. This is
the point made by Cox and McCullagh (1982), but
obscured by the cell means model: the Type III ss is
"obviously" best for a test of main effects only when
it makes little sense to test main effects at all.
If we test main effects only after deciding that there
are no interactions, then valid F tests of "all (x's are
equal" (or "all Ki.'s are equal") are obtained with ei-
ther (pu,
A, B) or (i, A, B, AB) as the Complete model.
It then seems reasonable to choose on the basis of
power. This leads to choosing (., A, B), as Shaw and
Mitchell-Olds (1993) remark. If there are neither in-
teractions nor B effects, then the Type I ss, using (.,
A), is more powerful than either. In both cases, the
power argument also leads to choosing the Complete
model as the Trusted model in Eq. 4, in order to max-
imize dfd.
This suggests that the best procedure for determining
the true model in a two-way unbalanced ANOVA is
usually (1) test for interactions, using the Type III ss;
(2) if there are no interactions, check A or B effects,
whichever seems a priori more likely to be absent, us-
ing the Type II ss; and (3) if these main effects seem
to be absent, check the remaining main effects using
the Type I ss, otherwise use the Type II ss. If inter-
actions are present, main effects would usually not be
tested. The emphasis is on models: we begin with a
large one and simplify it in steps, each time testing
whether the data justify removing, from the model cur-
rently accepted, the complicating factor judged least
likely to be present. A more general scheme, involving
a "baseline" model that can be updated iteratively, is
suggested by Cox (1984) for higher-way layouts. The
sss in this scheme would often be different from all of
the "Types" routinely offered by packages like SAS.
There may be exceptional cases where Type III sss
make sense. A set of treatments might sometimes be
expected to have exactly opposite effects on males and
females, or on the left and right sides. I know of no
example, but the claim that some marine structures
"increase" density merely by attracting fish from else-
where might lead to one. If interactions, though "sig-
nificant," seem small, it may make sense to see whether
some treatments (e.g., fertilizers) do consistently better
than others, though Type III tests for fertilizer effects
are not necessarily the best way to do this. If treatment
1 of both the A and B groups is a control, it might
make sense to define effects in terms of differences
from it: as (xi = .il - Al, A> =
f3j - Pm' and (ai4)ij
= Wij
- pil - ?lj
+ Al ; if it is plausible that A treat-
ments have no effects except when combined with B
treatments, then models like (p, AB) may be reasonable
and sss different from any of the "Types" called for.
These examples suggest that generalizations about ap-
propriate sss may be less useful than the injunction to
"think about the model."
It should be stressed that "using" a ss does not mean
decisions "should" be based only on formal tests.
Plots, conformity with other data and information,
plausibility of mechanisms, and apparent sizes of ef-
fects are also relevant, and "should" perhaps play a
larger role than tests. Cox (1984) notes that his pro-
cedure is "close to a severely constrained form of step-
wise regression;" using these other factors would bring
it closer to a constrained form of variable selection in
regression, where plots, measures like Cp (Mallows
1973), and subject matter information are all brought
to bear. Recent work on exploratory approaches (Hoag-
lin et al. 1991) is relevant here.
DISCUSSION
The aim of this paper is not to develop new ortho-
doxies of the universal validity of Normal-based pro-
cedures, universal avoidance of multiple-comparison
methods or the appropriate sums of squares in unbal-
anced ANOVA, nor to suggest there are no rules of
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
............ .......... .
. .
............. ... ... .. ......
............ ..... ....... . ...
.......... . ..... : -:: : : :. ............... ..........
............ . . . ....... ............
... .......... ..... . .... . ...... .. ....... ...... .. .. . ..........
.... .......
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . ..........
.
..........
.............
.. ....... ....... A
l m N '
A- bt Vbl --.76 N:
:-.20U OAT
............. . ..........
. . ........ .......
........
......................... .... .. ...... ....... .
.... .. .......... .............. ........ ..... .... ...... ..
....
..
.......
........ .. .
............
inference at all. It is to suggest that the practice of
statistics in ecology is sometimes too rigid. "Sensible"
opinions are treated as mandatory rules, frustrating au-
thors, who may be required to use methods they think
inappropriate, and confusing readers.
One possible reason for this is an aura of exactitude
and universality inherited from mathematics, although
almost all inference is approximate not only because
of the use of limiting distributions but also because of
model uncertainty, treatment-subject non-additivity,
and sampling that is random in at best a limited slice
of time and space. Another is the narrow options and
focus of most canned packages: like many students they
"have no idea that the summary of an experiment is
not the anova table but tables of means and standard
errors" (Nelder 1994). (Another example: neither SAS
nor SYSTAT gives a confidence interval for the dif-
ference of two means when the variances are unequal.)
A third is oversimplification, avoidance of models, nar-
rowness of focus, and proof by authority in many re-
views and textbooks.
Confusion of Popperian tests of a theory with "Fish-
erian" tests of a null hypothesis is a fourth reason,
though this unfairly oversimplifies Fisher's role. The
former subjects theories to grave danger of refutation,
while the latter too often merely supports them by re-
jecting a null hypothesis that no one believed in the
first place. "Since the null hypothesis is quasi-always
false, tables summarizing research in terms of patterns
of 'significant differences' are little more than com-
plex, causally uninterpretable outcomes of statistical
power functions" (Meehl 1978). Meehl's comment that
"the almost universal reliance on merely refuting the
null hypothesis as the standard method for corrobo-
rating substantive theories in the soft areas is a terrible
mistake, is basically unsound, poor scientific strategy,
and one of the worst things that ever happened in the
history of psychology" applies to other disciplines too.
Statistical techniques are intended to clarify: to sep-
arate signal from noise, reveal patterns, and tease out
systematic or consistent relationships and tendencies
that might otherwise be hidden in individual variation,
measurement error, and random accidents. Some hy-
pothesis testing can help, since some null hypotheses
could be true: some birds may make no distinction at
all between their own eggs and those of a parasite
(Rothstein 1986), and interactions may sometimes all
be exactly zero. Goodness-of-fit tests often do subject
well-defined models to danger of refutation, rather than
confirming vague ones by rejecting something implau-
sible. But much of statistics is not formal inference but
an often-iterative pattern of design, summary, descrip-
tion, and display, and linking results of the present
study to past studies, known mechanisms, and biolog-
ical intuition. Most of formal inference is not hypoth-
esis testing but model construction, selection, and
checking (formal and informal), estimation of param-
eters and standard errors, or calculation of confidence
regions or of Bayesian posterior distributions of pa-
rameters (Box 1980).
When "statistics" gets reduced to "statistical infer-
ence" and then to hypothesis testing (Freedman et al.
1991 is an outstanding exception), it can become its
opposite: not a way to reveal and clarify but to obscure
and terrify. Statistical consultants are then asked not
to help discover truth, but to produce mumbo jumbo,
of no interest to the client and distracting to the au-
dience, to pacify a referee.
A program to promote better statistical practices
would not only be a large undertaking, but might be-
come just as fossilized as present practice. Some sug-
gestions might be useful: more use of informal, non-
inferential techniques (plots, summary statistics, and
tables, as described in several books on Exploratory
Data Analysis); a much greater emphasis on model se-
lection, justification, and checking when formal tech-
niques are used; a healthy skepticism of the word
"should," especially in non-mathematical reviews and
texts; a reduction in significance testing of implausible
null hypotheses; and a general aim to use statistical
methods to simplify and reveal, as well as to measure
uncertainty. It may be helpful to recognize that, al-
though there are "wrong" analyses (incorrectly derived
from explicit models, or based-often implicitly-on
assumptions or models known not to be approximately
true), there are usually several "right" ones, depending
on the aims of the analysis and the knowledge, intui-
tion, and uncertainties (i.e., the models) of the inves-
tigator. Except for some extreme Bayesians, we are
rarely likely to have a complete and unique set of data
analysis rules-or, one hopes, of regulations-but rath-
er a set of sensible guidelines that, though supported
by mathematical results, must be flexible enough to
accommodate a variety of data-gathering setups and
individual interests. "If inference is what we think it
is, the only precept or theory which seems relevant is
the following: 'Do the best you can.' This may be
taxing for the old noodle, but even the authority of
Aristotle is not an acceptable substitute" (LeCam
1977).
ACKNOWLEDGMENTS
I thank Ed McCauley, Bill Murdoch, and Sue Swarbrick
for helpful comments on two drafts, Steve Rothstein and
Steve Gaines for useful further suggestions, and two anon-
ymous referees for many others. This work was supported in
part by the Minerals Management Service, U. S. Department
of the Interior, under Minerals Management Service Agree-
ment Number 14-35-0001-30471 (The Southern
California
Educational Initiative). The views and interpretation con-
tained in this document are those of the author and should
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions
..... . ......... ... ...
... . .......... ... -
.......... . .. .......... - ..... . ............
..... ..... .............. ... .... ... ....... ...
........ .. ............ . . ........ . ................ ...........
............. . . ...............
............ ......
-
.......... ........
....
0 . ........... . .. .......
.
N W.... R ETIRIC. ST, ............. ..........
............
.... . ......... ................. . .... .
.. . ........ .. .............
...
. ............ ..... .. ..........
. ......... .
.............
. . . ........
. .
........... .
... ... . .
..
......
.......... .......... ....... . . . ..... .
not be interpreted as necessarily representing the official pol-
icies, either express or implied, of the U. S. Government.
LITERATURE CITED
Andrews, D. F, P. J. Bickel, F R. Hampel, P. J. Huber, W. H.
Rogers, and J. W. Tukey. 1972. Robust estimates of lo-
cation. Princeton University Press, Princeton, New Jersey,
USA.
Anscombe, F J. 1985. Review of "Simultaneous Statistical
Inference" (Miller 1981). Journal of the American Statis-
tical Association 80:250.
Arbuthnot, J. 1710. An argument for divine providence,
taken from the constant regularity observed in the births
of both sexes. Philosophical Transactions 27:186-190.
Box, G. E. P. 1980. Sampling and Bayes inference in sci-
entific modeling and robustness (with discussion). Journal
of the Royal Statistical Society A143:383-430.
Box, G. E. P, W. G. Hunter, and J. S. Hunter. 1978. Statistics
for experimenters. John Wiley & Sons, New York, New
York, USA.
Chernoff, H., and I. R. Savage. 1958. Asymptotic Normality
and efficiency of certain nonparametric test statistics. An-
nals of Mathematical Statistics 29:972-994.
Cox, D. R. 1984. Interaction. International Statistical Review
52:1-31.
Cox, D. R., and P. McCullagh. 1982. Some aspects of anal-
ysis of covariance. (With discussion.) Biometrics 38:541-
561.
Cressie, N. A. C., and H. J. Whitford. 1986. How to use the
two-sample t test. Biometrical Journal 28:131-148.
Darwin, C. 1876. The effects of cross- and self-fertilization
in the vegetable kingdom. John Murray, London, England.
Day, R. W., and G. P. Quinn. 1989. Comparisons of treat-
ments after an analysis of variance in ecology. Ecological
Monographs 59:433-463.
Duncan, D. B. 1955. Multiple range and multiple F tests.
Biometrics 11: 1-42.
Efron, B. 1969. Student's t test under symmetry conditions.
Journal of the American Statistical Association 64:1278-
1302.
Finney, D. 1990. Letter to the editor. Biometrics Bulletin
7:2.
Fisher, R. A. 1960. The design of experiments. Seventh edi-
tion. Oliver and Boyd, Edinburgh, Scotland.
Fligner, M. A., and G. E. Policello II. 1981. Robust rank
procedures for the Behrens-Fisher problem. Journal of the
American Statistical Association 76:162-168.
Freedman, D., R. Pisani, R. Purves, and A. Adhikari. 1991.
Statistics. Norton, New York, New York, USA.
Garfield, E. 1990. The most-cited papers of all time, Science
Citation Index 1945-1988. Part 1B. Superstars new to the
SCI top 100. Current Contents 8:3-13.
Geary, R. C. 1947. Testing for Normality. Biometrika 34:
209-242.
Gross, A. M. 1976. Confidence interval robustness with
long-tailed distributions. Journal of the American Statis-
tical Association 71:409-416.
Hoaglin, D. C., F Mosteller, and J. W. Tukey. 1991. Fun-
damentals of exploratory analysis of variance. John Wiley
& Sons, New York, New York, USA.
Hocking, R. R. 1982. Discussion of Cox and McCullagh
(1982). Biometrics 38:559-561.
Hodges, J. L., Jr., and E. L. Lehmann. 1956. The efficiency
of some nonparametric competitors of the t-test. Annals of
Mathematical Statistics 27:324-335.
Hoeffding, W. 1952. The large sample power of tests based
on the permutation of observations. Annals of Mathemat-
ical Statistics 23:169-192.
Huber, P. 1981. Robust statistics. John Wiley & Sons, New
York, New York, USA.
Kafadar, K. 1982. Using biweight m-estimates in the two-
sample problem. Part 1: symmetric populations. Commu-
nications in Statistics, Theoretical Methods 11: 1883-1901.
LeCam, L. 1977. A note on metastatistics or 'an essay toward
stating a problem in the doctrine of chances.' Synthese 36:
133-160.
Lehmann, E. L. 1975. Nonparametrics: statistical methods
based on ranks. Holden-Day, San Francisco, California,
USA.
Mallows, C. L. 1973. Some comments on Cp. Technometrics
15:661-675.
Mead, R. 1988. The design of experiments. Cambridge Uni-
versity Press, Cambridge, England.
Meehl, P. E. 1978. Theoretical risks and tabular asterisks:
Sir Karl, Sir Ronald, and the slow progress of soft psy-
chology. Journal of Consulting and Clinical Psychology
48:806-834.
Miller, R. G., Jr. 1981. Simultaneous statistical inference.
Springer-Verlag, New York, New York, USA.
Nelder, J. A. 1971. Contribution to the Discussion of O'Neill,
R. T., and B. G. Wetherill. 1971. The present state of mul-
tiple comparison methods. Journal of the Royal Statistical
Society, 'B', 33:218-241.
Nelder, J. A. 1994. Science-a teaching framework. Royal
Statistical Society News 21:1-2.
Olshen, R. A. 1973. The conditional level of the F test.
Journal of the American Statistical Association 68:692-
698.
Pitman, E. J. G. 1948. Lecture notes on nonparametric sta-
tistics. Columbia University, New York, New York, USA.
Posten, H. 1978. The robustness of the two-sample t-test
over the Pearson system. Journal of Statistical Computation
and Simulation 6:295-311.
. 1979. The robustness of the one-sample t-test over
the Pearson system. Journal of Statistical Computation and
Simulation 9:133-149.
Potvin, C., and D. A. Roff. 1993. Distribution-free and robust
statistical methods: viable alternatives to parametric sta-
tistics? Ecology 74:1617-1628.
Romano, J. P. 1990. On the behavior of randomization tests
without a group invariance assumption. Journal of the
American Statistical Association 85:686-692.
Rothstein, S. I. 1986. A test of optimality: egg recognition
in the eastern phoebe. Animal Behavior 34:1109-1119.
Saville, D. J. 1990. Multiple comparison procedures: the
practical solution. The American Statistician 44:174-180.
Scheff6, H. 1959. The analysis of variance. John Wiley &
Sons, New York, New York, USA.
Shaw, R. G., and T Mitchell-Olds. 1993. ANOVA for un-
balanced data: an overview. Ecology 74:1638-1645.
Snedecor, G. W., and W. G. Cochran. 1989. Statistical meth-
ods. Eighth edition. Iowa State University, Ames, Iowa,
USA.
Sokal, R. R., and E J. Rohlf. 1981. Biometry. Second edition.
Freeman, New York, New York, USA.
Speed, E M, R. R. Hocking, and 0. P. Hackney. 1978. Meth-
ods of analysis of linear models with unbalanced data. Jour-
nal of the American Statistical Association 73:105-112.
Tiku, M. L., W. Y. Tan, and N. Balakrishnan. 1986. Robust
inference. Marcel Dekker, New York, New York, USA.
This content downloaded from 128.111.121.42 on Thu, 15 Oct 2015 20:46:34 UTC
All use subject to JSTOR Terms and Conditions