Page 1
Robustness to Failure of Assumptions of Tests
for a Common Slope Amongst Several Allometric Lines –
A Simulation Study
David I. Warton*
School of Mathematics and Statistics, University of New South Wales, NSW 2052, Australia
Received 14 October 2005, revised 3 February 2006, accepted 12 April 2006
Summary
In allometry, researchers are commonly interested in estimating the slope of the major axis or standar-
dized major axis (methods of bivariate line fitting related to principal components analysis). This study
considers the robustness of two tests for a common slope amongst several axes. It is of particular
interest to measure the robustness of these tests to slight violations of assumptions that may not be
readily detected in sample datasets. Type I error is estimated in simulations of data generated with
varying levels of nonnormality, heteroscedasticity and nonlinearity. The assumption failures introduced
in simulations were difficult to detect in a moderately sized dataset, with an expert panel only able to
correct detect assumption violations 34–45% of the time.
While the common slope tests were robust to nonnormal and heteroscedastic errors from the line,
Type I error was inflated if the two variables were related in a slightly nonlinear fashion. Similar
results were also observed for the linear regression case. The common slope tests were more liberal
when the simulated data had greater nonlinearity, and this effect was more evident when the underlying
distribution had longer tails than the normal. This result raises concerns for common slopes testing, as
slight nonlinearities such as those in simulations are often undetectable in moderately sized datasets.
Consequently, practitioners should take care in checking for nonlinearity and interpreting the results of
a test for common slope. This work has implications for the robustness of inference in linear models in
general.
Key words: Analysis of covariance; Errors-in-variables model; Heteroscedasticity; Model II
regression; Nonlinearity; Standardized major axis.
1 Introduction
Allometry is a field of biology involving the study of size and its biological consequences (Reiss,
1989). For example, there has been much allometric research exploring the relationship between meta-
bolism rate and body mass across animal species (as reviewed by Darveau et al., 2002), and how seed
mass varies with seed number across plant species (Henery and Westoby, 2001). As a field, allometry
is sufficiently advanced that textbooks have been written on the subject (Reiss, 1989; Niklas, 1994.
It is commonly the case that the relationship between two size variables approximates a power
relation:
Y ¼ aXb
and it is appropriate to log transform variables and estimate a line of best fit (Niklas, 1994). There is
often particular interest in estimating and interpreting the value of the exponent (b), i.e. the slope of
the line of best fit on log-transformed axes.
*Corresponding author: e-mail: David.Warton@unsw.edu.au, Phone: +61293857031, Fax: +61293857123
286 Biometrical Journal 49 (2007) 2, 286–299 DOI: 10.1002/bimj.200510263
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Page 2
The motivating example for this paper is presented in Figure 1. Leaf longevity and leaf mass per
area (roughly speaking, a measure of leaf thickness) are plotted on a logarithmic scale for plant spe-
cies from sites of contrasting rainfall. These two variables are known to be closely related, with thick-
er leaved species having leaves with longer lifespans, within communities sampled all over the globe
(Wright et al., 2004). Thicker leaves are more costly to the plant – a greater mass investment is made
per unit area of leaf available for light capture – but they tend to live longer, hence recouping the
higher costs of investment through a longer period of returns. The slope of the line of best fit for a
given community in Figure 1 suggests how great the additional return will be (longer lifespan) for a
given additional investment in the leaf (leaf mass per area). There is particular interest in whether
b ? 1 for the data on log scales: if this were the case, then a doubling of mass investment in the leaf
would typically correspond to a doubling of lifespan.
The lines fitted to data in allometric studies are not for prediction of one variable from the other,
rather they are lines fitted to summarize the relationship between the two variables. As such techni-
ques related to principal components analysis are commonly recommended in preference to linear
regression (Rayner, 1985; Harvey and Pagel, 1991; Warton et al., 2006). The two main types of lines
that have been generally recommended in allometry are referred to as the major axis (MA) and stan-
dardized (or reduced) major axis (SMA). These are respectively the first principal component vector
of the variance matrix and of the correlation matrix (rescaled to the original axes) fitted through the
centroid of the data.
Often a study involves the fitting of several allometric lines, and in such cases it is of interest to
compare the slopes of these lines (recent examples include Westoby and Wright, 2003; King et al.,
2005; Tjoelker et al., 2005; Wright et al., 2005). In the case of Figure 1, the central question of the
study was whether the leaf longevity-leaf mass per area relationship was the same in communities of
Biometrical Journal 49 (2007) 2287
0.06250.1250.25 0.5
0.5
1
2
4
Leaf Mass per Area (kg m2) [log scale]
Leaf Longevity (years) [log scale]
Figure 1
plant species at sites of high rainfall (closed blue circles)
and low rainfall (open red circles). Data from Wright, Wes-
toby, and Reich (2002), who were primarily interested in
comparing the slopes of the lines of best fit.
Leaf longevity vs leaf mass per area for woody
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 3
contrasting rainfall, i.e. is there evidence of a difference in the slope of the relationship, across sites
with different environmental conditions?
The common principal component methodology of Flury (1984) can be used to test for common
slope of several major axes, best done using a Bartlett correction (Warton and Weber, 2002). A varia-
tion of this test for standardized major axes has also been developed (Warton and Weber, 2002). These
tests perform well when errors from the line have a normal distribution, but their performance in other
situations is unknown.
For example, consider the analysis of the data of Figure 1 using standardised major axes. The stan-
dardised major axis (SMA) slope for the high-rainfall data is 2.12 (95% confidence interval 1.45–
3.10) and the SMA slope for the low rainfall data is much shallower (1.18, 95% confidence interval
0.94–1.49). A likelihood ratio test (described in Section 2) testing for equality of the SMA slopes
returns the test statistic 6.61, with a P-value of 0.01 when compared to the c2
might conclude that there is good evidence of a difference in slopes for high vs low rainfall sites, and
that in higher rainfall environments, there is a greater benefit (in terms of leaf longevity) to having
thicker leaves. However, diagnostic plots (Figure 2) suggest a possible departure from normality –
data for each site contains a few moderate outliers. Do these apparent departures from normality
invalidate the above analysis, or are the inferential procedures robust to such violations of assump-
tions?
Some preliminary bootstrap simulations suggested that the tests may indeed be sensitive to subtle
violations of assumptions, which prompted this more systematic investigation.
The purpose of this paper is to use Monte Carlo simulation to assess the robustness to subtle fail-
ures of assumptions of the common slope tests for major axes and standardized major axes. Simula-
1distribution. Hence one
288 D. I. Warton: Robustness of Common Slope Tests
0.6
fitted axis scores
0.30
2
0
2
std(res)
std(res) vs fits
(a)
202
2
0
2
std(res)
Z scores
qq plot
0.6
fitted axis scores
0.30
2
0
2
std(res)
(b)
202
2
0
2
std(res)
Z scores
Figure 2
metry data of Figure 1. The first column con-
tains residual vs fits plots, the second column
contains qq plots of residuals against normal
scores, for (a) the high rainfall site, (b) the
low rainfall site. Note that there are a few
moderate outliers – five of the 40 residuals
have standardised scores of approximately 2
or ?2.
Diagnostic plots for the leaf allo-
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 4
tions will be conducted to measure the accuracy of the chi square approximation of test statistics
under nonnormal errors from the line, heteroscedasticity, and a nonlinear relationship between the two
variables of interest. In the cases of heteroscedasticity and nonlinearity, only subtle violations will be
considered that might not be detected based on a moderately sized dataset. These cases are of particu-
lar interest because if undetectable violations of assumptions have substantial effects on Type I errors
rates, then this presents a significant practical problem.
It might be expected that the common slopes test will behave similarly to the common slopes test
for linear regression. Consequently, we also consider linear regression in our simulations, for compar-
ison. There is an extensive literature on the robustness of linear regression, which we review in Sec-
tion 3.
In this paper, the test statistics of interest will be introduced (Section 2), then the relevant literature
on robustness of linear regression will be reviewed (Section 3). Simulation conditions will be de-
scribed (Section 4), then the results presented (Section 5) and discussed (Section 6).
2 Test Statistics
This section reviews the test statistics under consideration. Further details can be found in Warton and
Weber (2002).
Consider g bivariate normal random samples, the i-th random sample consisting of ni pairs of
observations ðxi;yiÞ. The g samples may have different means and covariance matrices.
Define N ¼P
2.1Test for common major axis slope
g
i¼1
ni, the total number of observations across all g samples.
Let bibe the slope of the major axis through data from the i-th group. The hypothesis test of interest
is:
H0: bi¼ bMA
8i 2 f1;2;...;gg
Ha: otherwise:
The maximum likelihood estimator of the common major axis slope^b bMAsatisfies
0 ¼
1
1 þ^b b2
MA
P
g
i¼1
ni
1
^F Fið1;1Þ?
1
^F Fið2;2Þ
!
^F Fið1;2Þ
where^F Fiðj;kÞ is the ðj;kÞ-th element of^F Fi, the sample variance matrix of
ui
vi
ð Þ ¼ ð1 þ^b b2
MAÞ?1
2 xi
ð
yi
Þ
1
?^b bMA
1
^b bMA
!
:
The variables ui and vi can be interpreted as measuring location of a point along the fitted line
(“fitted axis scores”) and distance from the fitted line (“residual scores”), respectively.
The Bartlett corrected likelihood ratio test statistic can be written as
?2 log ðLMAÞ ¼ ?P
where ri;uvis the sample correlation coefficient between uiand vi. If major axis slopes of all g groups
are equal, ?2 log ðLMAÞ has an asymptotic chi square distribution with g ? 1 degrees of freedom, as
the niall become large.
Because ?2 log ðLMAÞ is a function of correlation coefficients, the assumptions of the test can be
relaxed – the null distribution of the test statistic is essentially unchanged if only one of u and v is
normally distributed, not both.
g
i¼1
ðni? 2:5Þ log ð1 ? r2
i;uvÞ
Biometrical Journal 49 (2007) 2 289
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 5
The Bartlett correction used in ?2 log ðLMAÞ was derived for the case when bMAis known,
although this correction is useful when bMAis unknown also. Simulations have demonstrated that with
this correction, Type I error of the test statistic is close to nominal levels when errors are normally
distributed (Warton and Weber, 2002), even for small samples (each ni¼ 10).
2.2 Test for common standardised major axis slope
In the case of standardized major axes, the biare the slopes of the standardized major axes, and the
hypothesis test of interest is:
H0: bi¼ bSMA
8i 2 f1;2;...;gg
Ha: otherwise:
The common slope estimator^b bSMAsatisfies
0 ¼
1
2j^b bSMAj
P
g
i¼1
ni
1
^Y Yið1;1Þþ
1
^Y Yið2;2Þ
!
^Y Yið1;2Þ
where^Y Yiðj;kÞ is the ðj;kÞ-th element of^Y Yi, the sample variance matrix of
Þ ¼ ð2j^b bSMAjÞ?1
si
ti
ð
2 xi
ð
yi
Þ
^b bSMA
1
?^b bSMA
1
??
:
The variables siand tihave the same interpretation as uiand vido for the major axis case, i.e. they
represent fitted axis and residual scores, respectively.
The Bartlett corrected likelihood ratio test statistic can be written as
?2 log ðLSMAÞ ¼ ?P
As previously, ?2 log ðLSMAÞ has an asymptotic c2
are equal, only one of s and t needs to be normally distributed, and the Bartlett correction was derived
assuming bSMAis known.
g
i¼1
ðni? 2:5Þ log ð1 ? r2
i;stÞ:
g?1distribution when all standardized major axes
2.3Relationship to the test for common linear regression slope
The standard F test for a common linear regression slope is equivalent to a maximum likelihood
statistic, derived either assuming bivariate normality or (more usually) conditioning on the xi and
assuming that yijxi is a realisation of normally distributed variables. It is assumed that the residual
variance is common across all groups.
The common slope estimator (^b breg) can be shown to satisfy
0 ¼P
g
i¼1
ðni? 1Þ^W Wið1;2Þ
where^W Wiðj;kÞ is the ðj;kÞ-th element of^W Wi, the sample variance matrix of
1
?^b breg
0
xi
wi
ð Þ ¼ xi
yi
ðÞ
1
??
:
The variables xiand wiare analogous to uiand vifor the major axis case.
If^W Wðj;kÞ is the pooled variance estimate from all groups:
^W Wðj;kÞ ¼
N ? g
1
P
g
i¼1
ðni? 1Þ^W Wiðj;kÞ
290D. I. Warton: Robustness of Common Slope Tests
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 6
and if^S Sðj;kÞ is defined similarly as the pooled estimate of the variance of ðxi
statistic can be written as
?
^W Wð2;2Þ ðg ? 1Þ
This statistic is closely related to ?2 log ðLMAÞ and ?2 log ðLSMAÞ. In particular, Fregis also equivalent
to a likelihood ratio statistic, and the assumptions required in specifying the likelihood for ?2 log ðLMAÞ
and ?2 log ðLSMAÞ are also required for Freg. However, ?2 log ðLMAÞ and ?2 log ðLSMAÞ differ from
the linear regression statistic Fregin several ways:
? The null distribution is asymptotic (with a Bartlett-type correction) rather than being exact,
although it is known to maintain close to nominal levels in small sample sizes (Warton and
Weber, 2002).
? There is no assumption of common residual variance for ?2 log ðLMAÞ and ?2 log ðLSMAÞ.
? ?2 log ðLMAÞ and ?2 log ðLSMAÞ are only likelihood ratio test statistics when data are bivariate
normal. The tests are still valid when conditioning on the fitted axis scores, but they are no long-
er maximum likelihood. In contrast, when conditioning on X, the linear regression F statistic is
still a maximum likelihood statistic.
? The fitted axis scores (uiand si) are a function of the estimated slope, which is not the case for
linear regression.
The only obvious implication these differences have for robustness is that the null distributions for
?2 log ðLMAÞ and ?2 log ðLSMAÞ should be unaffected by unequal residual variances, whereas this is
not the case for linear regression.
yiÞ, then the F
Freg¼
^S Sð2;2Þ ?^W Wð2;2Þ
?
ðN ? gÞ
:
3 Literature for Linear Regression
Whereas there has been little investigation of the robustness of common slope tests for the major axis
(MA) and standardised major axis (SMA), an extensive literature is available for linear regression.
There has been considerable attention to the question of robustness of analysis of covariance to
failure of assumptions, dating back to the 1950’s. An often-cited review is Glass, Peckham, and San-
ders (1972), a recent review that includes a metaanalysis of Monte Carlo simulations is Harwell
(2003). Most of this work has considered the situation of testing for equal elevation, assuming com-
mon slope, rather than specifically investigating tests for common slope. However, research does
appear to confirm that tests for common slope share similar properties with ANCOVA (Wilcox, 1999;
Luh and Guo, 2000). Key results relevant to our situation, which have achieved a wide consensus in
the literature, can be summarised:
1. ANCOVA is robust to non-normality, unless the X-variable itself is highly non-normal. Power, on
the other hand, can be strongly affected by long-tailed distributions.
2. If sampling is balanced, the ANCOVA F test is robust to heteroscedasticity and indeed to most
failures of assumptions.
3. The ANCOVA F tests are sensitive to unequal error variances if sampling is unbalanced. This is
the well-known Behrens-Fisher problem, in which the sensitivity to unequal error variances in-
creases with greater imbalance in sampling.
4. The F statistic can be substantially biased if EðYjXÞ is non-linear (Atiqullah, 1964).
Surprisingly little work has considered the robustness of ANCOVA procedures to non-linearity (Har-
well, 2003), and we have found no work on this subject specifically for common slopes tests.
Warton and Weber (2002) argued that tests using ?2 log ðLMAÞ and ?2 log ðLSMAÞ will be robust
to non-normality in one or the other of residuals and fitted axis scores. This claim was supported by
simulation results. No other work has considered the robustness of these tests. However, one might
expect similar levels of robustness to the linear regression case, given that the methods are closely
related (Warton and Weber, 2002).
Biometrical Journal 49 (2007) 2291
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 7
4 Simulation Design
This study measured the effect of failure of assumptions on the null distribution of the common slope
test statistics described above. The measure of interest was the observed rate at which critical values
from the c2
sample datasets. It was desirable that observed Type I error be within a factor of two of nominal
levels, and any Type I error rates outside of this range are of concern.
Three different types of failure of assumptions were considered:
? nonnormal errors from the line
? heteroscedasticity
? nonlinear relationship between variables
At first glance, it may seem inappropriate to consider the effects of nonlinearity and heteroscedasti-
city on a test statistic that assumes a linear relationship and homoscedastic errors. If the assumptions
of the procedure aren’t satisfied, why not use a more appropriate procedure? However, slight hetero-
scedasticity and slight nonlinearity can be difficult to detect in moderately sized samples. A test statis-
tic needs to be robust to such minor violations to be of practical use. The detectability of the assump-
tion violations considered in these simulations has been estimated in the section “detectability of
failures of assumptions”.
In all simulations, a total of 40 observations were generated in two groups, such that the true slope
in each group is one. Data were generated according to the following steps:
1. Sampling design – Both balanced and unbalanced sampling designs were considered in simula-
tions, where sizes of the two samples (n1;n2Þ were:
(a) 20, 20
(b) 10, 30
2. Distribution – two pairs of random samples of sizes n1and n2were generated from iid variables
ðU;VÞ according to a distribution with mean zero and variance one. The variables U and V have
interpretations as the location along (or projection to) the true line and the error from the line,
respectively, analogous to uiand viin Section 2.1. The distributions used in simulations were
(a) normal
(b) normal mixture A – a 10:90 mixture of normal distributions centered at zero with variances
four and one respectively. This variable was then rescaled to have variance one.
(c) normal mixture B – a 10:90 mixture of normal distributions centered at zero with variances
nine and one respectively. This variable was then rescaled to have variance one.
These normal mixtures are quite long-tailed distributions, in fact the last distribution in the list
above has a kurtosis coefficient of 8.3 (larger than for the double exponential distribution, or tn
distributions for which n ? 6).
3. Heteroscedasticity – heteroscedasticity was introduced by replacing V with the variable
V0¼ sðkÞ?1Vð1 þ kUÞ, where k is a constant controlling the amount of heteroscedasticity, and
sðkÞ2¼ Var fVð1 þ kUÞg ¼ 1 þ k2, so that Var ðV0Þ ¼ 1. For 0 < k < 0:2, Var ðV0jU ¼ uÞ in-
creases quadratically with u, since for this range of k Pð1 þ kU < 0Þ is negligible for the distri-
butions considered here. The closer k is to 0, the smaller the heteroscedasticity, and k ¼ 0 is the
homoscedastic case, V0¼ V. The levels of heteroscedasticity considered in simulations were:
(a) none (k ¼ 0)
(b) k ¼ 0:1
(c) k ¼ 0:2
The range of values of U typically observed in simulations was ?2 < U < 2, which means that
Var ðV0jUÞ usually varied over a factor of
of approximately 2.3 when k ¼ 0:1, or a factor of 5.4 when k ¼ 0:2.
4. Nonlinearity – nonlinearity was introduced using the function V00¼ sðcÞ?1fV0þ cðU2? 1Þg,
where c is a constant controlling the amount of nonlinearity, and sðcÞ2¼ Var fV0þ cðU2? 1Þg
¼ 1 þ c2fEðU4Þ ? 1g. EðU4Þ ¼ 3 if U is normal, EðU4Þ ? 4:4 if U is the (standardised) mixture
g?1distribution were exceeded, at the 0.05 and 0.01 level, which was estimated from 10000
1þ2k
1?2k
??2. Consequently, Var ðV0jUÞ varied over a factor
292D. I. Warton: Robustness of Common Slope Tests
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 8
with variances one and four, and EðU4Þ ? 8:3 for the normal mixture with variances one and
nine. Notice that EðV00Þ ¼ 0 and EðV00UÞ ¼ 0, so V00and U are uncorrelated, and both have
mean zero and variance one. The following levels of nonlinearity were introduced:
(a) none (c ¼ 0)
(b) c ¼ 0:1
(c) c ¼ 0:25
EðV00jU ¼ 2Þ ¼ EðV00jU ¼ ?2Þ ¼ 3c and EðV00jU ¼ 0Þ ¼ ?c, so given that U was usually in the
range ?2 to 2 in simulations, EðV00jUÞ usually varied over a range of 0.4 when c ¼ 0:1 and a
range of 1 when c ¼ 0:25.
5. Correlation structure – Correlation between the variables X and Y was varied in simulations,
where the correlation was introduced by linear transformation. For MA and SMA simulations,
the transformation used was:
ffiffiffiffiffiffiffiffiffiffiffi
For linear regression simulations, the linear transformation was:
In each case, this approach generated data that had a “true” slope of one and correlation q, because
U and V00were uncorrelated and had equal variance irrespective of the level of heteroscedasticity
or nonlinearity. The levels of correlation in the two groups were set to
(a) 0.4, 0.4
(b) 0.8, 0.8
(c) 0.4, 0.8
A total of 162 simulations were conducted for MA and SMA, considering all possible combinations
of the two sampling designs, three sampling distributions, three levels of heteroscedasticity, three
levels of nonlinearity and three levels of correlation. These simulations were repeated for linear regres-
sion for the purposes of comparison.
XY
ð Þ ¼ ð1 ? q2Þ?1
2 U
ð
V00
Þ
1 þ q
ffiffiffiffiffiffiffiffiffiffiffi
p
?
ffiffiffiffiffiffiffiffiffiffiffi
1 þ q
ffiffiffiffiffiffiffiffiffiffiffi
p
p
1 ? q
p
1 ? q
??
:
XY
ð Þ ¼ UV00
ðÞ
1
0
1
ffiffiffiffiffiffiffiffi
1?q2
q
p
!
:
4.1 Detectability of assumption violations
The detectability of the above violations was estimated using an expert panel of six statisticians. If
these assumption violations are difficult to detect, then any sensitivity to these assumption violations is
problematic for the practical application of the tests.
Biometrical Journal 49 (2007) 2293
Table 1
was estimated as the proportion of correctly detected assumption violations, from
an expert panel of six statisticians. Each statistician was presented with 12 data-
sets generated with two or less assumption violations. Results are presented, pool-
ing across all datasets with (a) any assumption violation (b) only major assump-
tion violations (normal mixture B, k ¼ 0:2 or c ¼ 0:25).
Assumption# violations
Detectability of assumption violations used in simulations. Detectability
# detected % detected
(a) Non-normality
Heteroscedasticity
Non-linearity
(b) Non-normality
Heteroscedasticity
Non-linearity
33
37
32
15
20
19
15
13
11
45%
35%
34%
47%
40%
42%
7
8
8
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 9
The six statisticians on the expert panel were statistics lecturers at the University of New South
Wales. Each had at least five years experience in applied statistics and regression modelling.
Seventy-two datasets were generated using the simulation methods described earlier in Section 4,
such that each dataset violated no more than two of the assumptions of linearity, homoscedasticity and
normality. The six statisticians on the expert panel were shown twelve of these datasets each, with
appropriate diagnostic plots. The statisticians were asked which (if any) assumptions were violated,
for each dataset. An example of the information presented to the statisticians is given in Figure 3.
On the whole, it can be concluded that these assumption violations were indeed difficult to detect
(Table 1). Non-linearity was only detected in 34% of the datasets in which it was present, heterosce-
dasticity was only detected in 35%, and non-normality in 45%. Even in the more extreme cases of
assumption violation (i.e. normal mixture B, heteroscedasticity with k ¼ 0:2, non-linearity with
c ¼ 0:25), violations were detected less than half the time. This clearly demonstrates that the levels of
294 D. I. Warton: Robustness of Common Slope Tests
–2 –101
–2
–1
0
1
2
y vs x
–2101
–0.5
0
0.5
res vs fits
–1012
–0.5
0
0.5
qq plot, resids vs normal
–1012
–1
0
1
–101
–0.5
0
0.5
–1012
–0.5
0
0.5
–22
–2
0
2
–22
–0.5
0
0.5
–112
–0.5
0
0.5
Figure 3
tected. Each row contains diagnostic plots for a dataset – the first
column is a plot of Y vs X, the second column is a standardised
residuals vs fits plot, the last column is a normal quantile plot of
standardised residuals. Each dataset contains 40 observations (with
correlation 0.4), generated as in simulations. One plot displays bi-
variate normal data, one displays bivariate normal data with hetero-
geneity introduced (with k ¼ 0:2, see description of simulation for
more details), and one plot displays bivariate normal data with non-
linearity introduced (using c ¼ 0:25). Which plot is which?
Minor violations of assumptions often can not be de-
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 10
assumption violation considered in these simulations are not easy to detect in practice, for moderately
size samples, and so any lack of robustness to these assumption violations will be problematic.
5 Results
The approximations of ?2 log ðLMAÞ and ?2 log ðLSMAÞ to a chi squared distribution had impressive
robustness to nonnormality and heteroscedasticity. This is demonstrated in Table 2 and Table 3 at the
0.05 and 0.01 levels, respectively, for data with a correlation of 0.4. In the absence of nonlinearity
Biometrical Journal 49 (2007) 2295
Table 2
standardized major axes ðbSMAÞ and linear regressions ðbregÞ. Simulated data were distributed as (a)
bivariate normal (b) bivariate mixture of normals with variances one and four (c) bivariate mixture of
normals with variances one and nine. Heteroscedasticity increased with k, and nonlinearity with c. In
all cases, Type I error was estimated from 10000 datasets, and each dataset consisted of two samples
of 20 observations each, and correlation 0.4. See methods for further details.
Observed Type I error rates at the 0.05 level for common slope tests of major axes ðbMAÞ,
bMA
bSMA
breg
k: 0
0.1 0.20 0.1 0.20 0.1 0.2
(a) c ¼ 0
c ¼ 0:1
c ¼ 0:25
(b) c ¼ 0
c ¼ 0:1
c ¼ 0:25
(c) c ¼ 0
c ¼ 0:1
c ¼ 0:25
0.040
0.047
0.077
0.040
0.054
0.114
0.039
0.079
0.198
0.047
0.052
0.078
0.044
0.059
0.118
0.043
0.082
0.200
0.050
0.054
0.082
0.050
0.067
0.124
0.052
0.092
0.203
0.046
0.054
0.092
0.046
0.067
0.140
0.049
0.105
0.246
0.053
0.062
0.092
0.051
0.074
0.143
0.052
0.107
0.248
0.058
0.064
0.096
0.058
0.078
0.149
0.066
0.118
0.249
0.054
0.055
0.085
0.050
0.066
0.133
0.051
0.100
0.221
0.051
0.054
0.091
0.054
0.069
0.138
0.059
0.100
0.224
0.058
0.066
0.091
0.061
0.079
0.145
0.062
0.101
0.235
Table 3
standardized major axes ðbSMAÞ and linear regressions ðbregÞ. Simulated data were distributed as (a)
bivariate normal (b) bivariate mixture of normals with variances one and four (c) bivariate mixture of
normals with variances one and nine. Heteroscedasticity increased with k, and nonlinearity with c. In
all cases, Type I error was estimated from 10000 datasets, and each dataset consisted of two samples
of 20 observations each, and correlation 0.4. See methods for further details.
Observed Type I error rates at the 0.01 level for common slope tests of major axes ðbMAÞ,
bMA
bSMA
breg
k: 0
0.1 0.20 0.10.20 0.10.2
(a) c ¼ 0
c ¼ 0:1
c ¼ 0:25
(b) c ¼ 0
c ¼ 0:1
c ¼ 0:25
(c) c ¼ 0
c ¼ 0:1
c ¼ 0:25
0.008
0.009
0.018
0.006
0.009
0.039
0.008
0.022
0.100
0.008
0.009
0.020
0.008
0.013
0.040
0.007
0.023
0.098
0.009
0.009
0.021
0.010
0.015
0.041
0.012
0.030
0.099
0.011
0.012
0.024
0.010
0.015
0.055
0.010
0.036
0.137
0.012
0.012
0.027
0.011
0.018
0.056
0.012
0.036
0.134
0.011
0.013
0.030
0.013
0.021
0.058
0.019
0.046
0.136
0.011
0.010
0.026
0.010
0.016
0.053
0.012
0.033
0.114
0.010
0.011
0.027
0.011
0.017
0.054
0.015
0.034
0.114
0.010
0.016
0.028
0.012
0.020
0.058
0.019
0.036
0.127
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 11
(c ¼ 0), the common slope tests became slightly more liberal with increasing heteroscedasticity, but
remained close to nominal levels. In these cases observed Type I error at the 0.05 level was always in
the range 0.039–0.066, and at the 0.01 level was in the range 0.006–0.013, except for one value at
0.019. These results are impressive in the sense that in some simulations, there was a considerable
departure from assumptions of normally distributed errors (the normal mixtures were quite long tailed)
and homoscedastic errors (when k ¼ 0:2, the variance of errors from the line changed by more than a
factor of 5).
The approximations to a chi squared distribution were not, however, robust to nonlinearity. The
common slope tests were more liberal when the simulated data had greater nonlinearity, and this
effect was more evident when the underlying distribution had longer tails. For normally distributed
data, Type I error usually remained within an acceptable range, although for the standardized major
axis test it exceeded 0.02 when a ¼ 0:01 for the most nonlinear case considered (c ¼ 0:25). In con-
trast, in simulations of the normal mixture with variances 1 and 9, Type I error always exceeded 0.098
when a ¼ 0:01 and c ¼ 0:25. The common slope tests were unacceptably liberal in simulations using
either normal mixture when c ¼ 0:25, and in the more longer tailed of the normal mixtures when
c ¼ 0:1 also.
Results were similar for all three tests considered, and were similar for balanced and unbalanced
sampling designs, with one exception – the linear regression statistic when correlation differed for the
two groups (Table 4). This corresponds to the well-known case in which error variances differ, and so
linear regression statistics perform poorly when sample sizes are unbalanced (Glass et al., 1972). For
balanced designs, the linear regression statistic maintained close to nominal levels when correlation
differed, however for unbalanced designs, this statistic was highly inflated (Table 4). In contrast, the
?2 log ðLMAÞ and ?2 log ðLSMAÞ were unaffected by sample design, and maintained close to nominal
levels for different combinations of sample design and correlation, provided that there was no non-
linearity. It can also be seen in Table 4 that the effect of heteroscedasticity was unchanged by the
level of correlation.
6 Discussion
The simulations conducted here are not encouraging, with respect to nonlinearity – they demonstrate
that slight nonlinearity can lead to unacceptable Type I error inflation, even when the nonlinearity is
sufficiently subtle that often it would not be detected in a sample dataset of 40 observations.
296D. I. Warton: Robustness of Common Slope Tests
Table 4
level of correlation. Tabulated results are the Type I error rates at the 0.05 level for common slope
tests of major axes ðbMAÞ and standardized major axes ðbSMAÞ, when there is no nonlinearity ðc ¼ 0Þ,
for bivariate mixture of normals with variances one and nine. Sample sizes are (a) balanced, (20, 20)
(b) unbalanced, (10, 30). Heteroscedasticity increased with k, and three different levels of correlation
were considered. In all cases, Type I error was estimated from 10000 datasets. See methods for
further details.
Observed Type I error rates of common slope tests as a function of heteroscedasticity and
bMA
bSMA
breg
k: 0
0.10.20 0.1 0.20 0.1 0.2
(a) q ¼ (0.4,0.4)
q ¼ (0.4,0.8)
q ¼ (0.8,0.8)
(b) q ¼ (0.4,0.4)
q ¼ (0.4,0.8)
q ¼ (0.8,0.8)
0.039
0.050
0.044
0.042
0.050
0.045
0.043
0.048
0.051
0.040
0.051
0.052
0.052
0.056
0.060
0.050
0.054
0.050
0.049
0.051
0.045
0.052
0.051
0.047
0.052
0.049
0.052
0.048
0.054
0.054
0.066
0.058
0.063
0.062
0.057
0.054
0.051
0.055
0.055
0.052
0.180
0.052
0.059
0.057
0.054
0.058
0.178
0.061
0.062
0.069
0.060
0.061
0.185
0.057
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 12
A further cause for concern is that resampling methods won’t help in robustifying inference. In dis-
cussing a similar problem, Freedman (1986) said “if [the estimator of the stochastic model] is silly, the
bootstrap cannot work: like any statistical procedure, the bootstrap is model dependent. (This is the
statistics version of the no free lunch principle.)” Using permutation tests or bootstrapping can not be
expected to improve the properties of a test assuming linearity and homoscedasticity in a nonlinear or
heteroscedastic context. Standard resampling methods for this test implicitly assume linearity – whether
freely permuting residuals under the reduced model (Anderson and Robinson, 2001), or bootstrapping
after rotating data to reflect H0 (Hall and Wilson, 1991). In fact, the above simulations have been
repeated using resampling-based tests, and Type I error rates remained similar to those in Tables 2–4.
It makes intuitive sense that a common slopes test might be sensitive to nonlinearity. The argument
is demonstrated schematically in Figure 4. By definition, if the relationship between two variables is
nonlinear, then the gradient of the relationship is not the same at all points along the function relating
the variables. Now due to sampling error, the location of different samples will always differ, leading
to different slopes of the best-fitting lines, due to nonlinearity. More specifically, the underlying non-
linear function used in these simulations had the form Y ¼ cðX2? 1Þ þ X. If the means of two sam-
ples of X differ by d, then the difference in gradients at these points (hence the approximate differ-
ence in slopes of the best-fitting lines) is 2cd.
Results suggest that it is worthwhile checking data for nonnormal errors as well as checking care-
fully for nonlinearity. While the tests were robust to nonnormality, the effect of nonlinearity was
considerably greater in the presence of nonnormality (Tables 2–3). When errors were normally distrib-
uted, the tests had acceptable levels of robustness to nonlinearity, and so a practitioner who does not
observe long tailed data can feel reasonably confident in results of a common slopes test. However,
Biometrical Journal 49 (2007) 2297
Y
X
Figure 4
samples of apparently different slope from a
non-linear function. Any two samples of data
will be centered on different locations along
the function relating the two variables, due to
sampling error, leading to different slopes for
lines fitted at different locations. (Although
the data on this figure have been generated
with different locations to emphasize the
point.)
The generation of iid bivariate
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 13
caution needs to be exercised in interpreting common slopes tests of data from long-tailed distribu-
tions, because of the possibility of undetected non-linearity (Table 1).
For example, consider the dataset of Figure 1. In Figure 2 there was a suggestion that the data were
long-tailed compared to the normal distribution, and there was no evidence of non-linearity. Given the
low P-value of 0.01, we can be reasonably confident that there is indeed a difference in slopes
between high and low rainfall sites. However, due to the disconcerting possibility of undetected non-
linearity, this result should be interpreted cautiously.
Finally, this study has implications for fitting linear models in general – while the central limit
theorem ensures robustness to nonnormality, and balanced sampling can ensure a certain level of
robustness to heteroscedasticity, nonlinearity can throw a cat amongst the pigeons.
Acknowledgements
for their assumption checking and for comments on the manuscript. Thanks also to anonymous reviews and
editors for comments on the manuscript, and to Ian Wright for the use of his leaf longevity data.
Thanks to Yanan Fan, Inge Koch, Sue Middleton, David Nott, Scott Sisson and Matt Wand
References
Anderson, M. J. and Robinson, J. (2001). Permutation tests for linear models. Australian and New Zealand Jour-
nal of Statistics 43, 75–88.
Atiqullah, M. (1964). The robustness of the covariance analysis of a one-way classification. Biometrika 51, 365–
372.
Darveau, C., Suarez, R. K., Andrews, R. D., and Hochachka, P. W. (2002). Allometric cascade as a unifying
principle of body mass effects on metabolism. Nature 417, 166–170.
Flury, B. N. (1984). Common principal components in k groups. Journal of the American Statistical Association
79, 892–898.
Freedman, D. A. (1986). Discussion: Jackknife, bootstrap and other resampling methods in regression analysis.
The Annals of Statistics 14, 1305–1308.
Glass, G. V., Peckham, P. D., and Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying
the fixed effects analyses of variance and covariance. Review of Educational Research 42, 237–288.
Hall, P. and Wilson, S. R. (1991). Two guidelines for bootstrap hypothesis testing. Biometrics 47, 757–762.
Harvey, P. H. and Pagel, M. D. (1991). The comparative method in evolutionary biology. Oxford University Press,
Oxford.
Harwell, M. (2003). Summarizing monte carlo results in methodological research: The single-factor, fixed-effects
ancova case. Journal of Educational and Behavioral Statistics 28, 45–70.
Henery, M. L. and Westoby, M. (2001). Seed mass and seed nutrient content as predictors of seed output variation
between species. Oikos 92, 479–490.
King, D. A., Davies, S. J., Supardi, M. N. N., and Tan, S. (2005). Tree growth is related to light interception and
wood density in two mixed dipterocarp forests of malaysia. Functional ecology 19, 445–453.
Luh, W. M. and Guo, J. H. (2000). Approximate transformation trimmed mean methods to the test of simple
linear regression slope equality. Journal of Applied Statistics 27, 843–857.
Niklas, K. J. (1994). Plant Allometry: The Scaling of Form and Process. University of Chicago Press, Chicago.
Rayner, J. M. V. (1985). Linear relations in biomechanics: the statistics of scaling functions. Journal of Zoology,
Ser. A 206, 415–439.
Reiss, M. J. (1989). The allometry of growth and reproduction. Cambridge University Press, Cambridge.
Tjoelker, M. G., Craine, J. M., Wedin, D., Reich, P. B., and Tilman, D. (2005). Linking leaf and root trait syn-
dromes among 39 grassland and savannah species. New Phytologist 167, 493–508.
Warton, D. I. and Weber, N. C. (2002). Common slope tests for errors-in-variables models. Biometrical Journal
44, 161–174.
Warton, D. I., Wright, I. J., Falster, D. S., and Westoby, M. (2006). Bivariate line-fitting methods for allometry.
Biological Reviews 81, 259–291.
Westoby, M. and Wright, I. (2003). The leaf size-twig size spectrum and its relationship to other important spec-
tra of variation among species. Oecologia 135, 621–628.
Wilcox, R. R. (1999). Testing hypotheses about regression parameters when the error term is heteroscedastic.
Biometrical Journal 41, 411–426.
298 D. I. Warton: Robustness of Common Slope Tests
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Page 14
Wright, I. J., Reich, P. B., Cornelissen, J. H. C., Falster, D. S., Garnier, E., Hikosaka, K., Lamont, B. B., Lee, W.,
Oleksyn, J., Osada, N., Poorter, H., Villar, R., Warton, D. I., and Westoby, M. (2005). Assessing the general-
ity of global leaf trait relationships. New Phytologist 166, 485–496.
Wright, I. J., Reich, P. B., Westoby, M., Ackerly, D. D., Baruch, Z., Bongers, F., Cavender-Bares, J., Chapin, T.,
Cornelissen, J. H. C., Diemer, M., Flexas, J., Garnier, E., Groom, P. K., Gulias, J., Hikosaka, K., Lamont,
B. B., Lee, T., Lee, W., Lusk, C., Midgley, J. J., Navas, M. L., Niinemets, U., Oleksyn, J., Osada, N., Poor-
ter, H., Poot, P., Prior, L., Pyankov, V. I., Roumet, C., Thomas, S. C., Tjoelker, M. G., Veneklaas, E. J., and
Villar, R. (2004). The worldwide leafeconomics spectrum. Nature 428, 821–827.
Wright, I. J., Westoby, M., and Reich, P. B. (2002). Convergence towards higher leaf mass per area in dry and
nutrient-poorhabitats has different consequences for leaf lifespan. Journal of Ecology 90, 534–543.
Biometrical Journal 49 (2007) 2 299
#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.biometrical-journal.com
Download full-text