# Robustness to failure of assumptions of tests for a common slope amongst several allometric lines--a simulation study.

**ABSTRACT** In allometry, researchers are commonly interested in estimating the slope of the major axis or standardized major axis (methods of bivariate line fitting related to principal components analysis). This study considers the robustness of two tests for a common slope amongst several axes. It is of particular interest to measure the robustness of these tests to slight violations of assumptions that may not be readily detected in sample datasets. Type I error is estimated in simulations of data generated with varying levels of nonnormality, heteroscedasticity and nonlinearity. The assumption failures introduced in simulations were difficult to detect in a moderately sized dataset, with an expert panel only able to correct detect assumption violations 34-45% of the time. While the common slope tests were robust to nonnormal and heteroscedastic errors from the line, Type I error was inflated if the two variables were related in a slightly nonlinear fashion. Similar results were also observed for the linear regression case. The common slope tests were more liberal when the simulated data had greater nonlinearity, and this effect was more evident when the underlying distribution had longer tails than the normal. This result raises concerns for common slopes testing, as slight nonlinearities such as those in simulations are often undetectable in moderately sized datasets. Consequently, practitioners should take care in checking for nonlinearity and interpreting the results of a test for common slope. This work has implications for the robustness of inference in linear models in general.

**0**Bookmarks

**·**

**72**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**In allometry, the study of how size variables scale against each other, it is often of interest to fit lines to bivariate data and test hypotheses about slope and elevation about one or several lines. The nature of the problem suggests that bivariate techniques related to principal component analysis are more appropriate than linear regression. Inference methods have been developed for this problem and are in widespread use, however, we demonstrate that such methods are not robust to bivariate contamination, and propose alternative approaches which are. The new approaches use Huber's M-estimator via a plug-in approach, where robust test procedures have the same form as classical ones, but where we plug in robust estimators of parameters and standard errors in place of classical estimators. Simulations demonstrate that these new procedures are robust against bivariate contamination, and can make accurate inferences even from small samples.Journal of Theoretical Biology 05/2013; · 2.35 Impact Factor - SourceAvailable from: Raul Maneyro
##### Article: Reproductive effort and the egg number vs. size trade-off in Physalaemus frogs (Anura: Leiuperidae)

[Show abstract] [Hide abstract]

**ABSTRACT:**Patterns of reproductive allocation are expected to differ between species according to temporally and spatially variable costs of reproduction. Even when reproductive allocation patterns are the same, species can also differ in how the reproductive effort is allocated between offspring number and size. In this study, we compared the reproductive allocation patterns and the offspring number vs. size trade-off in two frog species, Physalaemus biligonigerus and P. gracilis, using bivariate (standardized major axis) and multiple linear regressions. Both species showed a common slope between body size and reproductive effort and thus a similar allocation pattern although P. biligonigerus has a larger body size (shift along common slope) and makes a lower reproductive effort (shift in intercept) than P. gracilis. We suggest that similar allocation patterns may be related to the shared phenologies of these frogs and that the differences in reproductive effort could represent either an adaptive shift (e.g., change in body space for the clutch) or a historical constraint. There was a negative correlation between fecundity and egg size in P. biligonigerus but not in P. gracilis as predicted by the acquisition–allocation model (Y-model). This study constitutes the first valid test of the Y-model based on recent predictions derived for the trade-off between offspring size vs. number. We conclude that future studies should compare reproductive allocation patterns between species using tests of allometric slopes with appropriate phylogenetic control to detect both adaptive shifts in allocation strategies and correlations with other life-history traits.Acta Oecologica 07/2008; 34(2):163-171. · 1.84 Impact Factor - SourceAvailable from: Jean-michel MortillaroJ.M. Mortillaro, G. Schaal, J. Grall, C. Nerot, A. Brind'Amour, V. Marchais, M. Perdriau, H. Le Bris[Show abstract] [Hide abstract]

**ABSTRACT:**In coastal estuarine embayments, retention of water masses due to coastal topography may result in an increased contribution of continental organic matter in food webs. However, in megatidal embayments, the effect of topography can be counterbalanced by the process of tidal mixing. Large amounts of continental organic matter are exported each year by rivers to the oceans. The fate of terrestrial organic matter in food webs of coastal areas and on neighboring coastal benthic communities was therefore evaluated, at multi-trophic levels, from primary producers to primary consumers and predators. Two coastal areas of the French Atlantic coast, differing in the contributions from their watershed, tidal range and aperture degree, were compared using carbon and nitrogen stable isotopes (δ13C and δ15N) during two contrasted periods. The Bay of Vilaine receives large inputs of freshwater from the Vilaine River, displaying 15N enriched and 13C depleted benthic communities, emphasizing the important role played by allochtonous inputs and anthropogenic impact on terrestrial organic matter in the food web. In contrast, the Bay of Brest which is largely affected by tidal mixing, showed a lack of agreement between isotopic gradients displayed by suspended particulate organic matter (SPOM) and suspension-feeders. Discrepancy between SPOM and suspension-feeders is not surprising due to differences in isotopes integration times. We suggest further that such a discrepancy may result from water replenishment due to coastal inputs, nutrient depletion by phytoplankton production, as well as efficient selection of highly nutritive phytoplanktonic particles by primary consumers.Estuarine Coastal and Shelf Science 12/2013; · 2.25 Impact Factor

Page 1

Robustness to Failure of Assumptions of Tests

for a Common Slope Amongst Several Allometric Lines –

A Simulation Study

David I. Warton*

School of Mathematics and Statistics, University of New South Wales, NSW 2052, Australia

Received 14 October 2005, revised 3 February 2006, accepted 12 April 2006

Summary

In allometry, researchers are commonly interested in estimating the slope of the major axis or standar-

dized major axis (methods of bivariate line fitting related to principal components analysis). This study

considers the robustness of two tests for a common slope amongst several axes. It is of particular

interest to measure the robustness of these tests to slight violations of assumptions that may not be

readily detected in sample datasets. Type I error is estimated in simulations of data generated with

varying levels of nonnormality, heteroscedasticity and nonlinearity. The assumption failures introduced

in simulations were difficult to detect in a moderately sized dataset, with an expert panel only able to

correct detect assumption violations 34–45% of the time.

While the common slope tests were robust to nonnormal and heteroscedastic errors from the line,

Type I error was inflated if the two variables were related in a slightly nonlinear fashion. Similar

results were also observed for the linear regression case. The common slope tests were more liberal

when the simulated data had greater nonlinearity, and this effect was more evident when the underlying

distribution had longer tails than the normal. This result raises concerns for common slopes testing, as

slight nonlinearities such as those in simulations are often undetectable in moderately sized datasets.

Consequently, practitioners should take care in checking for nonlinearity and interpreting the results of

a test for common slope. This work has implications for the robustness of inference in linear models in

general.

Key words: Analysis of covariance; Errors-in-variables model; Heteroscedasticity; Model II

regression; Nonlinearity; Standardized major axis.

1Introduction

Allometry is a field of biology involving the study of size and its biological consequences (Reiss,

1989). For example, there has been much allometric research exploring the relationship between meta-

bolism rate and body mass across animal species (as reviewed by Darveau et al., 2002), and how seed

mass varies with seed number across plant species (Henery and Westoby, 2001). As a field, allometry

is sufficiently advanced that textbooks have been written on the subject (Reiss, 1989; Niklas, 1994.

It is commonly the case that the relationship between two size variables approximates a power

relation:

Y ¼ aXb

and it is appropriate to log transform variables and estimate a line of best fit (Niklas, 1994). There is

often particular interest in estimating and interpreting the value of the exponent (b), i.e. the slope of

the line of best fit on log-transformed axes.

*Corresponding author: e-mail: David.Warton@unsw.edu.au, Phone: +61293857031, Fax: +61293857123

286Biometrical Journal 49 (2007) 2, 286–299 DOI: 10.1002/bimj.200510263

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Page 2

The motivating example for this paper is presented in Figure 1. Leaf longevity and leaf mass per

area (roughly speaking, a measure of leaf thickness) are plotted on a logarithmic scale for plant spe-

cies from sites of contrasting rainfall. These two variables are known to be closely related, with thick-

er leaved species having leaves with longer lifespans, within communities sampled all over the globe

(Wright et al., 2004). Thicker leaves are more costly to the plant – a greater mass investment is made

per unit area of leaf available for light capture – but they tend to live longer, hence recouping the

higher costs of investment through a longer period of returns. The slope of the line of best fit for a

given community in Figure 1 suggests how great the additional return will be (longer lifespan) for a

given additional investment in the leaf (leaf mass per area). There is particular interest in whether

b ? 1 for the data on log scales: if this were the case, then a doubling of mass investment in the leaf

would typically correspond to a doubling of lifespan.

The lines fitted to data in allometric studies are not for prediction of one variable from the other,

rather they are lines fitted to summarize the relationship between the two variables. As such techni-

ques related to principal components analysis are commonly recommended in preference to linear

regression (Rayner, 1985; Harvey and Pagel, 1991; Warton et al., 2006). The two main types of lines

that have been generally recommended in allometry are referred to as the major axis (MA) and stan-

dardized (or reduced) major axis (SMA). These are respectively the first principal component vector

of the variance matrix and of the correlation matrix (rescaled to the original axes) fitted through the

centroid of the data.

Often a study involves the fitting of several allometric lines, and in such cases it is of interest to

compare the slopes of these lines (recent examples include Westoby and Wright, 2003; King et al.,

2005; Tjoelker et al., 2005; Wright et al., 2005). In the case of Figure 1, the central question of the

study was whether the leaf longevity-leaf mass per area relationship was the same in communities of

Biometrical Journal 49 (2007) 2287

0.0625 0.1250.25 0.5

0.5

1

2

4

Leaf Mass per Area (kg m2) [log scale]

Leaf Longevity (years) [log scale]

Figure 1

plant species at sites of high rainfall (closed blue circles)

and low rainfall (open red circles). Data from Wright, Wes-

toby, and Reich (2002), who were primarily interested in

comparing the slopes of the lines of best fit.

Leaf longevity vs leaf mass per area for woody

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 3

contrasting rainfall, i.e. is there evidence of a difference in the slope of the relationship, across sites

with different environmental conditions?

The common principal component methodology of Flury (1984) can be used to test for common

slope of several major axes, best done using a Bartlett correction (Warton and Weber, 2002). A varia-

tion of this test for standardized major axes has also been developed (Warton and Weber, 2002). These

tests perform well when errors from the line have a normal distribution, but their performance in other

situations is unknown.

For example, consider the analysis of the data of Figure 1 using standardised major axes. The stan-

dardised major axis (SMA) slope for the high-rainfall data is 2.12 (95% confidence interval 1.45–

3.10) and the SMA slope for the low rainfall data is much shallower (1.18, 95% confidence interval

0.94–1.49). A likelihood ratio test (described in Section 2) testing for equality of the SMA slopes

returns the test statistic 6.61, with a P-value of 0.01 when compared to the c2

might conclude that there is good evidence of a difference in slopes for high vs low rainfall sites, and

that in higher rainfall environments, there is a greater benefit (in terms of leaf longevity) to having

thicker leaves. However, diagnostic plots (Figure 2) suggest a possible departure from normality –

data for each site contains a few moderate outliers. Do these apparent departures from normality

invalidate the above analysis, or are the inferential procedures robust to such violations of assump-

tions?

Some preliminary bootstrap simulations suggested that the tests may indeed be sensitive to subtle

violations of assumptions, which prompted this more systematic investigation.

The purpose of this paper is to use Monte Carlo simulation to assess the robustness to subtle fail-

ures of assumptions of the common slope tests for major axes and standardized major axes. Simula-

1distribution. Hence one

288 D. I. Warton: Robustness of Common Slope Tests

0.6

fitted axis scores

0.30

2

0

2

std(res)

std(res) vs fits

(a)

202

2

0

2

std(res)

Z scores

qq plot

0.6

fitted axis scores

0.30

2

0

2

std(res)

(b)

202

2

0

2

std(res)

Z scores

Figure 2

metry data of Figure 1. The first column con-

tains residual vs fits plots, the second column

contains qq plots of residuals against normal

scores, for (a) the high rainfall site, (b) the

low rainfall site. Note that there are a few

moderate outliers – five of the 40 residuals

have standardised scores of approximately 2

or ?2.

Diagnostic plots for the leaf allo-

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 4

tions will be conducted to measure the accuracy of the chi square approximation of test statistics

under nonnormal errors from the line, heteroscedasticity, and a nonlinear relationship between the two

variables of interest. In the cases of heteroscedasticity and nonlinearity, only subtle violations will be

considered that might not be detected based on a moderately sized dataset. These cases are of particu-

lar interest because if undetectable violations of assumptions have substantial effects on Type I errors

rates, then this presents a significant practical problem.

It might be expected that the common slopes test will behave similarly to the common slopes test

for linear regression. Consequently, we also consider linear regression in our simulations, for compar-

ison. There is an extensive literature on the robustness of linear regression, which we review in Sec-

tion 3.

In this paper, the test statistics of interest will be introduced (Section 2), then the relevant literature

on robustness of linear regression will be reviewed (Section 3). Simulation conditions will be de-

scribed (Section 4), then the results presented (Section 5) and discussed (Section 6).

2 Test Statistics

This section reviews the test statistics under consideration. Further details can be found in Warton and

Weber (2002).

Consider g bivariate normal random samples, the i-th random sample consisting of ni pairs of

observations ðxi;yiÞ. The g samples may have different means and covariance matrices.

Define N ¼P

2.1Test for common major axis slope

g

i¼1

ni, the total number of observations across all g samples.

Let bibe the slope of the major axis through data from the i-th group. The hypothesis test of interest

is:

H0: bi¼ bMA

8i 2 f1;2;...;gg

Ha: otherwise:

The maximum likelihood estimator of the common major axis slope^b bMAsatisfies

0 ¼

1

1 þ^b b2

MA

P

g

i¼1

ni

1

^F Fið1;1Þ?

1

^F Fið2;2Þ

!

^F Fið1;2Þ

where^F Fiðj;kÞ is the ðj;kÞ-th element of^F Fi, the sample variance matrix of

ui

vi

ðÞ ¼ ð1 þ^b b2

MAÞ?1

2 xi

ð

yi

Þ

1

?^b bMA

1

^b bMA

!

:

The variables ui and vi can be interpreted as measuring location of a point along the fitted line

(“fitted axis scores”) and distance from the fitted line (“residual scores”), respectively.

The Bartlett corrected likelihood ratio test statistic can be written as

?2 log ðLMAÞ ¼ ?P

where ri;uvis the sample correlation coefficient between uiand vi. If major axis slopes of all g groups

are equal, ?2 log ðLMAÞ has an asymptotic chi square distribution with g ? 1 degrees of freedom, as

the niall become large.

Because ?2 log ðLMAÞ is a function of correlation coefficients, the assumptions of the test can be

relaxed – the null distribution of the test statistic is essentially unchanged if only one of u and v is

normally distributed, not both.

g

i¼1

ðni? 2:5Þ log ð1 ? r2

i;uvÞ

Biometrical Journal 49 (2007) 2289

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 5

The Bartlett correction used in ?2 log ðLMAÞ was derived for the case when bMAis known,

although this correction is useful when bMAis unknown also. Simulations have demonstrated that with

this correction, Type I error of the test statistic is close to nominal levels when errors are normally

distributed (Warton and Weber, 2002), even for small samples (each ni¼ 10).

2.2 Test for common standardised major axis slope

In the case of standardized major axes, the biare the slopes of the standardized major axes, and the

hypothesis test of interest is:

H0: bi¼ bSMA

8i 2 f1;2;...;gg

Ha: otherwise:

The common slope estimator^b bSMAsatisfies

0 ¼

1

2j^b bSMAj

P

g

i¼1

ni

1

^Y Yið1;1Þþ

1

^Y Yið2;2Þ

!

^Y Yið1;2Þ

where^Y Yiðj;kÞ is the ðj;kÞ-th element of^Y Yi, the sample variance matrix of

Þ ¼ ð2j^b bSMAjÞ?1

si

ti

ð

2 xi

ð

yi

Þ

^b bSMA

1

?^b bSMA

1

??

:

The variables siand tihave the same interpretation as uiand vido for the major axis case, i.e. they

represent fitted axis and residual scores, respectively.

The Bartlett corrected likelihood ratio test statistic can be written as

?2 log ðLSMAÞ ¼ ?P

As previously, ?2 log ðLSMAÞ has an asymptotic c2

are equal, only one of s and t needs to be normally distributed, and the Bartlett correction was derived

assuming bSMAis known.

g

i¼1

ðni? 2:5Þ log ð1 ? r2

i;stÞ:

g?1distribution when all standardized major axes

2.3Relationship to the test for common linear regression slope

The standard F test for a common linear regression slope is equivalent to a maximum likelihood

statistic, derived either assuming bivariate normality or (more usually) conditioning on the xi and

assuming that yijxi is a realisation of normally distributed variables. It is assumed that the residual

variance is common across all groups.

The common slope estimator (^b breg) can be shown to satisfy

0 ¼P

g

i¼1

ðni? 1Þ^W Wið1;2Þ

where^W Wiðj;kÞ is the ðj;kÞ-th element of^W Wi, the sample variance matrix of

1

?^b breg

0

xi

wi

ðÞ ¼ xi

yi

ðÞ

1

??

:

The variables xiand wiare analogous to uiand vifor the major axis case.

If^W Wðj;kÞ is the pooled variance estimate from all groups:

^W Wðj;kÞ ¼

N ? g

1

P

g

i¼1

ðni? 1Þ^W Wiðj;kÞ

290D. I. Warton: Robustness of Common Slope Tests

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 6

and if^S Sðj;kÞ is defined similarly as the pooled estimate of the variance of ðxi

statistic can be written as

?

^W Wð2;2Þ ðg ? 1Þ

This statistic is closely related to ?2 log ðLMAÞ and ?2 log ðLSMAÞ. In particular, Fregis also equivalent

to a likelihood ratio statistic, and the assumptions required in specifying the likelihood for ?2 log ðLMAÞ

and ?2 log ðLSMAÞ are also required for Freg. However, ?2 log ðLMAÞ and ?2 log ðLSMAÞ differ from

the linear regression statistic Fregin several ways:

? The null distribution is asymptotic (with a Bartlett-type correction) rather than being exact,

although it is known to maintain close to nominal levels in small sample sizes (Warton and

Weber, 2002).

? There is no assumption of common residual variance for ?2 log ðLMAÞ and ?2 log ðLSMAÞ.

? ?2 log ðLMAÞ and ?2 log ðLSMAÞ are only likelihood ratio test statistics when data are bivariate

normal. The tests are still valid when conditioning on the fitted axis scores, but they are no long-

er maximum likelihood. In contrast, when conditioning on X, the linear regression F statistic is

still a maximum likelihood statistic.

? The fitted axis scores (uiand si) are a function of the estimated slope, which is not the case for

linear regression.

The only obvious implication these differences have for robustness is that the null distributions for

?2 log ðLMAÞ and ?2 log ðLSMAÞ should be unaffected by unequal residual variances, whereas this is

not the case for linear regression.

yiÞ, then the F

Freg¼

^S Sð2;2Þ ?^W Wð2;2Þ

?

ðN ? gÞ

:

3Literature for Linear Regression

Whereas there has been little investigation of the robustness of common slope tests for the major axis

(MA) and standardised major axis (SMA), an extensive literature is available for linear regression.

There has been considerable attention to the question of robustness of analysis of covariance to

failure of assumptions, dating back to the 1950’s. An often-cited review is Glass, Peckham, and San-

ders (1972), a recent review that includes a metaanalysis of Monte Carlo simulations is Harwell

(2003). Most of this work has considered the situation of testing for equal elevation, assuming com-

mon slope, rather than specifically investigating tests for common slope. However, research does

appear to confirm that tests for common slope share similar properties with ANCOVA (Wilcox, 1999;

Luh and Guo, 2000). Key results relevant to our situation, which have achieved a wide consensus in

the literature, can be summarised:

1. ANCOVA is robust to non-normality, unless the X-variable itself is highly non-normal. Power, on

the other hand, can be strongly affected by long-tailed distributions.

2. If sampling is balanced, the ANCOVA F test is robust to heteroscedasticity and indeed to most

failures of assumptions.

3. The ANCOVA F tests are sensitive to unequal error variances if sampling is unbalanced. This is

the well-known Behrens-Fisher problem, in which the sensitivity to unequal error variances in-

creases with greater imbalance in sampling.

4. The F statistic can be substantially biased if EðYjXÞ is non-linear (Atiqullah, 1964).

Surprisingly little work has considered the robustness of ANCOVA procedures to non-linearity (Har-

well, 2003), and we have found no work on this subject specifically for common slopes tests.

Warton and Weber (2002) argued that tests using ?2 log ðLMAÞ and ?2 log ðLSMAÞ will be robust

to non-normality in one or the other of residuals and fitted axis scores. This claim was supported by

simulation results. No other work has considered the robustness of these tests. However, one might

expect similar levels of robustness to the linear regression case, given that the methods are closely

related (Warton and Weber, 2002).

Biometrical Journal 49 (2007) 2291

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 7

4Simulation Design

This study measured the effect of failure of assumptions on the null distribution of the common slope

test statistics described above. The measure of interest was the observed rate at which critical values

from the c2

sample datasets. It was desirable that observed Type I error be within a factor of two of nominal

levels, and any Type I error rates outside of this range are of concern.

Three different types of failure of assumptions were considered:

? nonnormal errors from the line

? heteroscedasticity

? nonlinear relationship between variables

At first glance, it may seem inappropriate to consider the effects of nonlinearity and heteroscedasti-

city on a test statistic that assumes a linear relationship and homoscedastic errors. If the assumptions

of the procedure aren’t satisfied, why not use a more appropriate procedure? However, slight hetero-

scedasticity and slight nonlinearity can be difficult to detect in moderately sized samples. A test statis-

tic needs to be robust to such minor violations to be of practical use. The detectability of the assump-

tion violations considered in these simulations has been estimated in the section “detectability of

failures of assumptions”.

In all simulations, a total of 40 observations were generated in two groups, such that the true slope

in each group is one. Data were generated according to the following steps:

1. Sampling design – Both balanced and unbalanced sampling designs were considered in simula-

tions, where sizes of the two samples (n1;n2Þ were:

(a) 20, 20

(b) 10, 30

2. Distribution – two pairs of random samples of sizes n1and n2were generated from iid variables

ðU;VÞ according to a distribution with mean zero and variance one. The variables U and V have

interpretations as the location along (or projection to) the true line and the error from the line,

respectively, analogous to uiand viin Section 2.1. The distributions used in simulations were

(a) normal

(b) normal mixture A – a 10:90 mixture of normal distributions centered at zero with variances

four and one respectively. This variable was then rescaled to have variance one.

(c) normal mixture B – a 10:90 mixture of normal distributions centered at zero with variances

nine and one respectively. This variable was then rescaled to have variance one.

These normal mixtures are quite long-tailed distributions, in fact the last distribution in the list

above has a kurtosis coefficient of 8.3 (larger than for the double exponential distribution, or tn

distributions for which n ? 6).

3. Heteroscedasticity – heteroscedasticity was introduced by replacing V with the variable

V0¼ sðkÞ?1Vð1 þ kUÞ, where k is a constant controlling the amount of heteroscedasticity, and

sðkÞ2¼ Var fVð1 þ kUÞg ¼ 1 þ k2, so that Var ðV0Þ ¼ 1. For 0 < k < 0:2, Var ðV0jU ¼ uÞ in-

creases quadratically with u, since for this range of k Pð1 þ kU < 0Þ is negligible for the distri-

butions considered here. The closer k is to 0, the smaller the heteroscedasticity, and k ¼ 0 is the

homoscedastic case, V0¼ V. The levels of heteroscedasticity considered in simulations were:

(a) none (k ¼ 0)

(b) k ¼ 0:1

(c) k ¼ 0:2

The range of values of U typically observed in simulations was ?2 < U < 2, which means that

Var ðV0jUÞ usually varied over a factor of

of approximately 2.3 when k ¼ 0:1, or a factor of 5.4 when k ¼ 0:2.

4. Nonlinearity – nonlinearity was introduced using the function V00¼ sðcÞ?1fV0þ cðU2? 1Þg,

where c is a constant controlling the amount of nonlinearity, and sðcÞ2¼ Var fV0þ cðU2? 1Þg

¼ 1 þ c2fEðU4Þ ? 1g. EðU4Þ ¼ 3 if U is normal, EðU4Þ ? 4:4 if U is the (standardised) mixture

g?1distribution were exceeded, at the 0.05 and 0.01 level, which was estimated from 10000

1þ2k

1?2k

??2. Consequently, Var ðV0jUÞ varied over a factor

292D. I. Warton: Robustness of Common Slope Tests

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 8

with variances one and four, and EðU4Þ ? 8:3 for the normal mixture with variances one and

nine. Notice that EðV00Þ ¼ 0 and EðV00UÞ ¼ 0, so V00and U are uncorrelated, and both have

mean zero and variance one. The following levels of nonlinearity were introduced:

(a) none (c ¼ 0)

(b) c ¼ 0:1

(c) c ¼ 0:25

EðV00jU ¼ 2Þ ¼ EðV00jU ¼ ?2Þ ¼ 3c and EðV00jU ¼ 0Þ ¼ ?c, so given that U was usually in the

range ?2 to 2 in simulations, EðV00jUÞ usually varied over a range of 0.4 when c ¼ 0:1 and a

range of 1 when c ¼ 0:25.

5. Correlation structure – Correlation between the variables X and Y was varied in simulations,

where the correlation was introduced by linear transformation. For MA and SMA simulations,

the transformation used was:

ffiffiffiffiffiffiffiffiffiffiffi

For linear regression simulations, the linear transformation was:

In each case, this approach generated data that had a “true” slope of one and correlation q, because

U and V00were uncorrelated and had equal variance irrespective of the level of heteroscedasticity

or nonlinearity. The levels of correlation in the two groups were set to

(a) 0.4, 0.4

(b) 0.8, 0.8

(c) 0.4, 0.8

A total of 162 simulations were conducted for MA and SMA, considering all possible combinations

of the two sampling designs, three sampling distributions, three levels of heteroscedasticity, three

levels of nonlinearity and three levels of correlation. These simulations were repeated for linear regres-

sion for the purposes of comparison.

XY

ðÞ ¼ ð1 ? q2Þ?1

2 U

ð

V00

Þ

1 þ q

ffiffiffiffiffiffiffiffiffiffiffi

p

?

ffiffiffiffiffiffiffiffiffiffiffi

1 þ q

ffiffiffiffiffiffiffiffiffiffiffi

p

p

1 ? q

p

1 ? q

??

:

XY

ðÞ ¼ UV00

ðÞ

1

0

1

ffiffiffiffiffiffiffiffi

1?q2

q

p

!

:

4.1Detectability of assumption violations

The detectability of the above violations was estimated using an expert panel of six statisticians. If

these assumption violations are difficult to detect, then any sensitivity to these assumption violations is

problematic for the practical application of the tests.

Biometrical Journal 49 (2007) 2293

Table 1

was estimated as the proportion of correctly detected assumption violations, from

an expert panel of six statisticians. Each statistician was presented with 12 data-

sets generated with two or less assumption violations. Results are presented, pool-

ing across all datasets with (a) any assumption violation (b) only major assump-

tion violations (normal mixture B, k ¼ 0:2 or c ¼ 0:25).

Assumption# violations

Detectability of assumption violations used in simulations. Detectability

# detected % detected

(a) Non-normality

Heteroscedasticity

Non-linearity

(b) Non-normality

Heteroscedasticity

Non-linearity

33

37

32

15

20

19

15

13

11

45%

35%

34%

47%

40%

42%

7

8

8

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 9

The six statisticians on the expert panel were statistics lecturers at the University of New South

Wales. Each had at least five years experience in applied statistics and regression modelling.

Seventy-two datasets were generated using the simulation methods described earlier in Section 4,

such that each dataset violated no more than two of the assumptions of linearity, homoscedasticity and

normality. The six statisticians on the expert panel were shown twelve of these datasets each, with

appropriate diagnostic plots. The statisticians were asked which (if any) assumptions were violated,

for each dataset. An example of the information presented to the statisticians is given in Figure 3.

On the whole, it can be concluded that these assumption violations were indeed difficult to detect

(Table 1). Non-linearity was only detected in 34% of the datasets in which it was present, heterosce-

dasticity was only detected in 35%, and non-normality in 45%. Even in the more extreme cases of

assumption violation (i.e. normal mixture B, heteroscedasticity with k ¼ 0:2, non-linearity with

c ¼ 0:25), violations were detected less than half the time. This clearly demonstrates that the levels of

294 D. I. Warton: Robustness of Common Slope Tests

–2–101

–2

–1

0

1

2

y vs x

–2 101

–0.5

0

0.5

res vs fits

–1012

–0.5

0

0.5

qq plot, resids vs normal

–1012

–1

0

1

–101

–0.5

0

0.5

–1012

–0.5

0

0.5

–22

–2

0

2

–22

–0.5

0

0.5

–112

–0.5

0

0.5

Figure 3

tected. Each row contains diagnostic plots for a dataset – the first

column is a plot of Y vs X, the second column is a standardised

residuals vs fits plot, the last column is a normal quantile plot of

standardised residuals. Each dataset contains 40 observations (with

correlation 0.4), generated as in simulations. One plot displays bi-

variate normal data, one displays bivariate normal data with hetero-

geneity introduced (with k ¼ 0:2, see description of simulation for

more details), and one plot displays bivariate normal data with non-

linearity introduced (using c ¼ 0:25). Which plot is which?

Minor violations of assumptions often can not be de-

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 10

assumption violation considered in these simulations are not easy to detect in practice, for moderately

size samples, and so any lack of robustness to these assumption violations will be problematic.

5 Results

The approximations of ?2 log ðLMAÞ and ?2 log ðLSMAÞ to a chi squared distribution had impressive

robustness to nonnormality and heteroscedasticity. This is demonstrated in Table 2 and Table 3 at the

0.05 and 0.01 levels, respectively, for data with a correlation of 0.4. In the absence of nonlinearity

Biometrical Journal 49 (2007) 2295

Table 2

standardized major axes ðbSMAÞ and linear regressions ðbregÞ. Simulated data were distributed as (a)

bivariate normal (b) bivariate mixture of normals with variances one and four (c) bivariate mixture of

normals with variances one and nine. Heteroscedasticity increased with k, and nonlinearity with c. In

all cases, Type I error was estimated from 10000 datasets, and each dataset consisted of two samples

of 20 observations each, and correlation 0.4. See methods for further details.

Observed Type I error rates at the 0.05 level for common slope tests of major axes ðbMAÞ,

bMA

bSMA

breg

k: 0

0.10.200.10.20 0.10.2

(a) c ¼ 0

c ¼ 0:1

c ¼ 0:25

(b) c ¼ 0

c ¼ 0:1

c ¼ 0:25

(c) c ¼ 0

c ¼ 0:1

c ¼ 0:25

0.040

0.047

0.077

0.040

0.054

0.114

0.039

0.079

0.198

0.047

0.052

0.078

0.044

0.059

0.118

0.043

0.082

0.200

0.050

0.054

0.082

0.050

0.067

0.124

0.052

0.092

0.203

0.046

0.054

0.092

0.046

0.067

0.140

0.049

0.105

0.246

0.053

0.062

0.092

0.051

0.074

0.143

0.052

0.107

0.248

0.058

0.064

0.096

0.058

0.078

0.149

0.066

0.118

0.249

0.054

0.055

0.085

0.050

0.066

0.133

0.051

0.100

0.221

0.051

0.054

0.091

0.054

0.069

0.138

0.059

0.100

0.224

0.058

0.066

0.091

0.061

0.079

0.145

0.062

0.101

0.235

Table 3

standardized major axes ðbSMAÞ and linear regressions ðbregÞ. Simulated data were distributed as (a)

bivariate normal (b) bivariate mixture of normals with variances one and four (c) bivariate mixture of

normals with variances one and nine. Heteroscedasticity increased with k, and nonlinearity with c. In

all cases, Type I error was estimated from 10000 datasets, and each dataset consisted of two samples

of 20 observations each, and correlation 0.4. See methods for further details.

Observed Type I error rates at the 0.01 level for common slope tests of major axes ðbMAÞ,

bMA

bSMA

breg

k: 0

0.10.200.10.200.10.2

(a) c ¼ 0

c ¼ 0:1

c ¼ 0:25

(b) c ¼ 0

c ¼ 0:1

c ¼ 0:25

(c) c ¼ 0

c ¼ 0:1

c ¼ 0:25

0.008

0.009

0.018

0.006

0.009

0.039

0.008

0.022

0.100

0.008

0.009

0.020

0.008

0.013

0.040

0.007

0.023

0.098

0.009

0.009

0.021

0.010

0.015

0.041

0.012

0.030

0.099

0.011

0.012

0.024

0.010

0.015

0.055

0.010

0.036

0.137

0.012

0.012

0.027

0.011

0.018

0.056

0.012

0.036

0.134

0.011

0.013

0.030

0.013

0.021

0.058

0.019

0.046

0.136

0.011

0.010

0.026

0.010

0.016

0.053

0.012

0.033

0.114

0.010

0.011

0.027

0.011

0.017

0.054

0.015

0.034

0.114

0.010

0.016

0.028

0.012

0.020

0.058

0.019

0.036

0.127

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 11

(c ¼ 0), the common slope tests became slightly more liberal with increasing heteroscedasticity, but

remained close to nominal levels. In these cases observed Type I error at the 0.05 level was always in

the range 0.039–0.066, and at the 0.01 level was in the range 0.006–0.013, except for one value at

0.019. These results are impressive in the sense that in some simulations, there was a considerable

departure from assumptions of normally distributed errors (the normal mixtures were quite long tailed)

and homoscedastic errors (when k ¼ 0:2, the variance of errors from the line changed by more than a

factor of 5).

The approximations to a chi squared distribution were not, however, robust to nonlinearity. The

common slope tests were more liberal when the simulated data had greater nonlinearity, and this

effect was more evident when the underlying distribution had longer tails. For normally distributed

data, Type I error usually remained within an acceptable range, although for the standardized major

axis test it exceeded 0.02 when a ¼ 0:01 for the most nonlinear case considered (c ¼ 0:25). In con-

trast, in simulations of the normal mixture with variances 1 and 9, Type I error always exceeded 0.098

when a ¼ 0:01 and c ¼ 0:25. The common slope tests were unacceptably liberal in simulations using

either normal mixture when c ¼ 0:25, and in the more longer tailed of the normal mixtures when

c ¼ 0:1 also.

Results were similar for all three tests considered, and were similar for balanced and unbalanced

sampling designs, with one exception – the linear regression statistic when correlation differed for the

two groups (Table 4). This corresponds to the well-known case in which error variances differ, and so

linear regression statistics perform poorly when sample sizes are unbalanced (Glass et al., 1972). For

balanced designs, the linear regression statistic maintained close to nominal levels when correlation

differed, however for unbalanced designs, this statistic was highly inflated (Table 4). In contrast, the

?2 log ðLMAÞ and ?2 log ðLSMAÞ were unaffected by sample design, and maintained close to nominal

levels for different combinations of sample design and correlation, provided that there was no non-

linearity. It can also be seen in Table 4 that the effect of heteroscedasticity was unchanged by the

level of correlation.

6 Discussion

The simulations conducted here are not encouraging, with respect to nonlinearity – they demonstrate

that slight nonlinearity can lead to unacceptable Type I error inflation, even when the nonlinearity is

sufficiently subtle that often it would not be detected in a sample dataset of 40 observations.

296D. I. Warton: Robustness of Common Slope Tests

Table 4

level of correlation. Tabulated results are the Type I error rates at the 0.05 level for common slope

tests of major axes ðbMAÞ and standardized major axes ðbSMAÞ, when there is no nonlinearity ðc ¼ 0Þ,

for bivariate mixture of normals with variances one and nine. Sample sizes are (a) balanced, (20, 20)

(b) unbalanced, (10, 30). Heteroscedasticity increased with k, and three different levels of correlation

were considered. In all cases, Type I error was estimated from 10000 datasets. See methods for

further details.

Observed Type I error rates of common slope tests as a function of heteroscedasticity and

bMA

bSMA

breg

k: 0

0.1 0.200.1 0.200.1 0.2

(a) q ¼ (0.4,0.4)

q ¼ (0.4,0.8)

q ¼ (0.8,0.8)

(b) q ¼ (0.4,0.4)

q ¼ (0.4,0.8)

q ¼ (0.8,0.8)

0.039

0.050

0.044

0.042

0.050

0.045

0.043

0.048

0.051

0.040

0.051

0.052

0.052

0.056

0.060

0.050

0.054

0.050

0.049

0.051

0.045

0.052

0.051

0.047

0.052

0.049

0.052

0.048

0.054

0.054

0.066

0.058

0.063

0.062

0.057

0.054

0.051

0.055

0.055

0.052

0.180

0.052

0.059

0.057

0.054

0.058

0.178

0.061

0.062

0.069

0.060

0.061

0.185

0.057

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 12

A further cause for concern is that resampling methods won’t help in robustifying inference. In dis-

cussing a similar problem, Freedman (1986) said “if [the estimator of the stochastic model] is silly, the

bootstrap cannot work: like any statistical procedure, the bootstrap is model dependent. (This is the

statistics version of the no free lunch principle.)” Using permutation tests or bootstrapping can not be

expected to improve the properties of a test assuming linearity and homoscedasticity in a nonlinear or

heteroscedastic context. Standard resampling methods for this test implicitly assume linearity – whether

freely permuting residuals under the reduced model (Anderson and Robinson, 2001), or bootstrapping

after rotating data to reflect H0 (Hall and Wilson, 1991). In fact, the above simulations have been

repeated using resampling-based tests, and Type I error rates remained similar to those in Tables 2–4.

It makes intuitive sense that a common slopes test might be sensitive to nonlinearity. The argument

is demonstrated schematically in Figure 4. By definition, if the relationship between two variables is

nonlinear, then the gradient of the relationship is not the same at all points along the function relating

the variables. Now due to sampling error, the location of different samples will always differ, leading

to different slopes of the best-fitting lines, due to nonlinearity. More specifically, the underlying non-

linear function used in these simulations had the form Y ¼ cðX2? 1Þ þ X. If the means of two sam-

ples of X differ by d, then the difference in gradients at these points (hence the approximate differ-

ence in slopes of the best-fitting lines) is 2cd.

Results suggest that it is worthwhile checking data for nonnormal errors as well as checking care-

fully for nonlinearity. While the tests were robust to nonnormality, the effect of nonlinearity was

considerably greater in the presence of nonnormality (Tables 2–3). When errors were normally distrib-

uted, the tests had acceptable levels of robustness to nonlinearity, and so a practitioner who does not

observe long tailed data can feel reasonably confident in results of a common slopes test. However,

Biometrical Journal 49 (2007) 2297

Y

X

Figure 4

samples of apparently different slope from a

non-linear function. Any two samples of data

will be centered on different locations along

the function relating the two variables, due to

sampling error, leading to different slopes for

lines fitted at different locations. (Although

the data on this figure have been generated

with different locations to emphasize the

point.)

The generation of iid bivariate

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 13

caution needs to be exercised in interpreting common slopes tests of data from long-tailed distribu-

tions, because of the possibility of undetected non-linearity (Table 1).

For example, consider the dataset of Figure 1. In Figure 2 there was a suggestion that the data were

long-tailed compared to the normal distribution, and there was no evidence of non-linearity. Given the

low P-value of 0.01, we can be reasonably confident that there is indeed a difference in slopes

between high and low rainfall sites. However, due to the disconcerting possibility of undetected non-

linearity, this result should be interpreted cautiously.

Finally, this study has implications for fitting linear models in general – while the central limit

theorem ensures robustness to nonnormality, and balanced sampling can ensure a certain level of

robustness to heteroscedasticity, nonlinearity can throw a cat amongst the pigeons.

Acknowledgements

for their assumption checking and for comments on the manuscript. Thanks also to anonymous reviews and

editors for comments on the manuscript, and to Ian Wright for the use of his leaf longevity data.

Thanks to Yanan Fan, Inge Koch, Sue Middleton, David Nott, Scott Sisson and Matt Wand

References

Anderson, M. J. and Robinson, J. (2001). Permutation tests for linear models. Australian and New Zealand Jour-

nal of Statistics 43, 75–88.

Atiqullah, M. (1964). The robustness of the covariance analysis of a one-way classification. Biometrika 51, 365–

372.

Darveau, C., Suarez, R. K., Andrews, R. D., and Hochachka, P. W. (2002). Allometric cascade as a unifying

principle of body mass effects on metabolism. Nature 417, 166–170.

Flury, B. N. (1984). Common principal components in k groups. Journal of the American Statistical Association

79, 892–898.

Freedman, D. A. (1986). Discussion: Jackknife, bootstrap and other resampling methods in regression analysis.

The Annals of Statistics 14, 1305–1308.

Glass, G. V., Peckham, P. D., and Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying

the fixed effects analyses of variance and covariance. Review of Educational Research 42, 237–288.

Hall, P. and Wilson, S. R. (1991). Two guidelines for bootstrap hypothesis testing. Biometrics 47, 757–762.

Harvey, P. H. and Pagel, M. D. (1991). The comparative method in evolutionary biology. Oxford University Press,

Oxford.

Harwell, M. (2003). Summarizing monte carlo results in methodological research: The single-factor, fixed-effects

ancova case. Journal of Educational and Behavioral Statistics 28, 45–70.

Henery, M. L. and Westoby, M. (2001). Seed mass and seed nutrient content as predictors of seed output variation

between species. Oikos 92, 479–490.

King, D. A., Davies, S. J., Supardi, M. N. N., and Tan, S. (2005). Tree growth is related to light interception and

wood density in two mixed dipterocarp forests of malaysia. Functional ecology 19, 445–453.

Luh, W. M. and Guo, J. H. (2000). Approximate transformation trimmed mean methods to the test of simple

linear regression slope equality. Journal of Applied Statistics 27, 843–857.

Niklas, K. J. (1994). Plant Allometry: The Scaling of Form and Process. University of Chicago Press, Chicago.

Rayner, J. M. V. (1985). Linear relations in biomechanics: the statistics of scaling functions. Journal of Zoology,

Ser. A 206, 415–439.

Reiss, M. J. (1989). The allometry of growth and reproduction. Cambridge University Press, Cambridge.

Tjoelker, M. G., Craine, J. M., Wedin, D., Reich, P. B., and Tilman, D. (2005). Linking leaf and root trait syn-

dromes among 39 grassland and savannah species. New Phytologist 167, 493–508.

Warton, D. I. and Weber, N. C. (2002). Common slope tests for errors-in-variables models. Biometrical Journal

44, 161–174.

Warton, D. I., Wright, I. J., Falster, D. S., and Westoby, M. (2006). Bivariate line-fitting methods for allometry.

Biological Reviews 81, 259–291.

Westoby, M. and Wright, I. (2003). The leaf size-twig size spectrum and its relationship to other important spec-

tra of variation among species. Oecologia 135, 621–628.

Wilcox, R. R. (1999). Testing hypotheses about regression parameters when the error term is heteroscedastic.

Biometrical Journal 41, 411–426.

298D. I. Warton: Robustness of Common Slope Tests

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Page 14

Wright, I. J., Reich, P. B., Cornelissen, J. H. C., Falster, D. S., Garnier, E., Hikosaka, K., Lamont, B. B., Lee, W.,

Oleksyn, J., Osada, N., Poorter, H., Villar, R., Warton, D. I., and Westoby, M. (2005). Assessing the general-

ity of global leaf trait relationships. New Phytologist 166, 485–496.

Wright, I. J., Reich, P. B., Westoby, M., Ackerly, D. D., Baruch, Z., Bongers, F., Cavender-Bares, J., Chapin, T.,

Cornelissen, J. H. C., Diemer, M., Flexas, J., Garnier, E., Groom, P. K., Gulias, J., Hikosaka, K., Lamont,

B. B., Lee, T., Lee, W., Lusk, C., Midgley, J. J., Navas, M. L., Niinemets, U., Oleksyn, J., Osada, N., Poor-

ter, H., Poot, P., Prior, L., Pyankov, V. I., Roumet, C., Thomas, S. C., Tjoelker, M. G., Veneklaas, E. J., and

Villar, R. (2004). The worldwide leafeconomics spectrum. Nature 428, 821–827.

Wright, I. J., Westoby, M., and Reich, P. B. (2002). Convergence towards higher leaf mass per area in dry and

nutrient-poorhabitats has different consequences for leaf lifespan. Journal of Ecology 90, 534–543.

Biometrical Journal 49 (2007) 2299

#2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com