Page 1

Performance of five two-sample location tests for skewed distributions with

unequal variances

Morten W. Fagerland⁎, Leiv Sandvik

Ullevål Department of Research Administration, Oslo University Hospital, N-0407 Oslo, Norway

a r t i c l e i n f oa b s t r a c t

Article history:

Received 16 March 2009

Accepted 18 June 2009

Tests for comparing the locations of two independent populations are associated with different

null hypotheses, but results are often interpreted as evidence for or against equality of means or

medians. We examine the appropriateness of this practice by investigating the performance of

five frequently used tests: the two-sample T test, the Welch U test, the Yuen–Welch test, the

Wilcoxon–Mann–Whitney test, and the Brunner–Munzel test. Under combined violations of

normality and variance homogeneity, the true significance level and power of the tests depend

on a complex interplay of several factors. In a wide ranging simulation study, we consider

scenarios differing in skewness, skewness heterogeneity, variance heterogeneity, sample size,

and sample size ratio. We find that small differences in distribution properties can alter test

performance markedly, thus confounding the effort to present simple test recommendations.

Instead, we provide detailed recommendations in Appendix A. The Welch U test is

recommended most frequently, but cannot be considered an omnibus test for this problem.

© 2009 Elsevier Inc. All rights reserved.

Keywords:

Two-sample location problem

T test

Welch test

Wilcoxon–Mann–Whitney test

Yuen–Welch test

Brunner–Munzel test

Robustness

Skewness

Heteroscedasticity

1. Introduction

Comparison of locations, or central tendency, of two

independent populations is common in medical research. A

plethora of tests exists, of amenability depending on the

distribution of the data at hand. The choice of test decides

what can be inferred from the results. This is due to the

different null hypotheses these methods are designed to test.

The two-sample T test is the most common approach. This

is a test of equality of means, but it is derived under the

assumptions that the two distributions are normal with equal

variances. A modification of this test, the Welch U test [1], is

designed for unequal variances, but the assumption of

normality is maintained.

When distributions deviate from normality, several

approaches are available. The most common non-parametric

alternative is the Wilcoxon–Mann–Whitney (WMW) test.

This test is often regarded as a test of equal medians, but this

is not true in general. The correct null hypothesis for this test

is P(XbY)=0.5, where X and Yare random samples from the

two populations. The results from the WMW test can be

interpreted as a testof equality of medians only when the two

distributions are identical except for a possible shift in

location [2]. Many attempts have been made to improve the

WMW test. The most prominent of these is the Brunner–

Munzel test [3], which allows for tied values and unequal

variances.

For markedly skewed distributions, the mean can be a

poor measure of central tendency because outliers inflate its

value. This can be ameliorated by removing the smallest and

the largest values in the sample. If an equal amount of values

are removed from each tail, the mean of the resulting sample

is called the trimmed mean. Comparing trimmed means can

be done with the Yuen–Welch test [4], which is identical to

the Welch U test for zero amount of trimming.

Whenusing these tests, one must be awarethat the results

pertain to the tests' specific null hypotheses. A significant p-

value from the WMW test or the Brunner–Munzel test, for

example, can be difficult to interpret beyond noting that the

observations from one of the populations tend to be smaller

than theobservationsfromthe otherpopulation.According to

Cliff [5], this interpretation has merit in its own right, and he

suggests making inference about P(XNY)−P(XbY) as an

Contemporary Clinical Trials 30 (2009) 490–496

⁎ Corresponding author. Tel.: +47 41 50 46 14; fax: +47 22 11 84 79.

E-mail address: morten.fagerland@medisin.uio.no (M.W. Fagerland).

1551-7144/$ – see front matter © 2009 Elsevier Inc. All rights reserved.

doi:10.1016/j.cct.2009.06.007

Contents lists available at ScienceDirect

Contemporary Clinical Trials

journal homepage: www.elsevier.com/locate/conclintrial

Page 2

alternative to means or other measures of location. In

practice, however, researchers often like to make inference

about the two common measures of central tendency, the

mean and the median, which offer intuitive interpretations.

In medical research, the assumptions of normality and

variance homogeneity are often violated [6,7]. Skewed data

are common in medical research [8], and several well known

variables are known to be markedly skewed, for example

triglyceride level and sedimentation rate. If two skewed

distributions have unequal locations, the variances can be

expected to differ as well. Hence, medical data often exhibit a

combination of skewness and unequal variances.

The purpose of this paper is to investigate to what extent

the five mentioned tests can be appropriately used to

compare means and medians for a wide range of skewed

distributions with varying degrees of unequal variances. Even

though the body of literature on two-sample location tests is

considerable [9,10], a consistent and comprehensive exam-

ination of this issue has not been previously presented. For

example, situationswhere the two distributions have unequal

skewness have not been thoroughly studied, although it has

been shown that both type I errors and power can be affected

[7,11].

The tests will be subjected to quantified robustness

criteria. For each situation, the test or tests with highest

power that maintain true significance levels (p) sufficiently

close to the nominal level (α) will be identified. Bradley [12]

defines criteria for α-robustness as conservative with 0.9α≤

p≤1.1α and liberal with 0.5α≤p≤1.5α. This implies that

closeness be considered sufficient if the true significance

levels are within plus or minus 10% or 50% of the nominal

significance levels. We consider 50% to be too liberal for most

situations, but 10%, 20%, and 40% limits will be studied. We

refer to this as the 10%-, 20%-, and 40%-robustness of the tests.

For a nominal significance level of 5%, this implies that we

accept true significance levels that are in the intervals [4.5,

5.5], [4.0, 5.0], and [3.0, 7.0], respectively.

2. Clinical example

Hormone therapy (HT) is associated with adverse effects

such as increased risk of arterial and venous thromboembo-

lism, and breast cancer. Eilertsen et al. [13] examined whether

different HT regimens have different effects on blood

coagulation by randomizing 202 healthy women to either

low-dose HT, conventional-dose (high-dose) HT, tibolone, or

raloxifene. The primary outcome measure was D-dimer—a

marker of fibrin production and degradation which can be

used to assess the effect of HT on coagulation.

After six weeks of therapy, the distribution of D-dimer was

considerably skewed in the low-dose HT group and moder-

ately skewed in the high-dose HT group (Fig. 1). Summary

statistics show that the difference in means is 87, the

difference in medians is 103, and the difference in 20%

trimmed means is 89:

n

Mean Median20% trimmed meanStd Skewness

Low-dose HT

High-dose HT

47

48

398

485

307

410

336

425

284

260

3.1

1.8

How strong is the evidence for a difference in location

between the two groups? We calculated the two-sample T

test (p=0.13), the Welch U test (p=0.13), the Wilcoxon–

Mann–Whitney test (p=0.011), the Brunner–Munzel test

(p=0.010), and the Yuen–Welch test (p=0.027). The high-

est p-value is more than ten times the smallest p-value.

Which test should we trust? We return to this example in

section 5.4.

3. Notation and test statistics

Consider two populations A and B. Assume that we have

two independent samples: X with m observations from A, and

Y with n observations from B. The estimated means and

sample variances are:

X =

1

m∑

i=1

m

Xi; Y =1

n∑

i=1

n

Yi;

and

S2

X=

1

m−1∑

m

i=1

ðXi−XÞ2; S2

Y=

1

n−1∑

n

i=1

ðYi−YÞ2:

Fig.1. Histogram showing the distribution of D-dimer in the low-dose HT (left) and high-dose HT (right) treatment arms after six weeks of the Eilertsen et al. trial

[13]. One outlier in each group was removed.

491

M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

Page 3

The two-sample T test is based on the test statistic

T =

X−Y

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi

Sp

1= m + 1= n

p

;

where Spis the pooled sample standard deviation:

S2

p=ðm−1ÞS2

X+ ðn−1ÞS2

m + n−2

Y

:

Under the null hypothesis of equal means, the T statistic

has a t-distribution with m+n−2 degrees of freedom. It is

assumed that the distributions of A and B are normal with

equal variances.

Welch [1] proposed several modifications of the two-

sample T test suitable for situations with unequal variances.

One of these tests, the Welch U test, is available in most

software packages. The appropriate test statistic is

U = ðX−YÞ=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

S2

m+S2

XY

n

s

:

U is approximately t-distributed with fUdegrees of freedom:

fU=

S2

m+S2

XY

n

!2

=

S4

X

m3−m2+

S4

Y

n3−n2

!

:

To obtain the sample trimmed means, the amount of

trimming (γ) must be chosen. For general use, γ=0.2 is a

good choice [11,14]. This corresponds to removing the 20%

smallest and the 20% largest observations in each sample. Let

X̅γ and Y̅γ denote the trimmed means (the mean of the

samples after trimming). The Yuen–Welch test [4] statistic is

given by

Y =

Xγ−Yγ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

dX+ dY

p

;

where dXand dYare estimates of the squared standard errors.

Calculation of dXand dYis shown in Appendix B. Under the

null hypothesis of equal trimmed means, Y follows a t-

distribution with fYdegrees of freedom,

fY= ðdX+ dYÞ2=

d2

X

hX−1+

d2

Y

hY−1

!

;

where hX and hYare the number of observations left in

samples X and Y after trimming.

The WMW test statistic is based on ranks and involves

calculating

WX= mn + mðm + 1Þ= 2−RX;

where RXis the sum of the ranks in sample X. Under the null

hypothesis that P(XbY)=0.5, WXis approximately normal

distributed with mean mn/2 and variance mn(m+n+1)/12.

The statistic

W = ðWX−mn= 2Þ=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

mnðm + n + 1Þ= 12

p

can be approximated by the standard normal distribution. By

using the exact permutation distribution of ranks, an exact

version of the WMW test can be constructed. Since the exact

test is only practicable for small samples, we do not consider

it. Throughout this paper, references to the WMW test are to

the approximate version of the test.

The Brunner–Munzel test [3] is a modification of the

WMW test designed to handle ties and unequal variances.

Instead of associating ranks with the sample observations,

midranks are computed. Midranks are equal to ranks when

there are no tied values. For tied values, the midranks are the

average of their ranks. The midranks of 2, 5, 5, 6, 9, 9, 9,10, for

example, are 1, 2.5, 2.5, 4, 6, 6, 6, and 8. Let M̅Xand M̅Ybe the

means of the midranks associated with the samples X and Y

when the data are pooled. The Brunner–Munzel test statistic

is

B = ðMY−MXÞ=ðm + nÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

SB2

X=mn2+ SB2

Y=m2n

q

;

where the expressions for SBX

Appendix B. The distribution of B can be approximated by

a t-distribution with fBdegrees of freedom:

2and SBY

2are given in

fB=

SB2

n

X

+SB2

Y

m

!2

=

SB4

X

n2ðm − 1Þ

+

SB4

Y

m2ðn − 1Þ

!

:

4. Simulation setup

We examined the significance level and power of the tests

by using computer simulations. Table 1 defines the relevant

parameters of the simulation setup. The choices of these

parameters are discussed below.

Two criteria were used to select sample sizes: the total

sample size had to range from small to large, and the ratio of

the two sample sizes had to correspond to balanced designs

(m=n), and unbalanced designs (m/nN1 and m/nb1).

The impact of unequal variances was studied by specifying

the ratio of the standard deviations (θ). The largest standard

deviation was associated with the m size sample X, and the

smallest standard deviation was associated with the n size

sample Y. Values of θ=1.0,1.25,1.5,2.0,4.0 were used. When

mNn, the distribution of the largest sample had the largest

variance, and when mbn, the distribution of the largest

sample had the smallest variance.

Different degrees of skewness (β) were introduced by

using gamma and lognormal distributions. When the two

distributions were given different degrees of skewness, the

distribution with the largest variance had the largest skew-

ness.Thenormaldistributionwasusedtogeneratesymmetric

distributions (β=0).

In the power simulations, a difference in location (D)

between the two distributions was introduced and standar-

dized to make it comparable across distributions and sample

sizes:

D = δ⋅

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

σ2

A=m + σ2

B=n

q

; δ = 1;2;3;

492

M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

Page 4

where σA

variance of distribution B.

2is the variance of distribution A, and σB

2is the

5. Results and recommendations

5.1 . Gamma distribution versus lognormal distribution

We generated data from two types of distributions, the

gamma distribution and the lognormal distribution. Test

recommendations were based on each distribution individu-

ally. In general, the robustness criteria were satisfied slightly

more often when data was generated from the lognormal

distribution as compared to when data was generated from

the gamma distribution. The general behavior of the tests was

very similar for the two distributions, both when significance

level and power were considered. We have restricted further

attention to the results and recommendations based on the

gamma distribution. This makes the recommendations

slightly more cautious than it would have been if it was

based on the lognormal distribution.

5.2 . Nominal significance level of 5% versus 1%

The qualitative behavior of the tests was the same for a

nominal significance level of 5% and 1%. However, the

significance levels of the 1% tests were more sensitive to the

effects of skewness, unequal variances, and unequal sample

sizes than the significance levels of the 5% tests, thus making

the 1% tests a little less type I error robust. As for power, the

1% and 5% power curves had similar shapes.

5.3 . Test recommendations

For each studied situation, two criteria were used to

decide if a test could be recommended. First, the true

significance level (p) of the test had to be close to the

nominal significance level (α). Closeness was defined in

three levels: p within 10% of α, p within 20% of α, and p

within 40% of α. As considerable less robustness was

observed for α=0.01, we felt that demanding that both the

α=0.01 and the α=0.05 tests had to satisfy the robust-

ness criteria was too strict, especially since α=0.05 is by

far the most used in medical publications. Therefore, the

robustness criteria were based on α=0.05 only. Second,

the power of the test had to be higher than the power of

the other tests. To allow for the inaccuracy of results from

simulation, a definition of power equivalence was devised.

For each test with each distribution and sample size

combination, the three power values corresponding to the

three introduced differences in distribution location

(δ=1,2,3) were summed. Two tests were considered

power equivalent if the smallest power sum deviated less

than 2.5% from the largest power sum.

Due to the large number of simulated situations, a

comprehensive display of the recommendations is given in

Appendix A. Two examples are given in Table 2: m=100,

n=25 with equal distribution skewness, and m=n=50

with unequal distribution skewness. In both cases, the

robustness level is 20%.

It is clear from the recommendations that simple rules

about which test should be used in which situation cannot be

accurately stated. Each of the factors under consideration in

this study—the total sample size, the sample size ratio, the

standard deviation ratio, skewness, and skewness hetero-

geneity—has an effect on type I errors or power or both of

some or all the tests. The net effect of these factors is often

difficult to predict. We strongly recommend that the relevant

tables in Appendix A are consulted before the choice of test is

made. Nonetheless, a superficial summary of the recommen-

dations is shown in Table 3.

There are situations where none of the tests can be

recommended. Transformation of the data by taking loga-

rithms or square roots may reduce skewness and variance

heterogeneity, but there are some problems with this

approach [16–18]. First, the exact effect of the transforma-

tion on skewness and variance is somewhat unpredictable.

Two samples of similar shape may have skewness and

variance altered differently, and differences that did not exist

between the original samples may be introduced between

the transformed samples. Second, the results from tests on

transformed data are valid only on the transformed scale,

and interpreting the results back onto the original scale can

be troublesome. As a general rule, when using transforma-

tions of any kind, the transformed samples should be

examined with the same scrutiny as the original samples.

Specifically, signs of unequal variances and skewness

distributed unevenly between the two samples should be

given particular attention.

Table 1

Summary of the simulation setup.

TestsT: the two-sample T test

U: the Welch U test

Y: the Yuen–Welch test

W: the Wilcoxon–Mann–Whitney test

B: the Brunner–Munzel test

Equal means; equal medians

δ=0,1,2,3

α=0.05; 0.01

Gammaa; lognormala

(m,n)=(10, 10), (10, 25), (25, 10), (25, 25) (50, 50), (25, 100), (100, 25), (100, 100)

θ=1.0,1.25,1.5,2.0,4.0

βA=βB=0.0,0.5,1.0,1.5,2.0,2.5,3.0

(βA, βB)=(1.0, 0.5), (2.0, 0.5), (3.0, 2.5) (3.0, 2.0), (3.0, 1.0)

10,000

Matlab [15] with the Statistics Toolbox

Null hypotheses

Difference in location

Nominal significance levels

Sampling distributions

Sample sizes

Standard deviation ratios

Equal skewness values

Unequal skewness values

Replications

Programming language

aNormal distribution for β=0.

493

M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

Page 5

5.4 . The clinical example revisited

In section 2, we compared the locations of D-dimer in the

low-dose HT and the high-dose HT treatment arms after six

weeks of the Eilertsen et al. trial [13]. We obtained widely

different p-values with our five tests. The sample sizes in the

two groups were 47 and 48, the standard deviation ratio was

284/260=1.1, and the sample skewness was 3.1 and 1.8. For

distributions with unequal skewness, Table 13 in Appendix A

details recommendations for a sample size of 50 in each

group. An excerpt is given in the lower part of Table 2. For

distributions similar to the ones in our example, the two-

sample T test and the Welch U test are the most powerful

tests of means, and the Yuen–Welch test is the most powerful

test of medians. All three tests are type I error robust at the

10% level. As the differences in means and trimmed means are

similar (87 and 89), the smaller p-value for the Yuen–Welch

test reflects the smaller variance estimate this test uses due to

trimming of the largest observations.

To conclude this example, there is some evidence (Yuen–

Welch test: p=0.027) that there is a difference in 20%

trimmed means, but no evidence of a difference in means (T

test/Welch: p=0.13). The Wilcoxon–Mann–Whitney and the

Brunner–Munzel tests are not recommended in this situation.

Because the Yuen–Welch test is robust for testing medians,

and because the trimmed means are close to the medians, any

inference drawn about the trimmed means can be applied to

the medians as well.

6. Discussion

Comparing the locations of two skewed populations is

fraught with difficulties. Unless the degree of skewness is

small, different measures of central tendency—for example

the mean, the median, and the 20% trimmed mean—can differ

markedly in numeric value. If the variances are unequal as

well, making inferences about equality of two different

measures can lead to opposite conclusions. In such cases, it

is important to accurately define the population differences of

interest, and to interpret test results in strict adherence to the

tests' null hypotheses.

The aim of this paper was to assess the ability of some

much used tests to compare means and medians for a wide

range of skewed distributions with unequal variances. Our

Table 3

Brief summary of the recommendations.

Comparing meansComparing medians

θ=1.0

W,B

B

W

θN1.0

T,U

U or no testb

U

m=n

mbn

mNn

T,U,W,B, sometimes Ya

U,B, sometimes Yc

U,Y,W,B

θ is the standard deviation ratio and β is the skewness. m and n are the

sample sizes. When mbn, the smallest sample has the largest variance.

When mNn, the largest sample has the largest variance. T = the two-sample

T test; U = the Welch U test; Y = the Yuen–Welch test; W = the Wilcoxon–

Mann–Whitney test; B = the Brunner–Munzel test.

aY for combinations of large θs or large βs or both.

bU when β≤1.0, else no test.

cY for large sample sizes.

Table 2

Tests with highest power that satisfy 4.0≤p≤6.0 for α=0.05.

m=100, n=25

Robustness level: 20%

H0: equal means

!

UU

UU

UBU

UBU

TTUW

H0: equal medians

Std. ratio

4.00

2.00

1.50

1.25

1.00

U

U

U

U

W

U

U

U

U

W

U

U

U

U

W

U

U

U

Y

W

U

U

U

Y

W

U

U

UB

UB

TU

Y

Y

U

U

TUW

Y

Y

Y

YB

W

Y

–

Y

B

W

–

–

Y

B

W

–

–

–

B

B

–

W

W

W

T

0.00.5 1.0 1.52.0 2.5 3.0

βA=βB

0.0 0.5 1.01.52.02.5 3.0

m=n=50

Robustness level: 20%

H0: equal means

H0: equal medians

Std. ratio

4.00

2.00

1.50

1.25

1.00

TU

TU

TU

TU

TU

–

TU

TU

TU

TU

–

–

TU

TU

Y

–

U

TU

TU

TU

–

–

TU

TU

B

Y

WB

WB

WB

WB

YB

Y

Y

Y

Y

Y

Y

WB

Y

Y

Y

Y

Y

Y

Y

–

–

Y

Y

Y

1.0

0.5

2.0

0.5

3.0

2.5

3.0

2.0

3.0

1.0

Skewness dist. A

Skewness dist. B

1.0

0.5

2.0

0.5

3.0

2.5

3.0

2.0

3.0

1.0

p is the true significance level and α is the nominal significance level. βAis the skewness of distribution A and βBis the skewness of distribution B. An entry of “–”

means that no test satisfies the robustness criterion. The data were generated from normal distributions (skewness=0) and gamma distributions (skewness N0).

T = the two-sample T test.

U = the Welch U test.

Y = the Yuen–Welch test.

W = the Wilcoxon–Mann–Whitney test.

B = the Brunner–Munzel test.

494

M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

Page 6

recommendations are detailed in Appendix A. We briefly

review the most important results:

• Theperformance of the tests dependsonmany factors,most

notably variance heterogeneity, skewness and skewness

heterogeneity, the sample size ratio, and the total sample

size.

• Small distribution changes can lead to large changes in test

performance.

• Skewness heterogeneity had a slight negative effect on the

rank-based tests, but almost no effect on the parametric

tests.

• For the simulated settings, the Welch U test is recom-

mended most frequently.

• The rank-based methods are sensitive to departures from

the pure shift model.

• For variables with skewed distributions, the 20% trimmed

mean is closer to the median than to the mean.

• The five examined tests performed similarly on samples

drawn from gamma distributions as compared to samples

drawn from lognormal distributions.

The advantage of the Welch test demonstrated in our

study is in agreement with previous studies and several

authors recommend the Welch test for almost all situations

[19–22]. We agree that the Welch test is the best test in

general, but to select the most powerful robust test, a careful

consideration of the properties of the data is recommended.

As an aid in this endeavor, Appendix A should be helpful.

The fivetests examinedin this paperarebut a small partof

the large set of tests available for the two-sample location

problem. However, because of their widespread use, these

five tests merit special attention. Several alternative methods

are presented in the two books by Wilcox [11,23], including

methods using robust measures of location, rank-based

methods, permutation tests, and bootstrap methods. One of

the main obstacles to contemporary methods is availability in

commercial software. This problem is easily overcome by

using the free software R [24] for which a large number of

functions exist to perform modern methods [11,23].

Our simulation study is limited in scope by two main

factors. First, we have employed two families of distributions,

the gamma and the lognormal. Although very similar results

were observed for the two distributions, we cannot rule out

the possibility that other types of distributions may produce

conspicuously different results. Also, extreme observations

havea large impacton the T test and the Welch test. A realistic

modeling of extreme observations is difficult, and other

distributions than the gamma and the lognormal are perhaps

better suited. Second, the effect of kurtosis has not been

assessed. There is some evidence that kurtosis has only a

minor effect on type I error rates [9,16,25], but that power

may be affected [26]. For gamma and lognormal distributions,

skewness and kurtosis are not independent parameters [27].

Thus, for the skewed distributions studied in this paper, the

effect of kurtosis cannot be separated from the effect of

skewness.

We have quantified robustness by defining 10%, 20%, and

40% limits to the deviation of the true significance level from

the nominal level. We consider a 10% deviation to be

acceptable in almost any practical application and that a

20% deviation is sufficiently precise for most situations.

However, if a test is robust at the 40% level only, obtained p-

values should be interpreted with due caution.

Appendix A. Supplementary data

Supplementary data associated with this article can be

found, in the online version, at doi:10.1016/j.cct.2009.06.007.

Appendix B

Appendix B.1. Estimates of the squared standard errors in the

Yuen–Welch test

Let gX=γm and gY=γn be the number of observations

(rounded down) trimmed from each tail in X and Y. Denote the

number of remaining observations in the trimmed samples by

hX=m−2gXand hY=n−2gY. The squared standard errors are

based on the sample Winsorized variances. Denote the sorted

observations in X by X(1)≤X(2)≤⋯≤X(m). The Winsorized

sample of X,

WX= W1

X;W2

X;…;Wm

X;

is foundbysettingWX=X and replacingeachof thegXsmallest

observations, X(1),…,X(gx), with X(gX+1), and replacing each of

the gXlargest observations, X(m−gX+1),…,X(m), with X(m−gX).

The Winsorized sample of Y (WY) is found in the same way.

Denote the Winsorized sample means by W̅Xand W̅Y. The

sample Winsorized variances are

sw2

X=

1

m−1∑

m

i=1

ðWi

X−WXÞ2and sw2

Y=

1

n−1∑

n

i=1

ðWi

Y−WYÞ2:

The squared standard errors in the Yuen–Welch test are

dX=sw2

Xðm−1Þ

hXðhX−1Þ

and dY=sw2

Yðn−1Þ

hYðhY−1Þ:

Further details can be found in [4,11].

Appendix B.2. Variance estimates in the Brunner–Munzel test

Following the notation in section 3, MX=MX

and MY=MY

pooling all observations. M̅Xand M̅Yare the means of the

pooled midranks. Midranks can alsobe computed withineach

sample. Denote these by VX=VX

The variance estimates in the Brunner–Munzel test are

1,MX

2,…,MX

m

1,MY

2,…,MY

nare the midranks of X and Y based on

1,VX

2,…,VX

mand VY=VY

1,VY

2,…VY

n.

SB2

X=

1

m−1∑

m

i=1

Mi

X−Vi

X−MX+m + 1

2

??2

and

SB2

Y=

1

n−1∑

n

i=1

Mi

Y−Vi

Y−MY+n + 1

2

??2

:

For further details, see [3,11].

495

M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

Page 7

References

[1] Welch BL. The significance of the difference between two means when

the population variances are unequal. Biometrika 1937;29:350–62.

[2] Lehmann EL. Nonparametrics—statistical methods based on ranks.

Upper Saddle River, NJ: Prentice-Hall, Inc.; 1975.

[3] Brunner E, Munzel U. The nonparametric Behrens–Fisher problem:

asymptotic theory and a small-sample approximation. Biom J 2000;42:

17–25.

[4] Yuen KK. The two-sample trimmed t for unequal population variances.

Biometrika 1974;61:165–70.

[5] Cliff N. Dominance statistics: ordinal analyses to answer ordinal

questions. Psychol Bull 1993;114:494–509.

[6] Wilcox RR. Comparing the means of two independent groups. Biom J

1990;32:771–80.

[7] Wilcox RR, Keselman HJ. Modern robust data analysis methods:

measures of central tendency. Psychol Methods 2003;8:254–74.

[8] Bridge PD, Sawilowsky SS. Increasing physicians' awareness of the

impact of statistics on research outcomes: comparative power of the t-

test and Wilcoxon rank-sum test in small samples applied research.

J Clin Epidemiol 1999;52:229–35.

[9] Penfield DA. Choosing a two-sample location test. J Exp Educ

1994;62:343–60.

[10] Stonehouse JM, Forrester GJ. Robustness of the t and U tests under

combined assumption violations. J Appl Stat 1998;25:63–74.

[11] Wilcox RR. Introduction to robust estimation and hypothesis testing.

2nd ed. San Diego, CA: Academic Press; 2005.

[12] Bradley JV. Robustness? Br J Math Stat Psychol 1978;31:144–52.

[13] Eilertsen AL, Qvigstad E, Andersen TO, Sandvik L, Sandset PM.

Conventional-dose hormone therapy (HT) and tibolone, but not low-

dose HT and raloxifene, increase markers of activated coagulation.

Maturitas 2006;55:278–87.

[14] Wilcox RR. Some results on the Tukey–McLaughlin and Yuen methods

for trimmed means when distributions are skewed. Biom J 1994;3:

259–73.

[15] Matlab 7. Natick, MA: The MathWorks, Inc.; 2005.

[16] Pearson ES, Please NW. Relation between the shape of population

distribution and the robustness of four simple test statistics. Biometrika

1975;62:223–41.

[17] Sutton CD. Computer-intensive methods for tests about the mean of an

asymmetrical distribution. J Am Stat Assoc 1993;88:802–10.

[18] Grissom RJ. Heterogeneity of variance in clinical data. J Consult Clin

Psychol 2000;68:155–65.

[19] Best DJ, Rayner JCW. Welch's approximate solution for the Behrens–

Fisher problem. Technometrics 1987;29(2):205–10.

[20] Gans DJ. Use of a preliminary test in comparing two sample means.

Commun Stat Simul C 1981;B10(2):163–74.

[21] Zimmerman DW. A note on preliminary tests of equality of variances. Br

J Math Stat Psychol 2004;57:173–81.

[22] Ruxton GD. The unequal variance t-test is an underused alternative to

Student's t-test and the Mann–Whitney U test. Behav Ecol 2006;17(4):

688–90.

[23] Wilcox RR. Applying contemporary statistical techniques. San Diego,

CA: Academic Press; 2003.

[24] The R project for statistical computing. [http://www.r-project.org/].

[25] Cressie NAC, Whitford HJ. How to use the two sample t-test. Biom J

1986;28(2):131–48.

[26] Wilcox RR. ANOVA: the practical importance of heteroscedastic

methods, using trimmed means versus means, and designing simula-

tion studies. Br J Math Stat Psychol 1995;48:99–114.

[27] Evans M, Hastings N, Peacock B. Statistical distributions. 3rd ed. New

York, NY: John Wiley & Sons, Inc.; 2000.

496

M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496