Page 1

Sample Size Planning for the Standardized Mean Difference:

Accuracy in Parameter Estimation Via Narrow Confidence Intervals

Ken Kelley

Indiana University

Joseph R. Rausch

University of Notre Dame

Methods for planning sample size (SS) for the standardized mean difference so that a narrow

confidence interval (CI) can be obtained via the accuracy in parameter estimation (AIPE)

approach are developed. One method plans SS so that the expected width of the CI is

sufficiently narrow. A modification adjusts the SS so that the obtained CI is no wider than

desired with some specified degree of certainty (e.g., 99% certain the 95% CI will be no wider

than ?). The rationale of the AIPE approach to SS planning is given, as is a discussion of the

analytic approach to CI formation for the population standardized mean difference. Tables

with values of necessary SS are provided. The freely available Methods for the Behavioral,

Educational, and Social Sciences (K. Kelley, 2006a) R (R Development Core Team, 2006)

software package easily implements the methods discussed.

Keywords: sample size planning, standardized mean difference, accuracy in parameter

estimation, power analysis, precision analysis

One of the simplest measures of effect is the difference

between two independent group means. It is this differ-

ence that is evaluated with the two-group t test to infer

whether the population difference between two group

means differs from some specified null value, which is

generally set to zero. However, in the behavioral, educa-

tional, and social sciences, units of measurement are

often arbitrary, different researchers might measure the

same phenomenon with different scalings of the same

instrument, or different instruments altogether might be

used. Because of the lack of standard measurement scales

and procedures for most behavioral, educational, and

social phenomena, the ability to compare measures of

effect across different situations has led many researchers

to use standardized measures of effect. Measures of ef-

fect, or effect sizes, that are standardized yield scale-free

numbers that are not wedded to a specific instrument or

scaling metric. Given the measurement issues in behav-

ioral, educational, and social research, such standardized

effect sizes provide what is arguably the optimal way to

estimate the size of an effect, along with its correspond-

ing confidence interval, for a more communal knowledge

base to be developed and so that the results from different

studies can be compared more readily.

A commonly used and many times intuitively appealing

effect size is the standardized mean difference.1In fact, the

standardized mean difference is the most widely used sta-

tistic in the context of meta-analysis for experimental and

intervention studies (Hunter & Schmidt, 2004, p. 246). The

population standardized mean difference is defined as

? ??1? ?2

?

, (1)

where ?1is the population mean of Group 1, ?2is the

population mean of Group 2, and ? is the population stan-

dard deviation assumed to be equal across the two groups.

Because the unstandardized (raw) mean difference may not

be directly comparable across studies, the unstandardized

difference between group means can be divided by the

standard deviation to remove the particular measurement

scale, yielding a pure number (Cohen, 1988, p. 20). A

commonly used set of guidelines for the standardized mean

1In some cases, the unstandardized difference between means is

more intuitively appealing than is the standardized mean differ-

ence (e.g., Bond, Wiitala, & Richard, 2003). If the unstandardized

mean difference is of interest, Kelley et al. (2003) discussed the

methods analogous to those discussed in the present article.

Ken Kelley, Inquiry Methodology Program, Indiana University;

Joseph R. Rausch, Department of Psychology, University of Notre

Dame.

Joseph R. Rausch is now at the Department of Psychology,

University of Minnesota, Twin Cities Campus.

Correspondence concerning this article should be addressed to

Ken Kelley, Inquiry Methodology Program, Indiana University,

201 North Rose Avenue, Bloomington, IN 47405. E–mail:

kkiii@indiana.edu

Psychological Methods

2006, Vol. 11, No. 4, 363–385

Copyright 2006 by the American Psychological Association

1082-989X/06/$12.00 DOI: 10.1037/1082-989X.11.4.363

363

Page 2

difference in the behavioral, educational, and social sci-

ences, although not without its critics (e.g., Lenth, 2001), is

that ?s of 0.2, 0.5, and 0.8 are regarded as small, medium,

and large effects, respectively (Cohen, 1969, 1988).2

Supposearesearcherisinterestedintheeffectofaparticular

treatment on the mean of some variable and would like to

compare an experimental group with a control group. The

researcher’s review of the literature and a pilot study lead the

researcher to believe that the effect of the treatment is of a

“medium” magnitude, corresponding to a standardized mean

difference of approximately 0.50. As is widely recommended

in the literature, the researcher conducts a power analysis to

determine the necessary sample size so that there will be a high

probability of rejecting the presumed false null hypothesis.

Basing the sample size calculations on a desired degree of

power of 0.85, the researcher conducts the study with 73

participants per group.

The observed standardized mean difference was 0.53,

giving some support to the researcher’s assertion that the

effect is of medium magnitude, and was shown to be sta-

tistically significant, t(144)? 3.20, p(t(144)? ?3.20?) ? .002.

In accord with recent recommendations in the literature, the

researcher forms a 95% confidence interval for ?, which

ranges from 0.199 to 0.859. Although the researcher be-

lieved the effect was medium in the population, to the

researcher’s dismay, the lower limit of the confidence in-

terval is smaller than “small” and the upper limit is larger

than “large.” The width of the researcher’s confidence in-

terval thus illustrates that even though the null hypothesis

was rejected, a great deal of uncertainty exists regarding the

value of ?, which is where the researcher’s interest ulti-

mately lies. Indeed, as Rosenthal (1993) argued, the results

we are actually interested in from empirical studies are the

estimate of the magnitude of the effect and an indication of

its accuracy, “as in a confidence interval placed around the

estimate” (p. 521).

The purpose of the present work is to offer an alter-

native to the power analytic approach to sample size

planning for the standardized mean difference. This gen-

eral approach to sample size planning is termed accuracy

in parameter estimation (AIPE; Kelley, 2006b; Kelley &

Maxwell, 2003, in press; Kelley, Maxwell, & Rausch,

2003), where what is of interest is planning sample size

to achieve a sufficiently narrow confidence interval so

that the parameter estimate will have a high degree of

expected accuracy. A confidence interval consists of a set

of plausible values that will contain the parameter with

(1 ? ?)100% confidence. Appropriately constructed con-

fidence intervals will always contain the parameter esti-

mate and will contain the parameter (1 ? ?)100% of the

time. The idea of the AIPE approach is that when the

width of a (1 ? ?)100% confidence interval decreases,

the range of plausible values for the parameter decreases

with the estimate necessarily contained within this set of

plausible values. Provided that the confidence interval

procedure is exact (i.e., the nominal coverage is equal to

the empirical coverage) and holding constant the (1 ?

?)100% confidence interval coverage, the expected dif-

ference between the estimate and the parameter decreases

as the confidence interval width decreases.

In the context of parameter estimation, accuracy is de-

fined as the square root of the mean square error, which is

a function of both precision and bias. Precision is inversely

related to the variance of the estimator, and bias is the

systematic discrepancy between an estimate and the param-

eter it estimates. More formally, accuracy is quantified by

the (square) root of the mean square error (RMSE) as

RMSE ? ?E[(?ˆ? ?)2]

? ?E[(?ˆ? E[?])2] ? (E[?ˆ? ?])2

? ???

2? B?

2, (2)

where E[?] represents expectation, ? is the parameter of

interest, ?ˆis an estimate of ?, ??

of the estimator, and B?is the bias of the estimator. As the

confidence interval width decreases, holding constant the

confidence interval coverage, the estimate is contained

within a narrower set of plausible parameter values and the

expected accuracy of the estimate improves (i.e., the RMSE

is reduced). Thus, provided that the confidence interval

procedure is exact, when the width of the (1 ? ?)100%

confidence interval decreases, the expected accuracy of the

estimate necessarily increases.

The effect of increasing sample size has two effects on

accuracy. First, the larger the sample size, generally the

more precise the estimate.3Second, estimates that are

biased will generally become less biased as sample size

increases, which must be the case for consistent estima-

tors (regardless of whether the estimator is biased or

unbiased; Stuart & Ord, 1994). Notice that when an

estimate is unbiased (i.e., E[?ˆ??]?0), precision and ac-

curacy are equivalent. However, a precise estimator need

not be an accurate estimator. Thus, precision is a neces-

2is the population variance

2Of course, as with most rules of thumb, Cohen’s (1988)

guidelines have their limitations and should not be applied without

first consulting the literature of the particular area. Overreliance on

Cohen’s guidelines can lead an investigator astray when planning

sample size for a particular research question when the size of ? is

misidentified, which is easy to do if the only possibilities consid-

ered are 0.2, 0.5, and 0.8.

3A counterexample is the Cauchy distribution, in which the

precision of the location estimate is the same regardless of the

sample size used to estimate it (Stuart & Ord, 1994, pp. 2–3).

364

KELLEY AND RAUSCH

Page 3

sary but not a sufficient condition for accuracy.4Beyond

the effect of improving precision, decreasing bias im-

proves accuracy.5This usage of the term accuracy is the

same as that used by Neyman (1937) in his seminal work

on the theory of confidence intervals: “The accuracy of

estimation corresponding to a fixed value of 1 ? ? may

be measured by the length of the confidence interval” (p.

358; we changed Neyman’s original notation of ? repre-

senting the confidence interval coverage to 1 ? ? to

reflect current usage).

One of the main reasons why researchers plan, conduct,

and then analyze the data of empirical studies is to learn

about some parameter of interest. One way in which re-

searchers have attempted to learn about the parameter of

interest historically has been by conducting null hypothesis

significance tests. Null hypothesis significance testing al-

lows researchers to reject the idea that the true value of the

parameter of interest is some precisely specified value (usu-

ally zero for the standardized mean difference). By conduct-

ing a significance test that achieves statistical significance,

researchers learn probabilistically what the parameter is not

(e.g., ? is not likely zero) and possibly the direction of the

effect. Another way in which researchers have attempted to

learn about the parameter of interest is by forming confi-

dence intervals for the population parameter of interest. By

forming a confidence interval, not only does a researcher

learn probabilistically what the parameter is not (i.e., those

values outside the bounds of the interval) but also a re-

searcher learns probabilistically a range of plausible values

for the parameter of interest.6

As has been echoed numerous times in the methodolog-

ical literature of the behavioral, educational, and social

sciences (e.g., Nickerson, 2000, which along with the ref-

erences contained therein provides a comprehensive histor-

ical review; see also Cohen, 1994; Meehl, 1997; Schmidt,

1996; among many others), there are serious limitations to

null hypothesis significance tests. As Hunter and Schmidt

(2004) and Cohen (1994) pointed out, the null hypothesis

may almost never be exactly true in nature.7Regardless of

whether the null hypothesis is true or false, what is often

most informative is the value or size of the population

effect. As recommended by Wilkinson and the American

Psychological Association (APA) Task Force on Statistical

Inference (1999), researchers should “always present effect

sizes for primary outcomes” (p. 599), and they stressed that

“interpreting effect sizes in the context of previously re-

ported effects is essential to good research” (p. 599).

Wilkinson and the APA Task Force on Statistical Inference

also recommended that “interval estimates should be given

for any effect sizes involving principal outcomes” (p. 599).

It seems that there is general consensus in the methodolog-

ical community of the behavioral, educational, and social

sciences with regard to trying to understand various phe-

nomena of interest, and that consensus is to report confi-

dence intervals for effect sizes whenever possible; indeed,

this strategy may be the future of quantitative methods in

applied research (Thompson, 2002).

Even though the merits of significance testing have

come under fire in the methodological literature, null

hypothesis significance tests have played a major role in

the behavioral, educational, and social sciences. Al-

though reporting measures of effect is useful, reporting

point estimates without confidence intervals to illustrate

the uncertainty of the estimate can be misleading and

cannot be condoned. Reporting and interpreting point

estimates can be especially misleading when the corre-

4As an extreme example, suppose a researcher always ig-

nores the data and estimates the parameter as a value that

corresponds to a particular theory. Such an estimate would have

a high degree of precision but potentially could be quite biased.

The estimate would only have a high degree of accuracy if the

theory was close to perfect.

5Some parameters have exact confidence interval procedures

that are based on a biased point estimate of the parameter yet

where an unbiased point estimate of the parameter also exists.

A strategy in such cases is to report the unbiased estimate for

the point estimate of the parameter in addition to the (1 ?

?)100% confidence interval for the parameter (calculated on the

basis of the biased estimate). Examples of parameters that have

exact confidence interval procedures that are calculated on the

basis of a biased estimate are the standardized mean difference

(e.g., Hedges & Olkin, 1985), the squared multiple correlation

coefficient (e.g., Algina & Olejnik, 2000), the standard devia-

tion (see, e.g., Hays, 1994, for the confidence interval method

and Holtzman, 1950, for the unbiased estimate), and the coef-

ficient of variation (see, e.g., Johnson & Welch, 1940, for the

confidence interval method and Sokal & Braumann, 1980, for

its nearly unbiased estimate).

6Assuming that the assumptions of the model are met, the

correct model is fit, and observations are randomly sampled,

1 ? ? is the probability that any given confidence interval from

a collection of confidence intervals calculated under the same

circumstances will contain the population parameter of interest.

However, it is not true that a specific confidence interval is

correct with 1 ? ? probability, as a computed confidence

interval either does or does not contain the value of the param-

eter. The procedure refers to the infinite number of confidence

intervals that could theoretically be constructed and the (1 ?

?)100% of those confidence intervals that correctly bracket the

population parameter of interest (see Hahn & Meeker, 1991, for

a technical review of confidence interval formation). Although

the meaning of confidence intervals given is from a frequentist

perspective, the methods discussed in the article are also appli-

cable in a Bayesian context.

7This argument seems especially salient in the context of obser-

vational studies in which preexisting group differences likely exist.

However, in unconfounded experimental studies with randomization,

it seems plausible that a treatment might literally have no effect,

which would of course imply that the null hypothesis is true.

365

AIPE FOR THE STANDARDIZED MEAN DIFFERENCE

Page 4

sponding confidence interval is wide and thus little is

known about the likely size of the population parameter

of interest. Because confidence intervals provide a range

of reasonable values that bracket the parameter of interest

with some desired degree of confidence, confidence in-

tervals provide a great deal of information above and

beyond the estimated value of the effect size and the

corresponding statistical significance test. Thus, effect

sizes accompanied by their corresponding confidence in-

tervals are perhaps the best way to illustrate how much

information was learned about the parameter of interest

from the study.

Suppose a very wide confidence interval is formed, and

yet zero is excluded from the confidence interval. Such a

confidence interval provides some but not much insight

into the phenomenon of interest. What is learned in such

a scenario is that the parameter is not likely zero and

possibly the direction of the effect. Even in situations in

which it is well established that the effect is not zero,

providing statistical evidence that the effect is not zero is

almost always a goal. The reason power analysis is so

beneficial is because it helps to ensure that an adequate

sample size is used to show that the effect is not zero in

the population. However, the result of a significance test

in and of itself does not provide information about the

size of the effect.

The accuracy of parameter estimates is also important

in another context when one wishes to show support for

the null hypothesis (e.g., Greenwald, 1975) or in the

context of equivalence testing (e.g., Steiger, 2004; Tryon,

2001). The “good enough” principle can be used and a

corresponding “good enough belt” can be formed around

the null value, where the limits of the belt would define

what constituted a nontrivial effect (Serlin & Lapsley,

1985, 1993). Suppose that not only is the null value

contained within the good enough belt but also the limits

of the confidence interval are within the good enough

belt. This would be a situation in which all of the plau-

sible values would be smaller in magnitude than what has

been defined as a trivial effect (i.e., they are contained

within the good enough belt). In such a situation, the

limits of the (1 ? ?)100% confidence interval would

exclude all effects of meaningful size. If the parameter is

less in magnitude than what is regarded to be minimally

important, then learning this can be very valuable. This

information may or may not support the theory of inter-

est, but what is important is that valuable information

about the size of the effect and thus the phenomenon of

interest has been gained.

Perhaps the ideal scenario in many research contexts is

when the confidence interval for the parameter of interest

is narrow (and thus a good deal is learned about the

plausible value of the parameter) and does not contain

zero (and thus the null hypothesis can be rejected). Ac-

complishing the latter, namely, rejecting the null hypoth-

esis, has long been a central part of research design in the

form of power analysis. However, accomplishing the

former, namely, obtaining a narrow confidence interval,

has not received much attention in the methodological

literature of the behavioral, educational, and social sci-

ences (cf. Algina & Olejnik, 2000; Bonett & Wright,

2000; Kelley & Maxwell, 2003; Kelley et al., 2003;

Smithson, 2003).

Confidence intervals can be calculated for the standard-

ized mean difference in two main ways. One method uses

the bootstrap technique (e.g., Efron & Tibshirani, 1993) and

does not require the assumption of homogeneity of variance

or normality to obtain valid confidence intervals (Kelley,

2005), potentially using a robust estimator of standardized

population separation in place of d (e.g., Algina, Keselman,

& Penfield, 2005). The other method, which is optimal

when the assumptions of normality, homogeneity of vari-

ance, and independence of observations are satisfied, is the

analytic approach. The analytic approach requires special-

ized computer routines, specifically noncentral t distribu-

tions, to obtain the confidence limits for ? (e.g., Cumming &

Finch, 2001; Kelley, 2005; Smithson, 2001; Steiger, 2004;

Steiger & Fouladi, 1997). Throughout the remainder of the

article, the focus is on the analytic approach to confidence

interval formation.

The problem the present work addresses is that of

obtaining an accurate estimate of the population stan-

dardized mean difference by planning sample size so that

the observed (1 ? ?)100% confidence interval will be

sufficiently narrow with some specified probability. The

following section provides an overview of confidence

interval formation for the population standardized mean

difference. Methods for planning sample size so that the

expected width of the confidence interval is sufficiently

narrow are then developed. The first procedure deter-

mines the sample size necessary for the expected width of

the obtained confidence interval for the population stan-

dardized mean difference to be sufficiently narrow. Ob-

taining a large enough sample size so that the expected

width will be sufficiently narrow does not guarantee that

a computed interval will, in fact, be as narrow as speci-

fied. This method is extended into a follow-up procedure

in which there will be some desired degree of certainty

that the computed interval will be sufficiently narrow

(e.g., 99% certain that the 95% confidence interval will

be no wider than the specified width). Sample size tables

are provided for a variety of situations on the basis of the

premise that they will assist applied researchers in choos-

ing an appropriate sample size given a particular goal

within the AIPE framework for the standardized mean

difference. Because a main goal of research is to learn

about the parameter of interest, obtaining a narrow con-

fidence interval may be the best way to fulfill this goal. It

366

KELLEY AND RAUSCH

Page 5

is this premise, coupled with the usefulness of the stan-

dardized mean difference, that has motivated this article

and the development of computer routines that can be

used to carry out the methods discussed.8

Estimation and Confidence Interval Formation for

the Standardized Mean Difference

Although ? is the ultimate quantity of interest, ? is un-

known and must be estimated from sample data. The most

common way in which ? is estimated is defined as

d ?X1? X2

s

, (3)

where Xjis the mean for the jth group (j ? 1, 2) and s is

the square root of the pooled variance (i.e., s is the square

root of the unbiased estimate of the within-group vari-

ance).

As pointed out by Cumming and Finch (2001), there is

inconsistency in the terminology and notation used when

discussing the standardized and unstandardized effect sizes

for mean differences (see also Grissom & Kim, 2005). Our

proposal is to use ? as ?1? ?2with D ? X1? X2as its

sample estimate, ? (Equation 1) as the population standard-

ized mean difference and d (Equation 3) as its sample

estimate, and ?Cas the population standardized mean dif-

ference using the control group standard deviation as the

divisor and dCas its sample estimate. Not discussed in this

article but important nonetheless are the unbiased estimators

of ? and ?C(d and dCare not unbiased), for which we

suggest dUand dCUas their notation (see Hedges, 1981, for

its theoretical developments and Kelley, 2005, for some

comparisons to the commonly used biased version). Dis-

cussed momentarily is the noncentral t distribution that has

a noncentral parameter. There is also inconsistency in no-

tation for this noncentral parameter, and we suggest ? as

opposed to ? or ? (both commonly used symbols) because

of their use as population effect size measures.9

Part of this inconsistency in notation is a function of

trying to attribute one or more versions of the standardized

effect size to particular authors coupled with those same

authors using different notation in different works. The

estimated standardized mean difference, d, is often referred

to as Cohen’s d (even though Cohen used d as the popula-

tion parameter and dsas its sample analog; Cohen, 1988)

because of Cohen’s work on the general topic of effect size

and power analysis (Cohen, 1969, p. 18) and sometimes as

Hedges’s g? (or g) because of Hedges’s work on how the

standardized effect size could be used in a meta-analysis

context and its theoretical developments (Hedges, 1981, p.

110). The analogous standardized effect size based on the

control group standard deviation is often called ? or Glass’s

g (Glass, 1976; Glass, McGaw, & Smith, 1981, p. 29;

Hedges, 1981, p. 109). Furthermore, the Mahalanobis dis-

tance, which is the multivariate version of d (and for one

variable is equal to d), was developed well before d was

used as a standardized effect size in the behavioral, educa-

tional, and social sciences (Mahalanobis, 1936). Given all of

the possible labelings of what is defined in Equation 3, we

call this quantity the standardized mean difference without

attempting to attribute this often used quantity to any one

individual (recognizing that many have worked on its the-

oretical developments and others have encouraged its use)

and use the notation d to represent the sample value (which

is currently the most widely used notation).

Recall that the two-group t test is defined as

t ?

X1? X2

s?

1

n1?

1

n2

, (4)

where n1and n2are the sample sizes for Group 1 and Group

2, respectively. In the two-group situation, assuming homo-

geneity of variance, s is defined as

s ??

n1? n2? 2

s1

2?n1? 1? ? s2

2?n2? 1?

, (5)

where s1

and Group 2, respectively, and has n1? n2? 2 degrees of

freedom. However, in an analysis of variance (ANOVA)

context where more than two groups exist and the assump-

tions of ANOVA are satisfied, the estimate of s (as well as

d) can be improved by pooling information across all J

groups, even if what is of interest is the difference between

2and s2

2are the within-group variances for Group 1

8Throughout the article, specialized software is used. Ken

Kelley has developed an R package that contains, among other

things, the necessary functions to form confidence intervals for

the population standardized mean difference and to estimate

sample size from the AIPE perspective for the standardized

mean difference. The R package is titled Methods for the

Behavioral, Educational, and Social Sciences (MBESS) and is

an Open Source, and thus freely available, package available

via the Comprehensive R Archival Network (CRAN; http://

www.r-project.org/). The direct link to the MBESS page on

CRAN, where the most up-to-date version of MBESS is avail-

able, is http://cran.r-project.org/src/contrib/Descriptions/

MBESS.html (note that this Internet address is case sensitive).

9Much of the work contained in the present article can be

applied to ?Cand dCby modifying the degrees of freedom of the

denominator to have degrees of freedom equal to that of sC, the

standard deviation of the control group.

367

AIPE FOR THE STANDARDIZED MEAN DIFFERENCE

Page 6

the means of only two specific groups. Thus, more gener-

ally, s can be defined as

s ???

j?1

J

sj

2?nj? 1?

N ? J

, (6)

where N is the total sample size ?N ? ?

number of groups (j ? 1, . . . , J). In situations where J ?

2 and the ANOVA assumptions are satisfied, basing s on all

groups leads to more degrees of freedom (N ? J degrees of

freedom instead of n1? n2? 2). Holding everything else

constant, the larger the degrees of freedom, the more pow-

erful the significance test for the mean difference and the

more accurate the estimate of the standardized (and un-

standardized) mean difference. Thus, when information on

J ? 3 groups is available, making use of that information

should be considered even if what is of interest is estimating

? for two specific groups. Of course, as J increases, the

potential for the assumption of homogeneity of variance to

be violated also increases, but if the assumption holds, more

power and accuracy will be gained by using a pooled

variance based on J ? 3 groups.

Notice that the difference between d from Equation 3 and

the two-group t statistic from Equation 4 is the quantity

?

n2

which is multiplied by s to estimate the standard error.

Because?

ing the inverse of this quantity by d leads to an equivalent

representation of the t statistic:

t ? d?

j?1

Jnj? and J is the

1

n1?

1

contained in the denominator of the t statistic,

1

n1?

1

n2can be rewritten as?

n2? n1

n1n2

, multiply-

n1n2

n2? n1. (7)

Given Equation 7, it can be seen that Equation 3 can be

written as

d ? t?

n2? n1

n1n2

. (8)

The usefulness of Equations 7 and 8 will be realized mo-

mentarily when discussing the formation of confidence in-

tervals for ?.

The noncentral parameter in the two-group context

indexes the magnitude of the difference between the null

hypothesis of ?1? ?2and an alternative hypothesis of

?1? ?2. The larger the difference between the null and

alternative hypotheses, the larger the noncentral param-

eter. In the population, the degree to which ?1? ?2for

N ? 2 degrees of freedom is known as a noncentral

parameter:

? ?

?1? ?2

??

1

n1?

1

n2

? ??

n1n2

n2? n1. (9)

The noncentral parameter ? is of the same form as a t

statistic (for a technical discussion of the noncentral t

distribution, see, e.g., Hogben, Pinkham, & Wilk, 1961;

Johnson, Kotz, & Balakrishnan, 1995; Johnson & Welch,

1940). In fact, ? can be obtained by replacing the sample

values in Equation 4 with their population values for the

sample sizes of interest. Given the relationship between a

t value and the corresponding noncentral parameter, ?

can be estimated by the observed t statistic: ?ˆ

Construction of confidence intervals for ? is indirect and

proceeds by first finding a confidence interval for ? and

then transforming those bounds via Equation 8 to the

scale of ? using a combination of the confidence interval

transformation principle and the inversion confidence

interval principle (Cumming & Finch, 2001; Kelley,

2005; Steiger & Fouladi, 1997; Steiger, 2004).

Let t??q,v,??be the critical value from the qth quantile

from a noncentral t distribution with ? degrees of free-

dom and noncentral parameter ?. The degrees of freedom

parameter is based on the sample size used to calculate s.

To find the confidence bounds for ?, first find the confi-

dence bounds for ?. Because of the confidence interval

transformation principle, the one-to-one monotonic rela-

tion between ? and ? given n1and n2(Equations 7 and 8)

implies that the (1 ? ?)100% confidence bounds for ?

provides, after transformation via Equation 8, the (1 ?

?)100% confidence bounds for ?.

The confidence bounds for ? are determined by finding

the noncentral parameter whose 1 ? ?/2 quantile is t (for

the lower bound of the confidence interval) and by find-

ing the noncentral parameter whose ?/2 quantile is t (for

the upper bound of the confidence interval). Thus, the

lower confidence bound for ?, ?L, is the noncentral

parameter that leads to t??1??/2,v,?L? ? t and the upper

confidence bound for ?, ?U, is the noncentral parameter

that leads to t???/2,v,?U? ? t.10For the lower and upper

confidence bounds for ?, given ?, ?, and t, the only

unknown values are ?Land ?U. It is ?Land ?Uthat are of

? t.

10It is assumed here that the confidence interval will use the same

rejection region in both tails. Although convenient, this is not neces-

sary. Rejection regions could be defined so that ? ? ?L? ?Ufor the

lower and upper rejection regions, respectively. It is assumed in this

article that ?/2 ? ?L? ?U(i.e., equal probability in each rejection

region). The MBESS package does not make this assumption, and

thus varying values of ?Land ?Uare possible when determining

the confidence interval for the standardized mean difference.

368

KELLEY AND RAUSCH

Page 7

interest when forming confidence intervals for ? and that

have, until recently, been difficult to obtain. However, ?L

and ?Ufrom t??1??/2,v,?L?and t???/2,v,?U?, respectively, are now

easily obtainable with several software titles, making the

formation of confidence limits for ? and ultimately for ?

easy to find:

p??L? ? ? ?U? ? 1 ? ?, (10)

where p represents the probability of ?Land ?Ubracketing

? at the 1 ? ? level.

As an example, suppose two groups of 10 participants

each have a standardized mean difference of 1.25 with the

corresponding t value of 2.7951. The noncentral t distribu-

tion with noncentral parameter of 0.6038 has at its .975

quantile 2.7951, whereas the noncentral t distribution with

noncentral parameter of 4.9226 has at its .025 quantile

2.7951, both with 18 degrees of freedom. Thus,

CI.95? ?0.6038 ? ? ? 4.9226?,(11)

where CI.95represents a 95% confidence interval. The re-

lation between the two noncentral distributions and the

observed t value is illustrated in Figure 1, where the shaded

regions represent the areas of the distributions that are

beyond the confidence limits. As can be seen in Figure 1,

the noncentral t distribution on the left has a noncentral

parameter of 0.6038, and at its .975 quantile is the observed

t value, which is denoted with the bold vertical line near the

center of the abscissa. As can also be seen, the noncentral t

distribution on the right has a noncentral parameter of

4.9226, and at its .025 quantile is the observed t value,

which is denoted with the same bold vertical line.

The shaded lines to the left and right of the ?Land ?U,

respectively, illustrate the area of these distributions outside

of the confidence bounds for ?. Furthermore, because of the

one-to-one relation between ? and ?, the upper abscissa

shows values of ?. Notice also that the shapes of the

distributions are different, with the one on the right more

variable and more positively skewed than the one on the left

(because of the larger noncentral parameter and all other

things being equal). Of special importance are the two outer

vertical lines that represent the noncentral parameters of the

two distributions. As can be seen, the noncentral parameters

are not only the confidence limits for ?, but after the

noncentral parameters have been rescaled with Equation 8,

they yield the confidence limits for ?,

CI.95??0.6038?

10 ? 10

10 ? 10? ? ? 4.9226?

10 ? 10

10 ? 10?, (12)

which equals

CI.95? ?0.2700 ? ? ? 2.2015?. (13)

Notice that although 2.5% of the distribution of d is beyond

the lower and upper limits, the distance between d and the

limits is not the same. As Stuart and Ord (1994) discussed,

“in general, the confidence limits are equidistant from the

sample statistic only if its [i.e., the statistic’s] sampling

distribution is symmetrical” (p. 121). Furthermore, the bold

vertical line in the center identifies the estimated noncentral

parameter (on the lower abscissa) and the estimated stan-

dardized mean difference (on the upper abscissa).

As Vaske, Gliner, and Morgan (2002) stated, “large con-

fidence intervals make conclusions more tentative and

weaken the practical significance of the findings” (p. 294).

In an effort to obtain narrower confidence intervals for

significant effects, Vaske et al. (2002) suggested researchers

report two confidence intervals, one based on the ? value

used to conduct the significance test and one that has a much

larger ? value and thus a much narrower confidence interval

width, such as ? ? .30 or ? ? .20 (p. 299). Although some

researchers may be willing to pay the price for such a

trade-off (narrow confidence interval but low level of con-

fidence interval coverage), readers may not be so willing to

accept it (Grissom & Kim, 2005, pp. 61–62). Although such

an approach is not advocated here, the desire to obtain

narrow confidence intervals because of the benefits they

provide is understandable.11Using the methods developed

here will help researchers avoid obtaining confidence inter-

vals whose widths are “embarrassingly large” (Cohen,

1994, p. 1002).

In some situations, the required sample size might be too

large for a researcher to reasonably collect the method-

implied sample size. As a reviewer pointed out, this could

imply trading “embarrassingly large” confidence intervals

for “distressingly large” sample sizes. The methods used

here are still beneficial because it will be known a priori that

the confidence interval will likely be wider than desired,

alleviating any unrealistic expectations about the width of

the confidence interval a priori. Furthermore, authors who

are only able to obtain smaller sample sizes could use the

methods to show that it would be difficult or impossible to

obtain the required sample size for the confidence interval

for ? to be as narrow as desired, even if the sample size

provides sufficient statistical power (e.g., for a large effect

size). In situations in which a single study cannot produce

(e.g., because of insufficient resources) a sufficiently narrow

confidence interval, the use of meta-analysis might be es-

pecially useful (Hedges & Olkin, 1985; Hunter & Schmidt,

2004).

Kelley et al. (2003) discussed methods for planning sam-

11We agree with those that state there is nothing magical about

? ? .05. However, regardless of what the a priori ? value is, the

methods discussed in the next section are applicable because the ?

value is specified by the researcher when planning sample size.

369

AIPE FOR THE STANDARDIZED MEAN DIFFERENCE

Page 8

ple size so that the expected width of the confidence interval

for the population unstandardized mean difference would be

equal to some specified value. A modified method was also

developed so that a desired degree of certainty (i.e., a

probability) could be incorporated into the sample size

procedure such that the obtained interval would be no wider

than desired. However, planning sample size so that the

obtained confidence interval is sufficiently narrow has not

been discussed in the context of the standardized mean

difference. The next section addresses this issue formally

for the standardized mean difference and provides solutions

so that the necessary sample size can be determined for the

expected width to be sufficiently narrow, optionally with a

desired degree of certainty that the obtained interval will be

no wider than desired.

Sample Size Planning From an AIPE Perspective for

the Standardized Mean Difference

There are presumably two reasons why sample size plan-

ning for the standardized mean difference from an AIPE

perspective has not been formerly considered. First, sample

size planning has been almost exclusively associated with

power analysis, and thus planning sample size in order to

obtain parameter estimates with a high degree of expected

accuracy (i.e., a narrow confidence interval) has only re-

cently been considered in much of the behavioral, educa-

tional, and social sciences. Second, working with noncentral

t distributions has proven quite difficult because of the

additional complexity of the probability function of the t

distribution when the noncentral parameter is not zero.

Values of t

Density of the Noncentral t−Distributions

0.0

0.1

0.2

0.3

0.4

−3−2 −1012345678910 11

−1.2 −0.8 −0.40 0.40.8 1.21.62 2.4 2.83.2 3.64 4.4 4.8

Values of δ δ

Figure 1.

0.6038 (distribution on the left) and for the noncentral t distribution with 18 degrees of freedom and

noncentral parameter 4.9226 (distribution on the right). Note that t?(.975,18,0.6038)? t?(.025,18,4.9226)? t ?

2.7951.Thus,the95%confidenceintervalfor?(shownonthelowerabscissa)giventheobservedtvalue

(2.7951) has lower and upper confidence bounds of 0.6038 and 4.9226, respectively. Transforming the

confidence limits to the scale of ? (shown on the upper abscissa) leads to lower and upper 95%

confidence bounds for ?s of 0.2700 and 2.2015, respectively.

Density of the noncentral t distribution with 18 degrees of freedom and noncentral parameter

370

KELLEY AND RAUSCH

Page 9

Specialized computer algorithms are necessary to determine

quantiles at desired probability values and probability val-

ues at desired quantiles. With the focus of sample size

planning for power at the neglect of accuracy and the

inability to readily work with noncentral t distributions, it is

no wonder that sample size planning from an accuracy

perspective has not yet been developed for the standardized

mean difference.

Nevertheless, the solution to this problem is of interest to

substantive researchers who want to estimate the sample

size necessary to obtain narrow confidence intervals and for

methodologists who study the properties of point estimates

and their corresponding confidence intervals. There are also

potential uses in the context of meta-analytic work.12

When attempting to plan sample size, for the expected

width of the obtained confidence interval to be sufficiently

narrow for the population standardized mean difference, it is

necessary to use an iterative process. Because the confi-

dence interval width for ? is not symmetric, the desired

width can pertain to the full confidence interval width,

the lower width, or the upper width. Let ?Ube defined as the

upper limit and ?Lbe defined as the lower limit of the

observed confidence interval for ?. The full width of the ob-

tained confidence interval is thus given as

w ? ?U? ?L, (14)

the lower width of the obtained confidence interval is given

as

wL? d ? ?L, (15)

and the upper width of the obtained confidence interval is

given as

wU? ?U? d. (16)

The goals of the research study will dictate the confidence

interval width for which sample size should be planned. In

general, w is the width of interest. Although the methods

discussed are directly applicable to determining sample size

for the lower or the upper limit, we focus exclusively on the

full confidence interval width. Let ? be defined as the

desired confidence interval width, which is specified a priori

by the researcher, much like the desired degree of statistical

power is chosen a priori when determining necessary sam-

ple size in a power analytic context (e.g., Cohen, 1988;

Kraemer & Thiemann, 1987; Lipsey, 1990; Murphy &

Myors, 1998).

The idea of determining sample size so that E[w] ? ? is

analogous to other methods of planning sample size when a

narrow confidence interval is desired (e.g., Guenther, 1981;

Hahn & Meeker, 1991; Kelley & Maxwell, 2003; Kupper &

Hafner, 1989). The goal is to determine the sample size so

that E[w] ? ?. However, because the theoretical sample

size where E[w] ? ? is almost always a fractional value,

E[w] is almost always just less than ? for the necessary

sample size to be some whole number. The population

values are used in the confidence interval as if the popula-

tion values were sample values, and then the necessary

sample size is solved for so that E[w] ? ?. In general,

sample size can be solved analytically or computationally.

Solving sample size computationally, which is especially

convenient when the confidence interval does not have a

convenient closed-form expression, begins by finding a

minimal sample size so that E[w] ? ?. The minimal sample

size can then be incremented by 1 until E[w] ? ?.

Because the noncentral t distribution is used for confi-

dence intervals for ?, sample size is solved for computa-

tionally. The initial value of the sample size used in the

algorithm is based on the standard normal distribution,

which guarantees that the initial sample size will not be too

large. If ? is known and is common for the two groups, a

confidence interval for the standardized mean difference is

given as

p?

?

? z?1??/2??

Because z?1??/ 2??

n2

observed standardized difference in means, the width of the

confidence interval is given as

2z?1??/2??

X1? X2

? z?1??/2??

1

n1?

1

n2??1? ?2

?

?X1? X2

n2? ? 1 ? ?.

?

1

n1?

1

(17)

1

n1?

1

is subtracted and added to the

1

n1?

1

n2.

When n1? n2? n, the confidence interval width can be

simplified to

2z?1??/2??

2

n.

Solving analytically for the necessary sample size so thatthe

expected width of the confidence interval is equal to ? is

given as

n?0?? ceiling?8?

z?1??/ 2?

??

2?,

12It should be noted that whenever planning sample size, re-

gardless of the perspective one is planning from, if the assumptions

the procedure is based on are not satisfied, then the sample size

estimate may not be appropriate. The degree of the inappropriate-

ness of the estimated sample size will depend strongly on the

specifics of the situation.

371

AIPE FOR THE STANDARDIZED MEAN DIFFERENCE

Page 10

where n(0)is the initial value of sample size that will be used

in the algorithm for determining the necessary sample size

and ceiling[?] rounds the value in brackets to the next largest

integer.

Of course, in practice, the use of the confidence interval

given in Equation 17 is not appropriate because ? is almost

never known and its estimate is a random variable necessi-

tating a noncentral confidence interval, as discussed in the

previous section. However, to obtain an initial value of

sample size that is guaranteed to be no larger than the

necessary sample size, the standard normal distribution is

used in place of the noncentral t distribution. The use of the

critical value from the standard normal distribution ensures

that the starting value for the sample size used in the

remainder of the algorithm is not initially overestimated, as

replacing the critical value with a noncentral t value at the

same ? level is guaranteed to increase the width of the

confidence interval.

Given n(0), the expected confidence interval can be cal-

culated using the noncentral method previously discussed

by replacing d in the confidence interval procedure with ?.

The value of ? is used in the sample size procedure because

? is (essentially) the expected value of d, and thus the

procedure is based on the value that is expected to be

obtained in the study.13Next, increment sample size by one,

yielding n(1), and then determine the expected width of the

confidence interval, which is now based on n(1)(n(1)?

n(0)? 1). If the expected width using n(1)is equal to or

narrower than the desired width, the procedures can be

stopped and the necessary sample size can be set to n(1). If

the expected confidence interval width is wider than the

desired width, sample size can be incremented by one and

the expected width determined again. This process contin-

ues until the expected width is equal to or just narrower than

the desired width. At the iteration where this happens, set

n(i)to the necessary sample size. The idea of the algorithm

is fairly straightforward: (a) Use ? as if it were d, (b)

increase necessary sample size until the expected width of

the confidence interval is sufficiently narrow, and (c) set

the value of sample size to the necessary value so that

E[w] ? ?.

Although in some situations ensuring that the expected

width of a confidence interval for ? is sufficiently narrow is

satisfactory, in most situations the desire is for w to be no

larger than ?. The procedure just discussed in no way

implies that the observed confidence interval width in any

particular study will be no larger than ?, as w is a random

variable that will fluctuate from study to study or from

replication to replication of the same study.14Thus, it is

important to remember that the algorithm just presented

provides the sample size such that E[w] ? ?. A modified

sample size procedure can be performed so that there is a

desired degree of certainty that w will not be larger than ?.

Ensuring a Confidence Interval No Wider Than

Desired With a Specified Degree of Certainty

As a function of the properties of the noncentral t distri-

bution, as the magnitude of ? gets larger, holding the con-

fidence interval coverage and sample size constant, the

expected width of the confidence interval becomes wider.15

However, the width of the observed confidence interval is a

function of d and the per-group sample sizes. When deter-

mining the necessary sample size given ?, the variability of

d is also important, because if the sample collected yields a

d smaller in magnitude than the ? specified when determin-

ing the sample size, then w will be narrower than ?. How-

ever, when the sample collected yields a d larger in mag-

nitude than ?, w will be wider than ?. Although the former

situation might be desirable, the latter situation might be

disappointing because the confidence interval width was

larger than desired.16

To avoid obtaining a d larger in magnitude than the value

the sample size procedure is based on with some specified

degree of certainty and thus a w wider than ?, a modified

sample size procedure can be used. Let ? be this desired

probability, such that ? represents the probability that d will

not be larger in magnitude than ??, where ??is the point that

d will exceed in magnitude only (1 ? ?)100% of the time.

Thus,

p??d? ? ????? ? ?,(18)

implying that d will be contained within the limits ???and

???100% of the time. Notice that ?d? ? ????, when holding

everything else constant, will yield a confidence interval

wider than ? because confidence intervals for ? become

wider as the magnitude (absolute value) of d increases.

Because ? can be transformed to ? (using the population

13Actually, d is a biased estimate of ?. However, for even

moderate sample sizes (e.g., 30), the discrepancy between E[d] and

? is trivial (Hedges & Olkin, 1985, chapter 5). Although the

expected value of d given ? and n could be substituted for ? in the

method, doing so leads to no difference in sample size estimates

for almost all realistic situations and will potentially lead to dif-

ferences only in situations where the procedures yield a very small

necessary sample size.

14Although the expected value of w is ?, this does not imply

that 50% of the distribution of w will be narrower than specified.

In fact, the distribution of w can be quite skewed and it is generally

the case that more than 50% of the distribution is less than ?.

15This is not necessarily true with all effect sizes. For example, the

confidence interval width for the squared multiple correlation coeffi-

cient is generally at its maximum for values of the sample squared

multiple correlation coefficient that are around .30–.40 (Algina &

Olejnik, 2000; Kelley, 2006b), depending on the particular condition.

16Alternatively, one could use the largest value of ? that would

seem plausible in the particular situation for the obtained confi-

dence interval not to be larger than some specified value.

372

KELLEY AND RAUSCH

Page 11

analog of Equation 7) and vice versa (using the population

analog of Equation 8), if the ??can be found that satisfies

p??t? ? ????? ? ?, (19)

then ??can be transformed into ??. The value of ??is thus

the value that satisfies the expression

?

???

??

f?t??;v??dt ? ?,(20)

where f(t(?;?)) is the probability density function of the non-

central t distribution, t is the random t variate, and ? is the

degrees of freedom (n1? n2? 2 in the present context). Thus,

??is the noncentral value along with its opposite (i.e., its

negative value) that excludes (1 ? ?)100% of the sampling

distribution of t values. Excluding (1 ? ?)100% of the sam-

pling distribution of t values that have the widest confidence

limits and then using ??in place of ? in the procedure will

ensure that no more than (1 ? ?)100% of the confidence

intervals will be wider than desired, as confidence intervals

will be wider than desired if and only if, holding everything

else constant, ?t? ? ????, which will occur only (1 ? ?)100% of

the time because of the definition of ??. The noncentral nature

of d, as explained below, makes the development of a sample

size planning procedure more difficult than the development of

analogous procedures for effects that follow central distribu-

tions (e.g., Guenther, 1981; Hahn & Meeker, 1991; Kelley &

Maxwell, 2003; Kelley et al., 2003; Kupper & Hafner, 1989).

It is first helpful to compare Equation 20 with the integral

form for confidence intervals. The two-sided (1 ? ?)100%

confidence limits for a noncentral t distribution are defined as

?

??

?L2

f?t??;v??dt ? ?/2 (21)

and

?

?U2

?

f?t??;v??dt ? ?/2,(22)

where ?L2and ?U2are the lower and upper two-sided (1 ?

?)100% confidence limits for ?, respectively. Finding ?L2

and ?U2from Equations 21 and 22 would lead to a (1 ?

?)100% confidence interval for ?. (Notice that ?L2and ?U2

are the values in Figure 1 in which the lower and upper

vertical lines, respectively, define the confidence limits.)

The one-sided confidence limits for a noncentral t distribu-

tion are defined as

?

??

?U1

f?t??;v??dt ? ?

(23)

for a lower (1 ? ?)100% confidence interval for ? or as

?

?L1

?

f?t??;v??dt ? ?

(24)

for an upper (1 ? ?)100% confidence interval for ?, where

?U1and ?L1are the upper and lower one-sided confidence

limits. Notice how the form of the confidence limits for ?

(Equations 21–24) differs from the form of Equation 20.

Equations 21–24 each have limits that stretch to positive or

to negative infinity, where the lower and the upper limits

contain (?/2)100% of the distribution on each side (for the

two-sided confidence intervals) or ?100% on either side (for

the one-sided confidence intervals.) Equation 20 is defined

such that there is (1 ? ?)100% of the distribution beyond

the confidence interval limits as a typical two-sided confi-

dence interval, with the nontypical requirement that the

confidence limits are of the same magnitude. Because of the

nonsymmetric properties of the noncentral t distribution,

there is not an equal proportion beyond each of the limits.

For a given sample size and level of confidence interval

coverage, the width of the confidence interval for ? (or ?) is

based only on ?ˆ(or d). The rationale for determining ??via

Equation 19 is based on this fact, as a negative value or a

positive ?ˆlarger in magnitude than ?, the value on which

the sample size procedure is based, will lead to a confidence

interval wider than desired. Equation 20 can be solved for

the value that will ensure only (1 ? ?)100% of the distri-

bution of the noncentral parameter will be larger in magni-

tude than ??. The width of the confidence interval for the

noncentral parameter is of the same width regardless of

sign. Ultimately, ??will be converted to ??so that ??can be

used in place of ? in the standard sample size procedure to

ensure that w will be no larger than ? with ?100% certainty.

Although Equation 20 does not have a straightforward

analytic solution, lower and upper bounds can be deter-

mined such that a range of values can be searched to find the

necessary value of ??that satisfies Equation 20. The con-

fidence limit from a one-sided confidence interval of the

form

?

??

?U1

f?t??;v??dt ? ?,(25)

373

AIPE FOR THE STANDARDIZED MEAN DIFFERENCE

Page 12

where ?U1is the limit of the ?100% confidence interval, is

used as a lower bound for ??. The reason that ?U1is a lower

bound for ??is that using ?U1in place of ??would lead to

more confidence intervals that are wider than desired. The

proportion of confidence intervals wider than desired is not

only equal to the area beyond ?U1in Equation 25 but also

equal to the proportion of the noncentral distribution beyond

??U1. Thus, the total proportion of confidence intervals

wider than desired if ?U1was used in place of ??when

determining the modified sample size would be

?

??

??U1

f?t??;v??dt ??

?U1

?

f?t??;v??dt ? p??t? ? ??U1??,(26)

which is greater than 1 ? ?. The first integral is equal to

some positive value and the second integral is equal to 1 ?

?, necessitating that p(?t? ? ??U1?) ? 1 ? ?.

The confidence limits from a ?100% two-sided confi-

dence interval are of the form

?

??

?L2

f?t??;v??dt ? ?1 ? ??/2(27)

and

?

?U2

?

f?t??;v??dt ? ?1 ? ??/2.(28)

Notice here that both confidence limits contain [(1 ? ?)/

2]100% of the distribution beyond each confidence limit.

The upper confidence limit can be used as an upper bound

for ??, because (unless ? ? 0) there will be less than (1 ?

?)100% of the distribution that is more extreme than ??U2

and ?U2. This is the case because [(1 ? ?)/2]100% of the

distribution is greater than ?U2, and because ?L2is smaller in

magnitude than ?U2, there must be less than [(1 ? ?)/

2]100% more extreme than ?L2. Thus,

?

??

??U2

f?t??;v??dt ??

?U2

?

f?t??;v??dt ? p??t? ? ??U2??,(29)

which is less than 1 ? ?. This is the case when ? is positive

because the first integral is necessarily smaller than the

second integral, and the second integral is equal to 1 ? ?/2,

necessitating that p(?t? ? ??U2?) ? 1 ? ? (the opposite is true

when ? is negative). Because ??U1and ?U1bound more

than (1 ? ?)100% of the distribution, ?U1must be smaller

in magnitude than ??. Because ??U2and ?U2bound less

than (1 ? ?)100% of the distribution, ?U2must be larger

than ??in magnitude. Thus, ??lies somewhere between ?U1

and ?U2. The closer ? is to zero, the closer ??is to ?U2in

magnitude (as the noncentral t distribution becomes more

symmetric). The farther away ? is from zero, the closer ??

is to ?U1in magnitude (as the proportion of the distribution

less than ??U1approaches zero). An optimization routine

that iterates over the interval ?U1to ?U2searching for ??

such that

p?t ? ? ??? ? p?t ? ??? ? 1 ? ?

(30)

will yield the ??that can be substituted for ? in the standard

procedure. Because the standard procedure is based on ?, ??

can be transformed into ??so that ??can replace ? from the

standard procedure.

Given the detailed discussion above, a summary follows.

First, recall (from Equation 19) that p(?t? ? ????) ? ? implies

(from Equation 18) that p(?d? ? ????) ? ?. A d larger in

magnitude than ??implies w ? ? (due to the definition of

??, as it is the value that will be exceeded in magnitude only

[1 ? ?]100%). Basing the sample size procedure on ??will

thus ensure that no less than ?100% of the confidence

interval widths will be greater than ?, because at least

?100% of the sampling distribution of d is less than ??.

Because w ? ? whenever ?d? ? ????, planning sample size

on the basis of ??will lead to no less than ?100% certainty

that w ? ?.

A brief conceptual overview.

point has thus far been rather technical. A very general

review that is largely conceptual is provided. On the basis of

the necessary sample size from the original procedure,

where sample size was based on the expected confidence

interval width being sufficiently narrow, determine ??. Re-

call that the value of ??is the value on the scale of ? that is

expected to be exceeded in magnitude only (1 ? ?)100% of

the time. The value of ??is found by solving iteratively for

??(using the noncentral t distribution; see Equation 30) in

the following equation:

The discussion up to this

p?d ? ? ??? ? p?d ? ??? ? 1 ? ?. (31)

Given ??, substitute ??for ? in the original procedure and

solve for sample size as before, by incrementing sample

size, beginning where the starting value is now the original

sample size, until the E[w] ? ?. The effect of replacing ?

with ??leads to ?100% of the sampling distribution of d

being less than ??. When sample size is based on ??, any d

less than ??in magnitude, which will occur ?100% of the

time, will imply an observed confidence interval width less

than ?.

374

KELLEY AND RAUSCH

Page 13

Tables of Necessary Sample Size

Although the AIPE approach to sample size planning for

the standardized mean difference can be readily carried out

using MBESS, selected tables of necessary sample size are

provided. The tables are not meant to supplant the computer

routines; rather, they are designed so that researchers can

quickly estimate the necessary sample size to obtain some

desired confidence interval width, possibly with some de-

gree of certainty. The necessary parameters manipulated in

the tables are ?, ?, 1 ? ?, and ?.

The values of ? used in the tables are 0.05 and 0.10

through 1.00 by 0.10s. The values of the desired full width

(?) used in the tables are 0.10 through 0.50 by 0.05s, and

0.60 through 1.00 by 0.10s. The desired degree of certainty

values used in the tables are no ? specification (i.e., E[w] ?

?) and ? values of .80 and .99. The confidence level (1 ?

?) was specified at .90, .95, and .99 for the values in Tables

1, 2, and 3, respectively. There are thus a total of 1,386 cells

in the tables representing a wide variety of situations. The

tables can easily be consulted when considering sample size

planning given the goals of AIPE for the standardized mean

difference. Of course, not all interesting combinations of ?,

?, ?, and ? are tabled. However, for situations not covered

in the tables, the Appendix provides computer code using

MBESS that show how sample size can be easily deter-

mined.

As can be seen from the tables, necessary sample size can

become very large for very narrow desired confidence in-

terval widths (e.g., ? ? 0.10 and ? ? 0.15). Few behav-

ioral, educational, or social scientists will likely have such

resources at their disposal to achieve a confidence interval

for ? whose expected value is close to 0.10 units wide. Thus,

the expectation is that almost all confidence intervals for ?

will be wider than 0.10 in practice. Even when the value

shown on one of the tables for a particular condition may be

distressingly large, the tables will help to illustrate that

obtaining a confidence interval less than some desired width

may not be practical for a particular situation. Furthermore,

because the ultimate goal might be to obtain accurate esti-

mates of the parameters of interest, when this cannot be

done satisfactorily in a single study, the use of meta-analysis

should be considered. Of course, when an investigator is

entering into a new area of research or performs the study in

a fundamentally different way compared with previous

studies, the use of meta-analysis may be inapplicable. An-

other possibility is multiple-site studies, an idea that has

recently been reproposed (Maxwell, 2004, p. 161), in which

several collaborative research teams collect the same type of

data under the same (or realistically similar) conditions. The

idea of such multisite studies is to spread the burden but

reap the benefits of estimates that are accurate and/or sta-

tistically significant.

The way in which the tables are used is to first identify the

table that corresponds to the confidence level of interest (the

1 ? ? values for Tables 1, 2, and 3 are .90, .95, and .99,

respectively). After identifying the correct confidence level,

one of the three ? values must be specified (each table

consists of three subtables where the particular ? is specified

at the top of each subtable). Next, base the sample size

calculation on ? (? is specified in the column headings).

Finally, the desired ? must be specified (? is given in the

first column of each subtable). The combination of each of

the required values leads to a particular cell in the table that

corresponds to the per-group sample size. The total sample

size is thus twice the value on the table because the proce-

dure assumes equal per-group sample sizes.

As an example of the use of the tables, suppose that a

researcher wishes to obtain a confidence interval with an

expected width of 0.50 units when ? ? 0.80 at the 95%

confidence level. Determining the necessary sample size

requires the first subtable (where E[w] is the subtable head-

ing) of Table 2 (where ? ? .05), where ? ? 0.50 (the 9th

row) and ? ? 0.80 (the 10th column). The necessary sample

size in this situation is shown to be 133 participants per

group (266 total).

Further suppose that the researcher wishes to be 99%

certain that the 95% confidence interval will be no larger

than 0.50 units wide. Using the third subtable of Table 2

(where ? ? 0.99) and the same procedure just discussed, a

sample size of 142 per group (284 total) is necessary. As is

demonstrated here, increasing the sample size from the

expected value being sufficiently narrow to a narrow con-

fidence interval with a high degree of certainty generally

does not necessitate a large increase in sample size relative

to the initial value of sample size. This phenomenon is

discussed in the next section.

Why Such a Small Change in Sample Size?

In some cases, modifying the sample size so that there is

a high probability of obtaining a confidence interval no

wider than desired adds a surprisingly small increase in

necessary sample size. From the previous example, recall

that a 95% confidence interval when ? ? 0.80 and ? ? 0.50

requires a necessary sample size of 133 per group. When the

desired degree of certainty is specified at .99, the necessary

sample size required increases to 142 per group (an increase

in total sample size of 18; 6.767%). Thus, in this situation,

a fairly small increase in sample size has a fairly large effect

on the probability of obtaining a sufficiently narrow confi-

dence interval.

Small increases in sample size when going from the ex-

pected width being sufficiently narrow to having a degree of

certainty that the width will be sufficiently narrow arise for

several reasons. First, with reasonably large sample sizes, ??

will not be much larger than ?. Recall that the upper ?100%

limit from a one-sided confidence interval is the lower bound

375

AIPE FOR THE STANDARDIZED MEAN DIFFERENCE

Page 14

Table 1

Necessary Sample Size per Group for 90% Confidence Intervals for the Population Standardized Mean Difference for Selected

Situations

?

?

0.050.10 0.200.30 0.400.500.60 0.70 0.800.901.00

? ? E[w]

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.60

0.70

0.80

0.90

1.00

2166

963

542

347

241

177

136

107

2168

964

542

347

241

177

136

108

87

61

45

34

27

22

2176

967

544

349

242

178

136

108

88

61

45

34

27

22

2189

973

548

351

244

179

137

109

88

61

45

35

28

22

2208

982

552

354

246

181

138

110

89

62

46

35

28

23

2233

993

559

358

249

183

140

111

90

63

46

35

28

23

2262

1006

566

362

252

185

142

112

91

63

47

36

28

23

2298

1021

575

368

256

188

144

114

92

64

47

36

29

24

2338

1039

585

375

260

191

147

116

94

65

48

37

29

24

2384

1060

596

382

265

195

150

118

96

67

49

38

30

24

2436

1083

609

390

271

199

153

121

98

68

50

39

31

25

87

61

45

34

27

22

? ? .80

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.60

0.70

0.80

0.90

1.00

2166

963

542

347

242

178

136

108

2169

965

543

348

242

178

136

108

88

61

45

35

28

23

2179

969

546

350

243

179

137

109

88

62

45

35

28

23

2194

976

550

353

245

181

139

110

89

62

46

35

28

23

2214

986

555

356

248

183

140

111

90

63

47

36

29

23

2240

997

562

361

251

185

142

113

91

64

47

37

29

24

2271

1012

570

366

255

188

144

114

93

65

48

37

30

24

2307

1028

580

372

259

191

147

116

95

66

49

38

30

25

2349

1047

591

379

264

195

150

119

97

68

50

39

31

25

2397

1068

603

387

270

199

153

121

99

69

51

40

32

26

2450

1092

617

396

276

204

157

124

101

71

53

41

32

27

88

61

45

35

28

23

? ? .99

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.60

0.70

0.80

0.90

1.00

Note.

? is the desired confidence interval width, and E[w] is the expected confidence interval width.

2169

965

544

349

243

179

138

109

2173

968

546

350

244

180

138

110

89

63

47

36

29

24

2185

974

550

353

246

182

140

111

90

64

47

37

30

25

2202

982

555

357

249

184

142

113

92

65

48

38

30

25

2225

993

562

361

252

187

144

114

93

66

49

39

31

26

2253

1007

570

367

256

190

146

117

95

67

51

40

32

27

2287

1023

579

373

261

193

149

119

97

69

52

41

33

27

2326

1041

590

380

266

197

152

122

100

71

53

42

34

28

2370

1062

602

388

272

202

156

125

102

72

55

43

35

29

2420

1085

615

397

279

207

160

128

105

74

56

44

36

30

2475

1110

630

407

286

212

164

131

108

77

58

46

37

31

89

62

47

36

29

24

? is the population standardized mean difference, ? is the desired degree of certainty of achieving a confidence interval for ? no wider than desired,

376

KELLEY AND RAUSCH

Page 15

Table 2

Necessary Sample Size per Group for 95% Confidence Intervals for the Population Standardized Mean Difference for Selected

Situations

?

?

0.050.100.20 0.300.400.50 0.600.70 0.800.90 1.00

? ? E[w]

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.60

0.70

0.80

0.90

1.00

3075

1367

769

492

342

251

193

152

123

3078

1368

770

493

342

252

193

152

124

86

63

49

38

31

3089

1373

773

495

344

253

194

153

124

86

64

49

39

31

3108

1382

777

498

346

254

195

154

125

87

64

49

39

32

3135

1394

784

502

349

256

196

155

126

88

64

49

39

32

3170

1409

793

508

353

259

199

157

127

89

65

50

40

32

3212

1428

803

514

357

263

201

159

129

90

66

51

40

33

3262

1450

816

522

363

267

204

162

131

91

67

52

41

33

3320

1476

830

532

369

272

208

164

133

93

68

52

42

34

3385

1505

847

542

377

277

212

168

136

95

70

53

42

34

3458

1537

865

554

385

283

217

171

139

97

71

55

43

35

86

63

49

38

31

? ? .80

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.60

0.70

0.80

0.90

1.00

3076

1368

770

493

343

252

193

153

124

3079

1369

771

494

343

252

193

153

124

86

64

49

39

32

3093

1376

774

496

345

254

195

154

125

87

64

49

39

32

3113

1385

780

500

348

256

196

155

126

88

65

50

40

32

3142

1398

788

505

351

258

198

157

127

89

66

51

40

33

3178

1415

797

511

356

262

201

159

129

90

67

51

41

33

3222

1435

809

519

361

266

204

162

131

92

68

52

42

34

3274

1458

822

527

367

270

208

164

134

93

69

53

42

35

3333

1485

837

537

374

276

212

168

136

95

70

54

43

35

3400

1515

854

548

382

281

216

171

139

97

72

56

44

36

3475

1548

873

561

391

288

221

175

142

100

74

57

45

37

86

64

49

39

32

? ? .99

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.60

0.70

0.80

0.90

1.00

Note.

? is the desired confidence interval width, and E[w] is the expected confidence interval width.

3078

1370

772

495

344

253

195

154

125

3083

1372

773

496

345

254

195

155

126

88

65

51

41

33

3100

1381

779

500

348

257

197

156

127

89

66

51

41

34

3123

1392

786

505

352

260

200

158

129

91

67

52

42

35

3155

1407

795

511

356

263

202

161

131

92

69

53

43

35

3194

1426

806

518

362

267

206

164

133

94

70

55

44

36

3241

1448

819

527

368

272

210

167

136

96

72

56

45

37

3295

1473

833

537

375

278

214

170

139

98

73

57

46

38

3358

1502

850

548

383

284

219

174

142

101

75

59

47

39

3428

1534

869

560

392

290

224

179

146

103

77

61

49

41

3505

1569

889

574

402

298

230

183

150

106

80

62

50

42

88

65

50

40

33

? is the population standardized mean difference, ? is the desired degree of certainty of achieving a confidence interval for ? no wider than desired,

377

AIPE FOR THE STANDARDIZED MEAN DIFFERENCE