Page 1

Tina Memo No. 2002-007

Internal Report

The Effects of an Arcsin Square Root Transform on a

Binomial Distributed Quantity.

P.A. Bromiley and N.A. Thacker

Last updated

13 / 6 / 2002

Imaging Science and Biomedical Engineering Division,

Medical School, University of Manchester,

Stopford Building, Oxford Road,

Manchester, M13 9PT.

Page 2

The Effects of an Arcsin Square Root Transform on a

Binomial Distributed Quantity

P.A. Bromiley and N.A. Thacker

Imaging Science and Biomedical Engineering Division

Medical School, University of Manchester

Manchester, M13 9PT, UK

paul.bromiley@man.ac.uk

Abstract

This document provides proofs of the following:

• The binomial distribution can be approximated with a Gaussian distribution at large values of N.

• The arcsin square-root transform is the variance stabilising transform for the binomial distribution.

• The Gaussian approximation for the binomial distribution is more accurate at smaller values of

N after the variance stabilising transform has been applied.

The conclusion contains some comments concerning the relationship between the variance stabilising

transform and the improved accuracy of the Gaussian approximation, which holds for both the binomial

and Poisson distributions.

1Gaussian Approximation to the Binomial Distribution

The binomial distribution

P(n|N) =

N!

n!(N − n)!pnq(N−n)

(1)

gives the probability of obtaining n successes out of N Bernoulli trials, where p is the probability of success and

q = 1−p is the probability of failure. The shape of the distribution approaches that of a Gaussian distribution at

large N as a consequence of the central limit theorem. In order to demonstrate this, it is necessary to expand the

distribution as a Taylor series around the maximum.

First, let n′be the position of the maximum of P(n), giving n = n′+η, and rather than expanding the distribution

itself, expand the natural logarithm of the distribution. Expanding as a Taylor series

f(a + h) = f(a) + hf′(a) +h2

2!f′′(a) + ... +

h(n−1)

(n − 1)!f(n−1)(a) + ...(2)

gives

ln[P(n′+ η)] = ln[P(n′)] + ηB1+η2

2!B2+η3

3!B3...(3)

where

Bk=

????

dkln[P(n)]

dnk

????

n=n′

(4)

Since this is an expansion around the position of the maximum, B1= 0 and B2is negative, so put B2= −|B2|.

The next step is to find the position of the maximum. Taking the log of the distribution and applying Stirling’s

approximation

ln(n!) ≈ nlnn − n

gives

dln[P(n)]

dn

= −lnn + ln(N − n) + lnp + lnq

(5)

≈

d

dn[lnN! − nlnn + n − (N − n)ln(N − n) + (N − n) + nlnp + (N − n)lnq](6)

Page 3

At the maximum this is equal to zero, giving

0 = ln[N − n

n

p

q](7)

or

1 =N − n

n

p

q

(8)

Substituting n = n′

1 =N − n′

n′

p

q

(9)

which reduces to

n′= Np(10)

Now the Bkof the Taylor expansion can be found. We know that B1= 0. B2is given by

B2=

????

d2ln[P(n)]

dn2

????

n=n′≈ [−1

n−

1

N − n]n=n′ = −

1

Np(1 − p)

(11)

Similarly

B3=

????

d4ln[P(n)]

dn4

d3ln[P(n)]

dn3

????

n=n′≈2(3p2− 3p + 1)

N3p3(1 − p)3

n=n′≈

1 − 2p

N2p2(1 − p)2

(12)

and

B4=

????

????

(13)

The terms are getting smaller by O(1/N) each time, so truncate the Taylor series at the B2term:

P(n) ≈ P(n′)e−|B2|η2

2

(14)

In order to find P(n’), impose the requirement of normalisation. Further, assume that the distribution can be

treated as continuous i.e.

N

?

so, using the standard Gaussian integral

limN→∞

n=0

P(n) ≈

?

P(n) =

?∞

−∞

P(n′+ η)dη = 1(15)

?∞

−∞

P(n′)e−|B2|η2

2dη = P(n′)

?

2π

|B2|= 1(16)

so

P(n) =

1

√2πNpqe−(n−Np)2

2Npq

(17)

Comparing this to the Gaussian distribution

P(n) =

1

σ√2πe−(n−µ)2

2σ2

(18)

it can be seen that, at values of the parameters where the assumptions made in the derivation hold (i.e. at large

N), the binomial distribution can be approximated by a Gaussian distribution with mean

µ = Np

and standard deviation

σ =

?

Npq

3

Page 4

2A Variance Stabilising Transform for the Binomial Distribution

The most commonly used variance stabilising transform for the binomial distribution is

y = arcsin

?n

N

(19)

Applying the change-of-variables formula,

g(y) = f[r−1(y)]dr−1y

dy

ify = r(x)(20)

gives the distribution in the new space

P(y) =

N!

(Nsin2y)!(N − Nsin2y)!2N sinycosy pN sin2yq(N−N sin2y)

(21)

It is again possible to demonstrate that a Gaussian distribution provides a good approximation to this distribution

at large N. As before, the aim is to expand the distribution as a Taylor series around the maximum

ln[P(y′+ z)] = ln[P(y′)] + zB1+z2

2!B2+z3

3!B3...(22)

Bk=

????

dkln[P(y)]

dyk

????

y=y′

(23)

where y’ is the position of the mean and z is the distance in y from the mean. Again, since this is at the maximum

B1must be zero and B2must be negative.

Taking the log of the distribution in y space and applying Stirling’s approximation

α! ≈ ααe−α√2πα(24)

gives, after simplification

ln[P(y)] ≈ N lnN +1

2lnN + N sin2y lnp − N sin2yln(N sin2y) + N lnq(25)

−N sin2ylnq − N ln(N cos2y) + N sin2yln(Ncos2y) + ln

?

2

π

Differentiating gives

dln[P(y)]

dy

≈ 2N sinycosy lnp − 2N sinycosy ln(N sin2y) − 2N siny cosy(26)

−2N sinycosy lnq +2N siny

cosy

−2N sin3y

cosy

+ 2N siny cosyln(N cos2y)

which reduces to

dln[P(y)]

dy

≈ 2N siny cosy lnp

q+ 2N sinycosy lncos2y

sin2y

(27)

Setting this equal to zero gives the position of the maximum: dividing through by 2N siny cosy gives

ln(p

q) = ln(sin2y

cos2y)(28)

Since cos2y = 1 − sin2y and q = 1 − p, this gives the position of the maximum as

p = sin2y =n

N

soy = arcsin√p(29)

Armed with the position of the maximum, we can go on to find the higher differentials and thus the Bk:

d2ln[P(y)]

dy2

≈ 2N ln(p

q)cos2y − 2N ln(p

q)sin2y + 2N ln(cos2y

sin2y)cos2y − 2N ln(cos2y

sin2y)sin2y − 4N(30)

4

Page 5

so

B2= −4N(31)

Similarly

d3ln[P(y)]

dy3

≈ −8N ln(p

q)cosysiny − 8N cosy siny ln(cos2y

sin2y) + 4Nsiny

cosy− 4Ncosy

siny

(32)

giving

B3= 4N[

?p

q−

?q

p](33)

and

d4ln[P(y)]

dy4

≈ 8N ln(p

q)(sin2y − cos2y) + 8N ln(cos2y

sin2y)(sin2y − cos2y) + 4N(

1

sin2y−

1

cos2y) + 16N(34)

giving

B4= 4N[4 +1

p−1

q](35)

Each term of the Taylor expansion is again O(N) smaller than its predecessor, (although this time the factors

of N are hidden in the variable itself) implying that the approximation to the lower order terms becomes exact

for large samples. Again, truncating the series at the B2 term, plugging the B′s back into the expansion, and

exponentiating gives

P(y) = P(y′)e−|B2|z2

2

(36)

In order for this to be normalized

P(y′) =

?

|B2|

2π

(37)

So, using B2= 4N, z = y − y′and y′= p

P(y) =

?

2N

πe−2N(y−arcsin√p)2

(38)

and so the distribution is a Gaussian with mean

µ = arcsin√p

and standard deviation

σ =

?

1

4N

The variance is now independent of the mean, and so the transform to arcsin√x space has been shown to be the

variance stabilising for the binomial distribution. Further, the approximation of the probability density distribution

to a Gaussian becomes exact for large samples. The fact that the variance depends on N is a consequence of the

space being scaled by N.

3Accuracy of the Gaussian Approximation following the Transform

In order to demonstrate that the Gaussian approximation to the binomial distribution is more accurate after the

variance stabilising transform than before it, first notice that in both cases the approximation is exact at the

position of the maximum of the distribution. This can be seen from the Taylor expansions (Eqn. 3 and Eqn. 22):

every term except the first term contains some power of the distance away from the mean, and so will fall to zero at

the mean, and so at the position of the mean the Gaussian approximations and the original binomial distributions

are exactly equal. Secondly, notice that each term in the expansion is O(1/N) smaller than its predecessor in

both cases, and so the cubic term in the expansions dominates the error. The Gaussian approximations in both

cases required normalisation of the expansions, and so it is not sufficient to look at the cubic terms in isolation.

However, the linear (B1) terms were zero in both cases, and the quadratic (B2) terms were the only terms retained.

Therefore, we can examine the ratios of the cubic to the quadratic terms to give measures of the proportional errors

in each case. Furthermore, since the errors are zero at the position of the maxima of the distributions, it is sufficient

to find which of these ratios has the highest first differential at the position of the maxima i.e. which proportional

error grows faster as we move away from that position.

5

Page 6

(a)(b)

Figure 1: Ratios of the first derivatives of the proportional errors on the B3term to the B2term before to after

the variance stabilising transform. The expression is plotted against p for x = 0.5 (a) and against x for p = 0.5

(b).

Before the variance stabilising transform, the quadratic term of the expansion is

1

Np(1 − p)

(n − Np)2

2!

(39)

and the cubic term is

−

(1 − 2p)

N2p2(1 − p)2

(n − Np)3

3!

(40)

so the ratio is

RB= −

1 − 2p

Np(1 − p)

(n − Np)

3

(41)

For ease of comparison with the ratio after the variance stabilising transform, write this in terms of the proportion

x = n/N,

RB= −1 − 2p

p(1 − p)

Similarly, after the variance stabilising transform the quadratic and cubic terms of the Taylor expansion are

−4N(arcsin√x − arcsin√p)2

(x − p)

3

(42)

2!

(43)

and

4N

??p

q−

?q

p

?(arcsin√x − arcsin√p)3

?(arcsin√x − arcsin√p)

3!

(44)

and so the ratio is

RA= −

??p

q−

?q

p3

(45)

Using

d

dxarcsin√x =

1

2?x(1 − x)

(46)

the first differentials of the ratios are

dRB

dx

= −(1 − 2p)

3p(1 − p)

(47)

and

dRA

dx

= −

??p

q−

?q

p

?

1

6?x(1 − x)

=

1

6?x(1 − x)

(1 − 2p)

?p(1 − p)

(48)

One additional comment should be made here: notice that neither of these functions go to zero anywhere in the

range 0 ≤ x ≤ 1. Therefore, the proportional error functions have no extrema i.e. they cannot change signs. The

ratio of the differentials is

dRB

dxdx

?dRA

= 2

?x(1 − x)

?p(1 − p)

(49)

6

Page 7

At the position of the maximum x = p this reduces to

dRB

dx

?dRA

dx

= 2 (50)

Therefore, the errors on the Gaussian approximation grow twice as fast around the mean before the variance

stabilising transform as they do after it, and so the transform makes the Gaussian approximation approximately

twice as accurate. Fig. 1 shows plots of Eqn. 49 against p for x = 0.5 and against x for p = 0.5. It should be

noted that the choice of fixed x or p in these plots only affects the scaling, not the shape of the curve, and that the

scaling is always fixed so that the expression is equal to 2 at x = p. Examining the plot against p, it can be seen

that the curve is almost flat across most of the range, but rises steeply below around p = 0.2 and above p = 0.8.

This is as expected: it correpsonds to the ends of the range, where the binomial distribution is highly skewed and

so is not well approximated by a Gaussian, and also agrees with the literature on the variance stabilising transform

for the binomial distribution, which states that it is only neccessary to use this transform when the data contains

proportions in the extremes of the range. Examining the plot against x, it can be seen that the curve is greater

than 1 over a large range around the mean. The curve drops below 1 at the ends of the range, indicating that the

errors on the Gaussian approximation after the variance stabilising transform are begining to increase faster than

those before the transform. However, this curve shows the ratio of the first differentials. The area between the

curve and the line y = 1 therefore indicates the absolute errors involved, and since that area is greater between

p = x and the point where y = 1 than between y = 1 and p = 0, the errors after the transform can never catch

up with the errors before it. The variance stabilising transform therefore makes the Gaussian approximation more

accurate across the entire range.

4 Conclusion

This report has demonstrated that the binomial distribution may be approximated with a Gaussian distribution

at large N; that the transformation to arcsin√x space is the variance stabilising transform for the binomial

distribution; and that the Gaussian approximation is approximately twice as accurate after the variance stabilising

transform for small values of N. The same behavior was seen with the variance stabilising transform of the Poisson

distribution: the Gaussian approximation is more accurate after the application of the transform. This effect

is often seen with normalising or variance stabilising transforms: that a transform intended for one purpose

(variance stabilisation) also aids another (normalisation). However, it is not generally true, and furthermore

the transformations that approximately achieve both ends tend to be compromises, and there may exist other

transformations that achieve, for example, better normalisation at the expense of no variance stabilisation. A fuller

discussion of these matters with examples is given in [1]. This is a reminder of the fact that such transformations

result in only better approximations of the data to Gaussian distributions so that we can then more reliably use

established methods which make such an assumption. Of course it would be possible though more awkward to have

an exact likelihood calculation for the data, based directly upon the binomial distribution, in the same way that

the ‘Fisher exact test’ is available for Poisson distributed data. The use of transform techniques must therefore be

applied with the an understanding of the practical consequences of this approximation.

There are several practical implications of the above results. The first is in the construction of chi-square similarity

measures for data and a ratio predicted from theory (hiand ti) or data and data (hiand ki), which rely on the

approximation of a binomial by a Gaussian for the construction of a “least-squares” meaure. These functions are

better estimated (particularly for small values of hiand ki) as

χ2= 4

?

i

Ni(arcsin

?

hi/Ni − arcsin√ti)2and χ2= 2

?

i

Ni(arcsin

?

hi/Ni − arcsin

?

ki/Ni)2

In addition, the binomial can be better approximated by a Gaussian of form given in Eqn. 38 than by the standard

Gaussian of form Eqn. 17.

The arcsin transform and square root transforms [2] are also necessary for the construction of (frequentist) prob-

ability similarity measures [3].

Finally, a Gaussian random variate x with variance11/4N and mean arcsin?n/N can be used to approximate a

1Thanks to Iain Buchan for pointing out a long-standing typographical error in previous versions of this document.

binomial variate y for use in Monte-Carlo simulations by applying the transform y = sin2(x).

7

Page 8

References

[1] A. Stuart and K. Ord, “Kendall’s Advanced Theory of Statistics Vol 2a: Classical Inference and the Linear

Model”, pp. 753-759, Arnold, 1994.

[2] N.A. Thacker and P.A. Bromiley, The Effects of a Square Root Transform on a Poisson Distributed Quantity.

Tina memo, 2001-010, 2001.

[3] N.A.Thacker, P.Bromiley, The Equal Variance Domain: Issues Surrounding the use of Probability Densities

for Algorithm Construction. Tina memo, 2004-005, 2004.

8