Page 1

Tina Memo No. 2002-007

Internal Report

The Effects of an Arcsin Square Root Transform on a

Binomial Distributed Quantity.

P.A. Bromiley and N.A. Thacker

Last updated

13 / 6 / 2002

Imaging Science and Biomedical Engineering Division,

Medical School, University of Manchester,

Stopford Building, Oxford Road,

Manchester, M13 9PT.

Page 2

The Effects of an Arcsin Square Root Transform on a

Binomial Distributed Quantity

P.A. Bromiley and N.A. Thacker

Imaging Science and Biomedical Engineering Division

Medical School, University of Manchester

Manchester, M13 9PT, UK

paul.bromiley@man.ac.uk

Abstract

This document provides proofs of the following:

• The binomial distribution can be approximated with a Gaussian distribution at large values of N.

• The arcsin square-root transform is the variance stabilising transform for the binomial distribution.

• The Gaussian approximation for the binomial distribution is more accurate at smaller values of

N after the variance stabilising transform has been applied.

The conclusion contains some comments concerning the relationship between the variance stabilising

transform and the improved accuracy of the Gaussian approximation, which holds for both the binomial

and Poisson distributions.

1Gaussian Approximation to the Binomial Distribution

The binomial distribution

P(n|N) =

N!

n!(N − n)!pnq(N−n)

(1)

gives the probability of obtaining n successes out of N Bernoulli trials, where p is the probability of success and

q = 1−p is the probability of failure. The shape of the distribution approaches that of a Gaussian distribution at

large N as a consequence of the central limit theorem. In order to demonstrate this, it is necessary to expand the

distribution as a Taylor series around the maximum.

First, let n′be the position of the maximum of P(n), giving n = n′+η, and rather than expanding the distribution

itself, expand the natural logarithm of the distribution. Expanding as a Taylor series

f(a + h) = f(a) + hf′(a) +h2

2!f′′(a) + ... +

h(n−1)

(n − 1)!f(n−1)(a) + ...(2)

gives

ln[P(n′+ η)] = ln[P(n′)] + ηB1+η2

2!B2+η3

3!B3...(3)

where

Bk=

????

dkln[P(n)]

dnk

????

n=n′

(4)

Since this is an expansion around the position of the maximum, B1= 0 and B2is negative, so put B2= −|B2|.

The next step is to find the position of the maximum. Taking the log of the distribution and applying Stirling’s

approximation

ln(n!) ≈ nlnn − n

gives

dln[P(n)]

dn

= −lnn + ln(N − n) + lnp + lnq

(5)

≈

d

dn[lnN! − nlnn + n − (N − n)ln(N − n) + (N − n) + nlnp + (N − n)lnq](6)

Page 3

At the maximum this is equal to zero, giving

0 = ln[N − n

n

p

q](7)

or

1 =N − n

n

p

q

(8)

Substituting n = n′

1 =N − n′

n′

p

q

(9)

which reduces to

n′= Np(10)

Now the Bkof the Taylor expansion can be found. We know that B1= 0. B2is given by

B2=

????

d2ln[P(n)]

dn2

????

n=n′≈ [−1

n−

1

N − n]n=n′ = −

1

Np(1 − p)

(11)

Similarly

B3=

????

d4ln[P(n)]

dn4

d3ln[P(n)]

dn3

????

n=n′≈2(3p2− 3p + 1)

N3p3(1 − p)3

n=n′≈

1 − 2p

N2p2(1 − p)2

(12)

and

B4=

????

????

(13)

The terms are getting smaller by O(1/N) each time, so truncate the Taylor series at the B2term:

P(n) ≈ P(n′)e−|B2|η2

2

(14)

In order to find P(n’), impose the requirement of normalisation. Further, assume that the distribution can be

treated as continuous i.e.

N

?

so, using the standard Gaussian integral

limN→∞

n=0

P(n) ≈

?

P(n) =

?∞

−∞

P(n′+ η)dη = 1(15)

?∞

−∞

P(n′)e−|B2|η2

2dη = P(n′)

?

2π

|B2|= 1(16)

so

P(n) =

1

√2πNpqe−(n−Np)2

2Npq

(17)

Comparing this to the Gaussian distribution

P(n) =

1

σ√2πe−(n−µ)2

2σ2

(18)

it can be seen that, at values of the parameters where the assumptions made in the derivation hold (i.e. at large

N), the binomial distribution can be approximated by a Gaussian distribution with mean

µ = Np

and standard deviation

σ =

?

Npq

3

Page 4

2A Variance Stabilising Transform for the Binomial Distribution

The most commonly used variance stabilising transform for the binomial distribution is

y = arcsin

?n

N

(19)

Applying the change-of-variables formula,

g(y) = f[r−1(y)]dr−1y

dy

ify = r(x)(20)

gives the distribution in the new space

P(y) =

N!

(Nsin2y)!(N − Nsin2y)!2N sinycosy pN sin2yq(N−N sin2y)

(21)

It is again possible to demonstrate that a Gaussian distribution provides a good approximation to this distribution

at large N. As before, the aim is to expand the distribution as a Taylor series around the maximum

ln[P(y′+ z)] = ln[P(y′)] + zB1+z2

2!B2+z3

3!B3...(22)

Bk=

????

dkln[P(y)]

dyk

????

y=y′

(23)

where y’ is the position of the mean and z is the distance in y from the mean. Again, since this is at the maximum

B1must be zero and B2must be negative.

Taking the log of the distribution in y space and applying Stirling’s approximation

α! ≈ ααe−α√2πα(24)

gives, after simplification

ln[P(y)] ≈ N lnN +1

2lnN + N sin2y lnp − N sin2yln(N sin2y) + N lnq(25)

−N sin2ylnq − N ln(N cos2y) + N sin2yln(Ncos2y) + ln

?

2

π

Differentiating gives

dln[P(y)]

dy

≈ 2N sinycosy lnp − 2N sinycosy ln(N sin2y) − 2N siny cosy(26)

−2N sinycosy lnq +2N siny

cosy

−2N sin3y

cosy

+ 2N siny cosyln(N cos2y)

which reduces to

dln[P(y)]

dy

≈ 2N siny cosy lnp

q+ 2N sinycosy lncos2y

sin2y

(27)

Setting this equal to zero gives the position of the maximum: dividing through by 2N siny cosy gives

ln(p

q) = ln(sin2y

cos2y)(28)

Since cos2y = 1 − sin2y and q = 1 − p, this gives the position of the maximum as

p = sin2y =n

N

soy = arcsin√p(29)

Armed with the position of the maximum, we can go on to find the higher differentials and thus the Bk:

d2ln[P(y)]

dy2

≈ 2N ln(p

q)cos2y − 2N ln(p

q)sin2y + 2N ln(cos2y

sin2y)cos2y − 2N ln(cos2y

sin2y)sin2y − 4N(30)

4

Page 5

so

B2= −4N(31)

Similarly

d3ln[P(y)]

dy3

≈ −8N ln(p

q)cosysiny − 8N cosy siny ln(cos2y

sin2y) + 4Nsiny

cosy− 4Ncosy

siny

(32)

giving

B3= 4N[

?p

q−

?q

p](33)

and

d4ln[P(y)]

dy4

≈ 8N ln(p

q)(sin2y − cos2y) + 8N ln(cos2y

sin2y)(sin2y − cos2y) + 4N(

1

sin2y−

1

cos2y) + 16N(34)

giving

B4= 4N[4 +1

p−1

q](35)

Each term of the Taylor expansion is again O(N) smaller than its predecessor, (although this time the factors

of N are hidden in the variable itself) implying that the approximation to the lower order terms becomes exact

for large samples. Again, truncating the series at the B2 term, plugging the B′s back into the expansion, and

exponentiating gives

P(y) = P(y′)e−|B2|z2

2

(36)

In order for this to be normalized

P(y′) =

?

|B2|

2π

(37)

So, using B2= 4N, z = y − y′and y′= p

P(y) =

?

2N

πe−2N(y−arcsin√p)2

(38)

and so the distribution is a Gaussian with mean

µ = arcsin√p

and standard deviation

σ =

?

1

4N

The variance is now independent of the mean, and so the transform to arcsin√x space has been shown to be the

variance stabilising for the binomial distribution. Further, the approximation of the probability density distribution

to a Gaussian becomes exact for large samples. The fact that the variance depends on N is a consequence of the

space being scaled by N.

3Accuracy of the Gaussian Approximation following the Transform

In order to demonstrate that the Gaussian approximation to the binomial distribution is more accurate after the

variance stabilising transform than before it, first notice that in both cases the approximation is exact at the

position of the maximum of the distribution. This can be seen from the Taylor expansions (Eqn. 3 and Eqn. 22):

every term except the first term contains some power of the distance away from the mean, and so will fall to zero at

the mean, and so at the position of the mean the Gaussian approximations and the original binomial distributions

are exactly equal. Secondly, notice that each term in the expansion is O(1/N) smaller than its predecessor in

both cases, and so the cubic term in the expansions dominates the error. The Gaussian approximations in both

cases required normalisation of the expansions, and so it is not sufficient to look at the cubic terms in isolation.

However, the linear (B1) terms were zero in both cases, and the quadratic (B2) terms were the only terms retained.

Therefore, we can examine the ratios of the cubic to the quadratic terms to give measures of the proportional errors

in each case. Furthermore, since the errors are zero at the position of the maxima of the distributions, it is sufficient

to find which of these ratios has the highest first differential at the position of the maxima i.e. which proportional

error grows faster as we move away from that position.

5