ArticlePDF AvailableLiterature Review

On computing the distribution function for the Poisson binomial distribution

Authors:

Abstract

The Poisson binomial distribution is the distribution of the sum of independent and non-identically distributed random indicators. Each indicator follows a Bernoulli distribution and the individual probabilities of success vary. When all success probabilities are equal, the Poisson binomial distribution is a binomial distribution. The Poisson binomial distribution has many applications in different areas such as reliability, actuarial science, survey sampling, econometrics, etc. The computing of the cumulative distribution function (cdf) of the Poisson binomial distribution, however, is not straightforward. Approximation methods such as the Poisson approximation and normal approximations have been used in literature. Recursive formulae also have been used to compute the cdf in some areas. In this paper, we present a simple derivation for an exact formula with a closed-form expression for the cdf of the Poisson binomial distribution. The derivation uses the discrete Fourier transform of the characteristic function of the distribution. We develop an algorithm that efficiently implements the exact formula. Numerical studies were conducted to study the accuracy of the developed algorithm and approximation methods. We also studied the computational efficiency of different methods. The paper is concluded with a discussion on the use of different methods in practice and some suggestions for practitioners.
On Computing the Distribution Function for the Poisson
Binomial Distribution
Yili Hong
Department of Statistics
Virginia Tech
Blacksburg, VA 24061, USA
Abstract
The Poisson binomial distribution is the distribution of the sum of independent and
non-identically distributed random indicators. Each indicator follows a Bernoulli distri-
bution and the individual probabilities of success vary. When all success probabilities
are equal, the Poisson binomial distribution is a binomial distribution. The Poisson bi-
nomial distribution has many applications in different areas such as reliability, actuarial
science, survey sampling, econometrics, etc. The computing of the cumulative distribu-
tion function (cdf) of the Poisson binomial distribution, however, is not straightforward.
Approximation methods such as the Poisson approximation and normal approximations
have been used in literature. Recursive formulae also have been used to compute the cdf
in some areas. In this paper, we present a simple derivation for an exact formula with a
closed-form expression for the cdf of the Poisson binomial distribution. The derivation
uses the discrete Fourier transform of the characteristic function of the distribution. We
develop an algorithm that efficiently implements the exact formula. Numerical studies
were conducted to study the accuracy of the developed algorithm and approximation
methods. We also studied the computational efficiency of different methods. The paper
is concluded with a discussion on the use of different methods in practice and some
suggestions for practitioners.
Key Words: Characteristic function; k-out-of-n system; Longevity risk; Normal
approximation; Sum of independent random indicators; Warranty returns.
1
1 Introduction
1.1 Motivation
The Poisson binomial distribution describes the distribution of the sum of independent and
non-identically distributed random indicators. Each indicator is a Bernoulli random variable
and the individual probabilities of success vary. A special case of the Poisson binomial dis-
tribution is the ordinary binomial distribution, when all success probabilities are equal. The
Poisson binomial distribution has many applications in different areas such as reliability, ac-
tuarial science, survey sampling, econometrics, and so on. The following gives examples from
different areas.
In some reliability applications, it is often of interest to predict the total number of
failures for a fleet of products in the field. Hong, Meeker, and McCalley (2009) considered
the prediction for the total number of field failures for a fleet of high-voltage power
transformers. Due to staggered entry of units into service, individual units in the field
have different failure probabilities at a specified future time. Thus the total number of
field failures follows a Poisson binomial distribution.
In actuarial science, the total payout of an insurance company is often related to the
Poisson binomial distribution. For example, Pitacco (2007) considered a one-year insur-
ance coverage only providing a death benefit for ninsureds. Let Cdenote the payout
due at each death. The individual payout is either 0 or Cwith probability 1 pjand pj,
respectively, where the death probability pjvaries from individual to individual. Assum-
ing that the individual lifetimes are independent, the total payout for those ninsureds
is Ctimes the total number of deaths which follows the Poisson binomial distribution.
In econometrics, it is sometimes of interest to predict the number of corporation defaults
(e.g., Duffie, Saita, and Wang 2007). The default probabilities differ from corporation
to corporation because each corporation has its own unique situation on assets, debts,
stock returns and so on. The number of corporation defaults at a future time also follows
a Poisson binomial distribution.
In engineering, Fern´andez and Williams (2010) provided several interesting examples
such as multi-sensor fusion and reliability of k-out-of-nsystems, which are related to
the Poisson binomial distribution.
In survey sampling, Chen and Liu (1997) presented an example where the inclusion
probabilities of sampling units are different. The total number of units in the sample
follows a Poisson binomial distribution.
2
The Poisson binomial distribution also has wide applications in areas such as, data
mining of uncertain databases (Tang and Peterson 2011), bioinformatics (Niida et al.
2012), and wind energy (Bossavy et al. 2012).
While the Poisson binomial distribution has many applications in different disciplines, the
computing of the cumulative distribution function (cdf) of the distribution is not straightfor-
ward. Because the individual probabilities of success vary, the naive way of computing the
cdf by using enumeration is not practical, even when the number of indicators is small (i.e.,
around 30). Approximation methods such as the Poisson approximation and normal approxi-
mations have been used in literature. There are situations, however, in which approximation
methods do not perform well. Thus it is desirable to have a method to compute the exact
values of the cdf. It is also useful to know in which situation approximation methods work
well. In applications such as predictions for the number of failures and corporation defaults,
the number of indicators is usually large. Thus efficiency of algorithms for computing the
exact values of the cdf is also important. This motivates us to provide efficient methods to
compute the exact values of the cdf of the Poisson binomial distribution.
1.2 Related Literature and This Work
The study on Poisson binomial distribution has a long history. Le Cam (1960) provided an
upper bound for the error of the Poisson approximation. Normal approximations are widely
used in practice. Volkova (1996) gave a normal approximation with second order correction
and provided an upper bound for the error of the approximation. Hong, Meeker, and McCalley
(2009) and Hong and Meeker (2010) applied the approximation in Volkova (1996) to warranty
prediction applications. Recursive formulae are available in literature to compute the exact
values of the cdf of the Poisson binomial distribution. For example, Barlow and Heidtmann
(1984) described a recursive formula for computing the cdf. Chen, Dempster, and Liu (1994)
provided another recursive formula. Details for these recursive formulae are described in
Section 2.5. Fern´andez and Williams (2010) gave a closed-form expression for the cdf using
the technique of polynomial interpolation and the discrete Fourier transform.
The contribution of this paper is summarized as follows.
We propose a simple derivation for an exact formula for the cdf of the Poisson binomial
distribution, which gives the same form as that in Fern´andez and Williams (2010).
We develop an algorithm that efficiently implements the exact formula, which outper-
forms existing methods.
Numerical studies were conducted to compare the accuracy of the algorithm and approx-
imation methods. We also compared the computational efficiency of different methods.
3
Based on the numerical studies, we provide a discussion on the advantages and disad-
vantages of different methods and some guidelines for practitioners.
The statistical software R (2012) is widely used. There was no package, however, for
computing the Poisson binomial distribution function. We developed an R package that
efficiently implements both exact methods and approximation methods. The package can be
downloaded from the R website, see Section 5 for more details.
1.3 Overview
The rest of the paper is organized as follows. Section 2 describes several exact methods for
computing the cdf and algorithms for their efficient implementations. Section 3 describes
several approximation methods based on the Poisson and normal approximations. Section 4
conducts a comprehensive numerical study to assess the performance of various methods
in terms of accuracy and efficiency. Section 5 discusses software implementation for both
the exact and approximation methods. Section 6 provides some concluding remarks and
suggestions for practitioners.
2 Exact Methods
2.1 Notation
Let Ij, j = 1,...,n be a series of nindependent and non-identically distributed random
indicators. In particular,
IjBernoulli(pj), j = 1,...,n, (1)
where pj= Pr(Ij= 1) is the success probability of indicator Ijand not all pj’s are equal. The
Poisson binomial random variable Nis defined as the sum of independent and non-identically
distributed random indicators (i.e., N=Pn
j=1 Ij). Note that Ntakes value in {0,1,...,n}.
Let ξk= Pr(N=k), k = 0,1,...,nbe the probability mass function (pmf ) for the Poisson
binomial random variable N. When all pj’s are identical, the distribution of Nis a binomial
distribution. The cdf of N, denoted by FN(k) = Pr(Nk), k = 0,1,...,n, gives the
probability of having at most ksuccesses out of a total of n. The cdf FN(k) can be expressed
by (Wang 1993)
FN(k) =
k
X
m=0
ξm=
k
X
m=0 (X
A∈FmY
jA
pjY
jAc
(1 pj)),(2)
4
where Fmis the set of all subsets of mintegers that can be selected from {1,2,3,...,n}and
Acis the complement of set A(i.e., Ac={1,2,3,...,n}\A). Samuels (1965) presented similar
formula as in (2). In order to compute FN(k) in (2), one needs to enumerate all elements in
Fm, which is not practical even when nis small (e.g., n= 30). For example, when n= 30, F15
contains 30!/[15! ×(30 15)!] = 155,117,520 elements. Thus efficient methods for computing
FN(k) are desirable.
2.2 Discrete Fourier Transform
In this section, we briefly introduce the discrete Fourier transform (DFT). For a sequence of
n+ 1 complex numbers {y0, y1,···, yn}, the DFT transforms {y0, y1,···, yn}into a sequence
of n+ 1 complex numbers {z0, z1,·· · , zn}where zk=Pn
l=0 ylexp(iωkl), k = 0,1,...,n,and
ω= 2π/(n+ 1). The inverse discrete Fourier transform (IDFT), which recovers {y0, y1,···,
yn}from {z0, z1,···, zn}, is given by
yl=1
n+ 1
n
X
k=0
zkexp(iωlk), l = 0,1,···, n. (3)
Applying the DFT to both sides of equation (3), one can also recover {z0, z1,···, zn}from
{y0, y1,···, yn}. More details on the DFT can be found in Bracewell (2000, Chapter 11).
There are fast Fourier transform (FFT) algorithms to compute the DFT efficiently. The
most commonly-used algorithm is the Cooley-Tukey algorithm (Cooley and Tukey 1965).
There are also subroutines available in C or FORTRAN that implement FFT algorithms. See
Bracewell (2000, Chapter 11) for details on FFT algorithms.
2.3 The DFT of the Characteristic Function of the Poisson Bino-
mial Distribution
In this section, we provide a derivation for a closed-form expression for the cdf of the Poisson
binomial distribution. Our approach is based on the characteristic function (CF) for the
Poisson binomial distribution (see, for example, Athreya and Lahiri 2006, Chapter 10 for
details on CF). Fern´andez and Williams (2010) provided the same closed-form expression for
the cdf, which was derived by using polynomial interpolation technique and the DFT. Our
approach, however, is much simpler.
5
The CF of the Poisson binomial random variable N=Pn
j=1 Ijis
ϕ(t) = E[exp(itN)] =
n
X
k=0
ξkexp(itk) = E"exp it
n
X
j=1
Ij!# (4)
=
n
Y
j=1
E[exp(itIj)] =
n
Y
j=1
[1 pj+pjexp(it)],
where i=1. Substituting t=ωl, l = 0,1,···, n into (4) where ω= 2π/(n+ 1), one
obtains
1
n+ 1
n
X
k=0
ξkexp(iωlk) = 1
n+ 1
n
Y
j=1
[1 pj+pjexp(iωl)] = 1
n+ 1xl, l = 0,1,·· · , n, (5)
where xl=Qn
j=1[1 pj+pjexp(iωl)]. Note that the left hand side of equation (5) is the IDFT
of the sequence {ξ0, ξ1,···, ξn}. Apply the DFT to both sides of equation (5), one recovers
{ξ0, ξ1,···, ξn}. In particular,
ξk=1
n+ 1
n
X
l=0
exp(iωlk)
n
Y
j=1
[1 pj+pjexp(iωl)] = 1
n+ 1
n
X
l=0
exp(iωlk)xl.(6)
The expression in equation (6) gives the same closed-form expression as in Fern´andez and
Williams (2010). From (6), the cdf of Ncan be expressed as
FN(k) =
k
X
m=0
ξm=1
n+ 1
n
X
l=0
k
X
m=0
exp(iωlm)xl=1
n+ 1
n
X
l=0
{1exp[iωl(k+ 1)]}xl
1exp(iωl).(7)
The last equality in (7) follows from the fact that exp(iωlm), m = 0,1,...,k is a geometric
sequence. We refer to the closed-form expression in (7) for FN(k) as the DFT-CF method.
2.4 Efficient Implementation of the DFT-CF Method
In this section, we develop an efficient algorithm for computing the cdf FN(k) in (7). To com-
pute ξk, k = 0,1,...,n, one first needs to compute xl. Let xl=al+ibl, l = 0,1,...,n, where al
and blare the real and imaginary parts of xl, respectively. From (5), xl=Pn
k=0 ξkexp(iωlk), l =
0,1,···, n. Note that x0=Pn
k=0 ξk= 1. Because all ξk’s are real numbers and exp[iω(n+
1)k] = 1, the conjugate of xlis
xl=alibl=
n
X
k=0
ξkexp(iωlk) =
n
X
k=0
ξkexp[iω(n+ 1 l)k]
=xn+1l=an+1l+ibn+1l, l = 1,...,n.
6
Thus al=an+1l, and bl=bn+1lfor l= 1,...,n. Let zj(l) = 1pj+pjcos(ωl)+ipjsin(ωl),
|zj(l)|be the modulus of zj(l), and Arg[zj(l)] be the principal value of the argument of zj(l).
Note that
xl= exp (n
X
j=1
log [zj(l)])= exp (n
X
j=1
log |zj(l)|exp{iArg[zj(l)]})
= exp (n
X
j=1
log [ |zj(l)|])exp i
n
X
j=1
Arg[zj(l)]!
= exp (n
X
j=1
log [ |zj(l)|]) cos (n
X
j=1
Arg[zj(l)])+isin (n
X
j=1
Arg[zj(l)])!.
Here |zj(l)|={[1 pj+pjcos(ωl)]2+ [pjsin(ωl)]2}1/2and Arg[zj(l)] = atan2[pjsin(ωl),1
pj+pjcos(ωl)]. The function atan2(y, x) is defined as
atan2(y, x) =
arctan(y
x)x > 0
π+ arctan(y
x)y0, x < 0
π+ arctan(y
x)y < 0, x < 0
π
2y > 0, x = 0
π
2y < 0, x = 0
0y= 0, x = 0
.
Thus explicit expressions for aland blare
al=dlcos (n
X
j=1
Arg[zj(l)])and bl=dlsin (n
X
j=1
Arg[zj(l)]),(8)
where dl= exp nPn
j=1 log [ |zj(l)|]o, l = 1,...,n. The following algorithm is used to compute
the cdf FN(k) for k= 0,1,···, n.
Algorithm A:
1. Let x0= 1. For l= 1,...[n/2], compute the real and imaginary parts of xlby using the
formulae in (8). Here [ ·] is the ceiling function.
2. For l= [n/2]+ 1,...,n, compute the real and imaginary parts of xlby using the formula
al=an+1l, and bl=bn+1l.
3. Apply the FFT algorithm to the set {x0/(n+ 1), x1/(n+ 1),...,xn/(n+ 1)}to obtain
{ξ0, ξ1,...,ξn}.
7
4. Compute the cdf by using FN(k) = Pk
m=0 ξm, k = 0,1,···, n.
The above algorithm returns the values of the entire cdf by doing FFT once. Because there are
C or FORTRAN subroutines available to do the FFT, the implementation of Algorithm A
is not difficult. The FFT algorithm that is used for the implementation in this paper is due
to Singleton (1969), which is an FFT algorithm based on the Cooley-Tukey algorithm. The
original subroutine was written in FORTRAN and it was translated to C, which is available
in the R library.
2.5 Recursive Formulae
Recursive formulae (RF) are available in literature to compute FN(k). Barlow and Heidtmann
(1984) described the following recursive formula. A better description of the algorithm is
available in Kuo and Zuo (2003, Chapter 7). Let Nj=Pj
m=1 Imand ξk,j = Pr(Nj=k) where
the random indicator Imis defined in (1). Note that N=Nnand ξk=ξk,n. The recursive
formula is given by
ξk,j = (1 pj)ξk,j1+pjξk1,j1,0kn, 0jn. (9)
The boundary conditions for (9) are ξ1,j =ξj+1,j = 0, j = 0,1,...,n1 and ξ0,0= 1. We
refer to (9) as the RF1 method. The RF1 method can be computer memory demanding when
nis large.
Chen, Dempster, and Liu (1994) introduced another recursive formula for computing ξk.
The algorithm requires all pj<1. In particular, the formula is given by
ξ0=
n
Y
j=1
(1 pj),and ξk=1
k
k
X
l=1
(1)l1tlξkl, k = 1,...,n, (10)
where tl=Pn
j=1[pj/(1 pj)]l. We refer to (10) as the RF2 method. This formula is sometimes
not numerically stable. This is caused by round-off error in ξ0and the explosion of the term
[pj/(1 pj)]lin tl, especially when pjis close to 1 and nis large.
3 Approximation Methods
In this section, we describe several commonly-used approximation methods for computing
the cdf FN(k). Approximation methods are still widely used because of their computational
efficiency, especially when nis large and the cdf FN(k) needs to be evaluated many times. For
example, in the prediction application in Hong, Meeker, and McCalley (2009), the cdf needs
to be evaluated B= 10,000 times in the calibration of prediction intervals for the number of
8
field failures. We will need moments or functions of moments of Nin the description of ap-
proximation methods. The expectation, standard deviation, and skewness of the distribution
of Nare
µ=E(N) =
n
X
j=1
pj, σ = [Var(N)]1/2="n
X
j=1
pj(1 pj)#1/2
,(11)
γ= [Var(N)]3/2E[Nµ]3=σ3
n
X
j=1
pj(1 pj)(1 2pj),
respectively.
3.1 Poisson Approximation
In literature, the Poisson distribution has been used to approximate the distribution of N,
which is referred to as the Poisson approximation (PA) method. In particular, the pmf the
Poisson binomial distribution ξkis approximated by
ξkµkexp(µ)
k!, k = 0,1,···, n, (12)
where µis defined in (11). By Le Cam’s theorem (Le Cam 1960), the approximation error for
the PA method is Pn
k=0 ξkµkexp(µ)/(k!)<2Pn
j=1 p2
j.Thus the PA method only works
well when the expected number of successes µis small. When µis large, the performance of the
PA method is generally poor. See Section 4.2 for a numerical illustration of the performance
of the PA method.
3.2 Normal Approximation
The normal approximation (NA) method is based on the central limit theorem. In particu-
lar, the NA method with continuous correction approximates the cdf of a Poisson binomial
distribution by
FN(k)Φk+ 0.5µ
σ, k = 0,1,···, n, (13)
where Φ(x) is the cdf of the standard normal distribution, and µand σare defined in (11).
When nis small, the performance of the normal approximation can be poor.
3.3 Refined Normal Approximation
Volkova (1996) described a refined normal approximation (RNA) which makes a correction to
the skewness of the distribution of N. For the RNA method, the cdf FN(k) is approximated
9
by
FN(k)Gk+ 0.5µ
σ, k = 0,1,···, n, (14)
where G(x) = Φ(x) + γ(1 x2)φ(x)/6, φ(x) is the pdf of the standard normal distribution,
and γis defined in (11). In some situations, the values of the cdf approximated by the RNA
method can be outside [0,1]. Thus those values less than 0 are corrected to 0 and those values
larger than 1 are corrected to 1.
4 Numerical Studies
4.1 Accuracy of the Implementations of Exact Methods
The DFT-CF, RF1 and RF2 methods can provide exact values of the cdf. It is, however, desir-
able to verify that the software implementations of these methods are correct. In this section,
we use the distribution of the sum of binomial random variables to verify the implementations
of these methods.
Note that the distribution of the sum of three independent and non-identically distributed
binomial distributions is a special case of the Poisson binomial distribution. The pmf in this
special case is
ξk=
k
X
j=0
bkj,n3 j
X
i=0
bi,n1bji,n2!=
k
X
j=0
j
X
i=0
bi,n1bji,n2bkj,n3,(15)
where n1+n2+n3=n, and bi,n1,bi,n2, and bi,n3are the pmfs of Binomial(n1, p1), Binomial(n2, p2),
and Binomial(n3, p3), respectively. Here p1, p2, and p3are the success probabilities for these
three binomial distributions. In particular,
bi,n1=n1
ipi
1(1 p1)n1i, bi,n2=n2
ipi
2(1 p2)n2i,and bi,n3=n3
ipi
3(1 p3)n3i.
The pmfs bi,n1,bi,n2and bi,n3can be accurately computed by using existing software. With dif-
ferent values of n1, n2, n3and p1, p2, p3, one can obtain various Poisson binomial distributions.
We use the total absolute error (TAE) between two cdfs as a metric for accuracy comparisons.
The TAE is defined by
TAE =
n
X
k=0 |F(k)Fbin(k)|,
where F(k) is a cdf computed by using one of the exact methods, and Fbin (k) is the cdf
computed by using the formula in (15). Table 1 shows the results from the accuracy study for
the DFT-CF, RF1 and RF2 methods. Various values of n1, n2, n3and p1, p2, p3were chosen
10
Table 1: Accuracy comparisons for the DFT-CF, RF1 and RF2 methods.
n n1n2n3p1p2p3
TAE
DFT-CF RF1 RF2
30 10 10 10 0.500 0.500 0.500 1.6×1014 7.4×1015 7.4×1015
30 10 5 15 0.500 0.500 0.500 1.3×1014 5.2×1015 5.2×1015
30 10 5 15 0.010 0.500 0.990 1.4×1014 7.0×1016 na
300 100 50 150 0.010 0.500 0.990 1.9×1012 4.7×1014 na
3,000 1,000 500 1,500 0.010 0.500 0.990 3.6×1010 1.1×1011 na
3,000 1,000 500 1,500 0.001 0.010 0.020 3.1×1011 9.4×1011 1.6×1010
3,000 1,000 500 1,500 0.999 0.990 0.980 1.4×1009 1.1×1014 na
3,000 1,000 500 1,500 0.001 0.500 0.999 3.4×1010 7.2×1012 na
3,000 1,000 500 1,500 0.300 0.500 0.700 3.8×1010 7.7×1011 na
to generate different scenarios. The TAE for each scenario was computed. The TAEs are
generally less than 1×1010 for the DFT-CF and RF1 methods. Thus the results show that
the DFT-CF and RF1 methods can accurately compute the cdf for the Poisson binomial
distribution. The RF2 method does not work for most cases because the algorithm is not
numerically stable.
4.2 Accuracy Comparisons for Approximation Methods
Being able to compute the exact values of the cdf FN(k) allows us to study the performance of
approximation methods. To see the performance of different approximation methods, we sim-
ulate success probabilities pj’s from various patterns. Figure 1 shows the six different patterns
in pj’s used in this numerical study. These patterns in the pj’s are generated from the uniform
distribution, beta distribution with various values of shape parameters, and mixtures of beta
distributions. For each pattern, various values of n, which is the number of random indicators
in N, were chosen to see the effect of non the accuracy of approximation methods. In particu-
lar, the values of nwere chosen from n= 10,20,50,100,200,500,1,000,2,000,5,000,10,000,
and 15,000.
Figure 2 shows an illustration of computed cdfs by using various methods. Each sub-
figure is based on a set of pj’s simulated from Pattern (b) in Figure 1 when n= 10,50,200 ,
and 1,000 , respectively. The x-axis is on the logarithm scale and the y-axis is on the scale
of the quantile function of the standard normal distribution (but labeled on the original
scales). For convenience of plotting, the location where k= 0.5 shows the value of the cdf at
k= 0. Note that the Poisson binomial distribution is a discrete distribution. Thus only those
points in Figure 2 show the values of the cdfs. Those segments that connect points are for
11
convenience of visual comparisons. The RNA method approximates the cdf well and the NA
method approximates the cdf moderately well (there are departures in the upper and lower
tails of the cdf). The cdf computed by the PA method deviates from the true cdf. Thus the
PA method does not perform well. The RF1 method gives exactly the same values (agree to
the ninth decimal places) as the DFT-CF method. The RF2 method does not work because
it is not numerically stable. Thus the results for recursive formulae are not shown in Figure 2.
To make accuracy comparison for different methods, Table 2 shows the average TAE of
1,000 sets of pj’s simulated for each combination of patterns in pj’s and values of n. In
particular, the TAE for a set of pj’s is computed by
TAE =
n
X
k=0 |F(k)FN(k)|,
where F(k) is a cdf computed by using one of the approximation methods, and FN(k) is the
cdf computed by using the DFT-CF method. As we can see from the results in Table 2, the
PA method does not perform well for most cases. The PA method only works reasonably well
when µis small, for example in Pattern (b) when n50.
For the normal approximation methods, the RNA method performs better than the NA
method for almost all cases. For Patterns (b) and (c) where Nis highly skewed, the RNA
method performs much better than the NA method. When n2000, the TAE for the
RNA method is generally less than 0.005. Thus the RNA method is recommended when an
approximation method needs to be used.
For all combinations of patterns in the pj’s and values of nconsidered in Table 2, both
the DFT-CF and RF1 methods provide results that agree to the ninth decimal places. The
RF2 method, however, does not work in most cases for the same reason mentioned previously.
Thus the results for the RF1 and RF2 methods are not shown in Table 2.
4.3 Efficiency Comparisons for Exact Methods
The computing time for the exact and approximation methods needs to be considered when n
is large. For each combination of patterns in pj’s and values of nas in Section 4.2, 1,000 sets
of pj’s were simulated. Table 3 gives the average time for computing the entire cdf using the
RNA, DFT-CF and RF1 methods based on those 1,000 sets of pj’s. The unit of time is the
second. The computations were done by using the 64-bit R in a workstation. The workstation
has an Intel Xeon CPU (X5570, 2.93GHz) and 24G RAM installed.
The results in Table 3 show that the computing time for the RNA method is generally
negligible (less than four milliseconds). The computing time for both the DFT-CF and RF1
methods are generally negligible (less than ten milliseconds) when n500. When n1,000,
12
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6
pj
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6
pj
(a) Uniform(0, 1) (b) Beta(0.1, 3)
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6
pj
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6
pj
(c) Beta(3, 0.1) (d) 0.5Beta(3, 0.1)+0.5Beta(0.1, 3)
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6
pj
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6
pj
(e) Beta(3, 3) (f) 0.5Beta(3, 10)+0.5Beta(10, 3)
Figure 1: Six different patterns in the pj’s used in the numerical study. Here Beta(a, b) is the
probability density function of the beta distribution with shape parameters aand b.
13
DFT−CF
RNA
NA
PA
DFT−CF
RNA
NA
PA
(a) n= 10 (b) n= 50
DFT−CF
RNA
NA
PA
DFT−CF
RNA
NA
PA
(c) n= 200 (d) n= 1,000
Figure 2: An illustration of computed cdfs with various methods for Pattern (b) in Figure 1,
when n= 10,50,200,and 1,000. The x-axis is on the logarithm scale (the location where
k= 0.5 shows the value of the cdf at k= 0) and the y-axis is on the scale of the quantile
function of the standard normal distribution.
14
Table 2: Average TAE of 1,000 sets of pj’s simulated for each combination of patterns in pj’s
and values of nfor accuracy comparisons of approximation methods.
Pattern (a) (b)
Method RNA NA PA RNA NA PA
n= 10 0.0209 0.0281 0.7372 0.0300 0.0466 0.0563
n= 20 0.0147 0.0200 1.0728 0.0259 0.0708 0.0897
n= 50 0.0092 0.0124 1.6948 0.0216 0.0762 0.1420
n= 100 0.0065 0.0086 2.3924 0.0195 0.0873 0.2043
n= 200 0.0046 0.0061 3.3763 0.0148 0.0940 0.2912
n= 500 0.0029 0.0038 5.3303 0.0092 0.0930 0.4637
n= 1,000 0.0021 0.0027 7.5429 0.0064 0.0925 0.6521
n= 2,000 0.0015 0.0019 10.664 0.0045 0.0919 0.9315
n= 5,000 0.0009 0.0012 16.864 0.0028 0.0919 1.4632
n= 10,000 0.0007 0.0009 23.844 0.0020 0.0918 2.0727
n= 15,000 0.0005 0.0007 29.211 0.0016 0.0918 2.5353
Pattern (c) (d)
Method RNA NA PA RNA NA PA
n= 10 0.0401 0.0838 1.4915 0.0456 0.0623 1.5046
n= 20 0.0459 0.1165 1.9571 0.0574 0.0772 2.0599
n= 50 0.0381 0.1086 3.0302 0.0434 0.0535 3.1704
n= 100 0.0225 0.0971 4.4510 0.0272 0.0330 4.4456
n= 200 0.0149 0.0952 6.7313 0.0185 0.0225 6.2709
n= 500 0.0092 0.0932 12.005 0.0114 0.0138 9.8885
n= 1,000 0.0064 0.0922 18.630 0.0080 0.0095 13.970
n= 2,000 0.0045 0.0921 28.247 0.0056 0.0067 19.762
n= 5,000 0.0028 0.0917 46.583 0.0035 0.0043 31.226
n= 10,000 0.0020 0.0917 66.215 0.0025 0.0030 44.161
n= 15,000 0.0016 0.0918 81.116 0.0020 0.0024 54.087
Pattern (e) (f)
Method RNA NA PA RNA NA PA
n= 10 0.0155 0.0215 0.6144 0.0252 0.0262 0.7726
n= 20 0.0109 0.0151 0.8733 0.0169 0.0176 1.0813
n= 50 0.0068 0.0095 1.3818 0.0105 0.0109 1.7042
n= 100 0.0048 0.0066 1.9534 0.0074 0.0077 2.4101
n= 200 0.0034 0.0047 2.7627 0.0052 0.0054 3.4005
n= 500 0.0021 0.0030 4.3608 0.0033 0.0034 5.3738
n= 1000 0.0015 0.0021 6.1632 0.0023 0.0024 7.6009
n= 2,000 0.0011 0.0015 8.7117 0.0016 0.0017 10.745
n= 5,000 0.0007 0.0009 13.771 0.0010 0.0011 16.988
n= 10,000 0.0005 0.0007 19.485 0.0007 0.0008 24.017
n= 15,000 0.0004 0.0005 23.869 0.0006 0.0006 29.425
15
the RF1 method requires more computing time than the DFT-CF method. The RF1 method
also requires more RAM. For example, when n= 15,000, approximately 4GB memory is
needed for computing the entire cdf. The DFT-CF method, however, is less demanding in
memory. Thus the DFT-CF method is recommended for computing the exact values for the
cdf FN(k), especially when nis large.
5 Software Implementation
The DFT-CF, RF1, RNA and NA methods have been implemented in R. The computation-
ally intensive components such as the FFT are implemented in C and are linked to R. The
R functions have been wrapped into an R package poibin which can be downloaded from
the Comprehensive R Archive Network (http://cran.r-project.org/). The R function in the
package for computing the cdf FN(k) is ppoibin(), which has an option that allows users to
specify the method for computing.
6 Concluding Remarks
In this paper, we focus on the computing of the distribution function for the Poisson binomial
distribution. We present a simple derivation for an exact formula with a closed-form expres-
sion. We develop an algorithm for efficient implementation of the exact formula and study
the advantages and disadvantages of various approximation methods. Numerical studies were
conducted to compare the accuracy of the exact and approximation methods. The DFT-CF,
RF1, RNA and NA methods have been implemented in an R package.
In practice, the DFT-CF method is generally recommended for computing. The RF1
method can also been used when n < 1,000, because there is not much difference in computing
time from the DFT-CF method. The RNA method is recommended when n > 2,000 and the
cdf needs to be evaluated many times. As shown in the numerical study, the RNA method
can approximate the cdf well, when nis large, and is more computationally efficient. The PA
method, however, is not recommended because its performance is generally poor. The RF2
method is not recommended either, because the algorithm is not numerically stable.
Acknowledgement
The author would like to thank the editor, an associate editor, and the referees, who provided
valuable comments that helped for improving this paper. The author also would like to thank
William Q. Meeker and Qingyu Yang for their valuable comments on an earlier version of
16
Table 3: Computational efficiency comparisons for the RNA, DFT-CF and RF1 Methods,
based on 1,000 sets of pj’s simulated from each combination of patterns in pj’s and values of
n. The unit of time is the second.
Pattern (a) (b)
Method RNA DFT-CF RF1 RNA DFT-CF RF1
n= 10 0.000 0.000 0.000 0.000 0.000 0.000
n= 20 0.000 0.000 0.000 0.000 0.000 0.000
n= 50 0.000 0.000 0.000 0.000 0.000 0.000
n= 100 0.000 0.000 0.000 0.000 0.000 0.000
n= 200 0.000 0.001 0.001 0.000 0.001 0.001
n= 500 0.000 0.008 0.005 0.000 0.006 0.006
n= 1,000 0.000 0.029 0.068 0.000 0.022 0.069
n= 2,000 0.001 0.111 0.185 0.000 0.084 0.181
n= 5,000 0.001 0.691 0.825 0.001 0.528 0.814
n= 10,000 0.002 2.735 3.377 0.002 2.100 3.307
n= 15,000 0.003 6.176 7.715 0.003 4.736 7.658
Pattern (c) (d)
Method RNA DFT-CF RF1 RNA DFT-CF RF1
n= 10 0.000 0.000 0.000 0.000 0.000 0.000
n= 20 0.000 0.000 0.000 0.000 0.000 0.000
n= 50 0.000 0.000 0.000 0.000 0.000 0.000
n= 100 0.000 0.000 0.000 0.000 0.000 0.000
n= 200 0.000 0.001 0.001 0.000 0.001 0.001
n= 500 0.000 0.006 0.005 0.000 0.006 0.005
n= 1,000 0.000 0.026 0.068 0.000 0.024 0.074
n= 2,000 0.000 0.098 0.185 0.000 0.094 0.193
n= 5,000 0.001 0.617 0.809 0.001 0.581 0.827
n= 10,000 0.002 2.445 3.337 0.002 2.271 3.359
n= 15,000 0.003 5.517 7.731 0.003 5.141 7.665
Pattern (e) (f)
Method RNA DFT-CF RF1 RNA DFT-CF RF1
n= 10 0.000 0.000 0.000 0.000 0.000 0.000
n= 20 0.000 0.000 0.000 0.000 0.000 0.000
n= 50 0.000 0.000 0.000 0.000 0.000 0.000
n= 100 0.000 0.000 0.000 0.000 0.000 0.000
n= 200 0.000 0.001 0.001 0.000 0.001 0.001
n= 500 0.000 0.007 0.005 0.000 0.007 0.004
n= 1000 0.000 0.029 0.067 0.000 0.027 0.069
n= 2,000 0.001 0.109 0.186 0.000 0.104 0.194
n= 5,000 0.001 0.689 0.859 0.001 0.654 0.912
n= 10,000 0.002 2.743 3.348 0.002 2.593 3.661
n= 15,000 0.002 6.181 7.741 0.003 5.847 8.626
17
the paper. The work was supported by funds from NSF Award CMMI-1068933 and the 2011
DuPont Young Professor Award.
References
Athreya, K. B. and S. N. Lahiri (2006). Measure Theory and Probability Theory. New York:
Springer.
Barlow, R. E. and K. D. Heidtmann (1984). Computing k-out-of-n system reliability. IEEE
Transactions on Reliability 33, 322–323.
Bossavy, A., R. Girard, and G. Kariniotakis (2012). Forecasting ramps of wind
power production with numerical weather prediction ensembles. Wind Energy, doi:
10.1002/we.526 .
Bracewell, R. (2000). The Fourier Transform & Its Applications (Third ed.). Singapore:
McGraw-Hill, Inc.
Chen, S. X. and J. S. Liu (1997). Statistical applications of the Poisson-binomial and con-
ditional Bernoulli distributions. Statistica Sinica 7, 875–892.
Chen, X.-H., A. P. Dempster, and J. S. Liu (1994). Weighted finite population sampling to
maximize entropy. Biometrika 81, 457–469.
Cooley, J. W. and J. W. Tukey (1965). An algorithm for the machine calculation of complex
Fourier series. Mathematics of Computation 19, 297–301.
Duffie, D., L. Saita, and K. Wang (2007). Multi-period corporate default prediction with
stochastic covariates. Journal of Financial Economics 83, 635–665.
Fern´andez, M. and S. Williams (2010). Closed-form expression for the Poisson-binomial
probability density function. IEEE Transactions on Aerospace Electronic Systems 46,
803–817.
Hong, Y. and W. Q. Meeker (2010). Field-failure and warranty prediction based on auxiliary
use-rate information. Technometrics 52, 148–159.
Hong, Y., W. Q. Meeker, and J. D. McCalley (2009). Prediction of remaining life of power
transformers based on left truncated and right censored lifetime data. The Annals of
Applied Statistics 3, 857–879.
Kuo, W. and M. Zuo (2003). Optimal Reliability Modeling: Principles and Applications.
Hoboken, NJ: John Wiley & Sons, Inc.
Niida, A., S. Imoto, T. Shimamura, and S. Miyano (2012). Statistical model-based testing
to evaluate the recurrence of genomic aberrations. Bioinformatics 28, i115–i120.
18
Pitacco, E. (2007). Mortality and longevity: A risk management perspective. In IAA Life
Colloquium, Stockholm, available at
http://www.actuaries.org/LIFE/Events/Stockholm/Pitacco.pdf.
Le Cam, L. (1960). An approximation theorem for the Poisson binomial distribution. Pacific
Journal of Mathematics 10, 1181–1197.
R (2012). The R Project for Statistical Computing. http://www.r-project.org/.
Samuels, S. M. (1965). On the number of successes in independent trials. The Annals of
Mathematical Statistics 36, 1272–1278.
Singleton, R. C. (1969). An algorithm for computing the mixed radix fast Fourier transform.
IEEE Transactions on Audio and Electroacoustics AU-17, 93–103.
Tang, P. and E. A. Peterson (2011). Mining probabilistic frequent closed itemsets in un-
certain databases. In Proceedings of the 49th ACM Southeast Conference (ACMSE),
Kennesaw, GA, pp. 86–91.
Volkova, A. Y. (1996). A refinement of the central limit theorem for sums of independent
random indicators. Theory of Probability and its Applications 40, 791–794.
Wang, Y. H. (1993). On the number of successes in independent trials. Statistica Sinica 3,
295–312.
19
... Although existing approximation methods, such as the Poisson and normal approximations, can estimate results roughly. In this study, we use a calculation formula based on the discrete Fourier transformation (DFT-CF) of distributional eigenfunctions to calculate the consensus security proposed in [47] in an exact way. According to equation (7) of [47], equation (10) can be calculated by: ...
... In this study, we use a calculation formula based on the discrete Fourier transformation (DFT-CF) of distributional eigenfunctions to calculate the consensus security proposed in [47] in an exact way. According to equation (7) of [47], equation (10) can be calculated by: ...
... The proposed algorithms can be found in Alg. A of [47], which takes fast Fourier transformation only once. ...
Preprint
Leveraging blockchain in Federated Learning (FL) emerges as a new paradigm for secure collaborative learning on Massive Edge Networks (MENs). As the scale of MENs increases, it becomes more difficult to implement and manage a blockchain among edge devices due to complex communication topologies, heterogeneous computation capabilities, and limited storage capacities. Moreover, the lack of a standard metric for blockchain security becomes a significant issue. To address these challenges, we propose a lightweight blockchain for verifiable and scalable FL, namely LiteChain, to provide efficient and secure services in MENs. Specifically, we develop a distributed clustering algorithm to reorganize MENs into a two-level structure to improve communication and computing efficiency under security requirements. Moreover, we introduce a Comprehensive Byzantine Fault Tolerance (CBFT) consensus mechanism and a secure update mechanism to ensure the security of model transactions through LiteChain. Our experiments based on Hyperledger Fabric demonstrate that LiteChain presents the lowest end-to-end latency and on-chain storage overheads across various network scales, outperforming the other two benchmarks. In addition, LiteChain exhibits a high level of robustness against replay and data poisoning attacks.
... The goal of this paper is to re-visit the idea of APL with some new development from another area, by realizing that the denominator of (1) is in the form of the probability mass function of a Poisson-binomial (PB) distribution. The PB distribution describes the sum of independent but non-identically distributed random indicators (Hong, 2013;Biscarri et al., 2018). When ties are present, the calculation of the APL gets even harder and more time consuming. ...
... In this paper, we propose a new computationally efficient method to calculate the APL based on the PB distribution development in Hong (2013). ...
... The key idea is that the denominator of the APL is exactly in the form of the probability mass function of a PB distribution, regardless the presence of ties or not. We use the method in Hong (2013) to compute the PB probability mass function, which is based on the discrete Fourier transformation of the characteristic function. Alternatively, one can also use the convolution-based method in Biscarri et al. (2018) based on the direct convolution or the divideand-conquer fast Fourier transform tree convolution. ...
... Each of the N log-rates is assigned a broad normal prior distribution (σ = 5), reflecting minimal prior knowledge (Fig. 1F). For each proposed set of rates, we compute a Poisson-binomial (Hong 2013;Straka 2017) distribution of exchanges -adjusted by the inferred back exchange and deuterium fraction (Fig. 1, Eq. 3) -to yield a convolved "theoretical" integrated mass envelope per timepoint. Discrepancies between these theoretical envelopes and the measured intensities are captured by a Gaussian distribution with a global noise parameter (drawn from an exponential prior, σ = 1). ...
Preprint
Full-text available
All folded proteins continuously fluctuate between their low-energy native structures and higher energy conformations that can be partially or fully unfolded. These rare states influence protein function, interactions, aggregation, and immunogenicity, yet they remain far less understood than protein native states. Although native protein structures are now often predictable with impressive accuracy, conformational fluctuations and their energies remain largely invisible and unpredictable, and experimental challenges have prevented large-scale measurements that could improve machine learning and physics-based modeling. Here, we introduce a multiplexed experimental approach to analyze the energies of conformational fluctuations for hundreds of protein domains in parallel using intact protein hydrogen-deuterium exchange mass spectrometry. We analyzed 5,778 domains 28-64 amino acids in length, revealing hidden variation in conformational fluctuations even between sequences sharing the same fold and global folding stability. Site-resolved hydrogen exchange NMR analysis of 13 domains showed that these fluctuations often involve entire secondary structural elements with lower stability than the overall fold. Computational modeling of our domains identified structural features that correlated with the experimentally observed fluctuations, enabling us to design mutations that stabilized low-stability structural segments. Our dataset enables new machine learning-based analysis of protein energy landscapes, and our experimental approach promises to reveal these landscapes at unprecedented scale.
... We used the implementation provided by Hong 94 to generate estimates of the Poisson binomial probability density function from the collection of dams in the WRD conditional to having survived in 2018, the numbers of annual failures were estimated for the period 2023-2035. Figure 4f compares these estimates with the recorded failures in the period 2019-2024. ...
Article
Full-text available
Dams are essential for flood protection, water resources management, energy generation and storage and food production. However, the consequences of their failure can be catastrophic, as demonstrated by recent examples. Here this study revisits dam failures worldwide since 1900, analysing key factors driving the failure risk, profiling current dam safety and providing an outlook to the near future. Similar to previous analyses (1970s to 2010s), we observe a strong infant mortality, which remains especially important for the development of new embankment dams, while recent concrete dams have become more resilient. In contrast, hazard signals related to ageing remain yet less apparent, contrary to common belief. Nevertheless, given their abundance, we expect decades-old dams to be prevalent in future failure statistics—especially for embankment dams of height between 15 and 70 m built in the second half of the last century. This highlights the relevance of investments in monitoring, maintenance and uprating, which, if ignored, could become a substantial liability and a major vulnerability, especially in the context of increased flooding frequency. We uncover a trend of increased failure rates of newly constructed dams in low-income regions, which coincides with areas where a substantial hydropower potential remains untapped. This is especially intensified in monsoon-dominated climates, whereas the pattern of construction and failures is more homogeneous across other major climatic regions. Our statistical analysis suggests that 23 (95% confidence interval 14–33) large dam failures are to be expected worldwide in the near future (2023–2035), with currently ~4.4% of large dams having a probability of failure larger than 1/10,000. Contrarily, 85% of large dams are at least twice as safe as this threshold, commonly embraced in policy. These findings can support the targeted allocation of limited resources for the future-proofing of infrastructure, thereby contributing to water, food and energy security.
... Fortunately, there are several ways to avoid this combinatorial explosion. In this work, we leverage the algorithm presented by Hong (2013), which is based on the Fast Fourier Transform and implemented in the poibin 1 python library. It allows us to derive the Poisson binomial distributions exactly and almost instantly. ...
Article
After a machine learning model has been deployed into production, its predictive performance needs to be monitored. Ideally, such monitoring can be carried out by comparing the model’s predictions against ground truth labels. For this to be possible, the ground truth labels must be available relatively soon after inference. However, there are many use cases where ground truth labels are available only after a significant delay, or in the worst case, not at all. In such cases, directly monitoring the model’s predictive performance is impossible. Recently, novel methods for estimating the predictive performance of a model when ground truth is unavailable have been developed. Many of these methods leverage model confidence or other uncertainty estimates and are experimentally compared against a naive baseline method, namely Average Confidence (AC), which estimates model accuracy as the average of confidence scores for a given set of predictions. However, until now the theoretical properties of the AC method have not been properly explored. In this paper, we bridge this gap by reviewing the AC method and show that under certain general assumptions, it is an unbiased and consistent estimator of model accuracy. We also augment the AC method by deriving valid confidence intervals for the estimates it produces. These contributions elevate AC from an ad-hoc estimator to a principled one, encouraging its use in practice. We complement our theoretical results with empirical experiments, comparing AC against more complex estimators in a monitoring setting under covariate shift. We conduct our experiments using synthetic datasets, which allow for full control over the nature of the shift. Our experiments with binary classifiers show that the AC method is able to beat other estimators in many cases. However, the comparative quality of the different estimators is found to be heavily case-dependent.
... Specifically, the sum of independent indicators of liquefaction (=1) or non-liquefaction (=0) for individual case histories in each region is assumed to follow a Poisson binomial distribution, where the occurrence probability of each indicator is specified as a case-history-specific P L0 value. In this study, PMF is obtained by the "poibin" package in the R software (Hong 2013). Figures 6a and 6b display the PMF results for the regionspecific SPT-BI12 and CPT-RW98 models, respectively. ...
Article
Full-text available
The accuracy of cyclic stress-based liquefaction-triggering assessment procedures can vary systematically from region to region, but it is challenging to regionalize models due to the lack of region-specific data. This paper presents a hierarchical Bayesian modeling (HBM)-based framework for incorporation of inter-region and intra-region variabilities of the bias factor in liquefaction-triggering procedures. A key feature is that the BUS approach (Bayesian Updating with Structural reliability methods) is combined with subset simulation to efficiently update high-dimensional statistics of bias factors. Another feature is a new four-hyperparameter HBM considering both region-specific means and variances of bias factors. This framework is utilized to develop three sets of region-specific liquefaction probability models for practical applications, covering liquefaction-susceptible sandy and gravelly soils. The results show that the four-hyperparameter HBM generally matches better with liquefaction observations and produces larger total variance, compared to the lumped-region modeling and the HBM with only region-specific means. Meanwhile, the population-level distribution and the weighting factor of liquefaction/non-liquefaction occurrence can considerably affect model performance. Furthermore, a discrete integration-based probabilistic method is suggested for liquefaction-triggering hazard assessment. Illustrative examples indicate that different HBM configurations can yield notably different liquefaction hazard results while neglecting the region-variability tends to be unconservative.
Article
Background Chimerism monitoring is part of the standard of care for patients following an allogeneic hematopoietic stem cell transplantation. There has recently been a move towards sensitive, high‐throughput (next‐generation) sequencing analysis of biallelic markers for this purpose. Determining the number and properties of the markers to include in an assay to achieve reliable yet cost‐effective chimerism quantification is an underexposed but critical part of chimerism assay development, optimization, and validation. Methods We develop FABCASE (Fast and Accurate Biallelic Chimerism Assay Size Estimation), an approach to estimate the required number of markers to screen to obtain a given informativity rate. We explore several practical examples that illustrate the diverse applications of FABCASE beyond calculating the required number of markers to screen. Results FABCASE offers a more than four orders of magnitude speed improvement compared to a previously introduced Monte Carlo simulation approach, unlocking extensive in silico scenario analyses. We find that minor allele frequency ( MAF ) and informative rate estimation based on small sample series (tens) are reasonably accurate. MAFs may vary drastically between populations, and the number of required markers to attain a preset informativity rate is inflated (here, +28%) when not optimized. Marker subset selection from a pool of candidate markers is little impacted by small‐to‐medium MAF differences (0%–20% MAF ). Prioritizing markers with uniform amplification efficiency reduces sequencing depth requirements and improves cost‐effectiveness. A web graphical user interface facilitating marker set informativity evaluation is available at https://mvynck.shinyapps.io/FABCASE . Conclusion FABCASE facilitates the design, refinement, and implementation of sensitive and cost‐effective chimerism assays. Due attention should be given to an assay's marker MAFs , sensitivities, and amplification efficiencies.
Article
Full-text available
En este artículo se presenta una propuesta Bayesiana para mejorar el estimador de Hussain en la Técnica de Conteo de Ítems (TCI), con el objetivo de eliminar o reducir la proporción de estimaciones negativas que presenta este estimador cuando se desea estudiar una característica sensible en particular. Se plantea un estimador analítico empleando como distribuciones previas una distribución beta y la distribución uniforme, el análisis se realiza vía simulación planteando diferentes escenarios para el numero de preguntas en el cuestionario, tamaño de la muestra y la proporción conocida de las preguntas no sensibles, obteniendo finalmente la eliminación total de las estimaciones negativas para la proporción de personas que poseen una característica sensible de interés, presentando reducciones significativas en el coeficiente de variación estimado.
Article
Full-text available
Usually the warranty data response used to make predictions of future failures is the number of weeks (or another unit of real time) in service. Use-rate information usually is not available (automobile warranty data are an exception, where both weeks in service and number of miles driven are available for units returned for warranty repair). With new technology, however, sensors and smart chips are being installed in many modern products ranging from computers and printers to automobiles and aircraft engines. Thus the coming generations of fleld data for many products will provide information on how the product has been used and the environment in which it was used. This paper was motivated by the need to predict warranty returns for a product with multiple failure modes. For this product, cycles-to-failure/use-rate information was available for those units that were connected to the network. We show how to use a cycles-to-failure model to compute predictions and prediction intervals for the number of warranty returns. We also present prediction methods for units not connected to the network. In order to provide insight into the reasons that use-rate models provide better predictions, we also present a comparison of asymptotic variances comparing the cycles-to-failure and time-to-failure models.
Article
Full-text available
Today, there is a growing interest in developing short-term wind power forecasting tools able to provide reliable information about particular, so-called ‘extreme’ situations. One of them is the large and sharp variation of the production a wind farm can experience within a few hours called ramp event. Developing forecast information specially dedicated to ramps is of primary interest because of both the difficulties that usual models have to predict and the potential risk they represent in the management of a power system. First, we propose a methodology to characterize ramps of wind power production with a derivative filtering approach derived from the edge detection literature. Then we investigate the skill of numerical weather prediction ensembles to make probabilistic forecasts of ramp occurrence. Through conditioning probability forecasts of ramp occurrence to the number of ensemble members forecasting a ramp in time intervals, we show how ensembles can be used to provide reliable forecasts of ramps with sharpness. Our study relies on 18 months of wind power measures from an 8 MW wind farm located in France and forecasts ensemble of 51 members from the Ensemble Prediction System of the European Center for Medium-Range Weather Forecasts. Copyright © 2012 John Wiley & Sons, Ltd.
Article
A unified combinatorial approach is used to obtain many theorems about S n , the number of successes in n independent non-identical Bernoulli trials. The following results are, in particular, proved: (1) The variance of S n increases as the set of success probabilities {p i } tends to be more and more homogeneous and attains its maximum as they become identical; (2) the density of S n is unimodal: first increasing then decreasing; (3) four different versions of Poisson’s theorem; (4) an upper bound for the total variation between the distribution of S n and that of the Poisson.
Book
Un libro a nivel de posgrado, que puede ser usado como texto para cursos de Teor'ia de la Medida y Teor'ia de Probabilidad, incluyendo tópicos suplementarios de Procesos Estocásticos y tópicos afines.\ En primera instancia, el libro está destinado para alumnos de primer año de Posgrado en Matemáticas y Estad'istica, también está recomendado para estudiantes de Ingenier'ia.\ Los conocimientos previos comprenden un nivel m'inimo de dominio en los conceptos básicos de Análisis Real. Entre estos están L'imites, Continuidad, Diferenciabilidad, Integral de Riemann, Convergencia de Secuencias y Series, entre otros. Una s'intesis de estos conceptos está inclu'ida en el apéndice