PreprintPDF Available

# Group testing with nested pools

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

## Abstract and Figures

We iterate Dorfman's pool testing algorithm \cite{dorfman} to identify infected individuals in a large population, a classification problem. This is an adaptive scheme with nested pools: pools at a given stage are subsets of the pools at the previous stage. We compute the mean and variance of the number of tests per individual as a function of the pool sizes $m=(m_1,\dots,m_k)$ in the first $k$ stages; in the $(k+1)$-th stage all remaining individuals are tested. Denote by $D_k(m,p)$ the mean number of tests per individual, which we will call the cost of the strategy $m$. The goal is to minimize $D_k(m,p)$ and to find the optimizing values $k$ and $m$. We show that the cost of the strategy $(3^k,\dots,3)$ with $k\approx \log_3(1/p)$ is of order $p\log(1/p)$, and differs from the optimal cost by a fraction of this value. To prove this result we bound the difference between the cost of this strategy with the minimal cost when pool sizes take real values. We conjecture that the optimal strategy, depending on the value of $p$, is indeed of the form $(3^k,\dots,3)$ or of the form $(3^{k-1}4,3^{k-1}\dots,3)$, with a precise description for $k$. This conjecture is supported by inspection of a family of values of $p$. Finally, we observe that for these values of $p$ and the best strategy of the form $(3^k,\dots,3)$, the standard deviation of the number of tests per individual is of the same order as the cost. As an example, when $p=0.02$, the optimal strategy is $k=3$, $m=(27,9,3)$. The cost of this strategy is $0.20$, that is, the mean number of tests required to screen 100 individuals is 20.
Content may be subject to copyright.
Group testing with nested pools
In´es Armend´ariz
, Pablo A. Ferrari
,
Daniel Fraiman
, Silvina Ponce Dawson§
June 9, 2020
Abstract
We iterate Dorfman’s pool testing algorithm [7] to identify infected individuals in a large
population, a classiﬁcation problem. This is an adaptive scheme with nested pools: pools at a
given stage are subsets of the pools at the previous stage. We compute the mean and variance
of the number of tests per individual as a function of the pool sizes m= (m1, . . . , mk) in the
ﬁrst kstages; in the (k+ 1)-th stage all remaining individuals are tested. Denote by Dk(m, p)
the mean number of tests per individual, which we will call the cost of the strategy m. The
goal is to minimize Dk(m, p) and to ﬁnd the optimizing values kand m. We show that the
cost of the strategy (3k,...,3) with klog3(1/p) is of order plog(1/p), and diﬀers from the
optimal cost by a fraction of this value. To prove this result we bound the diﬀerence between
the cost of this strategy with the minimal cost when pool sizes take real values. We conjecture
that the optimal strategy, depending on the value of p, is indeed of the form (3k,...,3) or of
the form (3k14,3k1. . . , 3), with a precise description for k. This conjecture is supported by
inspection of a family of values of p. Finally, we observe that for these values of pand the best
strategy of the form (3k,...,3), the standard deviation of the number of tests per individual is
of the same order as the cost. As an example, when p= 0.02, the optimal strategy is k= 3,
m= (27,9,3). The cost of this strategy is 0.20, that is, the mean number of tests required to
screen 100 individuals is 20.
Keywords Dorfman’s retesting, Group testing, Nested pooled testing, Adaptive testing.
AMS Math Classiﬁcation Primary 62P10
1 Introduction
The outbreak of COVID-19 caused by the novel Coronavirus, SARS-CoV-2, has spread over almost
all countries in the world [19]. Running diagnostic tests is a key tool not only for the treatment
of those infected but also to make decisions on how to handle the spread of the epidemic within
nations and communities. The possibility of running the current gold standard test, RT-qPCR,
in sample pools was investigated in [25] ﬁnding that the identiﬁcation of individuals infected with
SARS-CoV-2 is in fact possible using mixtures of up to 32 individual samples. The use of more
sensitive tests [24, 6] would likely improve this limit.
Universidad de Buenos Aires & IMAS-CONICET-UBA. Email: iarmend@dm.uba.ar
Universidad de Buenos Aires & IMAS-CONICET-UBA. Email: pferrari@dm.uba.ar
Universidad de San Andr´es & CONICET. Email: dfraiman@udesa.edu.ar
§Universidad de Buenos Aires & IFIBA-CONICET-UBA Email: silvina@df.uba.ar
1
arXiv:2005.13650v2 [math.ST] 6 Jun 2020
Dorfman [7] was the ﬁrst to propose a group testing strategy in 1943. The samples from n
individuals are pooled and tested together. If the test is negative, the nindividuals of the pool are
cleared. If the test is positive, each of the nindividuals must be tested separately, and n+1 tests are
required to test npeople. In the present note we focus on the mathematical aspects of a sequential
multi-stage extension of Dorfman’s algorithm. This scheme belongs to the family of adaptive models,
in which the course of action chosen at each stage depends on the results of previous stages. Our
approach assumes that each test is conclusive, i.e. there are no false positives/negatives.
Dorfman’s 2-stage strategy was subsequently improved by Sterrett [23] to further reduce the
number of tests and extended to more stages of group testing by Sobel and Groll [22] and Finu-
can [10]. Noticing that the optimal strategy depends on the fraction of infected individuals within
the tested population, in [22] the approach was extended to estimate the infection probability p,
and to situations with subpopulations characterized by diﬀerent infection probabilities. References
[16, 2, 3] proposed to use information on heterogeneous populations to improve Dorfman’s algorithm.
An extension of Dorfman’s algorithm in which each group is tested several times to minimize testing
errors was presented in [11]. The previous strategies are classiﬁed as adaptive. Adaptive strategies
can lead to errors if the test gives false positives or negatives. When testing errors are present
and/or tests are time consuming, it may be convenient to perform tests in parallel; these methods
are called nonadaptive [20, 8, 14, 5, 4, 17, 1]. Nonadaptive testing is not necessarily free of errors.
The impact of test sensitivity in both adaptive/nonadaptive testing was analyzed in [14].
Diﬀerent pool testing strategies have been analyzed speciﬁcally for the case of SARS-CoV-2.
In [12] Dorfman’s algorithm is applied including the use of replicates to check for false negatives
or positives. In [18] adaptive and non-adaptive methods that use binary splitting are compared
numerically. The work in [21] evaluates numerically the performance of two-dimensional array
pooling comparing it with Dorfman’s strategy.
Under Dorfman’s strategy [7] the mean number of tests per pool of npeople is
1 + n1(1 p)n,(1)
where pis the probability that an individual is infected, assuming that the events that diﬀerent
people are infected are independent. The ﬁrst term above is the number of tests in the ﬁrst stage:
one test per pool. The second term accounts for the nadditional tests required in the second stage,
one test per individual in the pool, when there is at least one infected individual, an event with
probability 1 (1 p)n. Dividing by nin (1), the cost of the scheme, that is the mean number of
tests per individual, is then
D(n, p) = 1
n+ 1 (1 p)n.(2)
The cost of the one stage strategy consisting in testing every person in the pool is 1, hence Dorfman’s
strategy is worth pursuing only if D(n, p)<1. Solving for p, this means that
p < 11
e1/e 0.3077992 . . . , (3)
and in this case nmust be greater than or equal to 3. For small values of pthe value nthat
minimizes Das a function of pis approximately p1/2and D(p1/2, p)2p, see Feller [9].
In this note we propose to consider a sequence of nested pools, a scheme mentioned by Sobel
and Groll [22]. In the ﬁrst stage test pools of size m1; individuals in pools that tested negative
are healthy. Pools that tested positive are split into (smaller) pools of size m2, and these pools are
2
tested in the second stage. Iterate until the k-th stage where the pools are of size mk, and ﬁnally,
in the k+ 1-th stage, test every remaining individual belonging to pools that tested positive in
the previous stage. For each infection proportion pand pool strategy given by the choice of kand
m= (m1, . . . , mk), let
Tk(m, p) := total number of tests per initial pool (of size m1),
Dk(m, p) := 1
m1
ETk= cost of this strategy.
We compute a precise formula for Tk(m, p) and derive its mean and variance in §2.
Consider Dk(m, p) : Rk
0R, that is, allow pool sizes to take real values. We show in §3.1
that for a given pand ﬁxed k, the minimum of Dk(m, p) can be computed using the Lambert W
function. Unfortunately there is no amenable formula of this minimum cost as a function of k. On
the other hand, the linearization Lk(m, p) in pof the cost function has a very simple form, and its
minimum is achieved at (ek, . . . , e) with klog(1/p), see §3.3. We note that Lk(m, p) coincides
with the cost of the simpliﬁed scheme proposed by Finucan [10].
Drawing inspiration from the study of the linear approximation, we consider in §3.4 the family
of pool strategies (3k,...,3), k N, which are associated to feasible testing schemes. We compute
in Lemma 8 the optimal choice of kfor this family, that we denote by k3=k3(p), and prove in
Theorem 9 that the cost Dk3(3k3,...,3), pis of the same order as the minimum cost
D?(p) = min
k,m Dk(m, p) = Oplog(1/p)(4)
and bound their diﬀerence. Notice that the optimization in (4) is carried out over real valued
mRk, and the minimum attained is a lower bound to the optimum value achieved when
considering actual testing schemes. Thus Theorem 9 provides a bound for the diﬀerence be-
tween Dk3(3k3,...,3), pand the cost of the optimal feasible strategy, and shows that both are
O(plog(1/p)).
In §3.5 we study by inspection the cost function restricted to feasible nested pooling strategies,
for a family of values of p[0.002,0.1]. The results conﬁrm that in most cases (3k3. . . , 3) is the
optimal strategy, see Table 1. These ﬁndings can be used as a guide to designing concrete group
testing strategies. In §3.6 we contrast these discrete optimization results with those obtained in the
linearized optimization problem, Table 2.
After plotting many diﬀerent strategies, we conjecture in §3.7 that the optimal strategy is given
by (3j,...,3) when p[λj, ρj], and (3j14,3j1,...,3) when p[ρj+1, λj], where ρ1> λ1> ρ2>
λ2> . . . are given explicitly.
Lemma 1 in §2 gives a closed formula for the VarTk(m, p)in the particular case of interest here,
when pool sizes m1, . . . , mkare powers of a ﬁxed quantity. We apply it to compute the standard
deviation of the number of tests per person in the discrete optimization solution in §3.5. It turns
out that, for the family of values of pconsidered, the standard deviation is of the same order as the
mean number of tests, Table 1.
The article is organized as follows. In §2 we deﬁne the pool strategy and compute the mean and
variance of the random number of tests per individual. We consider the problem of optimizing the
cost function in §3. We include three appendices with technical computations.
3
2 Nested strategy
We iterate Dorfman’s strategy using nested pools, that is, pools in each stage are obtained as a
partition of the positive-testing pools in the previous stage.
We work under the assumption that the events that diﬀerent individuals in the population to
be tested are infected are independent. Let X= (X1, . . . , XN) with i.i.d. XiBernoulli(p). When
Xi= 1 we say that the individual iis infected. The parameter pis the probability that an individual
is infected. For any subset Aof the population, that is A⊂ {1, . . . , N }, the function
φA(X) := Y
iA
(1 Xi).(5)
is called a test. Notice that if there is no infected individual in Athen the test takes the value 1,
and otherwise it vanishes. The goal is to reveal the values of Xas the result of a family of tests.
To describe Dorfman’s strategy [7] in these terms, let m1=ndenote the chosen size for the
pools, let W={Xi}1im1be one of the pools, and compute
φ:= (1 X1). . . (1 Xm1).(6)
This test is the ﬁrst stage of the strategy. If φ= 1 the pool has no infected individuals, and we
conclude that X1=··· =Xm1= 0. If φ= 0 we move on to the second stage, where we compute
each Xiindividually. The number T=T(X) of performed tests is a function of Xgiven by
T= 1 + m11φ,(7)
and the cost (2) is
D(m1, p) := 1
m1ET =1
m1
+ 1 qm1,(8)
with
q:= 1 p. (9)
Let us now describe the 3-stage procedure. Denote by m1, m21 the sizes of the pools in the
ﬁrst and second stage, respectively, where m1is a multiple of m2. Let W={Xi}1im1be the pool
in the ﬁrst stage. We partition this family into m1
m2subsets W0, . . . , W m1
m21, by setting
Wi:= Xm2i+1, . . . , Xm2(i+1),0im1
m21.(10)
These are the pools of the second stage. Let
φ:=
m1
Y
i=1
(1 Xi) and φi=
m2
Y
j=1
(1 Xm2i+j),0im1
m21.(11)
As before, φ= 1 if the pool Whas no infected individuals and 0 otherwise, and similarly φi= 1 if
and only if Wicontains no infected individuals.
In order to reveal the set of variables Xi= 1, 1 im1, we propose the following sequential
group testing scheme: ﬁrst evaluate φ. If φ= 1 then there are no infected individuals in W. If
4
φ= 0 then evaluate φi,0im1
m21. This is the second stage of testing. Note that conditional
on φ= 0 there must be at least one index iwith φi= 1, in this case we will say that the i-th pool
is infected. Finally, in the third stage, apply the test to each individual belonging to an infected
pool in the second stage. Denote by T2the number of tests for the sample Wunder this 3-stage
scheme. Note that we label Taccording to the number kof pooled stages (2 in this case) rather
than with the total number of stages. T2is a function of W= (X1, . . . , Xm1). We get
T2= 1 + (1 φ)m1
m2
+ (1 φ)m2
m1
m21
X
i=0
(1 φi).(12)
Now 1 φBernoulli(1 qm1), and {1φi}0im1
m21are independent, identically distributed
random variables with distribution Bernoulli1qm2, hence
I:=
m1
m21
X
i=0
(1 φi)Binomial m1
m2
,1qm2.(13)
Note that φ= 1 implies φi= 1, i = 0,..., m1
m21, hence
φ(1 φi)0 and Iφ 0.(14)
Replacing in (12) we get
T2= 1 + (1 φ)m1
m2
+m2
m1
m21
X
i=0
(1 φi).(15)
Expected number of tests per individual Let D2(m1, m2, p) := 1
m1ET2denote the cost of
the 3-stage scheme with size m1pools in the ﬁrst stage, size m2pools in the second stage, and
probability of infection p. From (15) we get
D2(m1, m2, p) = 1
m1
+1
m21qm1+ 1 qm2.(16)
In general, for the k+ 1-stage scheme, let us denote by Tkthe total number of tests that are
needed to classify the variables in the pool W. We get
Tk= 1 + (1 φ)m1
m2
+m2
m3
m1
m21
X
i=0
(1 φi) + m3
m4
m1
m21
X
i1=0
m2
m31
X
i2=0
(1 φi1i2)
+· ·· +mk
m1
m21
X
i1=0
m2
m31
X
i2=0 ···
mk1
mk1
X
ik1=0 1φi1i2...ik1,(17)
where (1 φi1i2...ik1) are i.i.d. Bernoulli1qmk. Then, denoting m= (m1, . . . , mk) and
Dk(m, p) := 1
m1
ETk,(18)
5
the cost is
Dk(m, p) = 1
m1
+ 1 qmk+
k
X
j=2
1
mj1qmj1.(19)
Note that, if we denote by
Tj
k:= mj1
mj
m1
m21
X
i1=0
m2
m31
X
i2=0 ···
mj2
mj11
X
ij2=0 1φi1i2...ij2,(20)
the number of tests performed at the j-th stage, then its mean is independent of k,
ET j
k=m1
mj1qmj1.(21)
Variance The variance of Tkcan be computed explicitly. We write down here the case k= 2; the
proof is given in the Appendix A,
VarT2=m2
1
m2
2
qm11qm1+m2m1qm21qm2+ 2 m2
1
m2
qm11qm2.(22)
An important case is when the ratio between consecutive pool sizes is constant and given by the
last pool size mk. This case will be relevant for the linearized optimization problem in §3.3 and for
the discrete optimization problem in §3.5.
Lemma 1 (Variance when pool-sizes are powers of mk).Let mj
mj+1 =µ=mk,j1k, so that
mj=µkj+1. Then,
VarTk=µ2k
X
i=1
µi11qmihqmi+ 2
i1
X
j=1
qmji(23)
=µ2nk
X
i=1
µi11qµki+1 hqµki+1 + 2
i1
X
j=1
qµkj+1 io.(24)
This lemma is proved in Appendix A. Computations are similar for the general case, without
assumptions on the sequence of pool sizes.
3 Optimization
In this section we search for the values of kand m= (m1, . . . , mk) which minimize Dk(m, p).
Throughout §3.1, §3.2 and §3.3 pool sizes are allowed to take values in R0. In §3.1 we show that
it is possible to optimize the cost Dkfor ﬁxed pand kusing the Lambert Wfunction, and derive
some estimates for the optimal strategy in §3.2. In §3.3 we optimize the linearization of the cost,
and bound its diﬀerence with the cost. We optimize over the family of strategies (3k,...,3), k N,
in §3.4, and show that the cost of the best choice is of the same order as the optimal strategy
(Theorem 9), we compare this cost with the linearized cost for the same strategy in §3.6. In §3.5
we optimize over nested pool sizes (m1, . . . , mk) in Nkby inspection, for some values of p. Finally,
in §3.7 we conjecture the optimal nested strategy for any p[0,1].
6
3.1 Exact optimization with (m1, . . . , mk)Rk
0
We now optimize (19) for ﬁxed pand k, over the vector of nested pool sizes (m1, . . . , mk) in Rk
0.
Denote
m0=, mk+1 = 1, xi=milog q. (25)
Dk(m, p) = log q
k+1
X
j=1
1exj1
xj
=: h(x) log q. (26)
Consider x= (x1, . . . , xk)Rk
0. We look for a maximum of h. Setting h
∂xi= 0, i= 1, . . . , k, we
get the following equations
x2
jexj=xj+1(1 exj1), j = 1, . . . , k. (27)
This system can be solved exactly in terms of the principal branch W0of the Lambert function. As
an example, for k= 2 the two equations are
x2
1ex1=x2(1 ex0), x2
2ex2=x3(1 ex1),(28)
or, using (25),
x2
1ex1=x2, x2
2ex2=(1 ex1) log q. (29)
From the second equation we get
x2
2ex2/2=1
2p(1 ex1) log(1/q).
The function z7→ zezmaps (−∞,0] to [e1,0]. Hence, if the right hand side of the above
expression falls in the interval [e1,0], we can solve
x2= 2W01
2p(1 ex1) log(1/q),(30)
The choice of the principal branch W0: [e1,0] [1,0] of the Lambert function is required by
the ﬁrst equation in (29) and the fact that infz0z2ez=4e2. Plugging (30) in the left equation
of (29), we have
x2
1ex1=2W01
2p(1 ex1) log(1/q).(31)
For p= 0.04, we use Mathematica [13] to get x1≈ −0.452041460261919 and x2≈ −0.1300281628.
Dividing by log 0.96 we obtain m111.07347805 and m23.185247667. Rounding up to get a
feasible strategy such that m1is a multiple of m2, we get m1= 12 and m2= 3. This strategy turns
out to be optimal among the strategies with m1100, its cost is shown in Table 1. When p= 0.08
we get m17.893901064 and m22.69028519. These values are close to strategies (8,2), and
(9,3) and D2((9,3),0.08) 0.508369323 < D2((8,2),0.08) 0.521990563. In fact, (9,3) is optimal
among the strategies such that m1100.
This procedure can be carried out for any k. For instance, for k= 3, denoting V: [4e2,0] R
by V(x)=2W0(x
2) and g(z) = V1(z) = z2ez, we get
x1=VV(1 ex1)V(1 eg(x1)) log q.(32)
Once we know x1we can derive x2and x3using (27).
7
3.2 Some estimates for the optimal strategy
In this subsection we establish bounds for the size m1of the ﬁrst pool and the number kof stages
under an optimal strategy. We then show that, under some assumptions, the cost of a strategy may
be lowered by adding one stage and reducing the ratio between two consecutive pool sizes. As a
consequence of these results, we obtain a bound on the ratio of consecutive pool sizes of the optimal
strategy.
For kN, let
Rk,
0:= nm= (m1, . . . , mk)Rk, mimi+1 1 and mi
mi+1 2,1ik1o.(33)
Note that the constraints in this deﬁnition are naturally satisﬁed when the vector mNkdeﬁnes
a nested strategy, while a generic mRk,
0will not be associated to a feasible nested strategy if
any of its coordinates belongs to R0\N, or if any of the entries mifails to be a multiple of the
next coordinate mi+1, 1 ik1. For the rest of this section we consider the cost functions
as mappings Dk:Rk,
0R. The estimates in Lemmas 2 and 3 below will be useful in §3.4 to
determine feasible and economical nested strategies.
Lemma 2 (Bounds on the ﬁrst pool size m1and stage number k).Let p[0,1], and let k?N
and m?= (m?
1, . . . , m?
k?)be minimizers of the cost function Dk(m, p),
k?= arg min
kN
min
mRk,
0
Dk(m, p), m?= arg min
mRk?,
0
Dk?(m, p).(34)
Then
m?
11
log 1/q and k?1 + |log log(1/q)|
log 2 .(35)
Proof. The second inequality in (35) follows from the ﬁrst. Indeed, by the conditions in (33)
1m?
k?1
2k?1m?
1=log(m?
1)
log 2 k?1 =k?1 + |log log(1/q)|
log 2 .
Now we show the ﬁrst inequality. By optimality, Dk?m?, p)Dk?1(m?
2, . . . , m?
k?), p, that is,
1
m?
1
+1qm?
1
m?
21
m?
2
,
which implies m?
2m?
1qm?
1and
m?
2max
x0xqx=e1
log(1/q).(36)
Fix (m?
2, . . . , m?
k?) and maximize Dk?(x, m?
2, . . . , m?
k?), pover x. We ﬁnd that the critical point
bm1satisﬁes
bm1=2
log qW0 plog(1/q)m?
2
2!,
8
W0the principal branch of the Lambert function. From (36) we get
plog(1/q)m?
2
2≥ −e1
2
2=W0 plog(1/q)m?
2
2!≥ −1
2and bm11
log 1/q .
It is easy to check that bm12m?
2, hence by the optimality of m?Rk?,
0we conclude that
bm1=m?
1.
Lemma 3 (Bounded ratio between consecutive pool sizes).Let kNand m= (m1, . . . , mk)Rk,
0
such that
i) m11
log 1/q ,
ii) mi1=mi, for some 6and i∈ {2, . . . , k}.
Let
m0= (m1, . . . , mi,3mi, mi, . . . , mk)Rk+1,
0.
Then
Dk+1(m0, p)Dk(m, p).(37)
Proof. From (19) we get
Dk+1(m0, p)Dk(m, p) = 1qmi
3mi
+1q3mi
mi1qmi
mi0,
if and only if
1+2qmi3q3mi0x=qmisatisﬁes 1 + 2x3x30.
For = 6 we have
2x63x3+ 1 = 0, x Rx= 1 or x=1
3
2
and 1 + 2q6mi3q3mi01
3
2qmi1.
In order to show that this last inequality holds under the hypotheses of the lemma, note that
qmiq1
log 1/q =e1
6by i) and ii), so that
qmie1
61
3
2,(38)
as wanted.
To prove the lemma for  > 6, write
2x3x3+ 1 = (x1)(2x1+·· · + 2x1) = (x1)(2x1+·· · + 2x5+· ·· + 2x1)
(x1)(2x5+· ·· + 2x1)
= 2x63x3+ 1 0, x h1
3
2,1i,
and the result follows from (38).
9
Corollary 4 (Bounded ratio between consecutive pool sizes in m?).Let p[0,1], and let k?N
and m?= (m?
1, . . . , m?
k?)be minimizers of the cost function Dk(m, p)as in (34). Then
m?
i+1 >1
6m?
i,1ik?1.(39)
Proof. The conditions in Lemma 3 are satisﬁed by m?by Lemma 2.
3.3 Linearization of the cost function
The exact computation of the optimal strategy becomes complicated as the number of stages in-
creases. In this subsection we study the linearized version of the cost, which is easier to optimize
and gives a good approximation to the cost for small p.
Let us ﬁx pand a stage number k+ 1. We linearize the expected number of tests per individual
Dk=Dk(m, p) obtained in (19):
Dk=1
m1
+ 1 emklog q+
k
X
j=2
1
mj1emj1log q
=Lk+ error.(40)
where the linear approximation Lk=Lk(m, p) is given by
Lk:= 1
m1
+mkp+p
k
X
j=2
mj1
mj
.(41)
The linearized cost Lkcoincides with the cost proposed by Finucan [10], who assumed that for
suitable pand m1there is at most one infected individual per pool at all stages; we give some
details after Lemma 7.
In the next lemma we show that the cost is bounded above by the linearized cost, and provide
an estimate for the diﬀerence. The result is proved in Appendix B.
Lemma 5 (Domination and error bounds).Let p[0,1
2]. Let m= (m1, . . . , mk)Rk
1. Then
1. The linearized cost is an upper bound to the cost,
Dk(m, p)Lk(m, p).(42)
2. If mi1
mifor 2ik+ 1, with mk+1 := 1, then
|Dk(m, p)Lk(m, p)| ≤ m1log2q+kp2.(43)
In particular, when mj
mj+1 =mk,1jk1, equation (43) becomes
|Dk(m, p)Lk(m, p)| ≤ mk+1
klog2q+mkkp2.(44)
Deﬁne the optimal values for Lkby
m](k)=(m]
1(k), . . . , m]
k(k)) := arg min
(m1,...,mk)Rk
+
Lkand (45)
L]
k:= Lk(m](k), p).(46)
In the next two lemmas we compute the optimal linearized values, see also [10].
10
Lemma 6 (Optimal pool sizes).Let p(0,1) and kN,k2. Then
m]
j(k) = pkj+1
k+1 ,1jk. (47)
L]
k= (k+ 1) pk
k+1 .(48)
Proof. For k3 we get
∂Lk
∂m1
=1
m2
1
+p
m2
,(49)
∂Lk
∂mi
=pmi1
m2
i
+p
mi+1
,2ik1,(50)
∂Lk
∂mk
=pmk1
m2
k
+p. (51)
In order to ﬁnd critical points we look for values of mj,1jkwhere these derivatives vanish.
We get
∂Lk
∂mk
= 0 mk1=m2
kfrom (51) (52)
∂Lk
∂mi
= 0 mi1=m2
i
mi+1
for 2 ik1,(53)
∂Lk
∂m1
= 0 m2=pm2
1(54)
Given mkwe use (52) and (53) to solve backwards in the index i, and we get
mk1=m2
k, mk2=m2
k1
mk
=m3
k, mk3=m2
k2
mk1
=m4
k
and, in general, mkj=mj+1
k,0jk1.(55)
We replace the values of m1and m2in (54) to obtain the equation
mk1
k=pm2k
kmk=p1
k+1 ,(56)
from where we get (47). We show in Appendix C that the Hessian matrix of Lkevaluated at the
critical point (m]
1(k), . . . , m]
k(k)) is positive deﬁnite, and hence this is a minimum of Lk.
Substituting (47) in (41) yields L]
k=pk
k+1 +kp p1
k+1 = (k+ 1) pk
k+1 .
We now optimize L]
kas a function of k. Denote
L]=L](p) := min
kR+
L]
k;k]:= arg min
kR+
L]
k.(57)
In general k]R\N. Notice that when kis not a positive integer it is not possible to deﬁne a
vector (m1, . . . , mk) where to evaluate Lk.
11
Lemma 7 (Optimal number of stages).For any p(0,1) we have
k]= log 1
p1, L]=e p log(1/p).(58)
Furthermore, if p=eufor some integer u2, then k]=u1and
L]=Lk](m], p),where m]= (eu1, . . . , e).(59)
Proof of Lemma 7. We compute the derivative
∂L]
k
∂k =p1
k+1 h1 + log p
1 + ki,
which vanishes at k=k]= log 1
p1. This is in fact a global minimum of L]
k, for a given value of p.
We now replace this value in L]
kto get
L]=p11
log(1/p)log 1
p=e p log(1/p).
Under p=euwe have k]=u1, which replaced in (47) gives
m]
k]=eand m]
j=ekj+1.
Remark Finucan [10] proposes to iterate Dorfman’s strategy with non necessarily nested pools,
under the assumption that at every stage each pool has at most one infected individual. Call U:=
number of infected individuals in a population of size N,Uhas Binomial(N, p) distribution. The
number of individuals to be tested in the i-th stage is Umi1, and the total number of tests is
N
m1
+Um1
m2
+Um2
m3
+· ·· +U mk1
mk
+Umk.(60)
Dividing by Nand taking expectation, Finucan gets the linearized cost Lk(m, p) deﬁned in (41)
and derives the results of Lemmas 6 and 7. He also shows that this optimal values maximize the
information gain per test in the case that there is at most one infected individual per pool.
However, the hypothesis that there is at most one infected individual per pool is not satisﬁed
for the optimal values (47). Indeed, when m11/p, the number of infected individuals per pool
is approximately Poisson with mean 1. In any case, Finucan’s cost provides an upper bound to
the true cost of the strategy (k, m), as it in fact computes the number of tests in the worst case
scenario, this is proved rigorously in Lemma 5. This result can also be derived using an information-
based approach since the least informative case is that in which the infected samples are as uniformly
distributed as possible which, in the case of interest here, corresponds to having at most one infected
individual per pool at all stages.
3.4 The strategy (3k3,...,3)
In this subsection we compute kthat minimizes the cost Dk(µk, . . . , µ), pfor a given µ0, and
prove in Theorem 9 that this optimal choice when µ= 3 leads to a cost of the same order as the
minimum possible cost associated to p,
D?(p) := min
kNmin
mRk,
0
Dk(m, p).(61)
12
The choice of pool sizes is inspired by the optimal results for the linearization of the cost obtained
in the previous subsection.
Lemma 8. Among strategies m= (µk, . . . , µ), the optimizing kfor the cost Dk(m, p)is achieved
at kµ(p)deﬁned by
kµ(p) := jlogµ1
logµ(1/(1 p))k.(62)
Proof. From the deﬁnition (19) of Dkwe have
Dk+1(m, p)Dk(m, p) = 1
µk+1 1
µk+1
µk(1 qµk+1 )
=1
µk+1 1
µkqµk+1 .
This expression is greater than zero if and only if µk+1 >1
logµ(1/q), which implies that kµ(p) given
by (62) minimizes Dk(m, p).
Theorem 9 (Estimation of the cost and accuracy of the strategy (3k3,...,3)).Let p[0,1
2],
k3=k3(p)as in (62). Then
Dk3(3k3,...,3), p3
log 3 plog(1/p) + p+ 5p2,(63)
and, with D?(p)deﬁned in (61),
Dk3(3k3,...,3), pD?(p)3
log 3 eplog(1/p) + p+ 15p2log(1/p) + 180p3(64)
0.013 plog(1/p) + p+O(p2log(1/p)).
Proof. Since the cost of a strategy is bounded by its linearized cost (42) we have
Dk3(3k3,...,3), pLk3(3k3,...,3), p
=1
3k3+ 3pk3by (41) and the constant ratio mj1
mj
= 3. (65)
Now
3k3log 3
3
1
log 1/q log 3
3
1
p(1 + p
2)log 3
3
1
p,(66)
and
3pk33plog31
log3(1/q)=3p
log 3 [log log(1/q)log log 3]
3
log 3 plog(1/p) + 3 log log 3
log 3 p+ 5p2.(67)
Apply bounds (66) and (67) to (65) to obtain
Dk3(3k3,...,3), p3
log 3 plog(1/p) + p+ 5p2,(68)
13
which is (63).
We next derive a lower bound. Let k?Nand m?Rk?,
0be as in (34), so that D?(p) =
Dk?(m?, p). By Lemma 5, we have
Dk3(3k3,...,3), pDk?(m?, p)Lk?(m?, p)m?
1log2qk?p2,(69)
where = max
2ik
m?
i1
m?
i6 (Corollary 4), k?1+ |log log(1/q)|
log 2 and m?
11
log(1/q)(Lemma 2). Replace
these bounds in (69) and use that by Lemma (7) Lk?(m?, p)ep log(1/p), to get
Dk3(3k3,...,3), pDk?(m?, p)ep log(1/p)6 log(1/q) log2q6p21 + |log log(1/q)|
log 2
ep log(1/p)6 log3(1/q)6p21 + |log log(1/q)|
log 2
ep log(1/p)15p2log(1/p)180p3.(70)
Combining (68) and (70), we conclude that
ep log(1
p)15p2log(1
p)180p3Dk?(m?, p)Dk3(3k3,...,3), p3
log 3 plog( 1
p) + p+ 5p2.
(71)
and in particular
Dk3(3k3,...,3), pD?(p)3
log 3 eplog(1/p) + p+ 15p2log(1/p) + 180p3
0.013 plog(1/p) + p+O(p2log(1/p)).
The result follows.
3.5 Optimizing the cost by inspection
We report here some results obtained by inspecting the values taken by Dkover the family
Bk:= {(m1, . . . , mk)Z[2,100]k:mj=bjmj+1 for some bjN, j = 1, . . . , k 1},(72)
k∈ {1,...,5}. That is, Bkconsists of vectors of kpool sizes in the range [2,100] Nsatisfying that
each size mjis a multiple of the next stage size mj+1. Denote by
m= (m
1(k), . . . , m
k(k)) := arg min
(m1,...,mk)Bk
Dk(73)
and D
k:= Dk(m, p).(74)
Optimizing now on k, let
D:= min
k∈{1,...,5}D
k;k:= arg min
k∈{1,...,5}
D
k;m
j:= m
j(k).(75)
Consider the standard deviation of the number of tests per person
σ(p) = σk(m, p) := 1
m
1qVarTk(m, p).(76)
14
To compute the variance we use (22), or (24) in Lemma 1. We note that the standard deviation is
of the same order as D(p). Hence, if one performs say, 100 independent realizations of the strategy,
we can expect that the average number of tests per individual will be within 3
100 % of the optimal
value D(p) (with probability .99, by the central limit theorem).
If we assume that m1does not exceed 100, then the values in (75) are optimal: they minimize
the (non-linearized) cost D. We observe in Table 1 critical values for several instances of p; see
Fig. 1 for a graphical representation.
p m
1m
2m
j3kD(p)σ(p)log3(log3(1 p))
0.1 9 3 2 0.5863043 0.4027611 2.133979
0.08 9 3 2 0.5083693 0.3934454 2.346938
0.06 9 3 2 0.4228622 0.3725679 2.618467
0.04 12 3 2 0.3276941 0.3145522 2.997037
0.02 27 9 3 3 0.1979772 0.1997479 3.637304
0.01 343334j+1 4 0.1179085 0.1059675 4.272842
0.008 343334j+1 4 0.09877677 0.09875318 4.476873
0.006 343334j+1 4 0.07876518 0.08931578 4.739649
0.004 353435j+1 5 0.05722486 0.04901306 5.109633
0.002 353435j+1 5 0.03220212 0.03821587 5.741475
0.0001 383738j+1 8 0.002425894 0.002686147 8.469174
0.00001 310 39310j+1 10 0.000305373 0.000363323 10.565118
Table 1: Upper 8 rows: optimal values of Dfound by full inspection of D(m, p) for all possible
pool-sizes min Bkand k∈ {1,2,3,4,5}. We notice that, except for p= 0.04, we have k=k3
and m= (3k3,...,3), see (62). Lower four (gray) rows (p=.004 to p=.00001): here D(p) =
D(3k3,...,3, p), the cost of the strategies (3k3,...,3) for these values of p. We have not proved
that those strategies are optimal but still they are feasible and their cost is computable. D(p) is
an upper bound of the optimal cost for these values.
15
Figure 1: Das a function of pin log-log scale. Colored dots represent the values of Dfor each p.
Filled dots correspond to the true minimum value and empty dots correspond to Dk3(3k3,...,3, p),
at p= 105and p= 104. The continuous line is the graph of the function L](p) = e p log(1/p)
obtained in Lemma 7.
3.6 Cost and linearized cost of strategy (3k3,...,3)
We now consider infection probabilities of the form p=eufor uNand compare
(i) the linearized cost of the optimal linearized strategy L](p) = Lk](m], p) = ue1u, see (58)
and (59). The choice p=euimplies k]=u1Nso we can deﬁne m]= (ek], . . . , e);
(ii) the linearized cost Lk3(m3, p) of the strategy m3:= (3k3,...,3), where k3is deﬁned in (62);
(iii) the cost Dk3(m3, p) of the strategy m3:= (3k3,...,3), where k3is deﬁned in (62). We
concluded by inspection that for pe5this is the optimal nested strategy in Bk,k5,
deﬁned in (72).
The choice of pis motivated by the requirement that k]= log(1/p)1 belong to N, to be able to
deﬁne Lk](m], p), see comment above Lemma 7. In this case k]=u1. Notice that (42) implies
Dk3(m3, p)Lk3(m3, p).(77)
Table 2 shows those values. For 2 u5 the values of Dk3(m3, p) are optimal in the sets of
strategies Bkdeﬁned in (72), for k5; this was observed by inspection. We include
(iv) the diﬀerence Lk3(m3, p)Dk3(m3, p);
(v) the bound (44) for this diﬀerence obtained in Lemma 5.
16
u p =euk]k3Lk](m], p)Lk3(m3, p)Dk3(m3, p)Lk3Dk3Bound (44)
2 0.1353353 1 1 0.7357589 0.739339183 0.686871 0.052468183 0.245252580231
3 0.04978707 2 2 0.4060058 0.4098335 0.3759855 0.033848 0.0852901665983
4 0.01831564 3 3 0.1991483 0.2018778 0.1857311 0.0161467 0.0306978149437
5 0.006737947 4 4 0.09157819 0.09320104 0.08625753 0.00694351 0.011651778305
6 0.002478752 5 5 0.04042768 0.04129651 0.03843346 0.00286305 0.0045824219304
7 0.000911882 6 6 0.01735127 0.01778562 0.01662603 0.00115959 0.0018351805189
8 0.000335462 7 7 0.007295056 0.007501963 0.007036057 0.000465906 0.0007409542827
9 0.000123409 8 8 0.003019164 0.003114251 0.002927807 0.000186444 0.0003001742097
10 4.539993e-05 9 9 0.001234098 0.001276603 0.001202175 0.000074428 0.0001217702372
11 1.67017e-05 10 10 0.000499399 0.000517986 0.000488330 2.96557e-05 4.9423784149e-05
12 6.144212e-06 11 11 0.000200420 0.000201261 0.000196608 4.6534e-06 2.0063981836e-05
13 2.260329e-06 12 11 7.987476e-05 8.02359e-05 7.8386e-05 1.8499e-06 2.7153541193e-06
14 8.315287e-07 13 12 3.164461e-05 3.181671e-05 3.107271e-05 7.44e-07 1.1024045206e-06
15 3.059023e-07 14 13 1.247293e-05 1.255742e-05 1.225847e-05 2.9895e-07 4.4757599204e-07
16 1.125352e-07 15 14 4.894437e-06 4.935552e-06 4.815546e-06 1.20006e-07 1.8171748609e-07
17 4.139938e-08 16 15 1.913098e-06 1.932664e-06 1.88454e-06 4.8124e-08 7.3778218113e-08
18 1.522998e-08 17 16 7.451888e-07 7.542696e-07 7.349931e-07 1.92765e-08 2.9954367138e-08
19 5.602796e-09 18 17 2.893696e-07 2.934861e-07 2.85774e-07 7.7121e-09 1.2161645332e-08
20 2.061154e-09 19 18 1.120559e-07 1.138835e-07 1.10802e-07 3.0815e-09 4.9376984908e-09
Table 2: Comparison of the costs L]
k](m], p), Lk3(m3, p) and Dk3(m3, p); see (i)-(v) above for
deﬁnitions. Only strategy (k3, m3) corresponds to a practical application. For 2 u5 the values
of Dk3(m3, p) are optimal in the sets of strategies Bkdeﬁned in (72), for k5.
3.7 Conjecturing the optimal strategy
After plotting the cost of diﬀerent strategies we have observed that, depending on p, the cost
is minimized by strategies (3k,...,3) or (3k14,3k1,...,3). The transition from one of these
strategies to another occurs at points λkand ρk, where ρ1:= 1 31/3–the solution of D1(3, p)=1
and for k1,
λk:= solution pin [0, ρk) of Dk((3k,...,3), p) = Dk((3k14,3k1,...,3), p); (78)
ρk+1 := solution pin [0, λk) of Dk+1((3k+1,...,3), p) = Dk((3k14,3k1, , . . . , 3), p).(79)
The solution of each equation is unique in the corresponding interval. Denote the space of nested
feasible strategies of size kby
Nk:= m= (m1, . . . , mk)Nk:mj=bjmj+1 for some bjN, j = 1, . . . , k 1,(80)
and let
Dopt(p) := min
kNmin
m∈Nk
Dk(m, p).(81)
Conjecture 1. The optimal cost Dopt(p)is realized by the strategy k= 1, m = 1 (no pooling) if
pρ1, and
Dopt(p) = (Dk((3k,...,3), p)if λkpρk,
Dk((3k14,3k1,...,3), p)if ρk+1 pλk.(82)
This conjecture has been checked by J. M. Martinez for one million values of pchosen uniformly
in the interval [0,0.31], using dynamic programming [15].
17
Computation of ρk, λkRecalling q= 1 p, we have
Dk((3k,...,3), p)Dk((3k14,3k1,...,3), p) = 0
if and only if x=q3k1satisﬁes 1 + 12 x412 x3= 0,(83)
and
Dk+1((3k+1,...,3), p)Dk((3k14,3k1,...,3), p) = 0
if and only if y=q3k1satisﬁes 12 y9+ 36 y336 y49 = 0.(84)
Using Mathematica [13], we ﬁnd that the solutions of (83) and (84) are x0.876057169753174 and
y0.89015230755103751611. Using λk= 1 x(1/3k1)and ρk+1 = 1 y(1/3k1), we listed some of
those transition points in Table 3.
k λkρk
1 0.12394283024682595 0.30663872564936529515
2 0.043149364977271065 0.10984769244896248389
3 0.0145951023362655970 0.03804496086305981708
4 0.004888896470446 0.01284596585087308883
5 0.00163229509440177 0.004300456028242352465
6 0.0005443946765845142 0.001435545146496115234
7 0.00018149783166476752 0.0004787442082735720083
8 0.00006050293775328175 0.0001596068757573522991
Table 3: Transition points ρkand λkdeﬁned in (78) and (79).
We plot the strategies of Conjecture 1 for 1 k7 in Fig. 2 and 3.
Figure 2: Strategy cost in function of p. Each line represents one of the strategies of Conjec-
ture 1. The conjecture states that the optimal cost at pis realized by the minimum of these curves
at p. Dotted lines are strategies (3k,...,3), optimal for p[λk, ρk], and full lines are strategies
(3k14,3k1,...,3), optimal for p[ρk+1, λk]. Plots obtained using desmos.com software; detailed
plot at https://www.desmos.com/calculator/tr3bek9hm0.
18
Figure 3: Details of Fig. 2. The ﬁgure on the left shows the transitions at point λ10.1239 from
strategy k= 1, m = 3 (black dotted line) to k= 1, m = 4 (black continuous line) and then at point
ρ20.1098 to strategy k= 2, m = (9,3) (blue dotted line). The ﬁgure on the right shows the
transition at point ρ70.0004787 from strategy k= 6, m = (354,35,...,3) (green continuous line)
to k= 7, m = (37,...,3) (orange dotted line). Plots obtained with desmos.com software.
Finally, the cost of the conjectured optimal strategy as a function of pin log-log scale is shown
in Fig. 4.
p
0.01
0.1
1
0.001 0.01 0.1
Dopt(p)
Figure 4: Cost Dopt(p) of the conjectured optimal strategy in function of pin log-log scale. Blue
segments corresponds to mopt
1= 4 ×3kopt 1and grey segments to mopt
1= 3kopt .
19
A Computation of the variance
We compute here VarTk. We start with k+ 1 = 3. From (15) we get
VarT2=m2
m2
2
qm11qm1+m2
2
m1
m2
qm21qm2(85)
+ 2 m2
2X
i6=j
Cov(1 φi,1φj) (86)
+ 2 m1
m1
m21
X
i=0
Cov(1 φ, 1φi).(87)
The sum in (86) vanishes because 1 φjand 1 φkare independent if j6=k, while
Cov(1 φ, 1φi) = E(1 φ)(1 φi)E[1 φ]E[1 φi]
=E[1 φi]1qm11qm2by (14)
=1qm2qm1.(88)
Replacing in (85) we get
VarT2=m2
1
m2
2
qm11qm1+m2m1qm21qm2+ 2 m2
1
m2
qm11qm2.(89)
The previous argument can be extended to several stages, as long as each pool size is a multiple
of the pool size in the following stage. This is the content of Lemma 1 in §2 which we prove next.
Proof of Lemma 1. From (17) we get
VarTk=µ2nVar(1 φ) + ·· · +
m1
m21
X
i1=0
m2
m31
X
i2=0 ···
mk1
mk1
X
ik1=0
Var1φi1i2...ik1o(90)
+ 2µ2n
m1
m21
X
i=0
Cov1φ, 1φi+·· · +
m1
m21
X
i1=0
m2
m31
X
i2=0 ···
mk1
mk1
X
ik1=0
Cov1φ, 1φi1i2...ik1o(91)
+ 2µ2n
m1
m21
X
i1=0 h
m2
m31
X
i2=0
Cov1φi1,1φi1,i2+·· · +
m2
m31
X
i2=0 ···
mk1
mk1
X
ik1=0
Cov1φi1,1φi1i2...ik1io
(92)
+. . .
+ 2µ2n
m1
m21
X
i1=0 ···
mk2
mk11
X
ik2=0 h
mk1
mk1
X
ik1=0
Cov1φi1...ik2,1φi1...ik2ik1io.(93)
The ﬁrst line (90) follows by adding the variances of each of the sums in (17), and using that terms
belonging to the same sum are independent, hence that are no covariance terms arising from each
of the individual sums. We then compute the covariances between the diﬀerent sums, and we take
20
advantage of the fact that if l < n, then 1 φi1...iland 1 φj1...jnare independent unless i=jfor
all 1 l, and in this case
Cov1φi1...il,1φi1...in=qml+1 1qmn+1 ,(94)
by a computation similar to (88). Recall that 1 φi1...ijBernoulli(1 qmj+1 ), hence
Var(1 φi1...ij) = qmj+1 1qmj+1 .(95)
Substituting (94) and (95) in the expression for the variance above, we have
VarTk=µ2nqm11qm1+µqm21qm2+· ·· +µk1qmk1qmko
+ 2µ2nµqm11qm2+µ2qm11qm3· ·· +µk1qm11qmko
+ 2µ2nµ2qm21qm3+· ·· +µk1qm21qmko(96)
+. . .
+ 2µ2nµk1qmk11qmko.
If we rewrite (96) by collecting all terms that have a factor (1qmi),1ik, we get the expression
in (23).
B Error in the linear approximation
Proof of Lemma 5. We have
Dk(m, p)Lk(m, p) = 1emklog qmkp+
k
X
j=2
1
mj1emj1log qmj1p.(97)
To show that the error is non positive it suﬃces to prove that
f(p)=1emjlog qmjp0,0p < 1
for any given 1 jk. We have f(0) = 0 and
f0(p) = mj
qemjlog qmj
=mjhemjlog q
q1i=mjhqmj
q1i
=mjhqmj11i0.
because mj1. This implies that fis decreasing and f(p)<0 for all p, and item i) in the lemma
follows.
To prove item ii), note that by the inequality |1ex+x| ≤ x2
2on x0, we have
1emj1log q
mj
+mj1
mj
log q1
2
m2
j1log2q
mj
.
21
Denote m= (m1, . . . , mk) and recall the notation mk+1 := 1. Then
Dk(m, p)1
m1mklog q
k
X
j=2
mj1
mj
log q1
2
k+1
X
j=2
m2
j1log2q
mj
1
2log2q
k+1
X
j=2
m1
2j2using mjm1
2j1
m1log2q. (98)
On the other hand
Lk(m, p)1
m1mklog q
k
X
j=2
mj1
mj
log q=p+ log q
k+1
X
j=2
mj1
mj
kp2,(99)
where the last line follows from the inequality |x+ log(1 x)| ≤ x2, 0 x1
2. The result follows
from (98) and (99).
C Positive deﬁnite Hessian matrix
We prove here that the critical point (47) is indeed a minimum of Lk, for given k. The Hessian
matrix of Lkis a tridiagonal symmetric matrix Hkwith entries
H11 =2
m3
1
, Hii =2pmi1
m3
i
,2ik,
Hi i+1 =Hi+1 i=p
m2
i+1
,1ik1,
and Hij = 0 if |ij|>1.
Let us denote by H]=H(m]
1(k), . . . , m]
k(k). To simplify notation, let µ:= p1
k+1 , so that
m]
j(k) = µkj+1. We have
Hii = 2µp3µ2i,2ik,
Hi i+1 =Hi+1 i=p3µ2(i+1),1ik1,
and Hij = 0 if |ij|>1.
Given x= (x1, . . . , xk)Rk, let us deﬁne y:= µx1, µ2x2, . . . , µkxk. We compute
xtH]x=p3hk
X
i=1
2µµ2ix2
i2
k1
X
i=1
µ2(i+1)xixi+1i
=p3k
X
i=1
2µy2
i2
k1
X
i=1
µyiyi+1i
=µp3hy2
1+ (y1y2)2+ (y2y3)2+· ·· + (yk1yk)2+y2
ki0,
and xtH]x= 0 if and only if y1=y2=. . . yk= 0, or, in terms of the original vector, x= 0. We
conclude that H]is positive deﬁnite.
22
Acknowledgements
We would like to thank Pablo Aguilar, Alejandro Colaneri, Hugo Menzella, Juliana Sesma, Sergio
Chialina for bringing this problem to our attention and encouraging us to study it. P.A.F. would
like to thank Luiz-Rafael Santos for comments and reference suggestions.
References
[1] M. Aldridge, L. Baldassini, and O. Johnson,Group testing algorithms: Bounds and
simulations, IEEE Transactions on Information Theory, 60 (2014), pp. 3671–3687.
[2] C. R. Bilder and J. M. Tebbs,Pooled-testing procedures for screening high volume clinical
specimens in heterogeneous populations, Statistics in Medicine, 31 (2012), pp. 3261–3268.
[3] M. S. Black, C. R. Bilder, and J. M. Tebbs,Optimal retesting conﬁgurations for hier-
archical group testing, Journal of the Royal Statistical Society: Series C (Applied Statistics),
64 (2015), pp. 693–710.
[4] C. L. Chan, P. H. Che, S. Jaggi, and V. Saligrama,Non-adaptive probabilistic group
testing with noisy measurements: Near-optimal bounds with eﬃcient algorithms, in 2011 49th
Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sep. 2011,
pp. 1832–1839.
[5] A. De Bonis,New combinatorial structures with applications to eﬃcient group testing with
inhibitors, Journal of Combinatorial Optimization, 15 (2008), pp. 77–94.
[6] L. Dong, J. Zhou, C. Niu, Q. Wang, Y. Pan, S. Sheng, X. Wang, Y. Zhang, J. Yang,
M. Liu, Y. Zhao, X. Zhang, T. Zhu, T. Peng, J. Xie, Y. Gao, D. Wang, Y. Zhao,
X. Dai, and X. Fang,Highly accurate and sensitive diagnostic detection of sars-cov-2 by
digital pcr, medRxiv, (2020).
[7] R. Dorfman,The detection of defective members of large populations, Ann. Math. Statist.,
14 (1943), pp. 436–440.
[8] D.-Z. Du and F. K. Hwang,Pooling Designs and Nonadaptive Group Testing, WORLD
SCIENTIFIC, 2006.
[9] W. Feller,An introduction to probability theory and its applications. Vol. I, John Wiley and
Sons, 1968.
[10] H. M. Finucan,The blood testing problem, Journal of the Royal Statistical Society. Series C
(Applied Statistics), 13 (1964), pp. 43–50.
[11] L. E. Graff and R. Roeloffs,Group testing in the presence of test error; an extension of
the Dorfman procedure, Technometrics, 14 (1972), pp. 113–122.
[12] R. Hanel and S. Thurner,Boosting test-eﬃciency by pooled testing strategies for sars-cov-2,
2020.
[13] W. R. Inc.,Mathematica, Version 12.0. Champaign, IL, 2019.
23
[14] H.-Y. Kim, M. G. Hudgens, J. M. Dreyfuss, D. J. Westreich, and C. D. Pilcher,
Comparison of group testing algorithms for case identiﬁcation in the presence of test error,
Biometrics, 63 (2007), pp. 1152–1163.
[15] J. M. Martinez, Personal communication.
[16] C. S. McMahan, J. M. Tebbs, and C. R. Bilder,Informative Dorfman screening, Bio-
metrics, 68 (2012), pp. 287–296.
[17] , Two-dimensional informative array testing, Biometrics, 68 (2012), pp. 793–804.
[18] C. Mentus, M. Romeo, and C. DiPaola,Analysis and applications of non-adaptive and
adaptive group testing methods for covid-19, medRxiv, (2020).
[19] W. H. Organization,Coronavirus disease 2019 (COVID-19) situation report-82. April 11,
2020., 2020.
[20] R. M. Phatarfod and A. Sudbury,The use of a square array scheme in blood testing,
Statistics in Medicine, 13 (1994), pp. 2337–2343.
[21] N. Sinnott-Armstrong, D. Klein, and B. Hickey,Evaluation of group testing for sars-
cov-2 rna, medRxiv, (2020).
[22] M. Sobel and P. A. Groll,Group testing to eliminate eﬃciently all defectives in a binomial
sample, The Bell System Technical Journal, 38 (1959), pp. 1179–1252.
[23] A. Sterrett,On the detection of defective members of large populations, The Annals of
Mathematical Statistics, 28 (1957), pp. 1033–1036.
[24] T. Suo, X. Liu, M. Guo, J. Feng, W. Hu, Y. Yang, Q. Zhang, X. Wang, M. Sajid,
D. Guo, Z. Huang, L. Deng, T. Chen, F. Liu, K. Xu, Y. Liu, Q. Zhang, Y. Liu,
Y. Xiong, G. Guo, Y. Chen, and K. Lan,ddpcr: a more sensitive and accurate tool for
sars-cov-2 detection in low viral load specimens, medRxiv, (2020).
[25] I. Yelin, N. Aharony, E. Shaer-Tamar, A. Argoetti, E. Messer, D. Beren-
baum, E. Shafran, A. Kuzli, N. Gandali, T. Hashimshony, Y. Mandel-Gutfreund,
M. Halberthal, Y. Geffen, M. Szwarcwort-Cohen, and R. Kishony,Evaluation of
covid-19 rt-qpcr test in multi-sample pools, medRxiv, (2020).
24
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We consider the problem of non-adaptive noiseless group testing. We develop several new algorithms and analyse their probability of success. In particular, we describe the Definite Defectives (DD) algorithm, and prove bounds on its probability of success, comparing it to the best known algorithms in the literature. Further, we describe the SCOMP algorithm, and perform simulations which show that it performs even better than the DD algorithm. Finally, we derive upper bounds on the success probability for any algorithms based on a Bernoulli sampled test matrix. We analyse the normalized number of tests required, in terms of information theoretic capacity. In certain regimes we argue that in an asymptotic sense the DD algorithm is optimal, and that less sparse problems (more defectives) have a larger `adaptivity gap'.
Article
Full-text available
We consider the problem of detecting a small subset of defective items from a large set via non-adaptive "random pooling" group tests. We consider both the case when the measurements are noiseless, and the case when the measurements are noisy (the outcome of each group test may be independently faulty with probability q). Order-optimal results for these scenarios are known in the literature. We give information-theoretic lower bounds on the query complexity of these problems, and provide corresponding computationally efficient algorithms that match the lower bounds up to a constant factor. To the best of our knowledge this work is the first to explicitly estimate such a constant that characterizes the gap between the upper and lower bounds for these problems.
Article
Hierarchical group testing is widely used to test individuals for diseases. This testing procedure works by first amalgamating individual specimens into groups for testing. Groups testing negatively have their members declared negative. Groups testing positively are subsequently divided into smaller subgroups and are then retested to search for positive individuals. We propose a new class of informative retesting procedures for hierarchical group testing that acknowledges heterogeneity among individuals. These procedures identify the optimal number of groups and their sizes at each testing stage to minimize the expected number of tests. We apply our proposals in two settings: human immunodeficiency virus testing programmes that currently use three-stage hierarchical testing and chlamydia and gonorrhoea screening practices that currently use individual testing. For both applications, we show that substantial savings can be realized by our new procedures.
Article
It is already known from numerical studies that, to identify the infected members of an assemblage, it may be economical to first test in groups and then to test individuals from the infected groups. In the present paper an algebraic treatment is provided for the two-stage method just mentioned and also for methods using three or more stages.
Article
A group test is a test performed on a group of more than one item in which a good reading indicates the group contains no defective items and a defective reading indicates the presence of at least one defective. A simple procedure, proposed by Dorfman, for classifying the members of an infinite population into defective and non-defective items, is to test, groups of size k, classify the entire group as non-defective if the group test reads good, and otherwise test and classify each of the k items individually. Dorfman has given the group size k, depending on the known population fraction defective, which maximizes the expected number of items classified per test. We consider an extension of Dorfman's procedure applicable to the case in which there is a known probability of a test outcome not corresponding to the true state of the group or item under test.
Article
In group-testing, a set of x units is taken from a total starting set of N units, and the x units (1 ≦ x ≦ N) are tested simultaneously as a group with one of two possible outcomes: either all x units are good or at least one defective unit is present (we don't know how many or which ones). Under this type of testing, the problem is to find the best integer x for the first test and to find a rule for choosing the best subsequent test-groups (which may depend on results already observed), in order to minimize the expected total number of group-tests required to classify each of the N units as good or defective. It is assumed that the N units can be treated like independent binomial chance variables with a common, known probability p of any one being defective; the case of unknown p and several generalizations of the problem are also considered.
Article
Group testing with inhibitors (GTI) is a variant of classical group testing where in addition to positive items and negative items, there is a third class of items called inhibitors. In this model the response to a test is YES if and only if the tested group of items contains at least one positive item and no inhibitor. This model of group testing has been introduced by Farach et al. (Proceedings of compression and complexity of sequences, pp 357–367, 1997) for applications in the field of molecular biology. In this paper we investigate the GTI problem both in the case when the exact number of positive items is given, and in the case when the number of positives is not given but we are provided with an upper bound on it. For the latter case, we present a lower bound on the number of tests required to determine the positive items in a completely nonadaptive fashion. Also under the same hypothesis, we derive an improved lower bound on the number of tests required by any algorithm (using any number of stages) for the GTI problem. As far as it concerns the case when the exact number of positives is known, we give an efficient trivial two-stage algorithm. Instrumental to our results are new combinatorial structures introduced in this paper. In particular we introduce generalized versions of the well known superimposed codes (Du, D.Z., Hwang, F.K. in Pooling designs and nonadaptive group testing, 2006; Dyachkov, A.G., Rykov, V.V. in Probl. Control Inf. Theory 12:7–13, 1983; Dyachkov, A.G., et al. in J. Comb. Theory Ser. A 99:195–218, 2002; Kautz, W.H., Singleton, R.R. in IEEE Trans. Inf. Theory 10:363–377, 1964) and selectors (Clementi, A.E.F, et al. in Proceedings of symposium on discrete algorithms, pp. 709–718, 2001; De Bonis, A., et al. in SIAM J Comput. 34(5):1253–1270, 2005; Indyk, P. in Proceedings of symposium on discrete algorithms, pp. 697–704, 2002) that we believe to be of independent interest.
Article
Pooled testing is a procedure commonly used to reduce the cost of screening a large number of individuals for infectious diseases. In its simplest form, pooled testing works by compositing a set of individual specimens (e.g., blood or urine) into a common pool. If the pool tests negative, all individuals within it are diagnosed as negative. If the pool tests positive, retesting is needed to decode the positive individuals from the negative individuals. Traditionally, pooled testing has assumed that each individual has the same probability of being positive. However, this assumption is often unrealistic, especially when known risk factors can be used to measure distinct probabilities of positivity for each individual. In this paper, we investigate new pooled-testing algorithms that exploit the heterogeneity among individual probabilities and subsequently reduce the total number of tests needed, while maintaining accuracy levels similar to standard algorithms that do not account for heterogeneity. We apply these algorithms to data from the Infertility Prevention Project, a nationally implemented program supported by the Centers for Disease Control and Prevention. Copyright © 2012 John Wiley & Sons, Ltd.