ArticlePDF Available

A new pooling strategy for high-throughput screening: The Shifted Transversal Design

Authors:

Abstract and Figures

In binary high-throughput screening projects where the goal is the identification of low-frequency events, beyond the obvious issue of efficiency, false positives and false negatives are a major concern. Pooling constitutes a natural solution: it reduces the number of tests, while providing critical duplication of the individual experiments, thereby correcting for experimental noise. The main difficulty consists in designing the pools in a manner that is both efficient and robust: few pools should be necessary to correct the errors and identify the positives, yet the experiment should not be too vulnerable to biological shakiness. For example, some information should still be obtained even if there are slightly more positives or errors than expected. This is known as the group testing problem, or pooling problem. In this paper, we present a new non-adaptive combinatorial pooling design: the "shifted transversal design" (STD). It relies on arithmetics, and rests on two intuitive ideas: minimizing the co-occurrence of objects, and constructing pools of constant-sized intersections. We prove that it allows unambiguous decoding of noisy experimental observations. This design is highly flexible, and can be tailored to function robustly in a wide range of experimental settings (i.e., numbers of objects, fractions of positives, and expected error-rates). Furthermore, we show that our design compares favorably, in terms of efficiency, to the previously described non-adaptive combinatorial pooling designs. This method is currently being validated by field-testing in the context of yeast-two-hybrid interactome mapping, in collaboration with Marc Vidal's lab at the Dana Farber Cancer Institute. Many similar projects could benefit from using the Shifted Transversal Design.
Content may be subject to copyright.
BioMed Central
Page 1 of 13
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Research article
A new pooling strategy for high-throughput screening: the Shifted
Transversal Design
Nicolas Thierry-Mieg*
Address: Laboratoire Logiciels-Systèmes-Réseaux, IMAG Institute, BP53, 38041 Grenoble Cedex 9, France
Email: Nicolas Thierry-Mieg* - Nicolas.Thierry-Mieg@imag.fr
* Corresponding author
Abstract
Background: In binary high-throughput screening projects where the goal is the identification of
low-frequency events, beyond the obvious issue of efficiency, false positives and false negatives are
a major concern. Pooling constitutes a natural solution: it reduces the number of tests, while
providing critical duplication of the individual experiments, thereby correcting for experimental
noise. The main difficulty consists in designing the pools in a manner that is both efficient and
robust: few pools should be necessary to correct the errors and identify the positives, yet the
experiment should not be too vulnerable to biological shakiness. For example, some information
should still be obtained even if there are slightly more positives or errors than expected. This is
known as the group testing problem, or pooling problem.
Results: In this paper, we present a new non-adaptive combinatorial pooling design: the "shifted
transversal design" (STD). It relies on arithmetics, and rests on two intuitive ideas: minimizing the
co-occurrence of objects, and constructing pools of constant-sized intersections. We prove that it
allows unambiguous decoding of noisy experimental observations. This design is highly flexible, and
can be tailored to function robustly in a wide range of experimental settings (i.e., numbers of
objects, fractions of positives, and expected error-rates). Furthermore, we show that our design
compares favorably, in terms of efficiency, to the previously described non-adaptive combinatorial
pooling designs.
Conclusion: This method is currently being validated by field-testing in the context of yeast-two-
hybrid interactome mapping, in collaboration with Marc Vidal's lab at the Dana Farber Cancer
Institute. Many similar projects could benefit from using the Shifted Transversal Design.
Background
With the availability of complete genome sequences, biol-
ogy has entered a new era. Relying on the sequencing data
of genomes, transcriptomes or proteomes, scientists have
been developing high-throughput screening assays and
undertaking a variety of large scale functional genomics
projects. While some projects involve quantitative meas-
urements, others consist in applying a basic yes-or-no test
to a large collection of samples or "objects", – be they
individuals, clones, cells, drugs, nucleic acid fragments,
proteins, peptides... A large class of these binary tests aims
at identifying relatively rare events. The main goal is of
course to obtain information as efficiently and as reliably
as possible. Typically, this is achieved by minimizing the
cost of the basic assay in terms of time and money, and
automating and parallelizing the experiments as much as
Published: 19 January 2006
BMC Bioinformatics2006, 7:28 doi:10.1186/1471-2105-7-28
Received: 17 June 2005
Accepted: 19 January 2006
This article is available from: http://www.biomedcentral.com/1471-2105/7/28
© 2006Thierry-Mieg; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0
),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 2 of 13
(page number not for citation purposes)
possible. A major difficulty stems from the fact that high-
throughput biological assays are usually somewhat noisy:
reproducibility is a known problem of microarray analy-
ses, and both false positive and false negative observations
are to be expected in binary type experiments. These
experimental artifacts should be identified and properly
treated. A clean way to deal with the issue consists in
repeating all tests several times, but this is usually prohib-
itively expensive and time-consuming. A more practical
approach, in the case of binary tests, consists in retesting
all positive results obtained in a first round. This strategy
identifies most of the false positives at a reduced cost, but
is powerless with regard to false negatives, leaving us in
need of a better solution.
In the case of binary experiments testing for rare events, an
intuitively appealing strategy consists in pooling the sam-
ples to minimize the number of tests. It requires three
conditions. First, the objects under scrutiny must be avail-
able individually, in a tagged form. For example, a cDNA
library in bulk is not exploitable, but a collection of cDNA
clones or of cloned coding regions, such as the one pro-
duced by the C. elegans ORFeome project [1], is fine. Sec-
ond, it must be possible to test a pool of objects in a single
assay and obtain a positive readout if at least one of the
objects is positive. For example, this is the case when
searching for a specific DNA sequence by PCR in a mixture
of molecules: a product will be amplified if at least one of
the pooled molecules contains the target sequence. Third,
pooling is especially desirable and efficient when the frac-
tion of expected positives is small (at most a few percent).
Under these conditions, pooling strategies can be applied,
and the difficulty then consists in choosing a "good" set of
pools. This being an intuitive but rather vague goal, it
must be formalized. A simple formulation, known as the
group testing problem (or pooling problem), is the fol-
lowing. Consider a set of n events which can be true or
false, represented by n Boolean variables. Let us call
"pool" a subset of variables. We define the value of a pool
as the disjunction (i.e., the logical OR operator) of the var-
iables that it contains. Let us assume that at most t varia-
bles are true. The goal is to build a set of v pools, where v
is small compared to n, such that by testing the values of
the v pools, one can unambiguously determine the values
of the n variables.
If the pools must be specified in a single step, rather than
incrementally by building on the results of previous tests,
the problem is called "non-adaptive". Although adaptive
designs can require fewer tests, non-adaptive pooling
designs are often better suited to high-throughput screen-
ing projects because they allow parallelization and facili-
tate automation of the experiments, and also because the
same pools can be used for all targets, thereby reducing
the total project cost.
The ability to deal with noisy observations is an important
added benefit to using a pooling system, compared to the
classical individual testing strategy. Indeed, noise detec-
tion and correction capabilities are inherent in any pool-
ing system, because each variable is present in several
pools, hence tested many times. Depending on the
expected noise level, the redundancy can be chosen at
will, and simply testing a few more pools than would be
necessary in the absence of noise results in robust error-
correction. It should be noted that minimization of the
number of pools and noise correction are two conflicting
goals: increasing noise tolerance generally requires testing
more pools. Designing a set of pools requires balancing
these two objectives, and finding the right compromise to
suit the application.
Other application-dependent constraints may be
imposed. In particular, the pool sizes are often limited by
the experimental setting. For example, in the context of
the C. elegans protein interaction mapping project led by
Marc Vidal [2,3], it is estimated that, using their high-
throughput two-hybrid protocol, reliable readouts can be
obtained with pools containing 400 AD-Y clones, or per-
haps up to 1000 by tweaking the assay (Marc Vidal, per-
sonal communication).
Many groups have used with some success variants of the
simple "grid" design, which consists in arraying the
objects on a grid and pooling the rows and columns [e.g.
[4-6]]. However, although it is better than no pooling, this
rudimentary design is vulnerable to noise and behaves
poorly when several objects are positive, in addition to
being far from optimal in terms of numbers of tests.
In answer to its shortcomings, more sophisticated error-
correcting pooling designs have been proposed. Some of
these designs are very efficient in terms of numbers of
tests, but lack the robustness and flexibility that most real
biological applications require. Others are more adapta-
ble and noise-tolerant at the expense of performance. In
this paper, we present a new pooling algorithm: the
"shifted transversal design" (STD). This design is highly
flexible: it can be tailored to allow the identification of
any number of positive objects and to deal with important
noise levels. Yet it is extremely efficient in terms of
number of tests, and we show that it compares favorably
to the previously described pooling designs.
The paper is organized as follows. After providing a formal
definition STD, we show that it constitutes an error-cor-
recting solution to the pooling problem. The theoretical
performance of STD is then evaluated and compared with
the main previously described deterministic pooling
designs. Finally, we summarize our results and discuss
future directions.
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 3 of 13
(page number not for citation purposes)
Results (1): the Shifted Transversal Design
Preliminaries
The following notations are used throughout this paper,
in accordance with the notations from [7].
Let n 2, and consider the set = {A
0
,...,A
n-1
} of n
Boolean variables.
We will call "pool" a subset of . We say that a pool is
"true", or "positive", if at least one of its elements is true.
Let us call "layer" a partition of .
Let q be a prime number, with q < n.
We define the "compression power" of q relative to n,
noted Γ(q,n), as the smallest integer γ such that q
γ
+1
n.
We will simply write Γ for Γ(q,n) whenever possible.
Let
σ
q
be the mapping of {0,1}
q
onto itself defined by:
Note that
σ
q
is a cyclic function of order q:
σ
q
q
is the iden-
tity function on {0,1}
q
.
The matrix representation
Any set of pools can be represented by a Boolean matrix,
as follows. Each column corresponds to one variable, and
each row to one pool. The cell (i,j) is true (value 1) if pool
i contains variable j, and false (value 0) otherwise.
Example
Consider the n = 9 variables = {A
0
, A
1
,...,A
8
}. The fol-
lowing matrix defines a set of 3 pools:
The pools are {A
0
,A
3
,A
6
} (defined by the first row),
{A
1
,A
4
,A
7
} (second row), and {A
2
,A
5
,A
8
} (third row). In
fact, this set of pools clearly constitutes a layer.
Definition of STD
A pooling design is a method to construct a set of pools.
When the set of pools can be partitioned into layers (i.e.
subsets which each form a partition of the set of varia-
bles), the pooling design is said to be "transversal". STD is
a transversal pooling design that rearranges the variables
from one layer to the next, with two intuitive goals in
mind. First, the number of pools in which any pair of var-
iables can occur (i.e. the co-occurrence of variables)
should be limited: this is essential for determining the var-
iables' values. The second aim is that the intersections
between pools should be of roughly constant size, in
order to maximize the mutual information obtained by
observing the pools' values and thus increase STD's effi-
ciency.
Given a prime number q with q < n, and k such that k
q+1, STD constructs a set STD(n; q; k) of pools composed
of k layers. When k q, the layers have a uniform con-
struction: they each contain q pools of n/q or (n/q)+1 var-
iables, and are globally interchangeable. In the special
case where k = q+1, the q homogeneous layers are supple-
mented with a singular layer, which has a specific con-
struction and is less regular, yet complements the others
nicely. A formal definition of STD(n; q; k) follows.
For every j {0,...,q}, let M
j
be a q × n Boolean matrix,
defined by its columns C
j,0
,...,C
j,n-1
as follows:
, and i {0,...,n-1}
where:
if j < q, and , where
the semi-bracket denotes the integer part.
Let L(j) be the set of pools of which M
j
is the matrix repre-
sentation. Note that each column C
j,i
has exactly one
occurrence of '1' and (q-1) occurrences of '0'. The index of
the '1' identifies the (single) pool of L(j) which contains
variable A
i
. Therefore, in a given set of pools L(j), each var-
iable is present in exactly one pool, that is to say L(j) con-
stitutes a partition of : L(j) is a layer.
Finally, for k {1,2,...,q+1}, STD(n; q; k) is defined as:
.
n
n
n
∀∈
=
(, , ){,},xx
x
x
x
x
x
x
q
q
q
q
q
1
1
2
1
1
01
σ
q
.
9
M
0
100100100
010010010
001001001
=
C
00
1
0
0
,
=
CC
ji q
sij
,
(,)
,
()
00
si j j
i
q
c
c
c
(, )=⋅
=
0
Γ
siq
i
q
(, )=
Γ
n
STD n; q; k() ()=
=
Lj
j
k
0
1
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 4 of 13
(page number not for citation purposes)
Example
Consider again the variables , and let q = 3 (hence Γ =
1). M
0
is as defined above, and M
1
, M
2
, M
3
are:
The corresponding layers of pools are the following:
Layer 0: L(0) = {{A
0
,A
3
,A
6
}, {A
1
,A
4
,A
7
}, {A
2
,A
5
,A
8
}}
Layer 1: L(1) = {{A
0
,A
5
,A
7
}, {A
1
,A
3
,A
8
}, {A
2
,A
4
,A
6
}}
Layer2: L(2) = {{A
0
,A
4
,A
8
}, {A
1
,A
5
,A
6
}, {A
2
,A
3
,A
7
}}
Layer3: L(3) = {{A
0
,A
1
,A
2
}, {A
3
,A
4
,A
5
}, {A
6
,A
7
,A
8
}}.
STD(9; 3; 2) is the following set of pools: STD(9; 3; 2) =
L(0) L(1).
Remark
The method builds at most q+1 layers: indeed, if we dis-
card the last particular layer L(q) and attempt to extend
the STD construction to any j, it becomes cyclic of order q:
for every j, L(j+q) = L(j).
Results (2): properties of STD
In this section, we establish an important theorem, lead-
ing to three corollaries which show that STD constitutes a
solution to the pooling problem described in the intro-
duction, and that it can be used to detect and correct noisy
observations. We then establish another property of STD,
which is noteworthy albeit not directly related to the pool-
ing problem.
Co-occurrence of variables
So far we have considered the variables that are contained
in a given pool. Dually, we may consider the set of pools
that contain a given variable. For k {1,2,...,q+1}, we will
note pools
k
(i) the set of pools of STD(n; q; k) that contain
variable A
i
:
i {0,...,n-1}, pools
k
(i) = {p STD(n; q; k) | A
i
p}.
Theorem 1
Recall that q is prime.
i
1
, i
2
{0,...,n-1}, [i
1
i
2
] [Card(pools
q+1
(i
1
)
pools
q+1
(i
2
)) Γ(q,n)].
Proof
see Methods section.
Example
Consider again the example n = 9, q = 3, k = 4, for which
the layers L(0), L(1), L(2), and L(3) are known (see
above). The set of pools containing A
0
is: pools
4
(0) =
{{A
0
,A
3
,A
6
},{A
0
,A
5
,A
7
},{A
0
,A
4
,A
8
},{A
0
,A
1
,A
2
}}.
One can easily see that A
0
is present exactly once with each
other variable. In fact, each pair of variables is present in
exactly 1 (= Γ(3,9)) pool, in conformity with theorem 1.
Remark
The property holds a fortiori when k < q+1, i.e. when con-
sidering STD(n; q; k) instead of STD(n; q; q+1).
A solution to the pooling problem
Corollary 1
Let t be an integer such that t·Γ(q,n) q. Let k = t·Γ+1,
and consider the set of pools STD(n; q; k). Suppose that
the value of each pool has been observed, and that there
are at most t positive variables in . Then, in the
absence of noise (i.e., if all pool values are correctly
observed), the value of every variable can be identified.
Proof
Consider the following algorithm, which tags variables as
negative or positive.
Algorithm 1: all the variables present in at least one nega-
tive pool are tagged negative; any variable present in at
least one positive pool where all other variables have been
tagged negative, is tagged positive.
We show that this algorithm correctly identifies the value
of each and every variable.
Let A
i
be a negative variable. A
i
is present in exactly k
pools: one pool in each layer. Theorem 1 asserts that no
variable other than A
i
is present in more than Γ of these
Γ+1 pools. Therefore, since at most t variables are posi-
tive, A
i
is present in at least one pool where no positive
variable is present. Consequently, examination of this
pool yields a negative answer (since all observations are
correct), which leads algorithm 1 to tag A
i
negative. This
shows that every negative variable is correctly tagged as
such.
n
M
M
1
2
100001010
010100001
001010100
100010001
01000
=
=
,
11100
001100010
111000000
000111000
000000111
3
=
,
M
.
n
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 5 of 13
(page number not for citation purposes)
Now let A
i
be a positive variable. Since we suppose here
that there are no observational errors, all pools containing
A
i
are positive: A
i
is not tagged negative. Again according
to theorem 1, no other variable is present in more than Γ
of these t·Γ+1 pools. Therefore, since there are at most t-
1 other positive variables, A
i
is present in at least Γ+1 pos-
itive pools where all other variables are negative, and are
tagged negative according to the above. This shows that
every positive variable is tagged correctly and uniquely.
Finally, since every positive pool must contain at least one
positive variable, and since no positive variable is tagged
negative, we can conclude that no negative variable can be
tagged positive (in addition to its negative tag): every neg-
ative variable is also uniquely tagged. This completes the
corollary's proof.
Example
Consider again our example STD(9; 3; 2) = {{A
0
,A
3
,A
6
},
{A
1
,A
4
,A
7
}, {A
2
,A
5
,A
8
}, {A
0
,A
5
,A
7
}, {A
1
,A
3
,A
8
},
{A
2
,A
4
,A
6
}}.
Let t = 1, and suppose that a single variable in is pos-
itive. For reasons of symmetry, the name of that variable
is inconsequent: all are equivalent. Let us suppose that the
only positive variable is A
8
. Then pools {A
0
,A
3
,A
6
},
{A
1
,A
4
,A
7
}, {A
0
,A
5
,A
7
}, and {A
2
,A
4
,A
6
} are negative,
which shows that variables A
0
, A
1
,...,A
7
are negative; and
pools {A
2
,A
5
,A
8
} and {A
1
,A
3
,A
8
} are positive, which each
prove that A
8
is positive (given that A
2
, A
5
, A
1
and A
3
have
been shown to be negative).
Remark
If more than t variables are positive, this fact is revealed:
clearly, at most n - (t+1) variables are tagged negative,
contrary to when there are at most t positives. In fact, all
tags produced by the above algorithm are still correct, but
some variables may not be tagged at all: these variables are
called "unresolved", or "ambiguous". It would be interest-
ing to know how many ambiguous variables are to be
expected, but this is a very hard problem to study analyti-
cally, particularly when one takes into account experimen-
tal noise. Instead, this issue can be suitably approached by
computer simulation.
Dealing with noise: error correction
As stated in the introduction, pooling designs have an
intrinsic potential for noise-correction, due to the redun-
dancy of variables. In the case of STD, this potential can
be taken advantage of by simply testing a few extra layers
of pools and using a modified algorithm, as shown here.
Corollary 2
Let t and E be integers such that t·Γ(q,n)+2·E q, and let
k = t·Γ+2·E+1. Consider the set of pools STD(n; q; k),
and suppose that the value of each pool has been
observed. Furthermore, suppose that there are at most t
positive variables in , and that there are at most E
observation errors. Then, all errors can be detected and
corrected, and the value of every variable can be identi-
fied.
Proof
Consider the following tagging algorithm.
Algorithm 2: all the variables present in at least E+1 nega-
tive pools are tagged negative; any variable present in at
least E+1 positive pools where all other variables have
been tagged negative, is tagged positive.
The proof is similar to that of corollary 1: we show that
algorithm 2 correctly and uniquely tags every variable. In
this case, theorem 1 shows that each negative variable is
necessarily present in at least 2·E+1 negative pools. Since
there are at most E observation errors, it follows that at
least E+1 of these negative pools are correctly observed.
Therefore, algorithm 2 tags all negative variables as such.
In addition, at most E pools containing a positive variable
can be observed negative; hence no positive variable is
tagged negative. Finally, since at most E pools containing
only negative variables can be observed positive, no nega-
tive variable is tagged positive:every negative variable is
correctly and uniquely tagged.
Conversely, a positive variable A
i
appears in at least
Γ+E+1 positive pools (since there are at most E errors),
of which at most (t-1)·Γ contain at least one other posi-
tive variable (according to theorem 1). Therefore A
i
is
present in at least (t·Γ+E+1) - (t-1)·Γ = Γ+E+1 positive
pools where all other variables are negative. Since these
negative variables have been correctly tagged as such (as
shown above), A
i
is tagged positive. This shows that algo-
rithm 2 also correctly and uniquely tags all positive varia-
bles.
Finally, any observation which is contradictory with the
obtained tagging is necessarily erroneous. In other words,
false negative and false positive observations are identi-
fied.
Remark
Few restrictions are imposed when choosing the value of
the parameter q: it must simply be a prime number
smaller than n. Consequently, STD can be used success-
fully even when very high noise levels are expected, by
picking a large value for q. Of course, as is to be expected
9
n
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 6 of 13
(page number not for citation purposes)
in low signal-to-noise situations, this high corrective
power comes at the price of lower compression perform-
ance, since larger q values mean more pools per layer.
Corollary 2 does not distinguish between the two types of
errors: false positives and false negatives. If we consider
them separately, the corrective power of STD can actually
be improved twofold, as shown below.
Corollary 3
Let t and E be integers such that Γ(q,n)+2·E q, and let
k = t·Γ+2·E+1. Consider the set of pools STD(n; q; k),
and suppose that the value of each pool has been
observed. Furthermore, suppose that there are at most t
positive variables in , and that there are at most E false
positive and E false negative observations. Then, all errors
can be detected and corrected, and the value of every var-
iable can be identified.
Proof
The proof of corollary 2 can be directly replicated, and
shows that algorithm 2 still tags all variables uniquely and
correctly. Indeed, since there are at most E false positives,
every negative variable is tagged as such; and since there
are at most E false negatives, no positive variable is tagged
negative. In addition, no negative variable is tagged posi-
n
Guaranteed error correction and detection properties of STDFigure 1
Guaranteed error correction and detection properties of STD. An experimenter, expecting up to t positives and E
errors, chooses a satisfactory prime number q and builds the set of pools STD(n; q; t·Γ+2·E+1), as specified in corollary 2.
Recall that n is the total number of variables and Γ is the compression power, i.e. the smallest
γ
such that q
γ+1
n. This figure
summarizes the behavior of these pools when the actual number of errors exceeds E, and distinguishes between the two types
of errors: false positives and false negatives. In the dark blue region, all errors are detected and corrected. In the intermediate
blue rectangles, correction is not guaranteed but detection is: in an unfavorable conformation of positives and errors, correc-
tion of all errors may fail, but this failure cannot go unnoticed, and the user can therefore plan additional experiments. In the
cyan square, detection is usually also guaranteed, except if E is very small (E < 2·Γ-1): in this case, the line y = 3·E+1-x splits the
square in two, and detection is only guaranteed in the bottom left portion, where the total number of errors is at most 3·E+1.
Finally, in the outer pale cyan zone, no guarantee is provided.
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 7 of 13
(page number not for citation purposes)
tive: this results from the facts that there are at most E false
positives and that no positive variable is tagged negative.
Finally, given that every negative variable is tagged nega-
tive, we can conclude that every positive variable is tagged
positive as long as there are less than E+Γ+1 false nega-
tives.
Error detection
If algorithm 2 tags some variables twice or not at all, or if
it tags more than t variables as positive, or if it identifies
more than E false positives or false negatives, then we
know that the conditions for corollaries 2 and 3 are not
satisfied. In this case the obtained tags may be incorrect,
but one is aware of the situation. However, if enough
excess errors are present, the tags can be wrong while
seeming to satisfy one of the corollaries' hypotheses; in
this case, the mistake is not detected. This leads to the fol-
lowing important question: in general, assuming there are
at most t positives, how many errors can be detected?
Examining the proof of corollary 3, if there are at most E
false positives and up to E+Γ false negatives, every variable
is correctly tagged, although some variables may be tagged
twice (i.e. both positive and negative). It follows that to
avoid detection, there must be at least E+Γ+1 false nega-
tives, or at least E+1 false positives. In fact, E+Γ+1 false
negatives can successfully remain undetected. On the
other hand, if there are E+1 false positives, a negative var-
iable may seem positive with only E fictitious false nega-
tives; but this would lead to t+1 putative positive
variables, hence detection is in fact not avoided. A
detailed analysis shows that escaping detection in this
case actually requires either Γ extra false positives, or
2·E+1 additional errors among which at least E+1 are
false negatives. Overall, ignoring the errors' types, we con-
clude that the detection of min(3·E+1, E+Γ) errors is
guaranteed. Typically Γ is 2 or 3, hence this guarantee is
not very strong; but it corresponds to a rare worst case sce-
nario, and in practice many more errors can usually be
detected.
The error correction and detection properties of STD are
summarized in Figure 1. From another angle, it is interest-
ing to know what happens if more than t variables are
positive. As long as there are at most E errors, all tags pro-
duced by algorithm 2 are still correct, although some var-
iables may not be tagged (i.e., they are unresolved).
Therefore the occurrence of more than t positives is
detected, as in the noiseless case. However, if there are
both more than E errors and more than t positives, prob-
lems may occur and escape detection (e.g., a positive var-
iable might be "mis-tagged" as negative). Some of these
problems reflect the natural limits of the STD pools, and
can only be avoided by using different STD parameters;
but some result from the rigidity of algorithm 2. In real
applications where the number of positives and errors will
probably exceed t and E in at least a few instances, more
sophisticated algorithms should be used.
Even redistribution of variables
We have just shown that STD constitutes a solution to the
pooling problem in the presence of experimental noise.
Although it digresses from the main focus of this paper,
the following theorem provides an interesting characteri-
zation of STD, basically showing that the STD layers work
well together, information-wise.
Theorem 2
Let m k q and consider a set of m pools {P
1
,...,P
m
}
STD(n; q; k), each belonging to a different layer. Then:
Proof
see Methods section.
Remarks
1.
λ
m
depends only on m and not on the choice of
P
1
,...,P
m
; hence this theorem can be expressed simply as
follows: each pool is redistributed evenly in every other
layer, and furthermore the intersection between any two
or more pools from different layers is also redistributed
evenly in the remaining layers. This property is very inter-
esting because it means that knowing that any given pool
is positive doesn't bring any information regarding which
pools of another layer will be positive; hence, the infor-
mation content of the other layers remains high.
2. Note that the theorem specifies k q rather than q+1:
the last layer that can be built with STD, L(q), is particular
and does not satisfy theorem 2.
Discussion
To evaluate and compare pooling designs, a fair perform-
ance measure is needed. A widely-used and reasonable
choice consists in considering the number of pools
required to guarantee the correction of all errors and the
identification of all variables' values: we call this the
"guarantee requirement". This criterion is used here to
study the behavior and performance of STD, and to com-
pare it to the main published deterministic error-correct-
ing pooling designs. Since most authors do not
distinguish between false positives and false negatives, we
only consider here the error correction power of STD as
stated in corollary 2, rather than the stronger result
expressed in corollary 3.
λλ λ
mm m
≤≤+ =
=
=
P
n
q
qq
h
h
m
c
cm
cm
1
1
1
,%.where
Γ
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 8 of 13
(page number not for citation purposes)
Guaranteed performance of STD
We define the "gain" of a design as the ratio between the
number of variables and the number of pools: n/v. The
gain is called "guaranteed gain" if the guarantee require-
ment is satisfied. This measure is particularly useful for
comparing settings where n varies.
Given the specifications of an application, i.e. values for n
(total number of objects to be tested), t (number of
expected positives), and E (expected number of errors to
be corrected), STD can propose many sets of pools, by
selecting various values for the parameter q and setting the
number of layers k accordingly (as specified by corollary
2). These pool sets are of different sizes, but all satisfy the
guarantee requirement. The optimal choice, q
opt
, is the
one with maximum guaranteed gain. Let q
min
be the
smallest possible q such that t·Γ(q,n)+2·E q, and let
Γ
max
= Γ(q
min
,n). At a fixed value for Γ, the number of lay-
ers k necessary to satisfy the guarantee requirement is con-
stant; therefore the best gain at fixed Γ is always obtained
with the smallest q whose compression is Γ. It follows that
q
opt
can be identified easily by finding the smallest q for
each value of Γ in {1,...,Γ
max
}, and calculating the corre-
sponding gain. In practice we often have q
opt
= q
min
, but
this is not compulsory, as illustrated by Table 1 in the case
n = 10000, t = 5, E = 0.
The above method allows to easily calculate the best guar-
anteed gain that STD can offer, in any specified (n,t,E) set-
ting. Therefore, the behavior of STD can be studied under
various angles. In particular, one interesting approach
consists in using fixed values for t and E, and studying the
evolution of the best guaranteed gain (obtained using
q
opt
) when n increases. For example, Table 2 displays the
number of pools necessary to identify three positives and
correct two errors, when the number of variables ranges
from 100 to 10
6
. When n increases, the gain increases sub-
stantially and fairly regularly: it is multiplied by a factor
ranging from 6 to 9 every time n gains an order of magni-
tude. Note that in a real application, the fact that the pool
sizes are generally constrained by practical considerations
can result in forcing to use values of q > q
opt
and hence
limit the gain.
Comparison with previous work
In this section, after a brief overview of the known con-
struction methods, we compare STD, in terms of flexibil-
ity and performance under the guarantee requirement, to
the main published error-correcting deterministic pooling
designs. In general, the guaranteed gains can be difficult to
compare analytically, because the numbers of pools and
variables can be defined by formulas that are often rather
involved. However, each paper describing a new design
typically holds a numerical example, which would hardly
Table 1: Choosing the optimal value for the number of pools per layer, q
q Γ kvgain
13 3 16 k > q+1, can't use these values
17 3 16 272 36.8
19 3 16 304 32.9
23 2 11 253 39.5
29 2 11 319 31.3
... 2 11 ... ...
97 2 11 1067 9.4
101 1 6 606 16.5
This table shows the gains obtained with various q values, when the total number of variables to be tested is n = 10000 and the number of expected
positives is t = 5, in a noiseless experiment (E = 0). Γ is the compression power (i.e. logarithm of n in base q, see Preliminaries in Results(1) section),
k is the number of layers, v is the number of pools (i.e. k·q), and the gain is defined as n/v. By construction, STD requires k q+1; and to guarantee
the identification of t positives while correcting E errors, section 3.3 showed that we must choose k = t·Γ+2·E+1; in this example, k = 5Γ+1. Often,
the smallest useable q (i.e., satisfying k q+1), q
min
, yields the highest gain, but this is not always the case. In this example, q
min
= 17, but q = 23
(smallest q such that Γ = 2) yields the highest gain: 39.5.
Table 2: Gains obtained when the identification of 3 positives and the correction of 2 errors is guaranteed (t = 3, E = 2)
nq
opt
pool size k v gain
100 11 9 8 88 1.1
1000 11 91 11 121 8.3
10
4
13 769 14 182 55
10
5
19 5263 14 266 376
10
6
19 52631 17 323 3096
For each value of n (total number of variables), the optimal q value q
opt
has been calculated, as well as the associated pool size, the number of layers
k, the total number of pools v, and the gain.
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 9 of 13
(page number not for citation purposes)
be disadvantageous to the described design. Therefore,
when the methods cannot be easily compared, it seems
fair to use each paper's numerical example for comparison
with STD. Note that the guarantee requirement cannot be
satisfied by random designs [e.g. [8]], which are conse-
quently not studied here.
Detailed reviews of deterministic pooling designs can be
found in [7,9,10], and we will only very briefly recapitu-
late them here. Broadly speaking, there are three main
construction methods: set packings, transversal designs,
and direct constructions. In fact, the non-adaptive pooling
problem is strongly connected to the problem of con-
structing superimposed codes [11], which was analyzed
forty years ago to deal with the questions of representing
rare document attributes in an information retrieval sys-
tem and of assigning channels to relieve congestion in
shared communications bands. The focus is different:
each variable is seen as a code word and the goal is to max-
imize the number of code words n at fixed length v rather
than the other way around; and these problems were
noiseless, contrary to our own situation where error-cor-
rection is critical. Yet [11] had already suggested construc-
tions of superimposed codes based on set packings, as
well as constructions based on q-nary codes (which are in
fact transversal designs) and on compositions of q-nary
codes (which are not transversal anymore, and are more
compact). Set packings, such as the designs presented in
[12], can yield very efficient designs, but are mainly lim-
ited to t 2 [7]. Transversal designs include the well-
known grid (or row-and-column) design. This design is
initially limited to identifying a single positive in the
absence of noise, and is not very efficient, but it has been
improved in two directions: hypercube designs [13] gen-
eralize it by considering higher dimension grids, and var-
ious methods [e.g. [14]] have been proposed to build
several "synergical" grids that work well together. Finally,
some authors have proposed direct constructions of error-
correcting pooling designs [15,16].
Note that STD, although directly constructed, is in fact a
transversal design. Furthermore, STD can be seen as a con-
structive definition of a q-nary code as proposed by [11],
i.e. a concatenated code where the inner code is simply the
unary code, and the outer code has some similarities with
a Reed-Solomon code [17]. Yet although related, the
methods are clearly different: for example, STD doesn't
produce useful pools if q is a prime power; on the other
hand, STD allows to build up to q+1 layers, whereas the
Reed-Solomon based construction can only build up to q-
1. Furthermore, STD produces efficient pools independ-
ently of the number of variables n, contrary to the Reed-
Solomon approach where one is faced with the difficult
problem of choosing a good subset of code words except
for some n values. The relationship between the two
approaches requires further investigations.
Set packing designs
Regarding set packing designs, the main results taking into
account error-correction are presented in [12]. The
authors exhibit Steiner designs that can identify up to t =
2 positives and in some instances correct many errors, and
prove that these designs are optimal when the construc-
tion is possible (it is only possible for very specific (n,E)
values). When these optimal designs exist, they are more
efficient than STD. The same authors describe a real-world
application in [18], where the goal is to screen a clone
map with n = 1530 and t = 2. They start off with a design
that can deal with 4368 variables while satisfying the guar-
antee requirement for t = 2 and E = 0. None of the optimal
designs from [12] can be used, but this initial design is
also based on a Steiner system and remains very efficient.
The authors then select 1530 of the 4368 variables to serve
as clones in their experiment. This was presumably done
because Steiner systems, even outside the optimality con-
ditions of [12], are not known for arbitrary values of n.
Although this reduces the resulting designs' performance,
they remain efficient and obviously still satisfy the guar-
antee requirement. Additionally, this strategy reduces the
sizes of pools, providing increased robustness (e.g., some
information can still be obtained if, exceptionally, three
objects are positive), and complying with the application-
imposed pool size constraints. In the example, n = 1530
and t = 2, and the authors propose two designs: one with
65 pools of approximately 118 clones each, and one with
54 pools of 142 clones. These numbers are very close to
what would be recommended with STD: we could pro-
pose STD(1530; 13; 5) which has 65 pools of 118 clones,
or STD(1530; 7; 7) with 49 pools of 218 clones. Note that
although STD(1530; 13; 5) has the same number of pools
and pool size as the first design proposed in [18], they are
in fact different: the latter is obtained by random sam-
pling from the Steiner design. All of these designs guaran-
tee the identification of 2 positives in the absence of noise.
Furthermore, although noise-tolerance is not guaranteed
in any of them, simulations we have performed suggest
that substantial error-rates can be corrected in the STD
designs, as is the case in the others. Therefore these
designs and STD appear to achieve very similar perform-
ances on these examples. However, it is important to note
that the only Steiner systems proven to be optimal con-
cern specific instances of the t = 1 and t = 2 cases. In more
general circumstances, designs derived from Steiner sys-
tems are not optimal, and their performance depends on
the problem specification (i.e. n, t, E values). For example,
considering the n = 10000, t = 5, E = 0 problem discussed
above and in Table 1, the smallest Steiner system that we
could identify (based on [19]) is S(3,24,530), which com-
prises 530 pools. In addition, there is no clear method for
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 10 of 13
(page number not for citation purposes)
choosing the Steiner system best suited to a given problem
specification: although we have searched extensively, we
cannot be certain that no better Steiner system exists for
this example. In contrast, finding the optimal STD param-
eters is straight-forward, as explained in the previous sec-
tion. In this case STD proposes a solution comprising 253
pools.
Transversal designs
An interesting generalization of the grid design is
described in [13]. The authors propose to array the varia-
bles in a D-dimensional cube, instead of the 2 dimensions
used in the standard grid design. Furthermore, they advise
that the length of the cube's side be chosen prime: let us
denote it q. A pool is then obtained from each hyper-
plane, so that the D-dimensional cube yields D layers of q
pools, each comprising up to n/q variables. To obtain
more layers, the authors propose a criterion to construct
"efficient transforming matrices" that produce additional
cubes, where variables are as shuffled as possible; in fact,
the purpose of their "efficiency" criterion is identical to
the "co-occurrence of variables" property satisfied by STD
(theorem 1). Seen like this, their system is clearly related
to STD: D is Γ+1, and although the authors do not inves-
tigate their design's behavior under the guarantee require-
ment, corollaries 1 and 2 can in essence be applied.
Furthermore, when the cube is "full", i.e. when n = q
D
,
their pools satisfy an analog of theorem 2 (i.e. they are
"information-efficient" in some sense). Note that this can-
not be the case when q is arbitrary; this may explain why
the authors limit their options for q to the smallest primes
larger than n
1/D
, for each D value. However, each D-
dimensional cube provides only D layers, and the pro-
posed criterion for building additional cubes is not sys-
tematic, so that the total number of layers that can be built
is unclear but seems much smaller than with STD. In addi-
tion, the authors don't take observational noise into
account (they do talk of "false positives", but are really
referring to what we call ambiguous variables). For these
reasons, we cannot rigorously compare the designs under
the guarantee requirement, but in general the fact that
STD can build more layers is clearly favorable, since it
allows dealing with a greater number of positives and/or
errors at any chosen q value. In a numerical example con-
cerning the screening of the CEPH YAC library, n = 72000
and the authors argue that the optimal dimension and
side length to use are D = 3 and q = 43, respectively. They
then exhibit a set of transforming matrices that allows the
construction of at most 3 additional cubes, yielding a total
of 12 layers. By contrast, using the same values for D and
q, STD can build up to 44 layers, which all satisfy the effi-
ciency criterion. We believe that some of these extra layers
could prove valuable, especially when allowing for exper-
imental noise. In addition, smaller values for q can be
used with STD (while still being information-efficient in
the sense of theorem 2), although simulations would be
necessary to choose the best value.
Two other transversal pooling designs, which generalize
the grid design by providing additional 2-dimensional
grids, are described in [14]: the "Union Jack" and the RCF
designs. In essence, they are very similar to STD when Γ =
1: writing q = n, they allow the construction of up to q+1
layers of pools (where each layer contains q pools of size
q) which satisfy the property that any pair of variables
appears in at most one pool. Theorem 1 shows that this
property, known as the "unique colinearity" condition, is
in fact verified by STD when Γ = 1 (in accord with q = n).
We can observe that these designs, as well as STD when Γ
= 1, are maximal under this condition, since each pair of
variables is in fact present in exactly one pool. Corollaries
1 and 2 can be applied, and show that they allow the iden-
tification of up to t positives while correcting E observa-
tion errors, provided that t+2·E+1 q+1. The
performance of the designs from [14] is therefore identical
to that of STD when Γ = 1. However, STD is superior to
these designs in two respects. First, their constructions are
only possible if q is prime and q5 mod 6 (using the RCF
construction), or if q is prime and q3 mod 4 (with the
Union Jack design). By contrast, STD only requires that q
is prime. Second, STD can be used with any compression
power, rather than being limited to Γ = 1. This flexibility
is an advantage, because STD can be customized to suit
more applications. Notably, when the fraction of positives
is small, the Union Jack and RCF designs perform less
well: the pools are too small, and observing that a pool is
negative brings little information. By contrast, pools in
STD can be very large (when choosing a small q), so that
every observation is informative. To illustrate this point,
let us consider the numerical example of [16] discussed
below, where the fraction of positives is particularly low
(n = 18,918,900 and t is 2 or 9). The best usable design
from [14] would be a Union Jack with q = 4363, and
would require a total of 13,089 pools for 2 positives – 77
times more than STD – and 43,630 pools to guarantee the
identification of 9 positives – 32 times more than STD.
Direct constructions
In [15], the author proposes a direct construction allow-
ing the detection of an arbitrary number of positives.
Although this design is not very efficient under the guar-
antee requirement, the author shows in [20] that the
pools designed for detecting 2 positives allow with high
probability the detection of more positives. A numerical
example, presented in [9], is the following. If n = 10
6
and
t = 5, using 946 pools guarantees the identification of 2
positives and successfully identifies up to 5 positives with
probability 97.1%. In comparison, under the guarantee
requirement (i.e. with probability 100%), STD(n; 11; 11)
contains 121 pools and identifies 2 positives, and STD(n;
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 11 of 13
(page number not for citation purposes)
23; 21), which comprises 483 pools, guarantees the iden-
tification of up to 5 positives.
Finally, another group [16] described two new classes of
non-adaptive pooling designs, which allow the detection
of any number of positives and the correction of half as
many errors. Following the idea from [20], they also show
that their designs for t = 2 have high probabilities of being
successful for more positives. In a numerical example,
they consider the case n = 18,918,900, and propose a
design with 5460 pools which guarantees the identifica-
tion of 2 positives, and can in addition identify up to 9
positives with 98.5% chance of success. By contrast,
STD(n; 13; 13) contains 169 pools and guarantees the
identification of 2 positives, and the identification of 9
positives is guaranteed with the 1369 pools of STD(n; 37;
37).
Conclusion
In this paper, we have presented a new pooling design: the
"shifted transversal design" (STD). We have proven that it
constitutes an error-correcting solution to the pooling
problem. This design is highly flexible: it can be tailored
to deal efficiently with many experimental settings (i.e.,
numbers of variables, positives and errors). Finally, under
a standard performance criterion, i.e. requiring that the
correction of all errors and the identification of all varia-
bles' values be guaranteed mathematically, we have
shown that STD compares favorably, in terms of numbers
of pools, to the main previously described deterministic
pooling designs.
This approach is being experimentally validated in collab-
oration with Marc Vidal's laboratory at the Dana Farber
Cancer Institute, Boston. In a pilot project, pools have
been generated with 940 AD-Y preys, using the
STD(940;13;13) design, and we are screening the 169
resulting pools against 50 different baits. This experiment
will provide estimations for the technical noise levels of
their high-throughput 2-hybrid protocol, in addition to
producing valuable interaction data and yielding a real-
world evaluation of STD.
Although this work is motivated by protein interaction
mapping, as we have been collaborating with Marc Vidal's
group for several years, its scope is certainly not limited to
high-throughput two-hybrid projects. Potential applica-
tions include a wide range of high-throughput PCR-based
assays such as gene knockout projects, drug screening
projects, and various proteomics studies. Furthermore,
this general problem certainly has applications outside
biology.
In practice, an important point is made in [20], where the
author shows that his pooling design can be used to detect
with high probability more positives than guaranteed.
Simulations we have performed show that this observa-
tion is also true with STD: the gains can be increased sub-
stantially if one tolerates a small fraction of ambiguous
variables that will need to be retested. However, these
considerations are outside the scope of this paper, because
we cannot study them analytically, but resort instead to
computer simulations. Yet using such a strategy in practice
with STD significantly improves the performance. For
example, consider the case n = 10000 and t = 5, and sup-
pose that the assay has an error-rate of 1%. To guarantee
the identification of all variables' values, one must use
483 pools (with q = 23 and k = 21). However, if one tol-
erates up to 10 ambiguous variables, even when overesti-
mating the error-rate to 2% for safety's sake, 143 pools
prove amply sufficient. It is clear that this "ambiguity-tol-
erant" approach should be preferred in practical applica-
tions. This approach and the corresponding computer
program, which performs simulations to select the STD
parameter values best suited for a given application and
includes original efficient algorithms for preparing the
pools and decoding the outcomes, will be discussed in
another paper.
Another interesting track will be to study the efficiency of
pooling designs from the point of view of Shannon's
information theory. We are planning to investigate STD's
behavior in this context. Theorem 2 could prove useful for
this.
Finally, the connection between STD and constructions
based on superimposed codes, e.g. q-nary Reed-Solomon
codes [11], warrants further studies.
Methods
Proof of theorem 1
Let i
1
,i
2
{0,...,n-1} with i
1
i
2
. Since each layer of pools
is a partition of , there cannot be more than one pool
per layer containing both A
i1
and A
i2
. Furthermore, there
exists a pool in layer L(j) that contains both A
i1
and A
i2
if
and only if the columns for A
i1
and A
i2
are equal in M
j
, that
is to say . Therefore the number of pools of
STD(n; q; q+1) that contain both i
1
and i
2
,
Card(pools
q+1
(i
1
) pools
q+1
(i
2
)), is the number of values of
j in {0,...,q} such that . However, the follow-
ing equivalencies hold j {0,...,q-1} :
n
CC
ji ji,,
12
=
CC
ji ji,,
12
=
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 12 of 13
(page number not for citation purposes)
Since q is prime, Z/qZ is a field, namely the Galois field
GF(q).
Furthermore, since i
1
i
2
, there exists at least one value c
{0,...,Γ} such that mod q. Indeed,
i
1
,i
2
{0,...,n-1} and n q
Γ+1
entails that
and ,
where % denotes the modulus (these are the unique
decompositions of i
1
and i
2
in base q). Hence,
. Supposing that
mod q for every c {0,...,Γ} would
lead to i
1
- i
2
= 0, which is contradictory with the hypoth-
esis that i
1
i
2
.
It follows that the above (1) can be seen as a non-zero pol-
ynomial (in j) of degree at most Γ on GF(q). As is well-
known, such a polynomial has at most Γ roots in GF(q).
That is to say, there are at most Γ values of j in {0,...,q-1}
such that a pool of L(j) contains both A
i1
and A
i2
. This
proves the theorem if . Furthermore, if
, the coefficient of j
Γ
in (1) is zero by defini-
tion of s(i,q), and (1) is of degree at most Γ-1. Therefore if
A
i1
and A
i2
are elements of the same pool in L(q), then
there are at most Γ-1 pools in L(0),...,L(q-1) that contain
both A
i1
and A
i2
. This concludes the proof of theorem 1.
Proof of theorem 2
Let j
1
,...,j
m
{0,...,k-1} be the layer numbers and p
1
,...,p
m
{0,...,q-1} be the pool indexes that define {P
1
,...,P
m
}:
for every h {1,...,m}, P
h
contains all variables of index i
{0,...,n-1} such that s(i,j
h
) p
h
mod q.
is the number of values i {0,...,n-1} such that:
h {1,...,m},
s(i,j
h
) p
h
mod q. Writing with
α
0
,...,
α
Γ
{0,...,q-1} (this is the unique decomposition of i in base
q), the above is equivalent to:
This system can be written:
If m Γ+1: consider the square sub-matrix composed of
the first Γ+1 rows of the left member. Since P
1
,...,P
m
belong to different layers, the j
h
values are all distinct.
Therefore, recalling that q is prime, this sub-matrix can be
seen as a Vandermonde matrix with elements in the
Galois field GF(q): it is nonsingular. This shows the exist-
ence of a unique tuple of values for
α
0
,...,
α
Γ
{0,...,q-1}
satisfying the first Γ+1 congruencies of (2). The remaining
m-(Γ+1) congruencies may or may not be satisfied with
these
α
0
,...,
α
Γ
values, and the corresponding
might be too large (i.e. n); but in any case, there is at
most one value of i satisfying the system: theorem 2 is
proved when m Γ+1 (given that in this case
λ
m
= 0).
Otherwise, m Γ: consider the square sub-matrix com-
posed of the first m columns of the left member. Again,
this sub-matrix is a Vandermonde matrix in GF(q), hence
it is nonsingular. Consequently, given any values for
α
m
,...,
α
Γ
, there exists a unique tuple of values for
α
0
,...,
α
m-
1
in {0,...,q-1} satisfying (2) (simply shift the terms in
α
m
,...,
α
Γ
to the right member). The question therefore
becomes: how many tuples of values for
α
m
,...,
α
Γ
exist,
such that where
α
0
,...,
α
m-1
are deter-
mined by
α
m
,...,
α
Γ
as explained above. To answer this,
CC sijsij q
j
i
q
j
ji ji
c
c
c
c
c
,,
(,) (,) mod
12
12
0
1
0
=⇔
⇔⋅
==
Γ ΓΓ
Γ
⇔⋅
=
i
q
q
j
i
q
i
q
c
c
c
cc
2
0
12
mod
()
0
1
mod q
i
q
i
q
cc
12
0
i
i
q
qq
c
c
c
1
1
0
=
=
%
Γ
i
i
q
qq
c
c
c
2
2
0
=
=
%
Γ
ii
i
q
i
q
qq
cc
c
c
12
12
0
−=
=
%
Γ
i
q
i
q
cc
12
0
CC
qi qi,,
12
CC
qi qi,,
12
=
P
h
h
m
=1
iq
c
c
=⋅
=
α
c
0
Γ
∀∈
()
=
hm jp
c
c
h
c
hq
{, , }, mod .12
0
α
Γ
1
1
11
2
1
2
01
jj j
jj j
p
p
mm m


Γ
Γ
Γ
α
α
mm
q
mod .
iq
c
c
=⋅
=
α
c
0
Γ
in=⋅<
=
α
c
c
c
q
0
Γ
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BMC Bioinformatics 2006, 7:28 http://www.biomedcentral.com/1471-2105/7/28
Page 13 of 13
(page number not for citation purposes)
consider the unique decomposition of n-1 in base q:
, where for c
{0,...,Γ}. Under this representation, it is clear that i < n,
i.e. i n-1, if and only if:
α
Γ
<
β
Γ
or
For each c {m,...,Γ}, the branch ending at (
α
c
<
β
c
) yields
β
c
·q
(c-m)
different tuples. Indeed, for d > c
α
d
=
β
d
in this
branch, and
α
0
,...,
α
m-1
are bound to
α
m
,...,
α
Γ
: there are
β
c
possible choices for
α
c
, and q choices each for
α
m
,...,
α
c-1
.
As to the final branch, it can yield at most one solution,
given that all the
α
values are set or bound in this branch.
Consequently, there are a total of or
λ
m
+1 solutions: theorem 2 is also proved when m Γ.
Acknowledgements
I thank Danielle Thierry-Mieg, Jean Thierry-Mieg, Laurent Trilling and Jean-
Louis Roch for stimulating discussions and for carefully reading the manu-
script, and an anonymous reviewer for insightful comments. This work was
funded by a BQR grant from the Institut National Polytechnique de Greno-
ble (INPG) to NT.
References
1. Reboul J, Vaglio P, Rual JF, Lamesch P, Martinez M, Armstrong CM, Li
S, Jacotot L, Bertin N, Janky R, Moore T, Hudson JR Jr, Hartley JL, Bra-
sch MA, Vandenhaute J, Boulton S, Endress GA, Jenna S, Chevet E,
Papasotiropoulos V, Tolias PP, Ptacek J, Snyder M, Huang R, Chance
MR, Lee H, Doucette-Stamm L, Hill DE, Vidal M: C. elegans ORFe-
ome version 1.1: experimental verification of the genome
annotation and resource for proteome-scale protein expres-
sion. Nat Genet 2003, 34(1):35-41.
2. Walhout A, Sordella R, Lu X, Hartley J, Temple GF, Brasch MA, Thi-
erry-Mieg N, Vidal M: Protein interaction mapping in C. elegans
using proteins involved in vulval development. Science 2000,
287:116-122.
3. Davy A, Bello P, Thierry-Mieg N, Vaglio P, Hitti J, Doucette-Stamm L,
Thierry-Mieg D, Reboul J, Boulton S, Walhout AJ, Coux O, Vidal M:
A protein-protein interaction map of the Caenorhabditis ele-
gans 26S proteasome. EMBO Rep 2001, 2(9):821-828.
4. Evans G, Lewis K: Physical mapping of complex genomes by
cosmid multiplex analysis. Proc Natl Acad Sci USA 1989,
86(13):5030-5034.
5. Zwaal R, Broeks A, van Meurs J, Groenen J, Plasterk RH: Target-
selected gene inactivation in Caenorhabditis elegans by using
a frozen transposon insertion mutant bank. Proc Natl Acad Sci
USA 1993, 90(16):7431-7435.
6. Cai W, Chen R, Gibbs R, Bradley A: A clone-array pooled shot-
gun strategy for sequencing large genomes. Genome Res 2001,
11(10):1619-1623.
7. Balding D, Bruno W, Knill E, Torney D: A comparative survey of
non-adaptive pooling designs. In Genetic mapping and DNA
sequencing New York: Springer; 1996:133-154.
8. Bruno W, Knill E, Balding D, Bruce D, Doggett NA, Sawhill WW,
Stallings RL, Whittaker CC, Torney DC: Efficient pooling designs
for library screening. Genomics 1995, 26:21-30.
9. Ngo H, Du DZ: A survey on combinatorial group testing algo-
rithms with applications to DNA library screening. DIMACS
Ser Discrete Math Theoret Comput Sci 2000, 55:171-182.
10. Du DZ, Hwang F: Combinatorial Group Testing and Its Appli-
cations, 2
nd
edn. Singapore: World Scientific; 2000.
11. Kautz W, Singleton H: Nonrandom binary superimposed codes.
IEEE Trans Inform Theory 1964, 10:363-377.
12. Balding D, Torney D: Optimal pooling designs with error cor-
rection. J Comb Theory Ser A 1996, 74:131-140.
13. Barillot E, Lacroix B, Cohen D: Theoretical analysis of library
screening using a N-dimensional pooling strategy. Nucl Acids
Res 1991, 19:6241-6247.
14. Chateauneuf M, Colbourn C, Kreher D, Lamken E, Torney D: Pool-
ing, lattice square, and union jack designs. Ann Comb 1999,
3:27-35.
15. Macula A: A simple construction of d-disjunct matrices with
certain constant weights. Discrete Math 1996, 162(1–3):311-312.
16. Ngo H, Du DZ: New constructions of non-adaptive and error-
tolerance pooling designs. Discrete Math 2002, 243(1–
3):161-170.
17. Reed I, Solomon G: Polynomial codes over certain finite fields.
J Soc Ind Appl Math 1960, 8:300-304.
18. Balding D, Torney D: The design of pooling experiments for
screening a clone map. Fungal genet, biol 1997, 21:302-307.
19. Colbourn C, Mathon R: Steiner systems. In The CRC Handbook of
Combinatorial Designs Edited by: Colbourn C, Dinitz J. Boca Raton:
CRC Press; 1996:66-75.
20. Macula A: Probabilistic nonadaptive group testing in the pres-
ence of errors and dna library screening. Ann Comb 1999,
3:61-69.
nq
c
c
c
−=
=
1
0
β
Γ
β
c
c
n
q
q=
1
%
and ( or
... and ( or
( and aq
mm
mm c
c=0
(
(
αβ αβ
αβ
αβ
ΓΓ ΓΓ
Γ
=<
<
=⋅
cc
n))...))).<
λβ
mc
cm
cm
q=⋅
=
Γ
... (1) Mv = w Brust and Brust BMC Bioinformatics (2023) 24:26 To see the role of these properties we describe the main processes for decoding the observed outcomes to the desired sample values (cf. [3,Corollary 1], too). ...
... Pooling designs whose set of pools can be partitioned into subsets which each partition all samples are called resolvable [6, I.1.6]. The so called "Shifted Transversal Brust and Brust BMC Bioinformatics (2023) 24:26 Design" (STD) [3] is a non-adaptive, combinatorial pool testing method designed for the application in binary high throughput screening and is an example of resolvable designs. A variation of STD, adapted and applied to high-throughput drug screening is presented in [12]. ...
... A central characteristic of STD is that pairs of samples should occur together in the same pool only a limited number of times (co-occurrence of samples). In [3] the co-occurrence is denoted by Ŵ and the relationship for the maximum number of t positives that are guaranteed to be discovered by a pooling scheme for a fixed value of q is given by Ŵt + 1 ≤ k where k is the number of layers. Since STD based on q generates up to q + 1 layers it holds that Ŵt ≤ q. ...
Article
Full-text available
Background Grouping samples with low prevalence of positives into pools and testing these pools can achieve considerable savings in testing resources compared with individual testing in the context of COVID-19. We review published pooling matrices, which encode the assignment of samples into pools and describe decoding algorithms, which decode individual samples from pools. Based on the findings we propose new one-round pooling designs with high compression that can efficiently be decoded by combinatorial algorithms. This expands the admissible parameter space for the construction of pooling matrices compared to current methods. Results By arranging samples in a grid and using polynomials to construct pools, we develop direct formulas for an Algorithm (Polynomial Pools (PP)) to generate assignments of samples into pools. Designs from PP guarantee to correctly decode all samples with up to a specified number of positive samples. PP includes recent combinatorial methods for COVID-19, and enables new constructions that can result in more effective designs. Conclusion For low prevalences of COVID-19, group tests can save resources when compared to individual testing. Constructions from the recent literature on combinatorial methods have gaps with respect to the designs that are available. We develop a method (PP), which generalizes previous constructions and enables new designs that can be advantageous in various situations.
... Design" (STD) [3] is a non-adaptive, combinatorial pool testing method designed for the application in binary high throughput screening and is an example of resolvable designs. A variation of STD, adapted and applied to high-throughput drug screening is presented in [12]. ...
... A central characteristic of STD is that pairs of samples should occur together in the same pool only a limited number of times (co-occurrence of samples). In [3] the co-occurrence is denoted by Γ and the relationship for the maximum number of t positives that are guaranteed to be discovered by a pooling scheme for a fixed value of q is given by Γt + 1 ≤ k where k is the number of layers. Since STD based on q generates up to q + 1 layers it holds that Γt ≤ q. ...
... Parameters of pooling designs from published methods and from PP (this work) for guaranteed detection of k = 2 positives among N = 961 samples. Table 4 lists the designs from PP (this work), STD [3], PPoL [13] and multipools [5] with the highest compression gain for these parameters. In this setting, designs generated by PP and STD are equivalent, the same applies to the designs generated by PPoL and multipools. ...
Preprint
Full-text available
Background: Grouping samples with low prevalence of positives into pools and testing these pools can achieve considerable savings in testing resources compared with individual testing in the context of COVID-19. We review published pooling matrices, which encode the assignment of samples into pools and describe decoding algorithms, which decode individual samples from pools. Based on the findings we propose new one-round pooling designs with high compression that can efficiently be decoded by combinatorial algorithms. This expands the admissible parameter space for the construction of pooling matrices compared to current methods. Results: By arranging samples in a grid and using polynomials to construct pools, we develop direct formulas for an Algorithm (Polynomial Pools (PP)) to generate assignments of samples into pools. Designs from PP guarantee to correctly decode all samples with up to a specified number of positive samples. PP includes recent combinatorial methods for COVID-19, and enables new constructions that can result in more effective designs. Conclusion: For low prevalences of COVID-19, group tests can save resources when compared to individual testing. Constructions from the recent literature on combinatorial methods have gaps with respect to the designs that are available. We develop a method (PP), which generalizes previous constructions and enables new designs that can be advantageous in various situations.
... To see the role of these properties we describe the main processes for decoding the observed outcomes to the desired sample values (cf. [3,Corollary 1], too). ...
... Pooling designs whose set of pools can be partitioned into subsets which each partition all samples are called resolvable [6, I.1.6]. The so called "Shifted Transversal Design" (STD) [3] is a non-adaptive, combinatorial pool testing method designed for the application in binary high throughput screening and is an example of resolvable designs. A variation of STD, adapted and applied to high-throughput drug screening is presented in [12]. ...
... A central characteristic of STD is that pairs of samples should occur together in the same pool only a limited number of times (co-occurrence of samples). In [3] the co-occurrence is denoted by Γ and the relationship for the maximum number of t positives that are guaranteed to be discovered by a pooling scheme for a fixed value of q is given by Γt + 1 ≤ k where k is the number of layers. Since STD based on q generates up to q + 1 layers it holds that Γt ≤ q. ...
Preprint
Full-text available
Grouping samples with low prevalence of positives into pools and testing these pools can achieve considerable savings in testing resources compared with individual testing in the context of COVID-19. We review published pooling matrices, which encode the assignment of samples into pools and describe decoding algorithms, which decode individual samples from pools. Based on the findings we propose new one-round pooling designs with high compression that can efficiently be decoded by combinatorial algorithms. This expands the admissible parameter space for the construction of pooling matrices compared to current methods. By arranging samples in a grid and using polynomials to construct pools, we develop direct formulas for an Algorithm (Polynomial Pools (PP)) to generate assignments of samples into pools. Designs from PP guarantee to correctly decode all samples with up to a specified number of positive samples. PP includes recent combinatorial methods for COVID-19, and enables new constructions that can result in more effective designs. For low prevalences of COVID-19, group tests can save resources when compared to individual testing. Constructions from the recent literature on combinatorial methods have gaps with respect to the designs that are available. We develop a method (PP), which generalizes previous constructions and enables new designs that can be advantageous in various situations.
... To see the role of these properties we describe the main processes for decoding the observed outcomes to the desired sample values (cf. [3,Corollary 1], too). ...
... Pooling designs whose set of pools can be partitioned into subsets which each partition all samples are called resolvable [6, I.1.6]. The so called "Shifted Transversal Design" (STD) [3] is a non-adaptive, combinatorial pool testing method designed for the application in binary high throughput screening and is an example of resolvable designs. A variation of STD, adapted and applied to high-throughput drug screening is presented in [12]. ...
... A central characteristic of STD is that pairs of samples should occur together in the same pool only a limited number of times (co-occurrence of samples). In [3] the co-occurrence is denoted by Γ and the relationship for the maximum number of t positives that are guaranteed to be discovered by a pooling scheme for a fixed value of q is given by Γt + 1 ≤ k where k is the number of layers. Since STD based on q generates up to q + 1 layers it holds that Γt ≤ q. ...
Preprint
Full-text available
Grouping samples with low prevalence of positives into pools and testing these pools can achieve considerable savings in testing resources compared with individual testing in the context of COVID-19. We review published pooling matrices, which encode the assignment of samples into pools and describe decoding algorithms, which decode individual samples from pools. Based on the findings we propose new one-round pooling designs with high compression that can efficiently be decoded by combinatorial algorithms. This expands the admissible parameter space for the construction of pooling matrices compared to current methods. By arranging samples in a grid and using polynomials to construct pools, we develop direct formulas for an Algorithm (Polynomial Pools (PP)) to generate assignments of samples into tests. Designs from PP guarantee to correctly decode all samples with up to a specified number of positive samples. PP includes recent combinatorial methods for COVID-19, and enables new constructions that can result in more effective designs. For low prevalences of COVID-19, group tests can save resources when compared to individual testing. Constructions from the recent literature on combinatorial methods have gaps with respect to the possibilities of designs. We develop a method (PP), which includes previous constructions and enables new designs that can be advantageous in various situations.
... These methods might not be efficient particularly in situations where the results need to be delivered quickly. Although adaptive pool testing methods might require fewer tests, non-hierarchical or "nonadaptive" pool testing schemes; where overlapping pool testing is completed in a single step, allow for parallel testing and do not require extra samples be extracted, which improves the testing efficiency [20,28]. ...
... Prior research developed several pools formation methods like the Shifted Transversal Pool Testing Design [28] which seeks to reduce the number of joint membership of individuals in any given pool, and at the same time generates pools that intersect in an equal number of locations. These two properties can improve the non-adaptive detection process significantly. ...
Article
Full-text available
Prior research on pool testing focus on developing testing methods with the main objective of reducing the total number of tests. However, pool testing can also be used to improve the accuracy of the testing process. The objective of this paper is to improve the accuracy of pool testing using the same number of tests as that of individual testing taking into consideration the probability of testing errors and pool multiplicity classification thresholds. Statistical models are developed to evaluate the impact of pool multiplicity classiffcation thresholds on pool testing accuracy using the receiver operating characteristic (ROC) curve and the area under the curve (AUC). The findings indicate that under certain conditions, pool testing multiplicity yields superior testing accuracy compared to individual testing without additional cost. The results reveal that selecting the multiplicity classification threshold is a critical factor in improving the pool testing accuracy and show that the lower the prevalence level the higher the gains in accuracy using multiplicity pool testing. The findings also indicate that performance can be improved using a batch size that is inversely proportional to the prevalence level. Furthermore, the results indicate that multiplicity pool testing not only improves the testing accuracy but also reduces the total cost of the testing process. Based on the findings, the manufacturer's test sensitivity has more significant impact on the accuracy of multiplicity pool testing compared to that of manufacturer's test specificity.
... Developing on the two-stage testing, the binary splitting method has been proposed, where the defective pools are sequentially divided into subpools, which are themselves tested and divided, and so on [11], [12]. From an experimental point of view, a method called shifted transversal for high-throughput screening has been proposed [13]. In addition, a pooling method using a paper-like device for multiple malaria infections has been developed for effective testing [14]. ...
... 10: 13: ...
Article
Full-text available
We study the inference problem in the noisy group testing to identify defective items from the perspective of the decision theory. We introduce Bayesian inference and consider the Bayesian optimal setting in which the true generative process of the test results is known. We demonstrate the adequacy of the posterior marginal probability in the Bayesian optimal setting as a diagnostic variable based on the area under the curve (AUC). Using the posterior marginal probability, we derive the general expression of the optimal cutoff value that yields the minimum expected risk function. Furthermore, we evaluate the performance of the Bayesian group testing without knowing the true states of the items: defective or non-defective. By introducing an analytical method from statistical physics, we derive the receiver operating characteristics curve, and quantify the corresponding AUC under the Bayesian optimal setting. The obtained analytical results precisely describes the actual performance of the belief propagation algorithm defined for single samples when the number of items is sufficiently large.
... For uniquely identifying and keeping track of every individual contribution to the pool, the designs with overlapping pools were found to be effective and accurate [2,[21][22][23]. Among the strategies that have been studied for assigning the individuals into overlapping pools, we found mentioned in the literature the DNA Sudoku approach [9] and the Shifted Transversal Design (STD) [24,25]. Both present a deterministic algorithm for recovering the individual test results from the pools. ...
... Both present a deterministic algorithm for recovering the individual test results from the pools. We have also noted other approaches as compressed sensing [2,24,26] which are particularly suitable for processing the rare variants and incorporate probabilities in the decoding step. Our design is a simple case of STD which partitions the samples to be pooled into repeated blocks, where each block corresponds to a pair of layers [20]. ...
Article
Full-text available
Background Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented. Results We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts. Conclusions We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.
... We implemented the verification algorithms of group-testing aggregate entity authentication for EA[GT, AM X ], EEA[GT, AM X ], and EA[GT, AM H ]. We used the MAC function HMAC-SHA-256 for tagging and SHA-256 to aggregate tags for AM H . For GT, we adopted non-adaptive group testing and used d-disjunct matrices generated by the shifted transversal design (STD) [30], where d is the upper bound on the number of invalid entities. ...
Article
Full-text available
It is common to implement challenge-response entity authentication with a MAC function. In such an entity authentication scheme, aggregate MAC is effective when a server needs to authenticate many entities. Aggregate MAC aggregates multiple tags (responses to a challenge) generated by entities into one short aggregate tag so that the entities can be authenticated simultaneously regarding only the aggregate tag. Then, all associated entities are valid if the pair of a challenge and the aggregate tag is valid. However, a drawback of this approach is that invalid entities cannot be identified when they exist. To resolve the drawback, we propose group-testing aggregate entity authentication by incorporating group testing into entity authentication using aggregate MAC. We first formalize the security requirements and present a generic construction. Then, we reduce the security of the generic construction to that of aggregate MAC and group testing. We also enhance the generic construction to instantiate a secure scheme from a simple and practical but weaker aggregate MAC scheme. Finally, we show some results on performance evaluation.
Preprint
Background Group testing, combining the samples of multiple patients into a single pool to be tested for infection, is an approach to increase throughput in clinical diagnostic and population testing by reducing the number of tests required. In order to further increase the throughput and accuracy of these approaches, mathematicians regularly devise novel combinatorial methods. However, although these novel methods are easily validated in silico, they are often never imple- mented in diagnostic laboratories because of the lack of clear and standardised pathways to clinical validation. Methods We develop a standardised analytical workflow that makes use of high-throughput automation and virus-like particle standards to validate theoretical group testing approaches. We then utilise specially developed virus-like particles for SARS-CoV-2, Influenza A, Influenza B, and Respiratory Syncytial Virus (RSV) to develop and validate a novel multiplex group testing approach based on simulated annealing and Bayesian optimization. Our approach improves the inference of positive samples in group testing, leveraging the quantitative nature of RT-qPCR test results. Results Our results show a higher positive predictive value of our novel approach for the inference of positive samples compared to the standard approach using binary test outcomes. In large-scale surveillance testing our method can greatly reduce the number of false positive identifications. Our in vitro findings show the viability of group testing for multiplexed testing of respiratory infections and demonstrate the potential of a novel inference method. Both innovations increase the number of people that can be tested with the available resources, which is particularly important in low- resource settings.
Chapter
In this paper, we comprehensively study group testing aggregate signatures that have functionality of both keyless aggregation of multiple signatures and identifying an invalid message from the aggregate signature, in order to reduce a total amount of signature-size for lots of messages. Our contribution is (i) to formalize strong security notions including soundness for group testing aggregate signatures by taking into account related work such as fault-tolerant aggregate signatures and non-interactive aggregate MACs with detecting functionality (i.e., symmetric case); (ii) to construct group testing aggregate signatures from aggregate signatures in a generic and comprehensive way; and (iii) to present an aggregate signature scheme which we can apply to our generic construction of group testing aggregate signatures with the formalized security.KeywordsAggregate signatureDigital signatureGroup testing
Article
Full-text available
Simplified pooling designs employ rows, columns, and principal diagonals from square and rectangular plates. The requirement that every two samples be tested together in exactly one pool leads to a novel combinatorial configuration: The union jack design. Existence of union jack designs is settled affirmatively whenever the ordern is a prime andn3 (mod 4).
Article
Pooling (or “group testing”) designs for screening clone libraries for rare “positives” are described and compared. We focus on non-adaptive designs in which, in order both to facilitate automation and to minimize the total number of pools required in multiple screenings, all the pools are specified in advance of the experiments. The designs considered include deterministic designs, such as set-packing designs, the widely-used “row and column” designs and the more general “transversal” designs, as well as random designs such as “random incidence” and “random k-set” designs. A range of possible performance measures is considered, including the expected numbers of unresolved positive and negative clones, and the probability of a one-pass solution. We describe a flexible strategy in which the experimenter chooses a compromise between the random k-set and the set-packing designs. In general, the latter have superior performance while the former are nearly as efficient and are easier to construct.
Article
In this paper, we give an overview of Combinatorial Group Testing algo-rithms which are applicable to DNA Library Screening. Our survey focuses on several classes of constructions not discussed in previous surveys, provides a general view on pooling design constructions and poses several open questions arising from this view.
Article
We use the subset containment relation to construct a probabilistic nonadaptive group testing design and decoding algorithm that, in the presence of testing errors, identifies many positives in a population. We give a lower bound for the expected portion of positives identified as a function of an upper bound on the number of testing errors.
Article
We describe efficient methods for screening clone libraries, based on pooling schemes that we call "random k-sets designs." In these designs, the pools in which any clone occurs are equally likely to be any possible selection of k from the v pools. The values of k and v can be chosen to optimize desirable properties. Random k-sets designs have substantial advantages over alternative pooling schemes: they are efficient, flexible, and easy to specify, require fewer pools, and have error-correcting and error-detecting capabilities. In addition, screening can often be achieved in only one pass, thus facilitating automation. For design comparison, we assume a binomial distribution for the number of "positive" clones, with parameters n, the number of clones, and c, the coverage. We propose the expected number of resolved positive clones--clones that are definitely positive based upon the pool assays--as a criterion for the efficiency of a pooling design. We determine the value of k that is optimal, with respect to this criterion, as a function of v, n, and c. We also describe superior k-sets designs called k-sets packing designs. As an illustration, we discuss a robotically implemented design for a 2.5-fold-coverage, human chromosome 16 YAC library of n = 1298 clones. We also estimate the probability that each clone is positive, given the pool-assay data and a model for experimental errors.