GARD: A Genetic Algorithm for Recombination Detection
Sergei L. Kosakovsky Ponda∗, David Posadab, Michael B. Gravenorc,
Christopher H. Woelkaand Simon D.W. Frosta.
aDepartment of Pathology, University of California San Diego, La Jolla, California, 92093, USA
bDepartamento de Bioqu´ ımica, Gen´ etica e Inmunolog´ ıa, Facultad de Biolog´ ıa, Universidad de Vigo, Vigo
cSchool of Medicine, University of Wales, Swansea, United Kingdom.
Motivation: Phylogenetic and evolutionary inference can be severely
misled if recombination is not accounted for, hence screening for it
should be an essential component of nearly every comparative study.
The evolution of recombinant sequences can not be properly explai-
ned by a single phylogenetic tree, but several phylogenies may be
used to correctly model the evolution of non-recombinant fragments.
Results: We developed a likelihood-based model selection proce-
dure that uses a genetic algorithm to search multiple sequence
alignments for evidence of recombination breakpoints and iden-
tify putative recombinant sequences. GARD is an extensible and
intuitive method that can be run efficiently in parallel. Extensive
simulation studies show that the method nearly always outperforms
other available tools, both in terms of power and accuracy and that
the use of GARD to screen sequences for recombination ensures
good statistical properties for methods aimed at detecting positive
Availability: Freely available. http://www.datamonkey.org/GARD/
Recombination can have a profound impact on the evolutionary
process and is of interest in its own right. In HIV-1, for instance,
recombination rates can rival mutation rates (Zhuang et al., 2002).
Recombination can adversely affect the power and accuracy of
fundamentally important tools of molecular evolutionary analyses:
phylogenetic reconstruction (Posada & Crandall, 2002), molecu-
lar clock inference (Schierup & Hein, 2000) and the detection of
positively selected sites (Shriner et al., 2003). Consequently, relia-
ble tools for discovering recombination are a critical part of any
phylogenetic analysis. A diverse array of algorithms and software
tools for detection of recombination have been published. However,
when benchmarked on simulated (Posada & Crandall, 2001) and
biological (Posada, 2002) data, the methods often gave contradic-
tory results, and no definitive recommendation on which approach
should be considered the “gold standard” could be made. We
developed a robust and extensible approach - Genetic Algorithm
Recombination Detection (GARD) - to screen multiple sequence
alignments for evidence of phylogenetic incongruence, identify the
number and location of breakpoints and sequences involved in puta-
tive recombination events. Using simulated and biological data sets
∗to whom correspondence should be addressed
we have shown (Kosakovsky Pond et al., 2006) that GARD out-
performs the best currently available tools in terms of power and
accuracy in a wide range of evolutionary scenarios.
We model recombinant sequences by allowing S ≥ 1 non-recombinant
alignment fragments, reconstructing a separate phylogenetic tree for each
fragment and evaluating the goodness-of-fit for the model using the small
sample Akaike’s Information Criterion (Sugiura, 1978) computed with stan-
dard phylogenetic likelihood methods and point substitution models (see
Kosakovsky Pond et al. (2006) for details). The computationally challenging
component of the model is the search for the locations of S − 1 breakpoints
- a problem of O(LS) complexity (L denotes the length of the alignment).
When S = 2, an exhaustive examination of all possible locations for the
single breakpoint can be undertaken. This single breakpoint (SBP) method
performs surprisingly well (Kosakovsky Pond et al., 2006) when a dicho-
tomous classification of alignments into recombinant or non-recombinant is
desired, and can be run quickly in a parallel computing environment.
When S > 2, we utilize an aggressive population based hill-climber -
the CHC genetic algorithm (Eshelman, 1991) - to search the space of break-
point locations, encoded as a binary vector of sorted concatenated breakpoint
positions. CHC always retains the most fit individual from the previous
generation and performs two basic operations on individuals currently in the
METHODS AND ALGORITHMS
1. When two individuals, b1and b2are picked to mate, their offspring is
equally likely to inherit bit bifrom either parent.
2. If the diversity of the sample (measured by the range of AICcscores
normalized by the score of the best individual) falls below a fixed thres-
hold, then all individuals in the population, excluding the most fit one,
have a proportion of randomly selected bits toggled.
For fixed S, the algorithm terminates if the best score remains unchanged
over 100 consecutive generations. A typical GA run considers 103− 104
possible models before converging. To infer S, we start with S = 1 seg-
ments and increase S by 1 for subsequent GA runs, until the AICcscore of
the best model fails to improve further. GARD and SBP have been imple-
mented as HyPhy (Kosakovsky Pond et al., 2005) language scripts enabled
to run in an MPI environment. Presently, GARD is hosted on our 40-node
cluster and can be accessed via a Web front-end. Standalone scripts or clu-
ster installation instructions can be obtained from the authors upon request
and will be made available online if there is sufficient interest. The current
implementation, shown schematically in Figure 1, allows the user to:
1. Upload an alignment of sequences to screen. At present up to 50
aligned DNA/RNA sequences with up to 10000 nucleotides will be
accepted. Both numbers will be increased periodically.
© The Author (2006). Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com
Associate Editor: Christos Ouzounis
Bioinformatics Advance Access published November 16, 2006
by guest on June 13, 2013
SL Kosakovsky Pond et al
Upload and Validate an alignment
(FASTA, NEXUS, PHYLIP)
Construct a NJ tree
SBP AnalysisGARD Analysis
0 200 400 600 800 1000 1200 1400 1600 1800
Model averaged support
Breakpoint placement support using c-AIC
Fig. 1. GARD and SBP server schematic flowchart and sample output.
2. Select an appropriate model of nucleotide evolution (Kosakovsky Pond
& Frost, 2005) and specify the distribution used to model site-to-site
variation in substitution rates.
3. Run SBP or GARD screens for recombination.
4. Visualize and download the results of recombination screens, inclu-
ding: (i) the number and best location of inferred breakpoints, and
the improvement in AICc score achieved by the multiple breakpoint
model (if any) ; (ii) model averaged support for the location of break-
points, useful for assessing the degree of confidence; (iii) phylogenetic
trees inferred from each non-recombinant breakpoint; (iv) a NEXUS
file containing the alignment, inferred partitions and trees.
5. Result files and HyPhy scripts needed for additional processing and
inference (e.g. for further tests of phylogenetic incongruence) can be
downloaded and run locally.
We intend to add new features and analysis options (e.g. protein sequence
analysis) with time.
In practice many widely-used molecular analyses may be confoun-
ded by its presence or absence. Hence, screening for recombination
should be an integral part of phylogenetic analyses. We have deve-
loped an intuitive and powerful method for detecting evidence of
recombination in alignments of DNA sequences. It is able to pro-
vide estimates for the number and location of breakpoints, and infer
segment-specific phylogenetic trees. GARD does not require a non-
recombinant reference alignment and recombination between ance-
stral sequences is also accommodated. Arbitrarily complex models
of point substitution (e.g. those allowing site-to-site variation in sub-
stitution rates, or codon models) can be easily incorporated. GARD
outperforms other methods and can be run in parallel on a cluster of
computers, and so is well suited to screen for recombination in large
This research was supported in part by the National Institutes of Health
(AI43638, AI47745, and AI57167, R01-GM66276), the University of Cali-
fornia Universitywide AIDS Research Program (IS 02-SD-701), and by a
lopmental Award to SDWF and SLKP (AI36214). DP was also supported by
by the “Ram´ on y Cajal” program of the Spanish government.
Eshelman, L. J. (1991) The CHC adaptive search algorithm: How to do safe search
when engaging in nontraditional genetic recombination. In Foundations of Genetic
Algorithms, (Spatz, B. M., ed.),. Morgan Kaufmann San Mateo, CA pp. 265–283.
Kosakovsky Pond, S. L. & Frost, S. D. W. (2005) Datamonkey: Rapid detection of
selective pressure on individual sites of codon alignments. Bioinformatics, 21 (10),
Kosakovsky Pond, S. L., Frost, S. D. W. & Muse, S. V. (2005) HyPhy: Hypothesis
testing using phylogenies. Bioinformatics, 21 (5), 676–679.
Kosakovsky Pond, S. L., Posada, D., Gravenor, M. B., Woelk, C. & Frost, S.
D. W. (2006) Automated phylogenetic detection of recombination using a genetic
algorithm. Mol. Biol. Evol., doi:10.1093/molbev/msl051.
Posada, D. (2002) Evaluation of methods for detecting recombination from DNA
sequences: empirical data. Mol Biol Evol, 19 (5), 708–717.
Posada, D. & Crandall, K. A. (2001) Evaluation of methods for detecting recombination
from DNA sequences: computer simulations. Proc Nat Acad Sci, 98 (24), 13757–
Posada, D. & Crandall, K. A. (2002) The effect of recombination on the accuracy of
phylogeny estimation. J Mol Evol, 54 (3), 396–402.
Schierup, M. & Hein, J. (2000) Recombination and the molecular clock. Mol Biol Evol,
by guest on June 13, 2013
Shriner, D., Nickle, D. C., Jensen, M. A. & Mullins, J. (2003) Potential impact of
recombination on sitewise approaches for detecting positive natural selection. Genet
Res, 81, 115–121.
Sugiura, N. (1978) Further analysis of the data by Akaike’s information criterion and
the finite corrections. Comm Stat Theory Methods, A7, 13–26.
Zhuang, J., Jetzt, A. E., Sun, G., Yu, H., Klarmann, G., Ron, Y., Preston, B. D. &
Dougherty, J. P. (2002) Human immunodeficiency virus type 1 recombination: rate,
fidelity, and putative hot spots. J Virol, 76 (22), 11273–11282.
by guest on June 13, 2013