ArticlePDF Available

Computer Programs and Methodologies for the Simulation of DNA Sequence Data with Recombination

Frontiers
Frontiers in Genetics
Authors:

Abstract and Figures

Computer simulations are useful in evolutionary biology for hypothesis testing, to verify analytical methods, to analyze interactions among evolutionary processes, and to estimate evolutionary parameters. In particular, the simulation of DNA sequences with recombination may help in understanding the role of recombination in diverse evolutionary questions, such as the genome structure. Consequently, plenty of computer simulators have been developed to simulate DNA sequence data with recombination. However, the choice of an appropriate tool, among all currently available simulators, is critical if recombination simulations are to be biologically meaningful. This review provides a practical survival guide to commonly used computer programs and methodologies for the simulation of coding and non-coding DNA sequences with recombination. It may help in the correct design of computer simulation experiments of recombination. In addition, the study includes a review of simulation studies investigating the impact of ignoring recombination when performing various evolutionary analyses, such as phylogenetic tree and ancestral sequence reconstructions. Alternative analytical methodologies accounting for recombination are also reviewed.
This content is subject to copyright.
MINI REVIEW ARTICLE
published: 01 February 2013
doi: 10.3389/fgene.2013.00009
Computer programs and methodologies for the simulation
of DNA sequence data with recombination
Miguel Arenas*
Centre for Molecular Biology “Severo Ochoa, Consejo Superior de Investigaciones Científicas, Madrid, Spain
Edited by:
Badri Padhukasahasram, Ford, USA
Reviewed by:
Bjørn Østman, Michigan State
University, USA
Marcos Perez-Losada, Centro de
Investigação em Biodiversidade e
Recursos Genéticos, Portugal
*Correspondence:
Miguel Arenas, Centre for Molecular
Biology “Severo Ochoa, Consejo
Superior de Investigaciones
Científicas Universidad Autónoma
de Madrid, C/Nicolás Cabrera, 1,
28049 Cantoblanco, Madrid, Spain.
e-mail: marenas@cbm.uam.es
Computer simulations are useful in evolutionary biology for hypothesis testing, to verify
analytical methods, to analyze interactions among evolutionary processes, and to estimate
evolutionary parameters. In particular, the simulation of DNA sequences with recombina-
tion may help in understanding the role of recombination in diverse evolutionary questions,
such as the genome structure. Consequently, plenty of computer simulators have been
developed to simulate DNA sequence data with recombination. However, the choice of an
appropriate tool, among all currently available simulators, is critical if recombination sim-
ulations are to be biologically meaningful. This review provides a practical survival guide
to commonly used computer programs and methodologies for the simulation of coding
and non-coding DNA sequences with recombination. It may help in the correct design
of computer simulation experiments of recombination. In addition, the study includes a
review of simulation studies investigating the impact of ignoring recombination when per-
forming various evolutionary analyses, such as phylogenetic tree and ancestral sequence
reconstructions. Alternative analytical methodologies accounting for recombination are also
reviewed.
Keywords: simulation, recombination, recombination breakpoints, recombination hotspots, DNA sequences,
recombination phylogenetic bias
INTRODUCTION
Recombination constitutes a basic and dominant mechanism in
molecular evolution, increasing genetic diversity before natural
selection operates on the new sequence. Recombination is wide-
spread across nuclear genomes (e.g.,Awadalla, 2003;Tsaousis et al.,
2005;Fraser et al., 2007;Gaut et al., 2007;Duret and Arndt, 2008)
and the importance of its understanding has been long recog-
nized, with crucial implications for genome structure (Reich et al.,
2001), phenotypic diversity (Zhang et al., 2002), and genetic dis-
eases (Daly et al., 2001). Moreover, ignoring recombination may
bias phylogenetic reconstructions (e.g., Posada, 2001;Posada and
Crandall, 2002;Beiko et al., 2008), and the derived inferences (e.g.,
Schierup and Hein, 2000a;Feil et al., 2001;Anisimova et al., 2003;
Arenas and Posada, 2010a,b,c).
The evolutionary importance of recombination (e.g., Robert-
son et al., 1995;Lukashev, 2005) calls for its accurate detection
and measurement (see, Martin et al., 2011). Although some ana-
lytical methods have shown an overall better performance than
others (Posada and Crandall, 2001;Wiuf et al., 2001), the choice
of an appropriate tool also depends on the particular analysis
(e.g., detection of recombination breakpoints or estimation of
recombination rates), computational costs (some methods are
computationally expensive), and the genetic marker. I recom-
mend the following two reviews for helping users to make choices
for appropriate methods and computer tools for recombination
inference (Posada et al., 2002;Martin et al., 2011).
Computer simulations aim to mimic real world processes. They
allow the study of mechanisms that may alter processes or the
understanding of complex systems that are analytically intractable
(Peck, 2004). Indeed, the simulation of evolutionary histories is
commonly used for hypothesis testing (e.g., Arenas et al., 2008;
Pierron et al., 2011), to verify and compare analytical methods
or programs (e.g., Westesson and Holmes, 2009;Marttinen et al.,
2012), to analyze interactions among evolutionary processes (e.g.,
Arenas et al., 2012, 2013), or to estimate evolutionary parameters
(e.g., Wilson et al., 2009;Beaumont, 2010). Importantly, the choice
of an appropriate simulator is critical because simulations should
be as realistic as possible in order to mimic a given biological sce-
nario. Although several studies have already reviewed computer
simulators in population genetics from global perspectives (e.g.,
Liu et al., 2008;Arenas, 2012;Arenas and Posada, 2012;Hoban
et al., 2012), they have not explored particular methodologies for
the simulation of DNA sequences with recombination.
The present study provides an overview of the capabilities
of available computer tools and methodologies, and suggests
recommendations, for the simulation of DNA sequences with
recombination. It also describes some applications of simulated
datasets with recombination to show the importance of includ-
ing recombination in evolutionary analyses. Alternative analytical
methodologies that consider recombination are also suggested.
COMPUTER PROGRAMS FOR THE SIMULATION OF DNA
DATA UNDER RECOMBINATION
Recombination can be simulated by the two major simulation
approaches commonly used in population genetics, the forward
in time (forward-time, where the evolutionary history of an entire
population is simulated from the past to the present; e.g.,Epperson
et al., 2010) and the coalescent (backward-time, which describes a
www.frontiersin.org February 2013 | Volume 4 | Article 9 | 1
Arenas Simulation of DNA recombination
backward in time genealogical process from a sample of genes to
a single ancestral copy; e.g., Nordborg, 2007;Wakeley, 2008). The
forward-time approach can simulate complex processes, including
gene–gene interactions and complex selection (e.g., Calafell et al.,
2001;Peng et al., 2007), but coalescent simulations are computa-
tionally faster and can be recommended for extensive simulation
studies (e.g., Beaumont et al., 2002). Table 1 shows an overview
of currently available computer programs, for both coalescent
and forward-time approaches, to simulate DNA sequences with
recombination.
SIMULATION OF CODING DNA SEQUENCES WITH RECOMBINATION
Direct simulation of coding DNA sequences with recombination
can be only performed with a few programs. Using the coales-
cent approach, the programs Recodon (Arenas and Posada, 2007),
CodonRecSim (Anisimova et al., 2003), and NetRecodon (Arenas
and Posada, 2010a) allow such simulation, but only the latter pro-
gram does not force recombination breakpoints to occur between
codons, thus allowing more realistic simulations (see Arenas and
Posada, 2010a). Concerning the forward-time approach, only the
programs GenomePop (Carvajal-Rodriguez,2008) and SFS_CODE
(Hernandez, 2008) implement the simulation of coding sequences
with recombination.
Evolutionary scenarios that are not implemented in these pro-
grams can be simulated by the following alternative methodology,
which is based on the concatenation of two different simulators.
First, we simulate an evolutionary history with recombination [an
ancestral recombination graph (ARG, see Figure 1A), which con-
tains a tree for each recombinant fragment; Figures 1B–D]. This
procedure can be carried out using, for example, the program ms
(Hudson, 2002); see also other evolutionary history simulators in
(Hoban et al., 2012). Next, we simulate molecular evolution of each
coding fragment, according to a user-specified codon-substitution
model, along its corresponding simulated tree (further details in
Yang, 2006;Fletcher and Yang, 2009). Finally, we just concate-
nate the simulated coding fragments. The simulation of coding
sequence evolution along given trees can be performed, for exam-
ple, with the program INDELible (Fletcher and Yang, 2009); see
also other software in (Arenas, 2012;Arenas and Posada, 2012).
The limitation of this methodology is that recombination break-
points are always assumed to occur between codons and not within
codons.
SIMULATION OF NUCLEOTIDE SEQUENCES WITH RECOMBINATION
A number of computer programs can directly simulate non-coding
DNA sequences under recombination (see Table 1). Similarly
to the previous subsection, the simulation of non-coding DNA
sequences under other evolutionary scenarios, which are not
described in the Table 1, can be performed by combining two
computer tools. We can use a simulator of recombination evolu-
tionary histories (e.g., ms or msms;Ewing and Hermisson, 2010)
followed by a non-coding DNA sequence evolution simulator (e.g.,
INDELible,Seq-Gen,Rambaut and Grassly, 1997;EVOLVER,Yang,
1997; or indel-Seq-Gen,Strope et al., 2009).
SIMULATION OF GENOMES WITH RECOMBINATION HOTSPOTS
It is known that the recombination rate is not homogeneous
throughout the genome and some regions (hotspot regions)
are more likely to suffer recombination (e.g., Gabriel et al.,
2002;Zhuang et al., 2002). Consequently, recombination hotspots
should be considered for realistic genome simulation.
The simulation of genomes with recombination requires
robust and memory-efficient simulators. Programs like fastsim-
coal (Excoffier and Foll, 2011) or mlcoalsim (Ramos-Onsins and
Mitchell-Olds, 2007) allow for efficient simulations of non-coding
genomic regions under recombination (including recombination
hotspots). However, these tools do not implement a variety of sub-
stitution models (e.g., codon models), or particular evolutionary
mechanisms like selection; this may be problematic if we are trying
to mimic genome-wide data (see, Arbiza et al., 2011).
Again, an alternative methodology consists of the use of two
simulators. A few programs currently implement the simulation
of recombination hotspots, namely, SNPsim (Wiuf and Posada,
2003), cosi (Schaffner et al., 2005), GENOME (Liang et al., 2007),
mbs (Teshima and Innan, 2009), and msHOT (Hellenthal and
Stephens, 2007). Although all these programs simulate particular
genetic markers (such as SNPs or STRs), DNA sequence evolu-
tion can be simulated upon phylogenetic trees produced by these
programs if we use the two-step procedure described above.
SIMULATION OF RECOMBINATION PHYLOGENETIC NETWORKS
In order to represent a full evolutionary history with recombina-
tion, phylogenetic networks should be used instead of forcing the
genealogy onto a single tree (Huson and Bryant, 2006). There are
two commonly used methodologies for the simulation of recom-
bination networks: direct simulation of the ARG (e.g., Figure 1A)
or combining the simulated trees for each recombinant fragment
(e.g., Figures 1B–D). To my knowledge, only two programs can
really output a simulated ARG, namely, Serial NetEvolve (Buen-
dia and Narasimhan, 2006) and NetRecodon (Arenas and Posada,
2010a), where the ARG can be visualized and analyzed using the
NetTest web server (Arenas et al., 2010)1. On the other hand,
trees can be combined to generate a network using tools like
CombineTrees (see for a review, Woolley et al., 2008)2.
RECOMBINATION SIMULATION FOR ANALYZING THE
INFLUENCE OF RECOMBINATION ON PHYLOGENETIC
INFERENCES
This section outlines three computer simulation studies where
ignoring recombination leads to biased phylogenetic inferences.
Alternative phylogenetic inference methodologies considering
recombination are also suggested.
INFLUENCE OF RECOMBINATION ON PHYLOGENETIC TREE
RECONSTRUCTION
Schierup and Hein (2000a) simulated samples under the coa-
lescent with recombination (Hudson, 1983). Then, from the
simulated genealogy, they simulated nucleotide sequence evolu-
tion under the Jukes-Cantor (JC) and Kimura’s two-parameter
(K2P) nucleotide substitution models of evolution. The simu-
lated datasets were analyzed using programs for phylogenetic tree
1http://darwin.uvigo.es/software/nettest/
2http://applications.lanevol.org/combineTrees/
Frontiers in Genetics | Evolutionary and Population Genetics February 2013 | Volume 4 | Article 9 | 2
Arenas Simulation of DNA recombination
Table 1 | Commonly used software for direct simulation of DNA sequences under recombination.
Program Evolutionary
history
Recombination
algorithm
Recombination
hotspots
Other
evolutionary
processes
Substitution
model
Rate
variation
Intracodon
recombination
Indels OS Citation
CodonRecSim Coalescent SCR No No Codb: GY94 No No No SC, W in Anisimova et al. (2003)
Recodon/
NetRecodon
Coalescent SCRaNo D, Pm Nt: All; Codb:
GY94
G, I Yes (NetRecodon) No All Arenas and Posada (2007,
2010a)
SIMCOAL2 Coalescent SCR Yes D, Pm Nt: JC, K2P No No No Linux, Win Laval and Excoffier (2004)
Fastsimcoal Coalescent SMC Yes D, Pm Nt: JC, K2P No No No Linux, Mac,
Win
Excoffier and Foll (2011)
Mlcoalsim Coalescent SCR Yes D, Pm Nt: JC, K2P G, I No No All Ramos-Onsins and
Mitchell-Olds (2007)
TREEEVOLVE Coalescent SCR No D, Pm Nt: All G No No SC, Mac Grassly and Rambaut (1997)
SPLATCHE2 Forward,
coalescent
SCR No D, Pm Nt: JC, K2P No No No Linux, Win Ray et al. (2010)
GenomePop Forward CO Yes D, Pm, S Nt: JC, GTR;
Cod: MG94
No Yes No SC, Linux, Win Carvajal-Rodriguez (2008)
SFS_CODE Forward CO, SB Yes D, Pm, S Nt: All; Cod:
Ntc
G No Yes All Hernandez (2008)
SimuPop Forward CO Yes D, Pm, S Nt: All No No Yes All Peng and Kimmel (2005)
“Recombination algorithm”: “SCR” means the standard coalescent with recombination to simulate the ARG (Hudson, 1983); “SMC” indicates the sequential Markovian coalescent, which is an approximation of the
SCR (further details in, McVean and Cardin, 2005); “CO” means crossing over recombination model (see Padhukasahasram et al., 2008); “SB” indicates sex-biased recombination. Other evolutionary processes”:
“D, “Pm, and “S” mean demographics, population structure with migration, and selection, respectively. “Substitution model” refers to substitution models based on nucleotide “Nt” or codon “Cod” sequences;
“Nt: All” means all nucleotide substitution models (JC, . . ., GTR). “Rate variation indicates variable substitution rate across sites (G, gamma distribution; I, proportion of invariable sites). “Intracodon recombination
indicates if recombination breakpoints can occur at any codon position (Yes) or are forced to occur between codons (No). OS” shows the availability of executable files in different operating systems (“All” means
available for Macintosh,Windows, and Linux), “SC” means that the source code is available.
aThe simulated ARG can be exported from NetRecodon and then can be visualized and analyzed using NetTest (Arenas et al., 2010).
bUnder codon models, dN/dS can vary across codons.
cCoding sequences are simulated by nucleotide substitution models, avoiding stop codons.
www.frontiersin.org February 2013 | Volume 4 | Article 9 | 3
Arenas Simulation of DNA recombination
FIGURE 1 | Example of an ancestral recombination graph (ARG) with the
corresponding embedded trees for each recombinant fragment. (A) ARG
based on two recombination events with breakpoints at positions 100 and
200. Dashed lines indicate branches for recombinant fragments. (B–D)
Embedded tree for each recombinant fragment. Note that topologies and
branch lengths may differ across trees. Finally, the simulation of sequence
evolution can be performed site by site along the corresponding tree (see,
Yang, 2006;Fletcher and Yang, 20 09).
reconstruction by both distance-based methods and maximum-
likelihood (ML) methods. Ignoring recombination biased the
inferred phylogenetic trees toward larger terminal branches,
smaller times to the most recent common ancestor (MRCA) and
incorrect topologies (Schierup and Hein, 2000a). In addition,
ignoring recombination led to overestimation of the substitution
rate heterogeneity, apparent homoplasies and loss of molecular
clock (Schierup and Hein, 2000a,b). Later, Posada (2001) ana-
lyzed the molecular clock hypothesis on four empirical datasets.
In particular, the author applied a triplet likelihood ratio test (test
for equality of evolutionary rates among three species, called a
relative-rate test, RRT), which is independent of topology and
might be unbiased by recombination. Results showed that recom-
binant data did not allow a good fit to the molecular clock when
using classical likelihood ratio tests (LRT). However, the molecu-
lar clock was not rejected when using the RRT test. Thus, this test
could be recommended for testing a molecular clock in the pres-
ence of recombination. In addition, phylogenetic incongruence
in empirical data was also observed as a consequence of ignor-
ing recombination (e.g., Worobey and Holmes, 1999;Feil et al.,
2001).
These findings, consequently, suggest biases in derived evolu-
tionary analyses based on phylogenetic reconstructions that ignore
recombination. As an alternative, there are two methodologies of
phylogenetic reconstruction accounting for recombination:
Inference of a single phylogenetic network (e.g., Figure 1A;Grif-
fiths and Marjoram, 1997;Huson and Bryant, 2006). Recombi-
nation networks can be inferred by using computer programs
like SplitsTree (Huson, 1998;Huson and Bryant, 2006).
Inference of a set of phylogenetic trees, where each tree corre-
sponds to the evolutionary history of each recombinant frag-
ment (e.g., Figures 1B–D). The methodology consists of the
detection of recombination breakpoints (see for a review, Mar-
tin et al., 2011) followed by a phylogenetic tree reconstruction
for each recombinant fragment.
Both methodologies correctly account for recombination and
the choice should be based on the posterior application. For exam-
ple, the phylogenetic network may help for an easy visualization of
clades and phylogenetic relationships (e.g., Maughan and Redfield,
2009). By contrast, the simulation of sequence evolution requires
a phylogenetic tree for each recombinant fragment (e.g., Fletcher
and Yang, 2009).
INFLUENCE OF RECOMBINATION ON ANCESTRAL SEQUENCE
RECONSTRUCTION
Recently, Arenas and Posada (2010c) analyzed the effect of consid-
ering recombination on ancestral sequence reconstruction (ASR).
They performed extensive simulations of nucleotide, codon, and
amino acid data by using the coalescent with recombination
Frontiers in Genetics | Evolutionary and Population Genetics February 2013 | Volume 4 | Article 9 | 4
Arenas Simulation of DNA recombination
approach implemented in NetRecodon. They then reconstructed
ancestral sequences with different ASR methods (joint ML, mar-
ginal ML, and empirical Bayes). Results clearly indicated that
ignoring recombination biases the reconstruction of ancestral
sequences, regardless of the method or software used. This ASR
error can be partially reduced if recombination is considered
(Arenas and Posada, 2010c). The methodology consists of four
steps: the detection of recombination breakpoints, the recon-
struction of a phylogenetic tree for each recombinant fragment,
the reconstruction of ancestral fragments by using the corre-
sponding trees and, finally, the concatenation of the ancestral
fragments to generate the entire ancestral sequence. The Data-
monkey web server (Kosakovsky Pond and Frost, 2005)3and the
Hyphy package (Kosakovsky Pond et al., 2005) have automated
the whole procedure described above to infer ancestral sequences
with consideration of recombination.
Arenas and Posada (2010c) also analyzed empirical data, in
particular two datasets of the env region of HIV-1. They inferred
ancestral sequences both ignoring and considering recombination,
using the methodology described above, and observed a different
number of CTL epitopes depending on whether recombination
was considered or not.
INFLUENCE OF RECOMBINATION ON THE DETECTION OF MOLECULAR
ADAPTATION
The detection of molecular adaptation (based on the non-
synonymous/synonymous substitution rate ratio, hereafter
dN/dS) is commonly used at both global (entire sequence) and
local (codon) levels. Indeed, these analyses have commonly been
applied to datasets collected from highly recombinant viruses and
bacteria (e.g., Perez-Losada et al., 2009, 2011;Bozek and Lengauer,
2010). Several studies have shown the impact of recombination
on the estimation of dN/dS (e.g., Anisimova et al., 2003;Are-
nas and Posada, 2010a). After simulating coding data under a
variety of codon-substitution models for heterogeneous selection
pressure (see, Yang et al., 2000) and different levels of recombi-
nation, they selected those heterogeneous codon models that best
fitted the simulated data by using LRTs. Results showed a weak
impact of recombination on the estimation of global dN/dS but
a strong effect on the estimation of local dN/dS, in particular
by increasing the number of false-positively selected sites (PSS).
An alternative methodology to reduce these errors consists of the
detection of recombination breakpoints followed by the recon-
struction of a phylogenetic tree for each recombinant fragment
and, finally, the estimation of dN/dS by using the corresponding
trees (see, Kosakovsky Pond et al., 2006). This methodology was
applied in (Perez-Losada et al., 2009, 2011). Again, the Datamon-
key web server and the Hyphy package have automated this whole
procedure to estimate dN/dS while accounting for recombination.
Recombination might also affect other evolutionary inferences.
For example, it could bias those analytical methods based on
the coalescent without recombination (e.g., BEAST; Drummond
and Rambaut, 2007). However these influences have not yet been
rigorously evaluated.
3http://www.datamonkey.org/
Another interesting question concerns the influence of recom-
bination on genetic diversity. Spencer et al. (2006) studied this
in humans and found that recombination only affects genetic
diversity at recombination hotspots. However, such hotspots
did not alter substitution rates, perhaps because recombination
rates were always low. By contrast, large recombination rates
(common in a variety of viruses and bacteria) may strongly
increase genetic diversity and bring novel lineages (e.g., He et al.,
2010).
At this point, I would suggest the approximate Bayesian com-
putation (ABC) approach (see for a review, Beaumont, 2010) to
estimate evolutionary parameters accounting for recombination.
ABC is based on computer simulations and provides an alter-
native for those analyses where the likelihood function cannot be
computed. Simulations can be performed according to a prior dis-
tribution for recombination rate (among other parameters) and
then, by a rejection or a regression method,a posterior distribution
can be computed to obtain the parameter estimates (Beaumont
et al., 2002). For example,Wilson et al. (2009) applied ABC for joint
estimation of a set of evolutionary parameters, such as substitu-
tion rate, dN/dS and recombination rate. By this methodology, the
influence of recombination on other evolutionary mechanisms is
accounted for, but only if it is indeed implemented in the computer
simulator.
CONCLUSION
This review provides a practical guide to the state of the art in
software, and recommends methodologies, for simulating cod-
ing and non-coding sequence data with recombination, including
recombination hotspots. Currently, only three programs imple-
ment the direct simulation of coding data with recombination.
These programs will not cover every evolutionary scenario, but
this problem can be circumvented by the use of two simulators,
one for the evolutionary history and another for sequence evolu-
tion. It is also important to consider intracodon recombination
(Arenas and Posada, 2010a), because 2/3 of recombination events
are expected to occur within codons. By contrast, the simulation of
non-coding sequences with recombination can be performed by a
variety of programs. Here again, two simulators may be combined
where necessary.
Among many other applications (e.g., Sun et al., 2011;Martti-
nen et al., 2012), the simulation of DNA data with recombination
has been especially important for demonstrating the strong influ-
ence of recombination on phylogenetic tree reconstruction and
derived analyses, such asASR or dN/dS estimation. However,some
alternative methodologies have been developed for phylogenetic
inference accounting for recombination.
The current set of computer tools to simulate DNA sequences
with recombination can cover a wide range of evolutionary scenar-
ios. However, some scenarios are still difficult to simulate and will
require the development of more complex simulators. For exam-
ple, next-generation sequencing (NGS) technologies now deliver
fast and accurate genome sequences (Metzker, 2010) that may
call for simulations of entire genomes accounting for recombi-
nation (including recombination hotspots; e.g., Westesson and
Holmes, 2009;Marttinen et al., 2012), as well as other evolutionary
www.frontiersin.org February 2013 | Volume 4 | Article 9 | 5
Arenas Simulation of DNA recombination
mechanisms like natural selection. Indeed, the simulation of DNA
evolution should be performed by using different substitution
models for each genomic region (Arbiza et al., 2011). Moreover,
I would expect interactions between the different evolutionary
forces, such as joint influences of natural selection and recombi-
nation on dN/dS (e.g., Anisimova et al., 2003;Kryazhimskiy and
Plotkin, 2008) or of structural protein energies on sequence evo-
lution (e.g., Bastolla et al., 2007;Arenas et al., 2009;Grahnen et al.,
2011). To my knowledge, there is currently no tool to simulate
sequences accounting for all these evolutionary features, includ-
ing interactions among them. On the other hand, there is also a
demand for fast simulations, in particular for applying ABC or
Bayesian model-choice approaches that require extensive simula-
tions (see recombination examples in, Wilson et al., 2009;Nunes
and Balding, 2010;Sohn et al., 2012).
In conclusion, there is a need to innovate continuously in fast
and complex simulators of DNA sequences with recombination
and I expect future advances in this area.
ACKNOWLEDGMENTS
I want to thank Badri Padhukasahasram, Guest Associate Edi-
tor of Frontiers in Evolutionary and Population Genetics, for the
invitation to contribute with this review to the Research Topic
Inference of recombination and gene-conversion from whole genome
sequence variation data.” Indeed, I also want to thank the Journal
Frontiers in Evolutionary and Population Genetics for a waiver to
cover publication costs. I thank Dr Richard M. Gunton for help-
ful comments. I thank two reviewers for insightful comments and
suggestions. I thank the Spanish Government for the “Juan de la
Cierva” fellowship, JCI-2011-10452.
REFERENCES
Anisimova, M., Nielsen, R., and Yang,
Z. (2003). Effect of recombination
on the accuracy of the likelihood
method for detecting positive selec-
tion at amino acid sites. Genetics 164,
1229–1236.
Arbiza, L., Patricio, M., Dopazo,H., and
Posada, D. (2011). Genome-wide
heterogeneity of nucleotide substi-
tution model fit. Genome Biol. Evol.
3, 896–908.
Arenas, M. (2012). Simulation of
molecular data under diverse
evolutionary scenarios. PLoS
Comput. Biol. 8:e1002495.
doi:10.1371/journal.pcbi.1002495
Arenas, M., Francois, O., Currat, M.,
Ray, N., and Excoffier, L. (2013).
Influence of admixture and pale-
olithic range contractions on current
European diversity gradients. Mol.
Biol. Evol. 30, 57–61.
Arenas, M., Patricio,M., Posada, D., and
Valiente, G. (2010). Characteriza-
tion of phylogenetic networks with
nettest. BMC Bioinformatics 11:268.
doi:10.1186/1471-2105-11-268
Arenas, M., and Posada, D. (2007).
Recodon: coalescent simulation of
coding DNA sequences with recom-
bination, migration and demog-
raphy. BMC Bioinformatics 8:458.
doi:10.1186/1471-2105-8-458
Arenas, M., and Posada, D. (2010a).
Coalescent simulation of intra-
codon recombination. Genetics 184,
429–437.
Arenas, M., and Posada, D. (2010b).
Computational design of central-
ized HIV-1 genes. Curr. HIV Res. 8,
613–621.
Arenas, M., and Posada, D. (2010c).
The effect of recombination on
the reconstruction of ancestral
sequences. Genetics 184, 1133–1139.
Arenas, M., and Posada, D. (2012).
“Simulation of coding sequence evo-
lution, in Codon Evolution, eds G.
M. Cannarozzi and A. Schneider
(Oxford: Oxford University Press),
126–132.
Arenas, M., Ray, N., Currat, M., and
Excoffier, L. (2012). Consequences
of range contractions and range
shifts on molecular diversity. Mol.
Biol. Evol. 29, 207–218.
Arenas, M., Valiente, G., and Posada, D.
(2008). Characterization of reticu-
late networks based on the coales-
cent with recombination. Mol. Biol.
Evol. 25, 2517–2520.
Arenas, M., Villaverde, M. C., and
Sussman, F. (2009). Prediction
and analysis of binding affinities
for chemically diverse HIV-1 PR
inhibitors by the modified SAFE_p
approach. J. Comput. Chem. 30,
1229–1240.
Awadalla, P. (2003). The evolutionary
genomics of pathogen recombina-
tion. Nat. Rev. Genet. 4, 50–60.
Bastolla, U., Porto, M., Roman, H. E.,
and Vendruscolo, M. (2007). Struc-
tural Approaches to Sequence Evolu-
tion. Berlin: Springer.
Beaumont, M. A. (2010). Approximate
Bayesian computation in evolution
and ecology. Annu. Rev. Ecol. Evol.
Syst. 41, 379–405.
Beaumont, M. A., Zhang, W., and
Balding, D. J. (2002). Approx-
imate Bayesian computation in
population genetics. Genetics 162,
2025–2035.
Beiko, R. G., Doolittle, W. F., and
Charlebois, R. L. (2008). The impact
of reticulate evolution on genome
phylogeny. Syst. Biol. 57, 844–856.
Bozek, K., and Lengauer, T. (2010).
Positive selection of HIV host fac-
tors and the evolution of lentivirus
genes. BMC Evol. Biol. 10:186.
doi:10.1186/1471-2148-10-186
Buendia, P., and Narasimhan, G.
(2006). Serial netevolve: a flexi-
ble utility for generating serially-
sampled sequences along a tree or
recombinant network. Bioinformat-
ics 22, 2313–2314.
Calafell, F.,Gr igorenko, E. L.,Chikanian,
A. A., and Kidd,K. K. (2001). Haplo-
type evolution and linkage disequi-
librium: a simulation study. Hum.
Hered. 51, 85–96.
Carvajal-Rodriguez, A. (2008).
GENOMEPOP: a program to
simulate genomes in popula-
tions. BMC Bioinformatics 9:223.
doi:10.1186/1471-2105-9-223
Daly, M. J., Rioux, J. D., Schaffner, S.
F., Hudson, T. J., and Lander, E. S.
(2001). High-resolution haplotype
structure in the human genome.
Nat. Genet. 29, 229–232.
Drummond, A. J., and Rambaut, A.
(2007). BEAST: Bayesian evolution-
ary analysis by sampling trees. BMC
Evol. Biol. 7:214. doi:10.1186/1471-
2148-7-214
Duret, L., and Arndt, P. F. (2008).
The impact of recombina-
tion on nucleotide substitu-
tions in the human genome.
PLoS Genet. 4:e1000071.
doi:10.1371/journal.pgen.1000071
Epperson, B. K., McRae, B. H., Scrib-
ner, K., Cushman, S. A., Rosenberg,
M. S., Fortin, M. J., et al. (2010).
Utility of computer simulations in
landscape genetics. Mol. Ecol. 19,
3549–3564.
Ewing, G., and Hermisson, J. (2010).
MSMS: a coalescent simulation
program including recombination,
demographic structure and selection
at a single locus. Bioinformatics 26,
2064–2065.
Excoffier, L., and Foll, M. (2011). Fast-
simcoal: a continuous-time coales-
cent simulator of genomic diver-
sity under arbitrarily complex evolu-
tionary scenarios. Bioinformatics 27,
1332–1334.
Feil, E. J., Holmes, E. C., Bessen, D. E.,
Chan, M.-S., Day, N. P. J., Enright,
M. C., et al. (2001). Recombination
within natural populations of path-
ogenic bacteria: Short-term empir-
ical estimates and long-term phy-
logenetic consequences. Proc. Natl.
Acad. Sci. U.S.A. 98, 182–187.
Fletcher, W., and Yang, Z. (2009).
INDELible: a flexible simulator of
biological sequence evolution. Mol.
Biol. Evol. 26, 1879–1888.
Fraser, C., Hanage, W. P., and Spratt,
B. G. (2007). Recombination and
the nature of bacterial speciation.
Science 315, 476–480.
Gabriel, S. B., Schaffner, S. F., Nguyen,
H., Moore, J. M., Roy, J., Blumen-
stiel, B., et al. (2002). The structure
of haplotype blocks in the human
genome. Science 296, 2225–2229.
Gaut, B. S., Wright, S. I., Rizzon, C.,
Dvorak, J., and Anderson, L. K.
(2007). Recombination: an under-
appreciated factor in the evolution
of plant genomes. Nat. Rev. Genet. 8,
77–84.
Grahnen, J. A., Nandakumar, P.,
Kubelka, J., and Liberles, D. A.
(2011). Biophysical and structural
considerations for protein sequence
evolution. BMC Evol. Biol. 11:361.
doi:10.1186/1471-2148-11-361
Grassly, N. C.,and Rambaut, A. (1997).
Treevolve: A Program to Simulate the
Evolution of DNA Sequences Under
Different Population Dynamic Sce-
narios. Oxford: Department of Zool-
ogy, Wellcome Centre for Infectious
Disease, Oxford University.
Griffiths, R. C., and Marjoram, P.
(1997). An ancestral recombina-
tion graph,” in Progress in Popula-
tion Genetics and Human Evolution,
eds P. Donelly and S. Tavaré (Berlin:
Springer-Verlag), 257–270.
He, C. Q., Ding, N. Z., He, M., Li,
S. N., Wang, X. M., He, H. B., et
al. (2010). Intragenic recombination
as a mechanism of genetic diver-
sity in bluetongue virus. J. Virol. 84,
11487–11495.
Frontiers in Genetics | Evolutionary and Population Genetics February 2013 | Volume 4 | Article 9 | 6
Arenas Simulation of DNA recombination
Hellenthal, G., and Stephens, M. (2007).
msHOT: modifying Hudson’s ms
simulator to incorporate crossover
and gene conversionhotspots. Bioin-
formatics 23, 520–521.
Hernandez, R. D. (2008).A flexible for-
ward simulator for populations sub-
ject to selection and demography.
Bioinformatics 24, 2786–2787.
Hoban, S., Bertorelle, G., and Gaggiotti,
O. E. (2012). Computer simulations:
tools for population and evolution-
ary genetics. Nat. Rev. Genet. 13,
110–122.
Hudson, R. R. (1983). Properties of a
neutral allele model with intragenic
recombination. Theor. Popul. Biol.
23, 183–201.
Hudson, R. R. (2002). Generating sam-
ples under a Wright-Fisher neutral
model of genetic variation. Bioinfor-
matics 18, 337–338.
Huson, D. H. (1998). Splitstree: ana-
lyzing and visualizing evolutionary
data. Bioinformatics 14, 68–73.
Huson, D. H., and Bryant, D. (2006).
Application of phylogenetic net-
works in evolutionary studies. Mol.
Biol. Evol. 23, 254–267.
Kosakovsky Pond, S. L., and Frost, S.
D.(2005). Datamonkey : rapid detec-
tion of selective pressure on indi-
vidual sites of codon alignments.
Bioinformatics 21, 2531–2533.
Kosakovsky Pond,S. L., Frost, S. D., and
Muse, S.V. (2005). HYPHY: hypoth-
esis testing using phylogenies. Bioin-
formatics 21, 676–679.
Kosakovsky Pond, S. L., Posada, D.,
Gravenor, M. B., Woelk, C. H., and
Frost, S. D. (2006). Automated phy-
logenetic detection of recombina-
tion using a genetic algorithm. Mol.
Biol. Evol. 23, 1891–1901.
Kryazhimskiy, S., and Plotkin, J. B.
(2008). The population genetics
of dN/dS. PLoS Genet. 4:e1000304.
doi:10.1371/journal.pgen.1000304
Laval, G., and Excoffier, L. (2004). SIM-
COAL 2.0: a program to simulate
genomic diversity over large recom-
bining regions in a subdivided popu-
lation with a complex history. Bioin-
formatics 20, 2485–2487.
Liang, L., Zollner, S., and Abecasis,
G. R. (2007). GENOME: a rapid
coalescent-based whole genome
simulator. Bioinformatics 23,
1565–1567.
Liu, Y., Athanasiadis, G., and Weale,
M. E. (2008). A survey of genetic
simulation software for population
and epidemiological studies. Hum.
Genomics 3, 79–86.
Lukashev, A. N. (2005). Role of
recombination in evolution of
enteroviruses. Rev. Med. Virol. 15,
157–167.
Martin, D. P., Lemey, P., and Posada,
D. (2011). Analysing recombination
in nucleotide sequences. Mol. Ecol.
Resour. 11, 943–955.
Marttinen, P., Hanage, W. P., Croucher,
N. J., Connor, T. R., Harris, S. R.,
Bentley,S. D., et al. (2012). Detection
of recombination events in bacter-
ial genomes from large population
samples. Nucleic Acids Res. 40, e6.
Maughan, H., and Redfield, R. J.
(2009). Tracing the evolution
of competence in Haemophilus
influenzae. PLoS ONE 4:e5854.
doi:10.1371/journal.pone.0005854
McVean, G. A., and Cardin, N. J. (2005).
Approximating the coalescent with
recombination. Philos. Trans. R. Soc.
Lond. B Biol. Sci. 360, 1387–1393.
Metzker, M. L. (2010). Sequencing tech-
nologies the next generation. Nat.
Rev. Genet. 11, 31–46.
Nordborg, M. (2007). “Coalescent the-
ory, in Handbook of Statistical
Genetics, 3rd Edn, eds D. J. Bald-
ing, M. Bishop, and C. Cannings
(Chichester: John Wiley& Sons Ltd),
843–877.
Nunes, M. A., and Balding, D. J. (2010).
On optimal selection of summary
statistics for approximate Bayesian
computation. Stat. Appl. Genet. Mol.
Biol. 9, 34.
Padhukasahasram, B., Marjoram, P.,
Wall, J. D., Bustamante, C. D.,
and Nordborg, M. (2008). Explor-
ing population genetic models
with recombination using efficient
forward-time simulations. Genetics
178, 2417–2427.
Peck, S. L. (2004). Simulation as exper-
iment: a philosophical reassessment
for biological modeling. Trends Ecol.
Evol. (Amst.) 19, 530–534.
Peng, B., Amos, C. I., and Kimmel,
M. (2007). Forward-time simula-
tions of human populations with
complex diseases. PLoS Genet. 3:e47.
doi:10.1371/journal.pgen.0030047
Peng, B., and Kimmel, M. (2005).
Simupop: a forward-time popula-
tion genetics simulation environ-
ment. Bioinformatics 21, 3686–3687.
Perez-Losada, M., Jobes, D. V., Sinangil,
F., Crandall, K. A., Arenas, M.,
Posada, D., et al. (2011). Phylo-
dynamics of HIV-1 from a phase
III AIDS vaccine trial in Bangkok,
Thailand. PLoS ONE 6:e16902.
doi:10.1371/journal.pone.0016902
Perez-Losada, M., Posada, D., Arenas,
M., Jobes, D.V., Sinangil, F., Berman,
P. W., et al. (2009). Ethnic differences
in the adaptation rate of HIV gp120
from a vaccine trial. Retrovirology 6,
67.
Pierron, D., Chang, I., Arachiche, A.,
Heiske, M., Thomas, O., Borlin, M.,
et al. (2011). Mutation rate switch
inside Eurasian mitochondrial hap-
logroups: impact of selection and
consequences for dating settlement
in Europe. PLoS ONE 6:e21543.
doi:10.1371/journal.pone.0021543
Posada, D. (2001). Unveiling the mol-
ecular clock in the presence of
recombination. Mol. Biol. Evol. 18,
1976–1978.
Posada, D., and Crandall, K. A.
(2001). Evaluation of methods for
detecting recombination from DNA
sequences: computer simulations.
Proc. Natl. Acad. Sci. U.S.A. 98,
13757–13762.
Posada, D., and Crandall, K. A. (2002).
The effect of recombination on the
accuracy of phylogeny estimation. J.
Mol. Evol. 54, 396–402.
Posada, D.,Crandall, K. A., and Holmes,
E. C. (2002). Recombination in
evolutionary genomics. Annu. Rev.
Genet. 36, 75–97.
Rambaut, A., and Grassly, N. C. (1997).
Seq-gen: an application for the
Monte carlo simulation of DNA
sequence evolution along phyloge-
netic trees. Comput. Appl. Biosci. 13,
235–238.
Ramos-Onsins, S. E., and Mitchell-
Olds, T. (2007). Mlcoalsim: multi-
locus coalescent simulations. Evol.
Bioinform. Online 3, 41–44.
Ray, N., Currat, M., Foll, M., and
Excoffier, L. (2010). SPLATCHE2: a
spatially explicit simulation frame-
work for complex demography,
genetic admixture and recombina-
tion. Bioinformatics 26, 2993–2994.
Reich, D. E., Cargill, M., Bolk, S., Ire-
land, J., Sabeti, P. C., Richter, D. J.,
et al. (2001). Linkage disequilibrium
in the human genome. Nature 411,
199–204.
Robertson, D. L., Sharp, P. M.,
McCutchan, F. E., and Hahn, B. H.
(1995). Recombination in HIV-1.
Nature 374, 124–126.
Schaffner, S. F., Foo, C., Gabriel, S.,
Reich, D., Daly, M. J., and Altshuler,
D. (2005). Calibrating a coales-
cent simulation of human genome
sequence variation. Genome Res. 15,
1576–1583.
Schierup, M. H., and Hein, J. (2000a).
Consequences of recombination on
traditional phylogenetic analysis.
Genetics 156, 879–891.
Schierup, M. H., and Hein, J. (2000b).
Recombination and the molecular
clock. Mol. Biol. Evol. 17,1578–1579.
Sohn, K. A., Ghahramani, Z., and Xing,
E. P. (2012). Robust estimation of
local genetic ancestry in admixed
populations using a nonparamet-
ric Bayesian approach. Genetics 191,
1295–1308.
Spencer, C. C., Deloukas, P., Hunt,
S., Mullikin, J., Myers, S., Silver-
man, B., et al. (2006). The influ-
ence of recombination on human
genetic diversity.PLoS Genet. 2:e148.
doi:10.1371/journal.pgen.0020148
Strope, C. L., Abel, K., Scott, S. D.,
and Moriyama, E. N. (2009). Bio-
logical sequence simulation for test-
ing complex evolutionary hypothe-
ses: indel-seq-gen version 2.0. Mol.
Biol. Evol. 26, 2581–2593.
Sun, S., Evans, B. J., and Golding, G. B.
(2011). “Patchy-tachy” leads to false
positives for recombination. Mol.
Biol. Evol. 28, 2549–2559.
Teshima, K. M., and Innan, H. (2009).
Mbs: modifying Hudson’s ms soft-
ware to generate samples of DNA
sequences with a biallelic site
under selection. BMC Bioinformat-
ics 10:166. doi:10.1186/1471-2105-
10-166
Tsaousis, A. D., Martin, D. P.,
Ladoukakis, E. D., Posada, D.,
and Zouros, E. (2005). Widespread
Recombination in PublishedAnimal
mtDNA Sequences. Mol. Biol. Evol.
22, 925–933.
Wakeley, J. (2008). Coalescent Theory:
An Introduction. Greenwood Village:
Roberts and Company Publishers.
Westesson, O., and Holmes, I. (2009).
Accurate detection of recombinant
breakpoints in whole-genome
alignments. PLoS Comput. Biol.
5:e1000318. doi:10.1371/journal.
pcbi.1000318
Wilson,D. J., Gabriel, E., Leatherbarrow,
A. J., Cheesbrough, J.,Gee, S., Bolton,
E., et al. (2009). Rapid evolution and
the importance of recombination to
the gastroenteric pathogen Campy-
lobacter jejuni. Mol. Biol. Evol. 26,
385–397.
Wiuf, C., Christensen, T., and Hein, J.
(2001). A simulation study of the
reliability of recombination detec-
tion methods. Mol. Biol. Evol. 18,
1929–1939.
Wiuf, C., and Posada, D. (2003).
A coalescent model of recom-
bination hotspots. Genetics 164,
407–417.
Woolley, S. M., Posada, D., and
Crandall, K. A. (2008). A com-
parison of phylogenetic network
methods using computer sim-
ulation. PLoS ONE 3:e1913.
doi:10.1371/journal.pone.0001913
Worobey, M., and Holmes, E. C. (1999).
Evolutionary aspects of recombina-
tion in RNA viruses. J. Gen. Virol. 80,
2535–2543.
Yang, Z. (1997). PAML: a program
package for phylogenetic analysis by
maximum likelihood. Comput. Appl.
Biosci. 13, 555–556.
www.frontiersin.org February 2013 | Volume 4 | Article 9 | 7
Arenas Simulation of DNA recombination
Yang, Z. (2006). Computational Molec-
ular Evolution. Oxford: Oxford Uni-
versity Press.
Yang, Z., Nielsen, R., Goldman, N., and
Pedersen, A.-M. K. (2000). Codon-
substitution models for heteroge-
neous selection pressure at amino
acid sites. Genetics 155, 431–449.
Zhang, Y. X., Perry, K., Vinci, V. A.,
Powell, K., Stemmer, W. P., and
del Cardayre, S. B. (2002). Genome
shuffling leads to rapid phenotypic
improvement in bacteria. Nature
415, 644–646.
Zhuang, J., Jetzt, A. E., Sun, G., Yu,
H., Klarmann, G., Ron, Y., et al.
(2002). Human immunodeficiency
virus type 1 recombination: rate,
fidelity, and putative hot spots. J.
Virol. 76, 11273–11282.
Conflict of Interest Statement: The
author declares that the research was
conducted in the absence of any
commercial or financial relationships
that could be construed as a potential
conflict of interest.
Received: 20 November 2012; accepted:
17 January 2013; published online: 01
February 2013.
Citation: Arenas M (2013) Computer
programs and methodologies for the sim-
ulation of DNA sequence data with
recombination. Front. Gene. 4:9. doi:
10.3389/fgene.2013.00009
This article was submitted to Frontiers in
Evolutionary and Population Genetics, a
specialty of Frontiers in Genetics.
Copyright © 2013 Arenas. This is an
open-access article distributed under the
terms of the Creative Commons Attribu-
tion License, which permits use, distrib-
ution and reproduction in other forums,
provided the original authors and source
are credited and subject to any copy-
right notices concerning any third-party
graphics etc.
Frontiers in Genetics | Evolutionary and Population Genetics February 2013 | Volume 4 | Article 9 | 8
... Recombination is a fundamental evolutionary force for most organisms, especially virus and bacteria (Perez-Losada et al., 2015). Recombination events generate an ancestral recombination graph (ARG) (Arenas, 2013b;Marjoram, 1997, 1996) where genomic fragments can present different evolutionary histories (Fig. 5) and that should be carefully considered for inferring phylogenetic trees (Schierup and Hein, 2000a;Arenas and Posada, 2010b;Mallo et al., 2016;Posada and Crandall, 2002). The number of simulated recombination events is based on the population recombination rate r¼ 4Nrl, where r is the recombination rate per generation per site and l is the sequence length. ...
... In addition to the coalescent, some other approaches have been developed to simulate evolutionary histories in population genetics (Arenas, 2012;Hoban et al., 2012). One of the most relevant is the forward in time approach, which simulates the evolutionary history of the entire population from the past to the present Carvajal-Rodriguez, 2010). ...
... Computer simulations of the coalescent with recombination (Arenas, 2013a), followed by the simulation of codon evolution upon the simulated coalescent histories (Arenas and Posada, 2012), were applied in several studies to analyze the influence of ignored recombination on the estimation of selection (the nonsynonymous (dN) to synonymous (dS) substitution ratio, dN/dS or o) Posada, 2010a, 2014a;Anisimova et al., 2003;Shriner et al., 2003). These studies found that ignored recombination biases the estimation of dN/dS by generating false positively selected sites. ...
... Here I created an ABM that enabled me to investigate and identify the evolutionary, demographic and genomic conditions under which AOD can occur, using realistic scenarios. Even though several simulation programs have been developed to model evolutionary processes (Hoban et al., 2012) and some specialize in simulating recombination (Arenas, 2013) and others can be highly customised and modified (Haller & Messer, 2019), I decided to develop my own simulation program with greater flexibility that allowed me to: ...
Thesis
Full-text available
Thesis Abstract Genetic differentiation is a vital aspect of population genetics and is a direct consequence of evolutionary forces acting on genetic diversity. By interpreting patterns of genetic differentiation, we can detect, infer, and estimate the extent to which natural selection, genetic drift, and gene flow affect genetic diversity. In this thesis, estimation of genetic differentiation is used as a tool to answer the following questions, three mainly theoretical, and the other an applied study on platypus conservation.
... Note that we assumed neutral evolution in the coalescent evolutionary history and selection in the protein evolution because to our knowledge no current simulation framework implements the simulation of these processes (evolutionary history and molecular evolution) under a same selection process. Thus, this assumption is commonly made in population genetics (see the reviews Yang 2006;Arenas 2012Arenas , 2013Arenas and Posada 2012;Hoban et al. 2012). Finally, the folding free energy of the simulated protein sequences was estimated with the program DeltaGREM (Bastolla 2014) based on the protein folding stability model described in Minning, et al. (2013) and also adopted in Arenas et al. (2015). ...
Article
Full-text available
Genetic recombination is a common evolutionary mechanism that produces molecular diversity. However, its consequences on protein folding stability have not attracted the same attention as in the case of point mutations. Here, we studied the effects of homologous recombination on the computationally predicted protein folding stability for several protein families, finding less detrimental effects than we previously expected. Although recombination can affect multiple protein sites, we found that the fraction of recombined proteins that are eliminated by negative selection because of insufficient stability is not significantly larger than the corresponding fraction of proteins produced by mutation events. Indeed, although recombination disrupts epistatic interactions, the mean stability of recombinant proteins is not lower than that of their parents. On the other hand, the difference of stability between recombined proteins is amplified with respect to the parents, promoting phenotypic diversity. As a result, at least one third of recombined proteins present stability between those of their parents, and a substantial fraction have higher or lower stability than those of both parents. As expected, we found that parents with similar sequences tend to produce recombined proteins with stability close to that of the parents. Finally, the simulation of protein evolution along the ancestral recombination graph with empirical substitution models commonly used in phylogenetics, which ignore constraints on protein folding stability, showed that recombination favors the decrease of folding stability, supporting the convenience of adopting structurally constrained models when possible for inferences of protein evolutionary histories with recombination.
... We would presume that limiting the range even more would cause the knot to turn to itself or an unknot even more often. Looking at the accuracy of our model, we could look into the Dr. Arena's paper on varying accuracy of DNA models: Trying out different applications to model DNA with Cre activity will allow us to narrow down our errors as well [1]. Furthermore, we dismissed the fact that CRE only acts on negative writhe, and a next step in an accurate simulation would be to include that. ...
Poster
Full-text available
This study was undertaken to better understand the enzymatic activity of DNA recombinases, specifically CRE-lox recombinase reactions. To model this enzyme activity, a simulation through the topological modeling program KnotPlotTM and the BFACF algorithm was used to study recombination on all knots with up to seven crossings. The data collected from the computational simulations were analyzed to produce a transition probability matrix in order to predict the topological transformations of DNA knots treated with Cre Recombinase. The probability matrix was used to determine the steady state for the recombination reaction and the efficiency of each knot transforming into unknot. The accuracy of each simulation trial was analyzed and compared to experimental data performed on the transformation of PSC1.3i by Cre recombinase. This research has potential pharmaceutical applications that can be furthered to improve the efficiency of enzyme activity in transforming circular DNA chains into the unknotted form.
... These could help understand evolutionary or genetic consequences of the processes that are not analytically tractable. As a result, a number of genetic sequence simulators have been developed; those available before 2012 are reviewed in [1,254,255]. Below, we briefly describe a conceptual framework of coding sequence simulation and then consider novel developments in this area. ...
Chapter
Full-text available
Cost and time of genome sequencing have plummeted over the last decade. This leads to explosive growth of genetic databases and development of novel sequencing-based approaches to study various biological phenomena. The database growth was particularly beneficial for investigation of protein-coding sequences at the codon level, requiring the access to large sets of related genomes. Such studies are expected to illuminate biological forces that shape primary structure of coding sequences and predict their evolutionary trajectories more precisely. In addition to fundamental interest, codon usage studies are of ample practical value, for example, in drug discovery and genomic medicine areas. Nevertheless, the depth of our understanding of codon-related issues is currently shallower as compared to what we know about nucleotide and amino acid sequences. Besides the lack of adequate datasets in the early days of molecular biology, codon usage studies, in our opinion, suffer from underdevelopment of easy-to-use tools to analyze and visualize how codon sequence changes along the gene and across the homologous genes in course of evolution. In this review, we aim to describe main areas of codon usage studies with an emphasis on the tools that allow visual interpretation of the data. We discuss underlying principles of different approaches, what kind of statistics lends confidence in their results and what has to be done to further boost the field of codon usage research.
... Computer simulations are useful in evolutionary biology for hypothesis testing, for verifying analytical methods, for analyzing interactions among evolutionary processes, and they are widely used in different disciplines. In general, computer simulations allow for the study of complex systems, including those analytically intractable [127]. Here, we use forward simulation models from primitive machines to advance translation machines, mimicking the biosynthetic processes for the origin of the genetic code, and for testing our hypothesis of the coevolution of the translation machines and the genetic code. ...
Article
Full-text available
Information is the currency of life, but the origin of prebiotic information remains a mystery. We propose transitional pathways from the cosmic building blocks of life to the complex prebiotic organic chemistry that led to the origin of information systems. The prebiotic information system, specifically the genetic code, is segregated, linear, and digital, and it appeared before the emergence of DNA. In the peptide/RNA world, lipid membranes randomly encapsulated amino acids, RNA, and peptide molecules, which are drawn from the prebiotic soup, to initiate a molecular symbiosis inside the protocells. This endosymbiosis led to the hierarchical emergence of several requisite components of the translation machine: transfer RNAs (tRNAs), aminoacyl-tRNA synthetase (aaRS), messenger RNAs (mRNAs), ribosomes, and various enzymes. When assembled in the right order, the translation machine created proteins, a process that transferred information from mRNAs to assemble amino acids into polypeptide chains. This was the beginning of the prebiotic information age. The origin of the genetic code is enigmatic; herein, we propose an evolutionary explanation: the demand for a wide range of protein enzymes over peptides in the prebiotic reactions was the main selective pressure for the origin of information-directed protein synthesis. The molecular basis of the genetic code manifests itself in the interaction of aaRS and their cognate tRNAs. In the beginning, aminoacylated ribozymes used amino acids as a cofactor with the help of bridge peptides as a process for selection between amino acids and their cognate codons/anticodons. This process selects amino acids and RNA species for the next steps. The ribozymes would give rise to pre-tRNA and the bridge peptides to pre-aaRS. Later, variants would appear and evolution would produce different but specific aaRS-tRNA-amino acid combinations. Pre-tRNA designed and built pre-mRNA for the storage of information regarding its cognate amino acid. Each pre-mRNA strand became the storage device for the genetic information that encoded the amino acid sequences in triplet nucleotides. As information appeared in the digital languages of the codon within pre-mRNA and mRNA, and the genetic code for protein synthesis evolved, the prebiotic chemistry then became more organized and directional with the emergence of the translation and genetic code. The genetic code developed in three stages that are coincident with the refinement of the translation machines: the GNC code that was developed by the pre-tRNA/pre-aaRS /pre-mRNA machine, SNS code by the tRNA/aaRS/mRNA machine, and finally the universal genetic code by the tRNA/aaRS/mRNA/ribosome machine. We suggest the coevolution of translation machines and the genetic code. The emergence of the translation machines was the beginning of the Darwinian evolution, an interplay between information and its supporting structure. Our hypothesis provides the logical and incremental steps for the origin of the programmed protein synthesis. In order to better understand the prebiotic information system, we converted letter codons into numerical codons in the Universal Genetic Code Table. We have developed a software, called CATI (Codon-Amino Acid-Translator-Imitator), to translate randomly chosen numerical codons into corresponding amino acids and vice versa. This conversion has granted us insight into how the genetic code might have evolved in the peptide/RNA world. There is great potential in the application of numerical codons to bioinformatics, such as barcoding, DNA mining, or DNA fingerprinting. We constructed the likely biochemical pathways for the origin of translation and the genetic code using the Model-View-Controller (MVC) software framework, and the translation machinery step-by-step. While using AnyLogic software, we were able to simulate and visualize the entire evolution of the translation machines, amino acids, and the genetic code.
... Computer simulations are useful in evolutionary biology for hypothesis testing, for verifying analytical methods, for analyzing interactions among evolutionary processes, and are widely used in different disciplines. In general, computer simulations allow the study of complex systems, including those analytically intractable [121]. Here we use forward simulation models from primitive machines to advance translation machines, mimicking the biosynthetic processes for the origin of the genetic code, and for testing our hypothesis of the coevolution of the translation machines and the genetic code. ...
Preprint
Information is the currency of life, but the origin of prebiotic information remains a mystery. We propose transitional pathways from the cosmic building blocks of life to the complex prebiotic organic chemistry that led to the origin of information systems. The prebiotic information system, specifically the genetic code, is segregated, linear, and digital and probably appeared during biogenesis four billion years ago. In the peptide/RNA world, lipid membranes randomly encapsulated amino acids, RNA, and protein molecules, drawn from the prebiotic soup, to initiate a molecular symbiosis inside the protocells. This endosymbiosis led to the hierarchical emergence of several requisite components of the translation machine: tRNAs, aaRS, mRNAs, and ribosomes. When assembled in the right order, the translation machine created biosynthetic polypeptides, a process that transferred information from mRNAs to proteins. This was the beginning of the prebiotic information age. The molecular attraction between tRNA and amino acids led to different stages of the translation machines and the genetic code. tRNA is an ancient molecule that designed and built mRNA for storing the information of its cognate amino acid. Each mRNA strand became the storage device for the genetic information that encoded the amino acid sequences in triplet nucleotides. As information appeared in the digital languages of the codon within mRNA, and the genetic code for protein synthesis evolved, the prebiotic chemistry then became more organized and directional. The origin of the genetic code is enigmatic; herein we propose an evolutionary explanation: the demand for a wide range of specific enzymes in the peptide/RNA world was the main selective pressure for the origin of information-directed protein synthesis. We review three main concepts on the origin and evolution of the genetic code: the stereochemical theory, the coevolution theory, and the adaptive theory. These three theories are compatible with our coevolution model of the translation machines and the genetic code. We suggest biosynthetic pathways as the origin of the specific translation machines which provided the framework for the origin of the genetic code. During translation, the genetic code developed in three stages coincident with the refinement of the translation machines: GNC code developed by the pre-tRNA/pre-aaRS /pre-mRNA machine, SNS code by the tRNA/aaRS/mRNA machine, and finally the universal genetic code by the tRNA/aaRS/mRNA/ribosome machine. Our hypothesis provides the logical and incremental steps for the origin of the programmed protein synthesis. In order to understand the prebiotic information system better, we converted letter codons into numerical codons in the Universal Genetic Code Table. We have developed a software called CATI (Codon-Amino Acid-Translator-Imitator) to translate randomly chosen numerical codons into corresponding amino acids and vice versa. This conversion has granted us insight into how the translation might have worked in the peptide/RNA world. There is great potential in the application of numerical codons to bioinformatics such as barcoding, DNA mining, or DNA fingerprinting. We constructed the likely biochemical pathways for the origin of translation and the genetic code using the Model-View-Controller (MVC) software framework, and the translation machinery step-by-step. Using AnyLogic software we were able to simulate and visualize the entire evolution of the translation machines and the genetic code. The results indicate that the emergence of the information age from the peptide/RNA world was a watershed event in the origin of life about four billion years ago.
... Indeed, because of its rapid simulation and realistic population genetics modeling, the coalescent is a very useful approach when extensive simulations are required, for example in studies based on ABC or Bayesian approaches. For further details about approaches and frameworks to simulate evolutionary histories, we recommend the following reviews [46][47][48]. Interestingly, the forward-time and coalescent approaches were combined into the simulator SPLATCHE, allowing a rapid simulation of the evolutionary history of a sample accounting for evolutionary processes acting at the whole population level [49,50]. ...
Article
Full-text available
Selecting among alternative scenarios of human evolution is nowadays a common methodology to investigate the history of our species. This strategy is usually based on computer simulations of genetic data under different evolutionary scenarios, followed by a fitting of the simulated data with the real data. A recent trend in the investigation of ancestral evolutionary processes of modern humans is the application of genetic gradients as a measure of fitting, since evolutionary processes such as range expansions, range contractions, and population admixture (among others) can lead to different genetic gradients. In addition, this strategy allows the analysis of the genetic causes of the observed genetic gradients. Here, we review recent findings on the selection among alternative scenarios of human evolution based on simulated genetic gradients, including pros and cons. First, we describe common methodologies to simulate genetic gradients and apply them to select among alternative scenarios of human evolution. Next, we review previous studies on the influence of range expansions, population admixture, last glacial period, and migration with long-distance dispersal on genetic gradients for some regions of the world. Finally, we discuss this analytical approach, including technical limitations, required improvements, and advice. Although here we focus on human evolution, this approach could be extended to study other species.
... Computer simulations are useful in evolutionary biology for hypothesis testing, for verifying analytical methods, for analyzing interactions among evolutionary processes, and are widely used in different disciplines. In general, computer simulations allow the study of complex systems, including those analytically intractable [121]. Here we use forward simulation models of primitive machines to advance translation machines, mimicking the biosynthetic processes for the origin of the genetic code, and for testing our hypothesis of the coevolution of the translation machines and the genetic code. ...
Preprint
The Late Heavy Bombardment Period (4.1 to 3.8 billion years ago) of heightened impact cratering activity on young Earth is likely the driving force for the origin of life. During the Eoarchean, asteroids such as carbonaceous chondrites delivered the building blocks of life and water to early Earth. Asteroid collisions created innumerable hydrothermal crater lakes in the Eoarchean crust which inadvertently became the perfect cradle for prebiotic chemistry. These hydrothermal crater lakes were filled with cosmic water and the building blocks of life. forming a thick prebiotic soup. The unique combination of exogenous delivery of extraterrestrial building blocks of life, and the endogenous biosynthesis in hydrothermal impact crater lakes very likely gave rise to life. A new symbiotic model for the origin of life within the hydrothermal crater lakes is here proposed. In this scenario, life arose around four billion years ago through five hierarchical stages of increasing molecular complexity: cosmic, geologic, chemical, information, and biological. During the prebiotic synthesis, membranes first appeared in the hydrothermal crater lakes, followed by the simultaneous origin of RNA and protein molecules, creating the RNA/protein world. These proteins were noncoded protein enzymes that facilitated chemical reactions. RNA molecules formed in the hydrothermal crater basin by polymerization of the nucleotides on the montmorillonite mineral substrate. Similarly, the initial synthesis of abiotic protein enzymes was mediated by the condensation of amino acids on pyrite surfaces. The regular wet-dry cycles within the crater lakes assisted further concentration, condensation, and polymerization of the RNAs and proteins. Lipid membranes randomly encapsulated amino acids, RNA, and protein molecules from the prebiotic soup to initiate a molecular symbiosis inside the protocells, this led to the hierarchical emergence of several cell components. As the role of protein enzymes became essential for catalytic process in the RNA/protein world, Darwinian selection from noncoded to coded protein synthesis led to translation systems and the genetic code, heralding the information stage. In this stage, the biochemical pathways suggest the successive emergence of translation machineries such as tRNAs, aaRS, mRNAs, and of ribosomes for protein synthesis. The molecular attraction between tRNA and amino acid led to the emergence of translation machinery and the genetic code. tRNA is an ancient molecule that created mRNA for the purpose of storing amino acid information like a digital strip. Each mRNA strand became the storage device for genetic information that encoded the amino acid sequences in triplet nucleotides. As information became available in the digital languages of the codon within mRNA, biosynthesis became less random and more organized and directional. The original translation machinery was simpler before the emergence of the ribosome than that of today. We review three main concepts on the origin and evolution of the genetic code: the stereochemical theory, the coevolution theory, and adaptive theory. We believe that these three theories are not mutually exclusive, but are compatible with our coevolution model of translations machines and the genetic code. We suggest biosynthetic pathways as the origin of the translation machine that provided the framework for the origin of the genetic code. During translation, the genetic code developed in three stages coincident with the refinement of the translation machinery: GNC code with four codons and four amino acids during interactions of pre-tRNA/pre-aaRS /pre-mRNA, SNS code consisting of 16 codons and 10 amino acids appeared during the tRNA/aaRS/mRNA interaction, and finally the universal genetic code evolved with the emergence of the tRNA/aaRS/mRNA/ribosome machine. The universal code consists of 64 codons and 20 amino acids, with a redundancy that minimizes errors in translation. To address the question of the origin of the biological information system in the RNA/protein world, we converted letter codons into numerical codons in the Universal Genetic Code Table. We developed a software called CATI (Codon-Amino Acid-Translator-Imitator) to translate randomly chosen numerical codons into corresponding amino acids and vice versa, gaining insight into how translation might have worked in the RNA/protein world. We simulated the likely biochemical pathways for the origin of translation and the genetic code using the Model-View-Controller (MVC) software framework, and the translation machinery step-by-step. We used AnyLogic software to simulate and visualize the evolution of the translation machines and the genetic code. We conclude that the emergence of the information age from the RNA/protein world was a watershed event in the origin of life about four billion years ago.
... ProteinEvolver implements the following steps. First, a phylogenetic tree is either specified by the user or is internally simulated under the coalescent model [37] extended with recombination (including recombination hotspots following Posada and Wiuf [38] and an adaptation of the intracodon recombination algorithm [39,40] to simulate protein evolution with recombination (see Fig. 1)), demographics (population growth rate and demographic periods), longitudinal sampling, and user-specified populations structure with migration [41,42]. Second, a protein sequence is assigned to the most recent common ancestor (MRCA), or grand MRCA (GMRCA) if recombination is simulated, and is evolved forward in time, from the root to the tip nodes, along the phylogeny ( Fig. 1) [43]. ...
Chapter
Phylogenetic inference from protein data is traditionally based on empirical substitution models of evolution that assume that protein sites evolve independently of each other and under the same substitution process. However, it is well known that the structural properties of a protein site in the native state affect its evolution, in particular the sequence entropy and the substitution rate. Starting from the seminal proposal by Halpern and Bruno, where structural properties are incorporated in the evolutionary model through site-specific amino acid frequencies, several models have been developed to tackle the influence of protein structure on sequence evolution. Here we describe stability-constrained substitution (SCS) models that explicitly consider the stability of the native state against both unfolded and misfolded states. One of them, the mean-field model, provides an independent sites approximation that can be readily incorporated in maximum likelihood methods of phylogenetic inference, including ancestral sequence reconstruction. Next, we describe its validation with simulated and real proteins and its limitations and advantages with respect to empirical models that lack site specificity. We finally provide guidelines and recommendations to analyze protein data accounting for stability constraints, including computer simulations and inferences of protein evolution based on maximum likelihood. Some practical examples are included to illustrate these procedures.
Book
Full-text available
Structural requirements constrain the evolution of biological entities at all levels, from macromolecules to their networks, right up to populations of biological organisms. Classical models of molecular evolution, however, are focused at the level of the symbols - the biological sequence - rather than that of their resulting structure. Now recent advances in understanding the thermodynamics of macromolecules, the topological properties of gene networks, the organization and mutation capabilities of genomes, and the structure of populations make it possible to incorporate these key elements into a broader and deeply interdisciplinary view of molecular evolution. This book gives an account of such a new approach, through clear tutorial contributions by leading scientists specializing in the different fields involved.
Article
Full-text available
Phylogenetic studies based on DNA sequences typically ignore the potential occurrence of recombination, which may produce different alignment regions with different evolutionary histories. Traditional phylogenetic methods assume that a single history underlies the data. If recombination is present, can we expect the inferred phylogeny to represent any of the underlying evolutionary histories? We examined this question by applying traditional phylogenetic reconstruction methods to simulated recombinant sequence alignments. The effect of recombination on phylogeny estimation depended on the relatedness of the sequences involved in the recombinational event and on the extent of the different regions with different phylogenetic histories. Given the topologies examined here, when the recombinational event was ancient, or when recombination occurred between closely related taxa, one of the two phylogenies underlying the data was generally inferred. In this scenario, the evolutionary history corresponding to the majority of the positions in the alignment was generally recovered. Very different results were obtained when recombination occurred recently among divergent taxa. In this case, when the recombinational breakpoint divided the alignment in two regions of similar length, a phylogeny that was different from any of the true phylogenies underlying the data was inferred.
Article
In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution.
Article
Reconstructing the evolutionary history of biological sequences will provide a better understanding of mechanisms of sequence divergence and functional evolution. Long-term sequence evolution includes not only substitutions of residues but also more dynamic changes such as insertion, deletion, and long-range rearrangements. Such dynamic changes make reconstructing sequence evolution history difficult and affect the accuracy of molecular evolutionary methods, such as multiple sequence alignments (MSAs) and phylogenetic methods. In order to test the accuracy of these methods, benchmark datasets are required. However, currently available benchmark datasets have limitations in their sizes and evolutionary histories of the included sequences are unknown. These are the serious drawbacks as benchmarks. Such problems can be solved by simulating sequences to create benchmark datasets with known evolutionary history. However, currently available simulation methods do not allow biologically realistic dynamic sequence evolution. We introduced indel-Seq-Gen version 1.0 (iSGv1.0), a program that simulates realistic evolutionary processes of protein sequences with insertions and deletions (indels). iSGv1.0 allows the user to simulate multiple subsequences according to different evolutionary parameters, tracks all evolutionary events including indels and outputs the "true" MSA of the simulated sequences. With indel-Seq-Gen version 2.0 (iSGv2.0), we aimed at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 adds lineage-specific evolution, motif conservation, indel tracking, subsequence length constraints, and incorporates coding and non-coding DNA evolution. We uncovered a flaw in the modeling of indels used in current state of the art methods, and fixed it by using a novel discrete stepping procedure. Finally, we developed a new MSA scoring metric called the gap profile score that utilizes insertion and deletion placements to evaluate MSA accuracy. Using a series of benchmark alignments created with iSGv2.0, we examined the performance of our scoring method against currently used character-based scoring metrics, including the sum of pairs score. We examined how well the scoring metric output correlates with accuracy of phylogenetic reconstruction. We show that the gap profile score opens a novel way to gauge the efficacy of MSA reconstructions, potentially opening the door to the research of better models of indel placement into MSA reconstruction methods.
Article
The evolutionary history of a set of taxa is usually represented by a phylogenetic tree, and this model has greatly facilitated the discussion and testing of hypotheses. However, it is well known that more complex evolutionary scenarios are poorly described by such models. Further, ...
Article
This paper describes a model of a gene as a continuous length of DNA represented by the interval [0,1]. The ancestry of a sample of genes is complicated by possible recombination events, where a gene can have two parent genes. An analogue of Kingman’s coalescent process [J. F. C. Kingman, Stochastic Processes Appl. 13, 235-248 (1982; Zbl 0491.60076)], in which the ancestry of a sample of genes at a single locus is described by a stochastic binary tree, is a stochastic ancestral recombination graph, with vertices where coalescent or recombination events occur. All the information about ancestry is contained in this graph. The sample DNA lengths have marginal ancestral trees at each point in [0,1] which are imbedded in the graph. An upper bound is found for the expected number of distinct most recent common ancestors of these trees, and the expected maximum waiting time to these ancestors.