LOSITAN: a workbench to detect molecular adaptation based on a FST-outlier method. BMC Bioinforma 9:323

Article (PDF Available)inBMC Bioinformatics 9(1):323 · February 2008with244 Reads
DOI: 10.1186/1471-2105-9-323 · Source: PubMed
Abstract
Testing for selection is becoming one of the most important steps in the analysis of multilocus population genetics data sets. Existing applications are difficult to use, leaving many non-trivial, error-prone tasks to the user. Here we present LOSITAN, a selection detection workbench based on a well evaluated Fst-outlier detection method. LOSITAN greatly facilitates correct approximation of model parameters (e.g., genome-wide average, neutral Fst), provides data import and export functions, iterative contour smoothing and generation of graphics in a easy to use graphical user interface. LOSITAN is able to use modern multi-core processor architectures by locally parallelizing fdist, reducing computation time by half in current dual core machines and with almost linear performance gains in machines with more cores. LOSITAN makes selection detection feasible to a much wider range of users, even for large population genomic datasets, by both providing an easy to use interface and essential functionality to complete the whole selection detection process.
BioMed Central
Page 1 of 5
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Software
LOSITAN: A workbench to detect molecular adaptation based on
a F
st
-outlier method
Tiago Antao*
1
, Ana Lopes
2
, Ricardo J Lopes
3
, Albano Beja-Pereira
3
and
Gordon Luikart
3,4
Address:
1
Liverpool School of Tropical Medicine, Pembroke Place, Liverpool L3 5QA, UK,
2
REQUIMTE, Departamento de Química, Faculdade de
Ciências, Universidade do Porto, Rua do Campo Alegre, 687, 4169-007 Porto, Portugal,
3
CIBIO, Centro de Investigação em Biodiversidade e
Recursos Genéticos, Campus Agrário de Vairão, Universidade do Porto, Portugal and
4
Division of Biological Sciences, University of Montana,
Missoula, MT 59812, USA
Email: Tiago Antao* - tiago.antao@liverpool.ac.uk; Ana Lopes - anablopes@gmail.com; Ricardo J Lopes - ricardolopes@mail.icav.up.pt;
Albano Beja-Pereira - albanobp@gmail.com; Gordon Luikart - gordon.luikart@mso.umt.edu
* Corresponding author
Abstract
Background: Testing for selection is becoming one of the most important steps in the analysis of
multilocus population genetics data sets. Existing applications are difficult to use, leaving many non-
trivial, error-prone tasks to the user.
Results: Here we present LOSITAN, a selection detection workbench based on a well evaluated
F
st
-outlier detection method. LOSITAN greatly facilitates correct approximation of model
parameters (e.g., genome-wide average, neutral F
st
), provides data import and export functions,
iterative contour smoothing and generation of graphics in a easy to use graphical user interface.
LOSITAN is able to use modern multi-core processor architectures by locally parallelizing fdist,
reducing computation time by half in current dual core machines and with almost linear
performance gains in machines with more cores.
Conclusion: LOSITAN makes selection detection feasible to a much wider range of users, even
for large population genomic datasets, by both providing an easy to use interface and essential
functionality to complete the whole selection detection process.
Background
Understanding the contribution of selection and molecu-
lar adaptation in shaping genome wide variation is
among the most exciting and widely researched problems
with many applications ranging from human health to
conservation of endangered species. Among the many
selection detection strategies [1], F
st
outlier approaches are
becoming widely used [2,3] because they are important
not only for studying the genetic basis of adaptation but
also for eliminating non-neutral outlier loci from data sets
before computing most population genetic parameters
(e.g., F
st
, N
m
, N
e
), that require neutral loci [4]. This is par-
ticularly important in a time where production of data sets
with information from hundreds of loci is becoming fairly
common.
One such F
st
method is described in [2,5] (but see also [6]
and [7]) and is implemented in the fdist program and can
be used for any codominant genetic molecular markers
including microsatellites, Single Nucleotide Polymor-
Published: 28 July 2008
BMC Bioinformatics 2008, 9:323 doi:10.1186/1471-2105-9-323
Received: 15 February 2008
Accepted: 28 July 2008
This article is available from: http://www.biomedcentral.com/1471-2105/9/323
© 2008 Antao et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0
),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2008, 9:323 http://www.biomedcentral.com/1471-2105/9/323
Page 2 of 5
(page number not for citation purposes)
phisms (SNPs) and allozymes. This method evaluates the
relationship between F
st
and H
e
(expected heterozygosity)
in an island model [8], describing the expected distribu-
tion of Wright's inbreeding coefficient F
st
vs. H
e
under an
island model of migration with neutral markers. This dis-
tribution is used to identify outlier loci that have exces-
sively high or low F
st
compared to neutral expectations.
Such outlier loci are candidates for being subject to selec-
tion.
Using fdist can be a challenging task for those not famil-
iarized with command-line applications and requires a
specific data format not used by other applications [9].
Furthermore, several independent runs are usually needed
to tune parameters (e.g., determine the appropriate aver-
age F
st
) before a final execution is made in a process that
is prone to human introduced mistakes. Fdist, not being
one of the most computationally intensive programs
available, can still take up to one hour for a single run
(especially if smooth contours for confidence intervals are
required), and, in most cases, multiple runs are needed for
parameter tuning. Large population genomic datasets can
take even longer. In this context, fdist requires experi-
enced computer users, and its usage is error prone (e.g., by
incorrectly converting data files or not approximating
average F
st
appropriately).
Implementation
We designed LOSITAN (LOoking for Selection In a TAN-
gled dataset), a selection detection workbench con-
structed around fdist. LOSITAN is a Java Web Start
application coded mostly in Jython with a small part in
Java, allowing direct execution from the web. LOSITAN
provides the following features:
1. Easy to use interface (Figure 1), directly usable from the
web.
LOSITAN consoleFigure 1
LOSITAN console. Screen shot showing run parameters (bottom panel) and a graphical output with the simulated confi-
dence area for neutral loci (middle color band) with loci from the original empirical dataset represented as dots. Outliers are
tagged with labels.
BMC Bioinformatics 2008, 9:323 http://www.biomedcentral.com/1471-2105/9/323
Page 3 of 5
(page number not for citation purposes)
2. Data import in Genepop [10] format.
3. Generation of graphics in several formats (PNG, SVG
and PDF).
Graphics can be generated in several formats (covering
both bitmap and vector format styles) and parametrized
in many ways (from choosing colours to deciding which
labels are printed, among others). A completely unedited
example of a PNG output is presented on Figure 2.
4. Data export in a format suitable for import into statisti-
cal packages like R [11] or commonly used spreadsheet
software.
In case the user desires to further analyze the data or have
total flexibility in generating graphics, LOSITAN makes
available both the confidence intervals computed and the
F
st
s and heterozygosities for each locus.
A simple R script is supplied in order to facilitate loading
the data into R. Loading in spreadsheet software is done
simply by importing as a tab delimited file.
5. Choice of which populations and/or loci are studied.
6. Approximating mean neutral F
st
(in the real dataset) by
removing potential selected loci.
The initial mean dataset F
st
is often not neutral in the sense
that (initially unknown) selected loci are often included
in the computation. LOSITAN can optionally be run once
to determine a first candidate subset of selected loci in
order to remove them from the computation of the neu-
tral F
st
. This value will be, in most cases, a better approxi-
mation of the neutral F
st
[5]. The procedure works as
follows: LOSITAN is run a first time, using all loci to esti-
mate the mean neutral F
st
. After the first run, all loci that
are outside the desired confidence intervals (e.g. 99% CIs)
are removed and the mean neutral F
st
is computed again
using only putative neutral loci that were not removed. A
second and final run of LOSITAN, using all loci, is then
conducted using the last computed mean. This procedure
lowers the bias on the estimation of the mean neutral F
st
by removing the most extreme loci from the estimation.
Naturally all loci will be present in the last run will have
their estimated selection status reported.
7. Approximating average simulated F
st
to the average
value found in the real dataset even when the experimen-
tal conditions are far from the ones where the theoretical
formula holds (e.g. low number of demes or
the usage of the stepwise mutation model, common in
microsatellite markers).
To be able to (optionally) approximate the average F
st
in
conditions far from the theoretical optimum, LOSITAN
starts by running fdist for 10,000 realizations using the
theoretical value, calculating the average simulated F
st
, if
the value is too far from the real average F
st
, LOSITAN uses
a bisection approximation algorithm running 10,000 real-
izations for every tentative bisection point. The algorithm
works by iteratively slicing the interval of possible F
st
val-
ues (i.e., between 0 and 1) in half at each iteration and
choosing the mean of the bounds on each iteration (with
the exception of the first iteration where one of the
extremes is chosen). An example is provided to make the
approach clearer:
In a certain demographic scenario we want to simulate a
neutral F
st
of 0.08. The algorithm starts by trying 0.08. If
the result is higher than desired then 0.0 will be tried (cre-
ating an absolute lower bound limit), after that 0.04 (0.0
+ 0.08)/2 will be tried, if the result is too low, 0.06 will be
used next (i.e. (0.04 + 0.08)/2), the process repeats until
the error margin is acceptable.
In practical terms the method was able to converge to the
desired value in all cases tested (a completely trivial bisec-
tion approach is not possible as the method for comput-
ing F
st
is stochastic and results might vary for the same
input conditions).
8. Iterative smoothing of confidence interval contours.
Contour smoothing is achieved by running fdist an extra
5,000 realizations. The user can request smoothing an
unlimited number of times until the result is deemed sat-
isfactory.
9. Ability to use multiple CPU cores and processors when
running fdist.
To be able to use multiple cores, LOSITAN divides the
number of desired simulation repeats among all available
cores (although the application detects the number of
existing cores, the user is able to change the number of
simultaneous concurrent processes), this is possible
because fdist simulation runs are independent, thus mak-
ing parallelization a simple task. Tests show a near linear
relationship between the number of cores used and per-
formance gains, an existing 5–10% penalty is due mainly
to joining the partial results together. LOSITAN, although
being directly executable from the web is a client-side
application and all computational intensive operations
occur on the user computer and not on the server.
10. Automatic and transparent download of the latest ver-
sion of fdist.
F
st
Nm
=
+
1
41
BMC Bioinformatics 2008, 9:323 http://www.biomedcentral.com/1471-2105/9/323
Page 4 of 5
(page number not for citation purposes)
We maintain the latest version of the fdist application on
the server, which is downloaded transparently by the cli-
ent application whenever there is a new version. At the
time of this writing the supported version is fdist2.
The interface includes tips for all the less obvious param-
eters and enforces constraints for all the user inputs which
the system can infer are not correct.
Results and discussion
In a beta test release to users the feedback was generally
very positive stressing essentially that the application is
easy to use, allows to easily input and output data and
deal with non-trivial parameter determination like calcu-
lating neutral F
st
. Most importantly it made users aware of
issues in data analysis that they were not aware of. For
example, users were not aware of how to estimate the
genome wide average neutral F
st
from their empirical data
set by removing one or a few strong outlier loci, and the
recomputing the average F
st
. Although LOSITAN helps
avoid many pitfalls involved with using F
st
-outlier
approaches in general, it is not able to solve fundamental
issues regarding these approaches, for instance the non-
linear behavior of when F
st
approaches zero
can make it difficult to detect low F
st
-outliers especially
when selection is not strong. As such an easy to use appli-
cation should not be seen by users as a excuse to avoid
critical reasoning around the the whole selection detec-
tion process. Feedback from users also allows to chart pos-
sible future work, like supporting dominant markers or
supporting other selection detection approaches like [3].
Our solution to use all the available computing power on
new multi-core hardware is an example of an "embarrass-
ingly simple parallel" computation approach. We con-
tend that having a simple approach is a good principle:
The point in this application is to make all computational
power available to the users and not to develop new con-
current algorithms. A simple, highly efficient, elegant and
less bug-prone approach is what responds to the users
needs, as the objective of this work is not to develop new
algorithms, but to use them.
Conclusion
LOSITAN is built along the principles exposed in [12],
namely that intuitiveness and user empowerment should
be fundamental guidelines for software construction tar-
geting biologists. This is done, not only by supplying an
easy to use web interface for an, otherwise, hard to use
application, but also allowing the use of widely utilized
population genetic data formats, automating the tuning
of nuisance parameters and lowering the computational
costs on modern hardware. In addition, strong emphasis
is put on trying to avoid errors on the usage of the soft-
ware either by both enforcing constraints and giving sug-
gestions on less obvious features. This will lower the
barriers to usage of the underlying application, allowing
PNG OutputFigure 2
PNG Output. Graphical output from LOSITAN in PNG format without any post-edition.
F
st
Nm
=
+
1
41
BMC Bioinformatics 2008, 9:323 http://www.biomedcentral.com/1471-2105/9/323
Page 5 of 5
(page number not for citation purposes)
for a wider user base which will be able to concentrate
more on the biological problems and less on unnecessary
application complexity.
We are in the dawn of the era of multi-core computing.
The vast majority of existing software cannot make use of
the extra computational power made available on new
machines. Our approach, based on partitioning a compu-
tational intensive task into smaller ones, can be used to
leverage the extra computational power even without
changing existing code on applications which can be bro-
ken into smaller independent running units. This parti-
tioning approach can be performed in some cases by users
on existing software or by programmers in new applica-
tions that take advantage of multiple cores. With the cur-
rent trend of supplying many more cores with new
computers, strategies like the one presented here will be
mandatory in order to take full advantage of all the exist-
ing processing power. LOSITAN is one of the first of many
applications to explore the multi-core programming para-
digm.
Future planned developments will include addition of
other F-outlier methods and simulation facilities for
explore the effects of different demographic scenarios on
F
st
variance and the detection of outliers. All the code to
handle GenePop and fdist file formats and applications
was also donated to the Biopython project and is publicly
available starting from version 1.44.
Availability and requirements
Project name LOSITAN
Project home page http://popgen.eu/soft/lositan
. Devel-
opment site: http://code.google.com/p/lositan/
Operating systems Platform independent
Programming language Java and Jython
Other requirements Browser with JavaWebStart to run
over the internet (software can be run locally).
Windows: At least Windows 2000 and Java 1.6.
Mac OS X: 10.4 (Tiger) and Java 1.5 (Most current 10.4
installations will require a freely available Java update).
Linux: Java 1.6 and the free GNU C compiler.
License GNU GPL
Any restrictions to use by non-academics None
Authors' contributions
TA is the leading architect and main developer of LOSI-
TAN, and drafted this publication. AB–P and GL have
both theoretically drafted the idea of developing LOSI-
TAN and together with TA, RJL contributed in discussions,
planning and writing of this manuscript. RJL developed
the web page and tutorials and AL developed the code
regarding multi core detection and graphics and data
export.
Acknowledgements
This work was partially supported by the Bill & Melinda Gates Foundation
(grant #39777).
TA was supported by research grant SFRH/BD/30834/2006, RJL by SFRH/
BPD/14953/2004 and AB-P by SFRH/BPD/17822/2004 and this work was
supported by POCI/CVT/567558/2004 all from Fundacao para a Ciencia e
Tecnologia (FCT), Portugal. GL was supported by the Luso-American
Foundation, UP, CIBIO and research grant PTDC/BIA-BDE/65625/2006
from FCT.
References
1. Nielsen R: Molecular signatures of natural selection. Annual
Reviews in Genetics 2005, 39:197-218.
2. Beaumont MA: Adaptation and speciation: what can Fst tell us?
Trends Ecol Evol 2005, 20(8):435-440.
3. Vitalis R, Dawson K, Boursot P: Interpretation of variation
across marker loci as evidence of selection. Genetics 2001,
158(4):1811-1823.
4. Luikart G, England PR, Tallmon D, Jordan S, Taberlet P: The power
and promise of population genomics: from genotyping to
genome typing. Nat Rev Genet 2003, 4(12):981-994.
5. Beaumont MA, Nichols RA: Evaluating loci for use in the genetic
analysis of population structure. Proceedings of the Royal Society B
1996, 363:1619-1626.
6. Cavalli-Sforza LL: Population Structure and Human Evolution.
Proc R Soc Lond B Biol Sci 1966, 164:362-379.
7. Lewontin RC, Krakauer J: Letters to the editors: Testing the
heterogeneity of F values. Genetics 1975, 80(2):397-398.
8. Wright S: Evolution in Mendelian Populations. Genetics 1931,
16(2):97-159.
9. Excoffier L, Heckel G: Computer programs for population
genetics data analysis: a survival guide. Nat Rev Genet 2006,
7(10):745-758.
10. Raymond M, Rousset F: GENEPOP: population genetics soft-
ware for exact tests and ecumenicism. Journal of Heredity 1995,
86:248-249.
11. R Development Core Team: R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria; 2007. [ISBN 3-900051-07-0].
12. Kumar S, Dudley J: Bioinformatics software for biologists in the
genomics era. Bioinformatics 2007, 23(14):1713-1717.
    • "Half or 150 of these SNPs were outliers . Outlier SNPs have F ST values with changes in allele frequencies that are not found in 100,000 random permutations of the data and are thought to be due adaptive evo- lution [58, 59]. Although outlier tests suffer from both type I and II errors [60] , a stepping-stone model of divergence is likely to be similar to the connectivity of the FRT populations, and thus the outlier test we used is unlikely to suffer from extensive type I errors [60]. "
    [Show abstract] [Hide abstract] ABSTRACT: Background Acropora cervicornis, a threatened, keystone reef-building coral has undergone severe declines (>90 %) throughout the Caribbean. These declines could reduce genetic variation and thus hamper the species’ ability to adapt. Active restoration strategies are a common conservation approach to mitigate species' declines and require genetic data on surviving populations to efficiently respond to declines while maintaining the genetic diversity needed to adapt to changing conditions. To evaluate active restoration strategies for the staghorn coral, the genetic diversity of A. cervicornis within and among populations was assessed in 77 individuals collected from 68 locations along the Florida Reef Tract (FRT) and in the Dominican Republic. Results Genotyping by Sequencing (GBS) identified 4,764 single nucleotide polymorphisms (SNPs). Pairwise nucleotide differences (π) within a population are large (~37 %) and similar to π across all individuals. This high level of genetic diversity along the FRT is similar to the diversity within a small, isolated reef. Much of the genetic diversity (>90 %) exists within a population, yet GBS analysis shows significant variation along the FRT, including 300 SNPs with significant FST values and significant divergence relative to distance. There are also significant differences in SNP allele frequencies over small spatial scales, exemplified by the large FST values among corals collected within Miami-Dade county. Conclusions Large standing diversity was found within each population even after recent declines in abundance, including significant, potentially adaptive divergence over short distances. The data here inform conservation and management actions by uncovering population structure and high levels of diversity maintained within coral collections among sites previously shown to have little genetic divergence. More broadly, this approach demonstrates the power of GBS to resolve differences among individuals and identify subtle genetic structure, informing conservation goals with evolutionary implications.
    Full-text · Article · Dec 2016
    • "To quantify patterns of admixture for each site, we estimated a hybrid index (proportion of caudacutus alleles in an individual) and interspecific heterozygosity (proportion of an individuals' genome with alleles inherited from both parental populations) for all individuals using the R package introgress [90, 91]. To identify markers under selection, we performed selection tests for all loci using an F ST outlier approach [92] in the program LOSI- TAN [93]. To test for genetic differentiation among populations, we calculated pairwise F ST values and performed significance testing using 1000 permutations in FSTAT. "
    [Show abstract] [Hide abstract] ABSTRACT: Evolutionary processes, including selection and differential fitness, shape the introgression of genetic material across a hybrid zone, resulting in the exchange of some genes but not others. Differential introgression of molecular or phenotypic markers can thus provide insight into factors contributing to reproductive isolation. We characterized patterns of genetic variation across a hybrid zone between two tidal marsh birds, Saltmarsh (Ammodramus caudacutus) and Nelson’s (A. nelsoni) sparrows (n = 286), and compared patterns of introgression among multiple genetic markers and phenotypic traits. Geographic and genomic cline analyses revealed variable patterns of introgression among marker types. Most markers exhibited gradual clines and indicated that introgression exceeds the spatial extent of the previously documented hybrid zone. We found steeper clines, indicating strong selection for loci associated with traits related to tidal marsh adaptations, including for a marker linked to a gene region associated with metabolic functions, including an osmotic regulatory pathway, as well as for a marker related to melanin-based pigmentation, supporting an adaptive role of darker plumage (salt marsh melanism) in tidal marshes. Narrow clines at mitochondrial and sex-linked markers also offer support for Haldane’s rule. We detected patterns of asymmetrical introgression toward A. caudacutus, which may be driven by differences in mating strategy or differences in population density between the two species. Our findings offer insight into the dynamics of a hybrid zone traversing a unique environmental gradient and provide evidence for a role of ecological divergence in the maintenance of pure species boundaries despite ongoing gene flow.
    Full-text · Article · Dec 2016
    • "F ST values were also calculated using the null allele correction method in FreeNA (Chapuis and Estoup, 2007). The assumption of neutrality of the microsatellite loci was assessed using the FDIST outlier test (Beaumont and Nichols, 1996) implemented in LOSITAN (Antao et al., 2008). Outlier tests were performed globally and between pairs of samples. "
    [Show abstract] [Hide abstract] ABSTRACT: Levels of self-recruitment within and connectivity among populations are key factors influencing marine population persistence and stock sustainability, as well as the effectiveness of spatially explicit management strategies such as Marine Protected Areas (MPAs). In the United Kingdom (UK), Lundy Island in the Bristol Channel was designated a No-Take Zone (NTZ) in 2003 and became the UK's first Marine Conservation Zone (MCZ) in 2009. This NTZ is expected to represent an additional resource for the sustainable management of the European lobster (Homarus gammarus) fishery. As the first step in a genetic monitoring program, this study aimed to investigate population genetic structure of lobster within and between the Irish Sea and Bristol Channel and in doing so to assess the functioning of the Lundy NTZ in the context of connectivity and other genetic parameters. Analysis of microsatellite data indicated that lobsters within the study area are genetically homogeneous and supports the view of a single panmictic population wherein the Lundy NTZ is highly connected. Levels of genetic variability were universally high with no evidence of differences for the NTZ. Furthermore, there was no evidence of recent genetic bottlenecks, and estimates of effective population sizes were infinitely large. The results suggest that if current management and breeding stock sizes are maintained genetic drift will not be strong enough to reduce neutral genetic diversity.
    Full-text · Article · Nov 2016 · BMC Evolutionary Biology
Show more