Sequence-based association and selection scans identify drug resistance loci in the Plasmodium falciparum malaria parasite

Article (PDF Available)inProceedings of the National Academy of Sciences 109(32):13052-7 · July 2012with63 Reads
DOI: 10.1073/pnas.1210585109 · Source: PubMed
Through rapid genetic adaptation and natural selection, the Plasmodium falciparum parasite--the deadliest of those that cause malaria--is able to develop resistance to antimalarial drugs, thwarting present efforts to control it. Genome-wide association studies (GWAS) provide a critical hypothesis-generating tool for understanding how this occurs. However, in P. falciparum, the limited amount of linkage disequilibrium hinders the power of traditional array-based GWAS. Here, we demonstrate the feasibility and power improvements gained by using whole-genome sequencing for association studies. We analyzed data from 45 Senegalese parasites and identified genetic changes associated with the parasites' in vitro response to 12 different antimalarials. To further increase statistical power, we adapted a common test for natural selection, XP-EHH (cross-population extended haplotype homozygosity), and used it to identify genomic regions associated with resistance to drugs. Using this sequence-based approach and the combination of association and selection-based tests, we detected several loci associated with drug resistance. These loci included the previously known signals at pfcrt, dhfr, and pfmdr1, as well as many genes not previously implicated in drug-resistance roles, including genes in the ubiquitination pathway. Based on the success of the analysis presented in this study, and on the demonstrated shortcomings of array-based approaches, we argue for a complete transition to sequence-based GWAS for small, low linkage-disequilibrium genomes like that of P. falciparum.
Sequence-based association and selection scans
identify drug resistance loci in the Plasmodium
falciparum malaria parasite
Daniel J. Park
, Amanda K. Lukens
, Daniel E. Neafsey
, Stephen F. Schaffner
, Hsiao-Han Chang
, Clarissa Valim
Ulf Ribacke
, Daria Van Tyne
, Kevin Galinsky
, Meghan Galligan
, Justin S. Becker
, Daouda Ndiaye
Souleymane Mboup
, Roger C. Wiegand
, Daniel L. Hartl
, Pardis C. Sabeti
, Dyann F. Wirth
and Sarah K. Volkman
Broad Institute of MIT and Harvard, Cambridge, MA 02142;
Department of Organismic and Evolutionary Biology, Harvard Univer sity, Cambridge, MA 02138;
Department of Imunology and Infectious Diseases, Harvard School of Public Health, Boston, MA 02115;
Faculty of Medicine and Pharmacy, Université Cheikh
Anta Diop de Dakar, BP 5005, Dakar Fann, Sénégal; and
School for Nursing and Health Sciences, Simmons College, Boston, MA 02115
Contributed by Daniel L. Hartl, June 21, 2012 (sent for review March 26, 2012)
Through rapid genetic adaptation and natural selection, the Plas-
modium falciparum parasitethe deadliest of those that cause
malariais able to develop resistance to antimalarial drugs,
thwarting present efforts to control it. Genome-wide association
studies (GWAS) provide a critical hypothesis-generating tool for
understanding how this occurs. However, in P. falciparum, the
limited amount of linkage disequilibrium hinders the power of
traditional array-based GWAS. Here, we demonstrate the feasibil-
ity and power improvements gained by using whole-genome
sequencing for association studies. We analyzed data from 45
Senegalese parasites and identied genetic changes associated
with the parasites in vitro response to 12 different antimalarials.
To further increase statistical power, we adapted a common test
for natural selection, XP-EHH (cross-population extended haplo-
type homozygosity), and used it to identify genomic regions asso-
ciated with resistance to drugs. Using this sequence-based
approach and the combination of association and selection-based
tests, we detected several loci associated with drug resistance.
These loci included the previously known signals at pfcrt, dhfr,
and pfmdr1, as well as many genes not previously implicated
in drug-resistance roles, including genes in the ubiquitination
pathway. Based on the success of the analysis presented in this
study, and on the demonstrated shortcomings of array-based
approaches, we argue for a complete transition to sequence-based
GWAS for small, low linkage-disequilibrium genomes like that of
P. falciparum.
he malaria parasite Plasmodium falciparum imposes a tremen-
dous disease burden on human societies and is responsible
for 1.2 million deaths annually (1). Current efforts to eradicate
malaria depend on the continued success of antimalarial drugs
(2); however, the emergence of drug-resistant parasites threatens
to hamper global health efforts to control and eliminate the
disease. Understanding the genetic basis of these adaptations
will be necessary to maintain effective global health policies in
the face of an ever-changing pathogen.
A key to elucidating the genetic basis of drug resistance is
identifying the specic genes associated with the phenotype. In
human studies of this kind, the genome-wide association study
(GWAS) has overtaken the classic candidate gene approach,
made affordable by the use of genotyping arrays (or SNP arrays)
that measure only a subset of variants in the genome (3). This
optimization is only possible because of the extensive correlation
between genetic markers (linkage disequilibrium or LD) in the
human genome, which allows the subset of SNPs on an array to
act as proxies for other markers not present; this process is
known as tagging (4).
In P. falciparum, however, array-based GWAS is severely
limited by the relatively short extent of LD (58). Lacking that
correlation between genetic markers, genotyping arrays usually
cannot detect associations with untyped markers, effectively
limiting inferences to markers actually present on the array; even
the highest density P. falciparum array reported to date found
that LD between adjacent markers on the array was too weak for
tagging in African populations (6). Consequently, current
P. falciparum arrays cannot condently capture all causal var-
iants for important phenotypes.
The rapidly decreasing cost of whole-genome sequencing
offers a promising solution. In principle, working with a whole-
genome sequence allows one to directly assay all mutations
segregating in the population, obviating the detection problems
associated with short LD. Discovering mutations directly also
avoids the ascertainment bias inherent to arrays, bias that is
exacerbated when SNP discovery and genotyping are performed
in different populations (9). Additionally, the small size of the
P. falciparum genome (23 Mb, roughly the size of a human
exome), makes it potentially 100-fold cheaper than whole-
genome sequencing in humans. As malaria sequencing projects
become cost-competitive with genotyping arrays, whole-genome
sequencing has the potential to become the most effective
approach to performing association studies in malaria.
Here, we test the hypothesis that whole-genome sequencing
will identify SNP associations not detected by classic array-based
approaches. We apply this method to identify loci in the P. fal-
ciparum genome that are associated with antimalarial drug re-
sistance and compare the approach to a standard array-based
GWAS. We improve the statistical power of this analysis by
adapting a commonly used selection test, the cross-population
extended haplotype homozygosity (XP-EHH) test (10), and use
it as an association test for positively selected phenotypes. These
approaches identify a number of candidate loci associated with
Author contributions: D.J.P., A.K.L., D.E.N., S.F.S., R.C.W., D.L.H., P.C.S., D.F.W., and S.K.V.
designed research; D.V.T., M.G., J.S.B., and S.K.V. pe rform ed research; D.N. and S.M.
contributed new reagents/analytic tools; D.J.P., A.K.L., D.E.N., S.F.S., H.-H.C., C.V., K.G.,
and S.K.V. analyzed data; and D.J.P., A.K.L., D.E.N., S.F.S., U.R., D.L.H., P.C.S., D.F.W., and
S.K.V. wrote the paper.
The authors declare no conict of interest.
Freely available online through the PNAS open access option.
Data deposition: The SNP data have been deposited at dbSNP,
projects/SNP (batch id Pf_0004 from submitter BROAD-GENOMEBIO), are accessible via
the Broad Institute, 12-supple-1.
zip, and have also been deposited in PlasmoDB v9.1, The consensus
calls for the whole genome are available via the Broad Institute, ftp://ftp.broadinstitute.
To whom correspondence may be addressed. E-mail:, dhartl@, or
D.L.H., P.C.H., D.F.W., and S.K.V. contributed equally to this work.
This article contains supporting information online at
August 7, 2012
vol. 109
no. 32
antimalarial drug resistance, including genes in the ubiquitina-
tion pathway, suggesting that alteration of the parasites ability to
modulate stress may contribute to evasion of drug pressure and
development of resistance in P. falciparum.
Forty-Five Parasite Genomes and the Absence of LD. We chose
a population in a West African region near Dakar, Senegal and
culture-adapted 45 P. falciparum parasites recently isolated from
malaria-infected patients. This population is particularly relevant
for these studies because it has recently been exposed to multiple,
changing drug regimens as clinical resistance to traditional drugs
has emerged (11). We obtained whole-genome sequence data and
generated high-quality consensus base calls for an average of 83%
of each genome. This process produces 225,623 segregating SNPs,
of which 25,757 met our call rate and minor allele frequency cri-
teria for further study (see Methods). Sequence-based SNP calling
in P. falciparum is technically challenging because of its extremely
AT-richgenome(12,13).Inlightofthisnding, we validated our
sequence-based approach against array-based methods by using
a previously described SNP array (6) to genotype 24 of the 45
isolates. Of the 74,656 SNPs assayed by the array, 4,653 meet our
call rate and minor allele frequency criteria. We observe nearly
perfect concordance between Affymetrix genotypes and sequence
genotypes (see Methods).
Our data demonstrate that SNPs in P. falciparum have very
little ability to tag neighboring SNPs because of the short LD in
the African population from which they were sampled. Although
some portions of the genome exhibit signicant LD, over 62% of
the SNPs in the genome have no LD (r
< 0.05) between adja-
cent SNPs, and 87% of the SNPs have insufcient LD to tag their
neighbor (Fig. 1A) using the criterion derived from human
< 0.8) (4). To measure tagging ability directly, we
simulate genotyping arrays of various sizes by sampling random
subsets of SNPs from our sequence data. We nd that the sim-
ulated arrays are not able to tag a signicant portion of unas-
sayed markers, a result in stark contrast to the performance of
human arrays (Fig. 1B). The tagging performance of our own
Affymetrix array (tagging only 22.6% of segregating SNPs in
Senegal) is even lower than simulated arrays of similar size (Fig.
1B), most likely because of population-based ascertainment
biases (9) that were not modeled in our idealized approach.
These ndings lead us to conclude that array-based studies in
P. falciparum will rarely be able to detect signals resulting from
mutations not present on the array.
Sequence-Based GWAS. The goal of these studies is to identify
genomic changes associated with changes in parasite response to
antimalarial drugs, as measured in the set of 45 independent
P. falciparum isolates. We assayed the cultured parasites for
in vitro drug responses (measured by IC
) to 12 standard anti-
malarials: amodiaquine, artemisinin, atovaquone, chloroquine,
dihydroartemisnin, halofantrine, lumefantrine, meoquine, piper-
aquine, primaquine, pyrimethamine, and quinine. These antima-
larials constitute the 12 phenotypes used in our association
studies (Fig. S1). Not surprisingly, drugs with similar chemical
structures (e.g., halofantrine, lumefantrine, and meoquine)
show a strong correlation in responses (Fig. S2), as has pre-
viously been observed (6, 7), and provide the opportunity for
cross-validation of SNPs identied in association studies.
To test associations between SNP genotypes and drug re-
sponse, we use efcient mixed-model association (EMMA).
EMMA is a quantitative association approach well-suited for
small sample sizes and partially inbred organisms, such as the
malaria parasite (14). It is a commonly used tool among mixed-
model GWAS approaches (15) and has recently demonstrated
effectiveness with P. falciparum drug studies (6). After correcting
for multiple testing (Bonferroni correction for 25,757 SNPs,
P < 2 × 10
), EMMA is able to detect a number of previously
known markers of drug resistance, such as four nonsynonymous
SNPs in pfcrt (conferring amino acid changes: N75E/K, K76T,
Q271E, R371I) (16, 17) associated with chloroquine response,
one pfmdr1 SNP (conferring amino acid change: N86Y) (18, 19)
associated with halofantrine, lumefantrine, and meoquine re-
sponse, and three dhfr SNPs (conferring amino acid changes:
N51I, C59R, S108N) (20) associated with pyrimethamine re-
sponse. We note here that, although mitochondrial and apico-
plast genomes were also sequenced, no signicant associations
were found and the known mitochondrial mutations associated
with atovaquone resistance in cytochrome b (codons 268, 133,
and 280) (21, 22) were xed in all 45 individuals for the drug-
sensitive alleles. In all, EMMA detects 34 signicant SNPs as-
sociated with parasite response to ve drugs (Fig. S3). Most of
these SNPs are in or near previously known associations (8), and
LD to adjacent SNP (r
% of SNPs
0.0 0.2 0.4 0.6 0.8 1.0
% of total SNPs sampled
% of total SNPs tagged
0% 20% 40% 60% 80% 100%
Fig. 1. Simulated P. falciparum arrays are unable to tag SNPs not present on the array. (A) A histogram of LD between adjacent SNPs from sequenced
P. falciparum (black). The vast majority of markers have little to no LD with their neighbors (62% of SNPs have r
0.05, 76% have r
0.2, and 87% have
0.8). This nding contrasts with human studies, where much more of the genome shows moderate to strong LD between neighboring SNPs (gray).
(B) Simulated genotyping marker sets of various sizes are plotted against the percentage of the entire sequenced marker set that they are able to tag (with
0.8). The dashed, identity line depicts the theoretical scenario where all SNPs are in complete linkage equilibrium and no SNP tags another. Because this is
true of 87% of SNPs in the malaria sequence data, the increase is almost linear (black dots). This nding contrasts with the array tagging performance seen in
human studies (gray dots), where only a small fraction of markers are needed to tag the bulk of the genome, a principle upon which the array-based GWAS
depends. The open triangle depicts the actual performance of the Affymetrix-based Broad Institute P. falciparum SNP array (6).
Park et al. PNAS
August 7, 2012
vol. 109
no. 32
ve are previously unknown associations with pyrimethamine
response (Dataset S1).
Although these sequence-based ndings validate the pre-
viously known relationship between the pfmdr1 gene and parasite
responses to halofantrine, lumefantrine, and meoquine, it is
notable that this association is not detectable by our SNP array
(Fig. 2), as the array lacks any markers in pfmdr1 with a suf-
ciently high minor allele frequency. This nding exemplies the
type of association that can be missed by arrays because of
limited LD. Additionally, the agreement between these three
drugs at this locus provides validation of this result with respect
to structurally related drugs.
Using Haplotype-Based Selection Tests for Association. To test the
hypothesis that drug resistance is largely driven by positive se-
lection, we searched for long haplotypes associated with selec-
tion for drug resistance using the XP-EHH test (10). This
selection test has not previously been used as a GWAS tool, but
it is well suited for this purpose when we presume that the
phenotype we are studying is under positive selection. Although
this assumption is not valid for most human-based GWAS for
noncommunicable diseases, it is very likely to be the case when
studying parasite genomes for resistance adaptations to widely
used drugs, which represent a strong selective pressure. Used in
this way, the XP-EHH test identies areas in the genome where
resistant parasites show much longer haplotypes than sensitive
parasites, indicative of recent positive selection on the resistant
population. In our data, the test detects a number of signals,
including pfcrt and dhfr, as well as a number of other hits span-
ning a total of 32 genomic regions across 11 drugs (Fig. 3, Fig. S4,
and Dataset S1). Seventeen of these regions are indicative of
selection in the drug-resistant population, whereas 15 are con-
sistent with selection in the drug-sensitive population. With the
exception of the regions containing pfcrt and dhfr, none of these
loci were detected by EMMA alone.
Although this approach does not detect the known pfmdr1 lo-
cus, this is consistent with our expectations because of the nature
of the test. The N86Y mutation in pfmdr1 confers increased sus-
ceptibility (18, 19) to many drugs compared with the wild-type
allele. As such, this SNP would not be an expected candidate for
positive natural selection on a novel variant, the type of selection
XP-EHH is designed to detect. Moreover, the absence of a pfmdr1
signal from the XP-EHH test is consistent with the lack of ndings
in this gene from previous genomic scans for positive selection
based on the relative EHH, iHS (integrated haplotype statistic),
and XP-EHH tests in multiple populations (5, 6, 23).
In searching for long haplotypes, the XP-EHH test typically
identies a large number of signicant SNPs in close proximity to
each other. These regions often span many tens of kilobases and
several annotated genes. This result is expected because the
process of positive natural selection increases the prevalence of
both the selected variant as well as of nearby variants, generating
local regions of extended haplotypes. Thus, although XP-EHH
strongly implicates these 32 regions as areas of phenotype-as-
sociated positive selection, by itself it is usually unable to localize
the source of this selection to a specic gene. We use P values
from EMMA to improve signal localization by identifying the
strongest signals of association within each region. This approach
allows us to suggest a possible gene or mutation as a focus of
phenotype-specic positive selection for each identied region
(Dataset S1) and is reminiscent of earlier approaches that in-
tersect selection and association results (23, 24).
A more comprehensive examination of the regions under
drug-associated selection reveals discrete biological pathways
and processes that may be particularly important as mediators of
drug response in P. falciparum (SI Results). The 59 genes in these
32 regions can be functionally classied as surface molecules or
transporters, genome maintenance or transcriptional regulation,
metabolic enzymes including lipid metabolizers, and members
of the ubiquitin proteasome system. Most surface molecule-
associated mutations and intergenic mutations are localized to
intrachromosomal clusters containing var, rin, and stevor genes,
and a number of genes are found among molecules modulating
ubiquitination, lipid metabolism, or folate metabolism. Members
of these pathways are also represented in the large region of
pyrimethamine-specic selection on chromosome 6, where it is
difcult to localize the focus of selection. Collectively, these
ndings argue that certain biological processes in general, and
genes in the ubiquitination and lipid metabolism pathways in
particular, play important roles in modulating drug responses in
P. falciparum.
Complete genome sequencing provides many advantages over
array-based genotyping for association studies. These advantages
include the ability to directly type the causal allele, the increased
detection power from increased marker density, and the ability
to overcome ascertainment biases that arise when studying dif-
ferent populations with a xed marker set. In P. falciparum, the
lack of tagging ability because of the near absence of long-range
chr 5 position (Mb)
Array 24
0.0 0.4 0.8 1.2
Seq 24
0.0 0.4 0.8 1.2
Seq 45
0.0 0.4 0.8 1.2
Fig. 2. Meoquine association signals around the known drug resistance
locus pfmdr1. EMMA results are shown for all of chromoso me 5 with P values
for each SNP on a log
scale against physical position. The array-based
study (Array 24) does not detect any association at the known pfmdr1 locus
because of a lack of marker coverage within the gene and sufcient LD
around the gene. The sequ ence-based study with the same 24 samples (Seq
24) detects the expected hit at 0.96 Mb. Including all samples from the
sequence-based study (Seq 45) increases the strength of this signal. The
dashed line indicates the Bonferroni-corrected signicance threshold (P =
0.05, genome-wide SNP counts are 7,068, 17,278, and 25,159, respectively).
Fig. 3. Signicant signals of drug-associated selection across ve antima-
larial drugs. XP-EHH results are shown using a Manhattan-inspired plot, with
SNP z-scores plotted against genomic position, with each chromosome col-
ored separately. Positive z-scores suggest selection in drug-resistant para-
sites, negative z-scores suggest selection in sensitive parasites. The dashed
lines indicate the two-sided Bonferroni signicance thresholds (P = 0.025 and
0.975). Only drugs with signicant hits are shown here; z-score and quantile-
quantile plots for all drugs are shown in Fig. S4.
| Park et al.
LD limits the utility of arrays for association studies. Further-
more, the small genome size of P. falciparum brings the cost of
whole-genome sequencing to approximate parity with traditional
genotyping arrays, and recent advances in pathogen-specic
DNA-enrichment and host-specic DNA-depletion techniques
for clinical samples makes the sequence-based GWAS approach
more accessible and cost-effective than ever before (13, 25).
We introduce a selection-association approach based on the
XP-EHH selection test. Although this approach may not be
appropriate for many association studies, it is sensible when the
phenotype under study is under strong selection, which is likely
the case for drug resistance in pathogens. As a haplotype-based
test that takes advantage of multiple, adjacent SNPs, it has the
advantage of being more sensitive than single-marker approaches
like EMMA, given the same sample size (4). In addition to de-
tecting new signals of drug-associated selection, we also nd that
the directional nature of the test statistic, a z-score, provides
useful information about whether the selection is associated with
drug sensitivity or resistance. Consequently, we also introduce an
alternative visualization of the output: a Manhattan-like plot of z-
scores, instead of log
P values, to illustrate the directionality
of the signals (Fig. 3). In our data, we observed a tendency for
many drugs (artemisnin, dihydroartemisnin, primaquine, hal-
ofantrine, lumefantrine, and meoquine) to show highly signi-
cant signals of selection for drug sensitivity at pfcrt, the gene
known to be responsible for chloroquine resistance (Fig. S4).
Although, in principle, this type of signal may result from se-
lection toward drug sensitivity, in this particular case it most
likely results from the general pattern of anticorrelation between
chloroquine and these six other drugs (Fig. S2). Additionally, the
absence of a signicant chloroquine sensitivity signal at pfcrt is
consistent with reports that the return of chloroquine-sensitive
parasites in Africa did not result from a classic selective sweep
(26). In either case, the Manhattan-like z-score plots allow us to
note the presence of these drug-sensitivity signals while keeping
them visually separate from the drug resistance signals on which
we wish to focus.
Our approaches identify a signicant number of loci associated
with changes in drug response (Dataset S1). The strongest of these
loci contain previously known mediators of resistance, such as the
mutations in pfcrt, pfmdr1,anddhfr. Curation of our remaining
results using a variety of gene and protein prediction algorithms
and literature searches (27) point to several cellular processes and
pathways of potential interest, including the ubiquitin proteasome
system, lipid metabolism, and folate metabolism (Dataset S1). We
argue that these ndings point to biological processes used by the
parasite to survive drug pressure or circumvent the action of an-
timalarial compounds. Other genes of interest include those
encoding three ABC transportersa class of transporters known
to modulate drug responses in other organisms (28)and genes
proposed to modulate chromatin (29, 30), DNA repair (31, 32), or
RNA binding (33), pathways that have been shown to potentially
be altered in response to drug pressure.
A number of the signals of recent positive selection are unique
to pyrimethamine-resistant parasites. Although the known re-
sistance locus, dhfr, is present among these, there are even
stronger signals of pyrimethamine-associated selection on chro-
mosome 6 and chromosome 12. The region on chromosome 6
contains two previously uncharacterized genes proposed to
participate in folate metabolism (PFF1360w and PFF1490w), as
well as ve genes encoding proteins acting as either chaperones or in
ubiquitination (PFF1365c, PFF1485 w, PFF1445c, PFF1415c, and
PFF1505w), and three genes encoding molecules likely to modulate
lipid metabolism (PFF1350c, PFF1375c-a/b, and PFF1420w). In the
chromosome 12 region, the XP-EHH test produces signicant P
values for eight SNPs over a 15-kb region spanning ve adjacent
genes. The extended haplotypes surrounding these SNPs continue
even further, spanning 28 kb and 14 genes in total (Fig. 4A). These
results present challenges for experimental validation, as the goal of
association studies is to generateasmallnumberoftestablehy-
potheses about molecular mechanisms. Fortunately, the use of
EMMA P values in this region can assist in localizing the signal. We
nd that the strongest EMMA SNP coincides with the strongest XP-
EHH SNP, which is a nonsynonymous mutation in PFL2100w,
a putative ubiquitin-conjugating enzyme (E2) (Fig. 4B). Addition-
ally, a signicant, pyrimethamine-specic selection signal on chro-
mosome 8 is entirely contained within MAL8P1.23 [a putative
Fig. 4. Localizing the pyrimethamine-associated selection signal on chro-
mosome 12. (A)Dening the region: XP-EHH identies eight genome-wide
signicant SNPs in close proximity on chromosome 12. Each of these eight
SNPs represents the center of an area of extended haplotype homozygosity,
as measured by the EHH statistic. Haplotype decay for resistant parasites is
plotted for each of these eight SNPs, which denes a larger region from
1.807 Mb to 1.835 Mb in which the causal mutation may exist. This region
spans 28 kb and 14 genes. (B) Localizing the signal: focusing within this re-
gion, we use single-marker association signals from EMMA to localize the
signal. The most signicant EMMA SNP coincides with the most signicant
XP-EHH SNP and localizes to an E398D amino acid change in PFL2100w
(ubiquitin conjugating enzyme E2).
Park et al. PNAS
August 7, 2012
vol. 109
no. 32
HECT (homologous to the E6-AP carboxyl terminus) ubiquitin li-
gase E3] (Dataset S1), another gene in the ubiquitin-mediated
pathway (34). Given the role of this pathway in directing protein
degradation and recycling, it is possible that alterations in these
genes create changes in stress responses or protein turnover of key
resistance modulators that allow the parasite to survive under
drug pressure.
The evolution of drug resistance in the natural setting is likely
to be a multistep process and our work potentially identies key
pathways involved in this process. Field-based evidence has
demonstrated a reduced tness for drug-resistant parasites in the
absence of drug pressure, and laboratory-based work has dem-
onstrated the relative tness of different mutational changes in
target enzymes. Our ndings point to potential compensatory
mutations in a pathway related to protein stability and turnover,
and it is tempting to speculate that such adaptations enable the
expression of a resistant phenotype, such as has been observed
in yeast (35). Although molecular approaches are required to
validate the role of this pathway in modulating drug response,
these results demonstrate the potential for sequence-based
GWAS approaches to identify pathways, in addition to individual
genes, that may be responsible for the phenotype of interest.
Ultimately, all association results require experimental valida-
tion and follow-up work to explore possible mechanisms of action.
Association studies, even in their ideal form, simply generate hy-
potheses based on correlations. However, improved methods for
association studies can signicantly reduce the necessary validation
work by reducing false-positive rates, increasing study-detection
power, and improving localization ability. This study successfully
pilots the use of whole-genome sequence data for association
studies in malaria and demonstrates signicant advantages in de-
tection power over array-based studies. We strongly recommend
that future association studies in low-LD, small-genome organisms
adopt the sequence-based GWAS approach as well, given the
relative costs. We additionally demonstrate the effectiveness of the
XP-EHH selection test as an association test for phenotypes under
positive selection. Finally, we combine data from both tests to lo-
calize long signals and reduce the number of hypotheses for follow-
up validation. This combined approach identies more candidate
loci than with single-marker tests alone.
Materials and Methods
Sequencing. Parasites were obtained from patients with uncomplicated mild
malaria in Senegal from 2001 to 2009 under ethical approval from the In-
stitutional Review Board at the Harvard School of Public Health under
protocol #16330-106 with informed consent for the study. Parasites were
culture-adapted by standard methods (36) and genomic DNA was extracted
from 45 single-clone samples. Samples were determined to be monoclonal
and genetically distinct by a 24 SNP molecular barcode (37). Genomic DNA
was sequenced using Illumina Hi-Seq machines. The rst 12 parasites were
sequenced with 76-bp single-end reads and the remaining 33 were se-
quenced with paired-end reads ranging from 76 bp to 101 bp in length. The
median sequence coverage depth was 144.8× after alignment (ranging from
32× to 400×). Reads were aligned with the Burrows-Wheeler Aligner (BWA)
v0.5.9-r16 against the 3D7 reference assembly (PlasmoDB v7.1). A consensus
sequence was called for each strain using the GATK Unied Genotyper
v1.2.3-g61b89e2 (38) with the following parameters: -A AlleleBalance
-stand_emit_conf 0 ‐‐output_mode EMIT_ALL_SITES. Bases were then removed
if they exhibited poor quality (GQ less than 30 or QUAL less than 60) or if
they called a heterozygo us genotype. This process left consensus calls for 56
91% of the genome (83% median) for each of 45 individuals. Of these sites,
225,623 positions are polymorphic among the 45 individuals. Of these SNPs,
only 25,757 had genotypes in at least 36 individuals (80% call rate) and were
nonsingletons (i.e., minor allele count > 1 or minor allele frequency > 4%).
All analyses are based on this set of 25,757 SNPs. SNP data are available in
dbSNP ( as batch Pf_0004 from
submitter BROAD-GENOMEBIO. SNPs have been deposited at PlasmoDB (27)
v9.1 to allow easy searching and visualization in combination with other
malaria genomic data sets. SNP data can also be found in ref. 39. Consensus
calls for the whole genome are available in ref. 40.
Principal component analysis was conducted using the program
SMARTPCA (41) in the EIGENSOFT 3.0 package. We applied a local LD cor-
rection (nsnpldregress = 2) and found no signicant eigenvectors in
the population.
Tagging Analysis. Tagging analysis in Fig. 1B was generated by using PLINK
(42) to nd tagging SNPs for each SNP that were within 10 kb and at least
0.8. We then simulated genotyping arrays by randomly sampling subsets
of SNPs of varying subset sizes and calculating the fraction of total SNPs that
are tagged by the subset. We rst reduced the sequence data to 40 random
individuals to simulate ascertainment bias against low allele-frequency
markers, then randomly sampled markers that were still polymorphic among
the smaller population size to simulate a genotyping array. We simulated 19
different array sizes, ranging from 5% of the sequenced SNPs (1,227) to 95%
of the sequenced SNPs (22,087). Two-hundred simulations per array size
were run and the result was highly consistent: 95% condence intervals
were too small to visualize on the gure. Simulations for the human ge-
nome were based on 60 diploid individuals of European descent (CEU) from
Hapmap release 23a. Each iteration chose 54 random individuals to simulate
ascertainment bias, ltered SNPs to an 80% call rate and to nonsingletons.
Our Affymetrix array was able to tag 5,508 SNPs in our sequence data using
the 4,894 SNPs on the array that overlapped with the 25,757 SNPs in our
sequence data (open triangle in Fig. 1B). Histograms in Fig. 1A are binned
into 20 evenly spaced bins of r
from 0 to 1. The plot is normalized such that
the sum of all bars in each histogram is equal to 1 to show the relative
proportions of SNPs in each bin. Simulation data are provided at ftp://ftp. (39).
Drug Assays. Drug assays were performed as previously described (43) with
slight modications for 384-well format (SI Methods). The range of drug
concentrations are shown in Fig. S1, and the IC
data along with raw input
data for all association tests, is provided in ref. 39.
EMMA. Single marker association tests were run using EMMA (14). Because
not all drugs have complete phenotype data for all 45 individuals, SNPs are
additionally ltered to those that met our previous call rate and minor allele
criteria among the subset of samples for which drug data exists. This ltering
results in 23,00025,180 SNPs for any given drug. Log
) values were
used for this quantitative test. Biological replicates of drug data were pre-
sented to EMMA as multiple individuals from the same genetic strain, which
allows EMMA to use the additional data to discern heritable phenotypic
variance from nonheritable variance (15) and mimics the use of clonally
identical parasites in other studies (44, 45). Signicance was dened as SNPs
that exceeded a Bonferroni- corrected threshold of P < 0.05 and also survive
60% of jackknife simulations. EMMA results were jackknifed by perform ing
200 random subsets of 38 samples and requiring an false-discovery rate-
corrected signicance of Q < 0.1. SNPs that passed this threshold in 60% of
jackknife simulations were considered to be robust against false-positives
because of small sample-size effects.
XP-EHH. Selection-association tests were run using the XP-EHH test (10). Each
drug dened a partitioning of samples into two subpopulations (sensi-
tive and resistant) based on cutoffs shown in Fig. S1 andprovidedat (39)
(SI Methods). XP-EHH requires a recombination map as input, which we
constructed with LDhat v2.1 (46) (SI Methods). XP-EHH also requires fully
imputed genotypes. Imputation was performed using PHASE 2.1.1 (47),
producing 29,605 nonsingleton SNPs (SI Methods).
XP-EHH computes a signicance value for each SNP in the genome, as-
suming that SNP comprises the haplotype core of selection. Because the
test identies long haplotypes, it results in a large number of genome-wide
signicant SNPs (dened by Bonferroni-corrected P < 0.05) in clustered
stretches of the genome. We reduced the set of signicant SNPs to a set of
signicant genomic regions by taking each signicant core SNP, computing
a window around each one where EHH decayed to 0.05, and merging
overlapping windows. This process resulted in a smaller list of signicant
regions for each drug (Dataset S1). Regions were further ltered by re-
moving those which did not contain at least one core SNP that survived 50%
of jackknife simulations. XP-EHH results were jackknifed by performing
200 random subsets of 38 samples and requiring a Bonferroni-corrected
signicance of P < 0.1.
Genotyping Arrays. A subset of 25 parasites was also hybridized to an Affy-
metrix array containing 74,656 markers (6). SNPs were called using BRLMM-P
from Affy Power Tools v1.10.2 and ltered according to the same methods as
| Park et al.
Van Tyne, et al. (6), resulting in 15,075 validated SNPs, 8,778 of which were
polymorphic among the 25 individuals from Senegal. SNP coordinates were
converted from PlasmoDB v5.0 coordinates to v7.1 coordinates using whole
genome nucmer alignments (48). Concordance between array and sequencing
data were measured for the set of markers in which genotype calls existed by
both methods. For 24 samples, nearly perfect concordance between Affyme-
trix genotypes and sequence genotypes was observed for the 24 samples
(averaging 99.2% concordance, with all 24 samples above 98.2% concor-
dance). This level of concordance is similar to what is observed with technical
replicate hybridizations of the same DNA sample (6). One sample, SenP19.04.c,
reported a 28.2% mismatch rate, suggestive of a sample identication error,
and was removed from the analysis. EMMA analyses were run on the array
data using the same lters and procedures as for sequence data described
above, using 4,5144,653 SNPs per drug phenotype. Results are shown in
Fig. S5. Array data for these 24 samples are available from ref. 39.
ACKNOWLEDGMENTS. We thank the sample collection team in Senegal,
including Younouss Diedhiou, Lamine Ndiaye, Amadou Moctar Mbaye, Baba
Dieye, Moussa Dieng Sarr, Papa Diogoye Sene, and Ngayo Sy; the technical
staff at the Harvard School of Public Health who maintained parasite cultures,
including Kayla Barnes, Dave Rosen, Kate Fernandez, and Gilberto Ramirez;
members of the P.C.S. laboratory for a careful review of our manuscript,
including Kristian Andersen, Chris Edwards, Chris Matranga, Rachel Sealfon,
Jesse Shapiro, Ilya Shlyakhter, Matt Stremlau, and Shervin Tabrizi; and those
who made contributions to the community database,,thatfa-
cilitated biological curation of candidate genes presented in this work. This
study is supported by the Bill and Melinda Gates Foundation; National Insti-
tutes of Health (NIH) Grant 1R01AI075080-01A1; the Ellison Medical Founda-
tion; the Exxon-Mobil Foundation; the NIH Fogarty International Center; the
National Institute of Allergy and Infectious Diseases, and Broad ScienticPlan-
ning and Allocation of Resources Committee (SPARC); a National Science Foun-
dation Graduate Research Fellowship (to D.J.P.); and fellowships from the
Burroughs Wellcome and Packard Foundations (to P.C.S.).
1. Murray CJL, et al. (2012) Global malaria mortality between 1980 and 2010: A sys-
tematic analysis. Lancet 379:413431.
2. malERA Consultative Group on Drugs (2011) A research agenda for malaria eradica-
tion: Drugs. PLoS Med 8:e1000402.
3. Altshuler DM, Daly MJ, Lander ES (2008) Genetic mapping in human disease. Science
4. de Bakker PIW, et al. (2005) Efciency and power in genetic association studies. Nat
Genet 37:12171223.
5. Mu J, et al. (2010) Plasmodium falciparum genome-wide scans for positive selection,
recombination hot spots and resistance to antimalarial drugs. Nat Genet 42:268271.
6. Van Tyne D, et al. (2011) Identication and functional validation of the novel anti-
malarial resistance locus PF10_0355 in Plasmodium falciparum. PLoS Genet 7:
7. Yuan J, et al. (2011) Chemical genomic proling for antimalarial therapies, response
signatures, and molecular targets. Science 333:724729.
8. Volkman SK, Neafsey DE, Schaffner SF, Park DJ, Wirth DF (2012) Harnessing genomics
and genome biology to understand malaria biology. Nat Rev Genet 13:315328.
9. Albrechtsen A, Nielsen FC, Nielsen R (2010) Ascertainment biases in SNP chips affect
measures of population divergence. Mol Biol Evol 27:25342547.
10. Sabeti PC, et al.; International HapMap Consortium (2007) Genome-wide detection
and characterization of positive selection in human populations. Nature 449:913918.
11. Mouzin E, Thior PM, Diouf MB, Sambou B (2010) Focus on senegal. Progress & impact
series, no. 4, (WHO, Geneva, Switzerland). Available at
12. Oyola SO, et al. (2012) Optimizing Illumina next-generation sequencing library
preparation for extremely AT-biased genomes. BMC Genomics 13:1.
13. Melnikov A, et al. (2011) Hybrid selection for sequencing pathogen genomes from
clinical samples. Genome Biol 12:R73.
14. Kang HM, et al. (2008) Efcient control of population structure in model organism
association mapping. Genetics 178:17091723.
15. Price AL, Zaitlen NA, Reich DE, Patterson N (2010) New approaches to population
stratication in genome-wide association studies. Nat Rev Genet 11:459463.
16. Fidock DA, et al. (2000) Mutations in the P. falciparum digestive vacuole trans-
membrane protein PfCRT and evidence for their role in chloroquine resistance. Mol
Cell 6:861871.
17. Wootton JC, et al. (2002) Genetic diversity and chloroquine selective sweeps in Plas-
modium falciparum. Nature 418:320323.
18. Duraisingh MT, et al. (2000) The tyrosine-86 allele of the pfmdr1 gene of Plasmodium
falciparum is associated with increased sensitivity to the anti-malarials meoquine
and artemisinin. Mol Biochem Parasitol 108:1323.
19. Nkhoma S, et al. (2009) Parasites bearing a single copy of the multi-drug resistance
gene (pfmdr-1) with wild-type SNPs predominate amongst Plasmodium falciparum
isolates from Malawi. Acta Trop 111:7881.
20. Nair S, et al. (2003) A selective sweep driven by pyrimethamine treatment in south-
east Asian malaria parasites. Mol Biol Evol 20:15261536.
21. Kessl JJ, Meshnick SR, Trumpower BL (2007) Modeling the molecular basis of atova-
quone resistance in parasites and pathogenic fungi. Trends Parasitol 23:494501.
22. Dong CK, et al. (2011) Identication and validation of tetracyclic benzothiazepines as
Plasmodium falciparum cytochrome bc1 inhibitors. Chem Biol 18:16021610.
23. Cheeseman IH, et al. (2012) A major genome region underlying artemisinin resistance
in malaria. Science 336:7982.
24. Kudaravalli S, Veyrieras JB, Stranger BE, Dermitzakis ET, Pritchard JK (2009) Gene
expression levels are a target of recent natural selection in the human genome. Mol
Biol Evol 26:649658.
25. Venkatesan M, et al. (2012) Using CF11 cellulose columns to inexpensively and ef-
fectively remove human DNA from Plasmodium falciparum-infected whole blood
samples. Malar J 11:41.
26. Laufer MK, et al. (2010) Return of chloroquine-susceptible falciparum malaria in
Malawi was a reexpansion of diverse susceptible parasites. J Infect Dis 202:801808.
27. Aurrecoechea C, et al. (2009) PlasmoDB: A functional genomic database for malaria
parasites. Nucleic Acids Res 37(Database issue):D539D543.
28. Leprohon P, Légaré D, Ouellette M (2011) ABC transporters involved in drug
resistance in human parasites. Essays Biochem 50:121144.
29. Cui L, Miao J (2010) Chromatin-mediated epigenetic regulation in the malaria para-
site Plasmodium falciparum. Eukaryot Cell 9:11381149.
30. Coleman BI, Duraisingh MT (2008) Transcriptional control and gene silencing in
Plasmodium falciparum. Cell Microbiol 10:19351946.
31. Castellini MA, et al. (2011) Malaria drug resistance is associated with defective DNA
mismatch repair.
Mol Biochem Parasitol 177:143147.
32. Tarique M, Satsangi AT, Ahmad M, Singh S, Tuteja R (2012) Plasmodium falciparum
MLH is schizont stage specic endonuclease. Mol Biochem Parasitol 181:153161.
33. Meng X, et al. (2012) Cytoplasmic Metadherin (MTDH) provides survival advantage
under conditions of stress by acting as RNA-binding protein. J Biol Chem 287:
34. Ponts N, et al. (2008) Deciphering the ubiquitin-mediated pathway in apicomplexan
parasites: A potential strategy to interfere with parasite virulence. PLoS ONE 3:e2386.
35. Jarosz DF, Lindquist S (2010) Hsp90 and environmental stress transform the adaptive
value of natural genetic variation. Science 330:18201824.
36. Trager W, Jensen JB (1976) Human malaria parasites in continuous culture. Science
37. Daniels R, et al. (2008) A general SNP-based molecular barcode for Plasmodium fal-
ciparum identication and tracking. Malar J 7:223.
38. McKenna A, et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for
analyzing next-generation DNA sequencing data. Genome Res 20:12971303.
39. Broad Institute (2012) Tagging simulation data, drug data, PLINK-formatted input
data for both sequence and array data, recombination maps, imputed genotypes,
GWAS outputs, and R code for generating all gures. Available at ftp://ftp.broadinstitute.
40. Broad Institute (2012) Consensus sequence calls for each of 45 strains and 23 million
bases. VCF le is bgzip compressed and indexed by tabix and vcftools (.tbi and .vcdx
les are also in this directory). Available at
41. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS
Genet 2:e190.
42. Purcell S, et al. (2007) PLINK: a tool set for whole-genome association and population-
based linkage analyses. Am J Hum Genet 81:559575.
43. Plouffe D, et al. (2008) In silico activity proling reveals the mechanism of action of
antimalarials discovered in a high-throughput screen. Proc Natl Acad Sci USA 105:
44. Anderson TJC, et al. (2010) Inferred relatedness and heritability in malaria parasites.
Proc Biol Sci 277:25312540.
45. Anderson TJC, et al. (2010) High heritability of malaria parasite clearance rate in-
dicates a genetic basis for artemisinin resistance in western Cambodia. J Infect Dis
46. McVean G, Awadalla P, Fearnhead P (2002) A coalescent-based method for detecting
and estimating recombination from gene sequences. Genetics
47. Stephens M, Donnelly P (2003) A comparison of bayesian methods for haplotype
reconstruction from population genotype data. Am J Hum Genet 73:11621169.
48. Kurtz S, et al. (2004) Versatile and open software for comparing large genomes.
Genome Biol 5:R12.
Park et al. PNAS
August 7, 2012
vol. 109
no. 32
    • "Genes previously known to be linked to drug resistance were identified, such as pfmdr1, pfcrt, and dhfr. However, other genes were also found, some of which encode steps in metabolic pathways that might modulate the response to drug stress [94] . A similar approach was applied to a set of nine other P. falciparum isolates with different genetic backgrounds; SNPs were documented and then compared with 57 other genome sequences [95]. "
    [Show abstract] [Hide abstract] ABSTRACT: In recent years, most new candidate antiparasitic drugs have been found by screening huge numbers of compounds for their ability to kill parasites, followed by counterscreening for toxicity to mammalian cells. Several public–private initiatives have supported this, yielding many hits each for Plasmodia and Kinetoplastids. From these, candidates are selected for further investigation. Although knowledge of the precise mode of action is not necessary for successful development, detailed understanding of the drug's uptake, activation, and target can be very useful in guiding medicinal chemistry, toxicology, and pharmacology. Knowledge of the target can also provide information for further drug discovery studies and in choosing partner drugs in combinations. A multiplicity of complementary approaches can be applied to investigate the drug mode of action. Examples include selecting drug-resistant parasites and identifying the resistance-causing mutations, reverse genetics to find genes required for drug susceptibility, metabolomics, and biochemical approaches such as affinity purification. Here, we review the myriad possibilities, including numerous examples.
    Chapter · Jul 2016 · Genome Medicine
    • "If validated, they can then be used as markers for molecular surveillance of drug resistance. Candidate molecular markers of ART resistance have been successfully identified through this approach recently (Noedl et al., 2009; Takala-Harrison et al., 2013; Stephens et al., 2012 ). the de novo identification of a gene involved in increased resistance to halofantrine (Van Tyne et al., 2011) and have recently been validated as a powerful tool for identifying several other malaria drug resistance-associated loci (Park et al., 2012). "
    [Show abstract] [Hide abstract] ABSTRACT: Plasmodium falciparum, the malignant malaria parasite, has developed resistance to artemisinin, the most important and widely used antimalarial drug at present. Currently confined to Southeast Asia, the spread of resistant parasites to Africa would constitute a public health catastrophe. In this review we highlight the recent contributions of genomics to our understanding how the parasite develops resistance to artemisinin and its derivatives, and how resistant parasites may be monitored and tracked in real-time, using molecular approaches. Copyright © 2015. Published by Elsevier B.V.
    Full-text · Article · Apr 2015
    • "It can therefore be applied to both clonal and sexual/recombining pathogens as long as recombination is taken into account in the phylogenetic tree construction [43] . For highly recombining pathogens, the tools of human GWAS might be appropriate, with some modifications [44,45]. "
    [Show abstract] [Hide abstract] ABSTRACT: Whole genome sequencing is increasingly used to study phenotypic variation among infectious pathogens and to evaluate their relative transmissibility, virulence, and immunogenicity. To date, relatively little has been published on how and how many pathogen strains should be selected for studies associating phenotype and genotype. There are specific challenges when identifying genetic associations in bacteria which often comprise highly structured populations. Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens. We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence. We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species. Electronic supplementary material The online version of this article (doi:10.1186/s13073-014-0101-7) contains supplementary material, which is available to authorized users.
    Full-text · Article · Nov 2014
Show more