Sequence-based association and selection scans identify drug resistance loci in the Plasmodium falciparum malaria parasite
Through rapid genetic adaptation and natural selection, the Plasmodium falciparum parasite--the deadliest of those that cause malaria--is able to develop resistance to antimalarial drugs, thwarting present efforts to control it. Genome-wide association studies (GWAS) provide a critical hypothesis-generating tool for understanding how this occurs. However, in P. falciparum, the limited amount of linkage disequilibrium hinders the power of traditional array-based GWAS. Here, we demonstrate the feasibility and power improvements gained by using whole-genome sequencing for association studies. We analyzed data from 45 Senegalese parasites and identified genetic changes associated with the parasites' in vitro response to 12 different antimalarials. To further increase statistical power, we adapted a common test for natural selection, XP-EHH (cross-population extended haplotype homozygosity), and used it to identify genomic regions associated with resistance to drugs. Using this sequence-based approach and the combination of association and selection-based tests, we detected several loci associated with drug resistance. These loci included the previously known signals at pfcrt, dhfr, and pfmdr1, as well as many genes not previously implicated in drug-resistance roles, including genes in the ubiquitination pathway. Based on the success of the analysis presented in this study, and on the demonstrated shortcomings of array-based approaches, we argue for a complete transition to sequence-based GWAS for small, low linkage-disequilibrium genomes like that of P. falciparum.
Sequence-based association and selection scans
identify drug resistance loci in the Plasmodium
falciparum malaria parasite
Daniel J. Park
, Amanda K. Lukens
, Daniel E. Neafsey
, Stephen F. Schaffner
, Hsiao-Han Chang
, Clarissa Valim
, Daria Van Tyne
, Kevin Galinsky
, Meghan Galligan
, Justin S. Becker
, Daouda Ndiaye
, Roger C. Wiegand
, Daniel L. Hartl
, Pardis C. Sabeti
, Dyann F. Wirth
and Sarah K. Volkman
Broad Institute of MIT and Harvard, Cambridge, MA 02142;
Department of Organismic and Evolutionary Biology, Harvard Univer sity, Cambridge, MA 02138;
Department of Imunology and Infectious Diseases, Harvard School of Public Health, Boston, MA 02115;
Faculty of Medicine and Pharmacy, Université Cheikh
Anta Diop de Dakar, BP 5005, Dakar Fann, Sénégal; and
School for Nursing and Health Sciences, Simmons College, Boston, MA 02115
Contributed by Daniel L. Hartl, June 21, 2012 (sent for review March 26, 2012)
Through rapid genetic adaptation and natural selection, the Plas-
modium falciparum parasite—the deadliest of those that cause
malaria—is able to develop resistance to antimalarial drugs,
thwarting present efforts to control it. Genome-wide association
studies (GWAS) provide a critical hypothesis-generating tool for
understanding how this occurs. However, in P. falciparum, the
limited amount of linkage disequilibrium hinders the power of
traditional array-based GWAS. Here, we demonstrate the feasibil-
ity and power improvements gained by using whole-genome
sequencing for association studies. We analyzed data from 45
Senegalese parasites and identiﬁed genetic changes associated
with the parasites’ in vitro response to 12 different antimalarials.
To further increase statistical power, we adapted a common test
for natural selection, XP-EHH (cross-population extended haplo-
type homozygosity), and used it to identify genomic regions asso-
ciated with resistance to drugs. Using this sequence-based
approach and the combination of association and selection-based
tests, we detected several loci associated with drug resistance.
These loci included the previously known signals at pfcrt, dhfr,
and pfmdr1, as well as many genes not previously implicated
in drug-resistance roles, including genes in the ubiquitination
pathway. Based on the success of the analysis presented in this
study, and on the demonstrated shortcomings of array-based
approaches, we argue for a complete transition to sequence-based
GWAS for small, low linkage-disequilibrium genomes like that of
he malaria parasite Plasmodium falciparum imposes a tremen-
dous disease burden on human societies and is responsible
for 1.2 million deaths annually (1). Current efforts to eradicate
malaria depend on the continued success of antimalarial drugs
(2); however, the emergence of drug-resistant parasites threatens
to hamper global health efforts to control and eliminate the
disease. Understanding the genetic basis of these adaptations
will be necessary to maintain effective global health policies in
the face of an ever-changing pathogen.
A key to elucidating the genetic basis of drug resistance is
identifying the speciﬁc genes associated with the phenotype. In
human studies of this kind, the genome-wide association study
(GWAS) has overtaken the classic candidate gene approach,
made affordable by the use of genotyping arrays (or SNP arrays)
that measure only a subset of variants in the genome (3). This
optimization is only possible because of the extensive correlation
between genetic markers (linkage disequilibrium or LD) in the
human genome, which allows the subset of SNPs on an array to
act as proxies for other markers not present; this process is
known as “tagging” (4).
In P. falciparum, however, array-based GWAS is severely
limited by the relatively short extent of LD (5–8). Lacking that
correlation between genetic markers, genotyping arrays usually
cannot detect associations with untyped markers, effectively
limiting inferences to markers actually present on the array; even
the highest density P. falciparum array reported to date found
that LD between adjacent markers on the array was too weak for
tagging in African populations (6). Consequently, current
P. falciparum arrays cannot conﬁdently capture all causal var-
iants for important phenotypes.
The rapidly decreasing cost of whole-genome sequencing
offers a promising solution. In principle, working with a whole-
genome sequence allows one to directly assay all mutations
segregating in the population, obviating the detection problems
associated with short LD. Discovering mutations directly also
avoids the ascertainment bias inherent to arrays, bias that is
exacerbated when SNP discovery and genotyping are performed
in different populations (9). Additionally, the small size of the
P. falciparum genome (23 Mb, roughly the size of a human
exome), makes it potentially 100-fold cheaper than whole-
genome sequencing in humans. As malaria sequencing projects
become cost-competitive with genotyping arrays, whole-genome
sequencing has the potential to become the most effective
approach to performing association studies in malaria.
Here, we test the hypothesis that whole-genome sequencing
will identify SNP associations not detected by classic array-based
approaches. We apply this method to identify loci in the P. fal-
ciparum genome that are associated with antimalarial drug re-
sistance and compare the approach to a standard array-based
GWAS. We improve the statistical power of this analysis by
adapting a commonly used selection test, the cross-population
extended haplotype homozygosity (XP-EHH) test (10), and use
it as an association test for positively selected phenotypes. These
approaches identify a number of candidate loci associated with
Author contributions: D.J.P., A.K.L., D.E.N., S.F.S., R.C.W., D.L.H., P.C.S., D.F.W., and S.K.V.
designed research; D.V.T., M.G., J.S.B., and S.K.V. pe rform ed research; D.N. and S.M.
contributed new reagents/analytic tools; D.J.P., A.K.L., D.E.N., S.F.S., H.-H.C., C.V., K.G.,
and S.K.V. analyzed data; and D.J.P., A.K.L., D.E.N., S.F.S., U.R., D.L.H., P.C.S., D.F.W., and
S.K.V. wrote the paper.
The authors declare no conﬂict of interest.
Freely available online through the PNAS open access option.
Data deposition: The SNP data have been deposited at dbSNP, www.ncbi.nlm.nih.gov/
projects/SNP (batch id Pf_0004 from submitter BROAD-GENOMEBIO), are accessible via
the Broad Institute, ftp://ftp.broadinstitute.org/pub/malaria/pnas-park-20 12-suppﬁle-1.
zip, and have also been deposited in PlasmoDB v9.1, http://plasmodb.org/. The consensus
calls for the whole genome are available via the Broad Institute, ftp://ftp.broadinstitute.
To whom correspondence may be addressed. E-mail: email@example.com, dhartl@
oeb.harvard.edu, or firstname.lastname@example.org.
D.L.H., P.C.H., D.F.W., and S.K.V. contributed equally to this work.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
August 7, 2012
no. 32 www.pnas.org/cgi/doi/10.1073/pnas.1210585109
antimalarial drug resistance, including genes in the ubiquitina-
tion pathway, suggesting that alteration of the parasites ability to
modulate stress may contribute to evasion of drug pressure and
development of resistance in P. falciparum.
Forty-Five Parasite Genomes and the Absence of LD. We chose
a population in a West African region near Dakar, Senegal and
culture-adapted 45 P. falciparum parasites recently isolated from
malaria-infected patients. This population is particularly relevant
for these studies because it has recently been exposed to multiple,
changing drug regimens as clinical resistance to traditional drugs
has emerged (11). We obtained whole-genome sequence data and
generated high-quality consensus base calls for an average of 83%
of each genome. This process produces 225,623 segregating SNPs,
of which 25,757 met our call rate and minor allele frequency cri-
teria for further study (see Methods). Sequence-based SNP calling
in P. falciparum is technically challenging because of its extremely
AT-richgenome(12,13).Inlightofthisﬁnding, we validated our
sequence-based approach against array-based methods by using
a previously described SNP array (6) to genotype 24 of the 45
isolates. Of the 74,656 SNPs assayed by the array, 4,653 meet our
call rate and minor allele frequency criteria. We observe nearly
perfect concordance between Affymetrix genotypes and sequence
genotypes (see Methods).
Our data demonstrate that SNPs in P. falciparum have very
little ability to tag neighboring SNPs because of the short LD in
the African population from which they were sampled. Although
some portions of the genome exhibit signiﬁcant LD, over 62% of
the SNPs in the genome have no LD (r
< 0.05) between adja-
cent SNPs, and 87% of the SNPs have insufﬁcient LD to tag their
neighbor (Fig. 1A) using the criterion derived from human
< 0.8) (4). To measure tagging ability directly, we
simulate genotyping arrays of various sizes by sampling random
subsets of SNPs from our sequence data. We ﬁnd that the sim-
ulated arrays are not able to tag a signiﬁcant portion of unas-
sayed markers, a result in stark contrast to the performance of
human arrays (Fig. 1B). The tagging performance of our own
Affymetrix array (tagging only 22.6% of segregating SNPs in
Senegal) is even lower than simulated arrays of similar size (Fig.
1B), most likely because of population-based ascertainment
biases (9) that were not modeled in our idealized approach.
These ﬁndings lead us to conclude that array-based studies in
P. falciparum will rarely be able to detect signals resulting from
mutations not present on the array.
Sequence-Based GWAS. The goal of these studies is to identify
genomic changes associated with changes in parasite response to
antimalarial drugs, as measured in the set of 45 independent
P. falciparum isolates. We assayed the cultured parasites for
in vitro drug responses (measured by IC
) to 12 standard anti-
malarials: amodiaquine, artemisinin, atovaquone, chloroquine,
dihydroartemisnin, halofantrine, lumefantrine, meﬂoquine, piper-
aquine, primaquine, pyrimethamine, and quinine. These antima-
larials constitute the 12 phenotypes used in our association
studies (Fig. S1). Not surprisingly, drugs with similar chemical
structures (e.g., halofantrine, lumefantrine, and meﬂoquine)
show a strong correlation in responses (Fig. S2), as has pre-
viously been observed (6, 7), and provide the opportunity for
cross-validation of SNPs identiﬁed in association studies.
To test associations between SNP genotypes and drug re-
sponse, we use efﬁcient mixed-model association (EMMA).
EMMA is a quantitative association approach well-suited for
small sample sizes and partially inbred organisms, such as the
malaria parasite (14). It is a commonly used tool among mixed-
model GWAS approaches (15) and has recently demonstrated
effectiveness with P. falciparum drug studies (6). After correcting
for multiple testing (Bonferroni correction for 25,757 SNPs,
P < 2 × 10
), EMMA is able to detect a number of previously
known markers of drug resistance, such as four nonsynonymous
SNPs in pfcrt (conferring amino acid changes: N75E/K, K76T,
Q271E, R371I) (16, 17) associated with chloroquine response,
one pfmdr1 SNP (conferring amino acid change: N86Y) (18, 19)
associated with halofantrine, lumefantrine, and meﬂoquine re-
sponse, and three dhfr SNPs (conferring amino acid changes:
N51I, C59R, S108N) (20) associated with pyrimethamine re-
sponse. We note here that, although mitochondrial and apico-
plast genomes were also sequenced, no signiﬁcant associations
were found and the known mitochondrial mutations associated
with atovaquone resistance in cytochrome b (codons 268, 133,
and 280) (21, 22) were ﬁxed in all 45 individuals for the drug-
sensitive alleles. In all, EMMA detects 34 signiﬁcant SNPs as-
sociated with parasite response to ﬁve drugs (Fig. S3). Most of
these SNPs are in or near previously known associations (8), and
LD to adjacent SNP (r
% of SNPs
0.0 0.2 0.4 0.6 0.8 1.0
% of total SNPs sampled
% of total SNPs tagged
0% 20% 40% 60% 80% 100%
Fig. 1. Simulated P. falciparum arrays are unable to tag SNPs not present on the array. (A) A histogram of LD between adjacent SNPs from sequenced
P. falciparum (black). The vast majority of markers have little to no LD with their neighbors (62% of SNPs have r
≤ 0.05, 76% have r
≤ 0.2, and 87% have
≤ 0.8). This ﬁnding contrasts with human studies, where much more of the genome shows moderate to strong LD between neighboring SNPs (gray).
(B) Simulated genotyping marker sets of various sizes are plotted against the percentage of the entire sequenced marker set that they are able to tag (with
≥ 0.8). The dashed, identity line depicts the theoretical scenario where all SNPs are in complete linkage equilibrium and no SNP tags another. Because this is
true of 87% of SNPs in the malaria sequence data, the increase is almost linear (black dots). This ﬁnding contrasts with the array tagging performance seen in
human studies (gray dots), where only a small fraction of markers are needed to tag the bulk of the genome, a principle upon which the array-based GWAS
depends. The open triangle depicts the actual performance of the Affymetrix-based Broad Institute P. falciparum SNP array (6).
Park et al. PNAS
August 7, 2012
ﬁve are previously unknown associations with pyrimethamine
response (Dataset S1).
Although these sequence-based ﬁndings validate the pre-
viously known relationship between the pfmdr1 gene and parasite
responses to halofantrine, lumefantrine, and meﬂoquine, it is
notable that this association is not detectable by our SNP array
(Fig. 2), as the array lacks any markers in pfmdr1 with a sufﬁ-
ciently high minor allele frequency. This ﬁ nding exempliﬁes the
type of association that can be missed by arrays because of
limited LD. Additionally, the agreement between these three
drugs at this locus provides validation of this result with respect
to structurally related drugs.
Using Haplotype-Based Selection Tests for Association. To test the
hypothesis that drug resistance is largely driven by positive se-
lection, we searched for long haplotypes associated with selec-
tion for drug resistance using the XP-EHH test (10). This
selection test has not previously been used as a GWAS tool, but
it is well suited for this purpose when we presume that the
phenotype we are studying is under positive selection. Although
this assumption is not valid for most human-based GWAS for
noncommunicable diseases, it is very likely to be the case when
studying parasite genomes for resistance adaptations to widely
used drugs, which represent a strong selective pressure. Used in
this way, the XP-EHH test identiﬁes areas in the genome where
resistant parasites show much longer haplotypes than sensitive
parasites, indicative of recent positive selection on the resistant
population. In our data, the test detects a number of signals,
including pfcrt and dhfr, as well as a number of other hits span-
ning a total of 32 genomic regions across 11 drugs (Fig. 3, Fig. S4,
and Dataset S1). Seventeen of these regions are indicative of
selection in the drug-resistant population, whereas 15 are con-
sistent with selection in the drug-sensitive population. With the
exception of the regions containing pfcrt and dhfr, none of these
loci were detected by EMMA alone.
Although this approach does not detect the known pfmdr1 lo-
cus, this is consistent with our expectations because of the nature
of the test. The N86Y mutation in pfmdr1 confers increased sus-
ceptibility (18, 19) to many drugs compared with the wild-type
allele. As such, this SNP would not be an expected candidate for
positive natural selection on a novel variant, the type of selection
XP-EHH is designed to detect. Moreover, the absence of a pfmdr1
signal from the XP-EHH test is consistent with the lack of ﬁndings
in this gene from previous genomic scans for positive selection
based on the relative EHH, iHS (integrated haplotype statistic),
and XP-EHH tests in multiple populations (5, 6, 23).
In searching for long haplotypes, the XP-EHH test typically
identiﬁes a large number of signiﬁcant SNPs in close proximity to
each other. These regions often span many tens of kilobases and
several annotated genes. This result is expected because the
process of positive natural selection increases the prevalence of
both the selected variant as well as of nearby variants, generating
local regions of extended haplotypes. Thus, although XP-EHH
strongly implicates these 32 regions as areas of phenotype-as-
sociated positive selection, by itself it is usually unable to localize
the source of this selection to a speciﬁc gene. We use P values
from EMMA to improve signal localization by identifying the
strongest signals of association within each region. This approach
allows us to suggest a possible gene or mutation as a focus of
phenotype-speciﬁc positive selection for each identiﬁed region
(Dataset S1) and is reminiscent of earlier approaches that in-
tersect selection and association results (23, 24).
A more comprehensive examination of the regions under
drug-associated selection reveals discrete biological pathways
and processes that may be particularly important as mediators of
drug response in P. falciparum (SI Results). The 59 genes in these
32 regions can be functionally classiﬁed as surface molecules or
transporters, genome maintenance or transcriptional regulation,
metabolic enzymes including lipid metabolizers, and members
of the ubiquitin proteasome system. Most surface molecule-
associated mutations and intergenic mutations are localized to
intrachromosomal clusters containing var, riﬁn, and stevor genes,
and a number of genes are found among molecules modulating
ubiquitination, lipid metabolism, or folate metabolism. Members
of these pathways are also represented in the large region of
pyrimethamine-speciﬁc selection on chromosome 6, where it is
difﬁcult to localize the focus of selection. Collectively, these
ﬁndings argue that certain biological processes in general, and
genes in the ubiquitination and lipid metabolism pathways in
particular, play important roles in modulating drug responses in
Complete genome sequencing provides many advantages over
array-based genotyping for association studies. These advantages
include the ability to directly type the causal allele, the increased
detection power from increased marker density, and the ability
to overcome ascertainment biases that arise when studying dif-
ferent populations with a ﬁxed marker set. In P. falciparum, the
lack of tagging ability because of the near absence of long-range
chr 5 position (Mb)
0.0 0.4 0.8 1.2
0.0 0.4 0.8 1.2
0.0 0.4 0.8 1.2
Fig. 2. Meﬂoquine association signals around the known drug resistance
locus pfmdr1. EMMA results are shown for all of chromoso me 5 with P values
for each SNP on a −log
scale against physical position. The array-based
study (Array 24) does not detect any association at the known pfmdr1 locus
because of a lack of marker coverage within the gene and sufﬁcient LD
around the gene. The sequ ence-based study with the same 24 samples (Seq
24) detects the expected hit at 0.96 Mb. Including all samples from the
sequence-based study (Seq 45) increases the strength of this signal. The
dashed line indicates the Bonferroni-corrected signiﬁcance threshold (P =
0.05, genome-wide SNP counts are 7,068, 17,278, and 25,159, respectively).
Fig. 3. Signiﬁcant signals of drug-associated selection across ﬁve antima-
larial drugs. XP-EHH results are shown using a Manhattan-inspired plot, with
SNP z-scores plotted against genomic position, with each chromosome col-
ored separately. Positive z-scores suggest selection in drug-resistant para-
sites, negative z-scores suggest selection in sensitive parasites. The dashed
lines indicate the two-sided Bonferroni signiﬁcance thresholds (P = 0.025 and
0.975). Only drugs with signiﬁcant hits are shown here; z-score and quantile-
quantile plots for all drugs are shown in Fig. S4.
www.pnas.org/cgi/doi/10.1073/pnas.1210585109 Park et al.
LD limits the utility of arrays for association studies. Further-
more, the small genome size of P. falciparum brings the cost of
whole-genome sequencing to approximate parity with traditional
genotyping arrays, and recent advances in pathogen-speciﬁc
DNA-enrichment and host-speciﬁc DNA-depletion techniques
for clinical samples makes the sequence-based GWAS approach
more accessible and cost-effective than ever before (13, 25).
We introduce a selection-association approach based on the
XP-EHH selection test. Although this approach may not be
appropriate for many association studies, it is sensible when the
phenotype under study is under strong selection, which is likely
the case for drug resistance in pathogens. As a haplotype-based
test that takes advantage of multiple, adjacent SNPs, it has the
advantage of being more sensitive than single-marker approaches
like EMMA, given the same sample size (4). In addition to de-
tecting new signals of drug-associated selection, we also ﬁnd that
the directional nature of the test statistic, a z-score, provides
useful information about whether the selection is associated with
drug sensitivity or resistance. Consequently, we also introduce an
alternative visualization of the output: a Manhattan-like plot of z-
scores, instead of −log
P values, to illustrate the directionality
of the signals (Fig. 3). In our data, we observed a tendency for
many drugs (artemisnin, dihydroartemisnin, primaquine, hal-
ofantrine, lumefantrine, and meﬂoquine) to show highly signiﬁ-
cant signals of selection for drug sensitivity at pfcrt, the gene
known to be responsible for chloroquine resistance (Fig. S4).
Although, in principle, this type of signal may result from se-
lection toward drug sensitivity, in this particular case it most
likely results from the general pattern of anticorrelation between
chloroquine and these six other drugs (Fig. S2). Additionally, the
absence of a signiﬁcant chloroquine sensitivity signal at pfcrt is
consistent with reports that the return of chloroquine-sensitive
parasites in Africa did not result from a classic selective sweep
(26). In either case, the Manhattan-like z-score plots allow us to
note the presence of these drug-sensitivity signals while keeping
them visually separate from the drug resistance signals on which
we wish to focus.
Our approaches identify a signiﬁcant number of loci associated
with changes in drug response (Dataset S1). The strongest of these
loci contain previously known mediators of resistance, such as the
mutations in pfcrt, pfmdr1,anddhfr. Curation of our remaining
results using a variety of gene and protein prediction algorithms
and literature searches (27) point to several cellular processes and
pathways of potential interest, including the ubiquitin proteasome
system, lipid metabolism, and folate metabolism (Dataset S1). We
argue that these ﬁndings point to biological processes used by the
parasite to survive drug pressure or circumvent the action of an-
timalarial compounds. Other genes of interest include those
encoding three ABC transporters—a class of transporters known
to modulate drug responses in other organisms (28)—and genes
proposed to modulate chromatin (29, 30), DNA repair (31, 32), or
RNA binding (33), pathways that have been shown to potentially
be altered in response to drug pressure.
A number of the signals of recent positive selection are unique
to pyrimethamine-resistant parasites. Although the known re-
sistance locus, dhfr, is present among these, there are even
stronger signals of pyrimethamine-associated selection on chro-
mosome 6 and chromosome 12. The region on chromosome 6
contains two previously uncharacterized genes proposed to
participate in folate metabolism (PFF1360w and PFF1490w), as
well as ﬁve genes encoding proteins acting as either chaperones or in
ubiquitination (PFF1365c, PFF1485 w, PFF1445c, PFF1415c, and
PFF1505w), and three genes encoding molecules likely to modulate
lipid metabolism (PFF1350c, PFF1375c-a/b, and PFF1420w). In the
chromosome 12 region, the XP-EHH test produces signiﬁcant P
values for eight SNPs over a 15-kb region spanning ﬁve adjacent
genes. The extended haplotypes surrounding these SNPs continue
even further, spanning 28 kb and 14 genes in total (Fig. 4A). These
results present challenges for experimental validation, as the goal of
association studies is to generateasmallnumberoftestablehy-
potheses about molecular mechanisms. Fortunately, the use of
EMMA P values in this region can assist in localizing the signal. We
ﬁnd that the strongest EMMA SNP coincides with the strongest XP-
EHH SNP, which is a nonsynonymous mutation in PFL2100w,
a putative ubiquitin-conjugating enzyme (E2) (Fig. 4B). Addition-
ally, a signiﬁcant, pyrimethamine-speciﬁc selection signal on chro-
mosome 8 is entirely contained within MAL8P1.23 [a putative
Fig. 4. Localizing the pyrimethamine-associated selection signal on chro-
mosome 12. (A)Deﬁning the region: XP-EHH identiﬁes eight genome-wide
signiﬁcant SNPs in close proximity on chromosome 12. Each of these eight
SNPs represents the center of an area of extended haplotype homozygosity,
as measured by the EHH statistic. Haplotype decay for resistant parasites is
plotted for each of these eight SNPs, which deﬁnes a larger region from
1.807 Mb to 1.835 Mb in which the causal mutation may exist. This region
spans 28 kb and 14 genes. (B) Localizing the signal: focusing within this re-
gion, we use single-marker association signals from EMMA to localize the
signal. The most signiﬁcant EMMA SNP coincides with the most signiﬁcant
XP-EHH SNP and localizes to an E398D amino acid change in PFL2100w
(ubiquitin conjugating enzyme E2).
Park et al. PNAS
August 7, 2012
HECT (homologous to the E6-AP carboxyl terminus) ubiquitin li-
gase E3] (Dataset S1), another gene in the ubiquitin-mediated
pathway (34). Given the role of this pathway in directing protein
degradation and recycling, it is possible that alterations in these
genes create changes in stress responses or protein turnover of key
resistance modulators that allow the parasite to survive under
The evolution of drug resistance in the natural setting is likely
to be a multistep process and our work potentially identiﬁes key
pathways involved in this process. Field-based evidence has
demonstrated a reduced ﬁtness for drug-resistant parasites in the
absence of drug pressure, and laboratory-based work has dem-
onstrated the relative ﬁtness of different mutational changes in
target enzymes. Our ﬁndings point to potential compensatory
mutations in a pathway related to protein stability and turnover,
and it is tempting to speculate that such adaptations enable the
“expression” of a resistant phenotype, such as has been observed
in yeast (35). Although molecular approaches are required to
validate the role of this pathway in modulating drug response,
these results demonstrate the potential for sequence-based
GWAS approaches to identify pathways, in addition to individual
genes, that may be responsible for the phenotype of interest.
Ultimately, all association results require experimental valida-
tion and follow-up work to explore possible mechanisms of action.
Association studies, even in their ideal form, simply generate hy-
potheses based on correlations. However, improved methods for
association studies can signiﬁcantly reduce the necessary validation
work by reducing false-positive rates, increasing study-detection
power, and improving localization ability. This study successfully
pilots the use of whole-genome sequence data for association
studies in malaria and demonstrates signiﬁcant advantages in de-
tection power over array-based studies. We strongly recommend
that future association studies in low-LD, small-genome organisms
adopt the sequence-based GWAS approach as well, given the
relative costs. We additionally demonstrate the effectiveness of the
XP-EHH selection test as an association test for phenotypes under
positive selection. Finally, we combine data from both tests to lo-
calize long signals and reduce the number of hypotheses for follow-
up validation. This combined approach identiﬁes more candidate
loci than with single-marker tests alone.
Materials and Methods
Sequencing. Parasites were obtained from patients with uncomplicated mild
malaria in Senegal from 2001 to 2009 under ethical approval from the In-
stitutional Review Board at the Harvard School of Public Health under
protocol #16330-106 with informed consent for the study. Parasites were
culture-adapted by standard methods (36) and genomic DNA was extracted
from 45 single-clone samples. Samples were determined to be monoclonal
and genetically distinct by a 24 SNP molecular barcode (37). Genomic DNA
was sequenced using Illumina Hi-Seq machines. The ﬁrst 12 parasites were
sequenced with 76-bp single-end reads and the remaining 33 were se-
quenced with paired-end reads ranging from 76 bp to 101 bp in length. The
median sequence coverage depth was 144.8× after alignment (ranging from
32× to 400×). Reads were aligned with the Burrows-Wheeler Aligner (BWA)
v0.5.9-r16 against the 3D7 reference assembly (PlasmoDB v7.1). A consensus
sequence was called for each strain using the GATK Uniﬁed Genotyper
v1.2.3-g61b89e2 (38) with the following parameters: -A AlleleBalance
-stand_emit_conf 0 ‐‐output_mode EMIT_ALL_SITES. Bases were then removed
if they exhibited poor quality (GQ less than 30 or QUAL less than 60) or if
they called a heterozygo us genotype. This process left consensus calls for 56–
91% of the genome (83% median) for each of 45 individuals. Of these sites,
225,623 positions are polymorphic among the 45 individuals. Of these SNPs,
only 25,757 had genotypes in at least 36 individuals (80% call rate) and were
nonsingletons (i.e., minor allele count > 1 or minor allele frequency > 4%).
All analyses are based on this set of 25,757 SNPs. SNP data are available in
dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/) as batch Pf_0004 from
submitter BROAD-GENOMEBIO. SNPs have been deposited at PlasmoDB (27)
v9.1 to allow easy searching and visualization in combination with other
malaria genomic data sets. SNP data can also be found in ref. 39. Consensus
calls for the whole genome are available in ref. 40.
Principal component analysis was conducted using the program
SMARTPCA (41) in the EIGENSOFT 3.0 package. We applied a local LD cor-
rection (nsnpldregress = 2) and found no signiﬁcant eigenvectors in
Tagging Analysis. Tagging analysis in Fig. 1B was generated by using PLINK
(42) to ﬁnd tagging SNPs for each SNP that were within 10 kb and at least
≥ 0.8. We then simulated genotyping arrays by randomly sampling subsets
of SNPs of varying subset sizes and calculating the fraction of total SNPs that
are tagged by the subset. We ﬁrst reduced the sequence data to 40 random
individuals to simulate ascertainment bias against low allele-frequency
markers, then randomly sampled markers that were still polymorphic among
the smaller population size to simulate a genotyping array. We simulated 19
different array sizes, ranging from 5% of the sequenced SNPs (1,227) to 95%
of the sequenced SNPs (22,087). Two-hundred simulations per array size
were run and the result was highly consistent: 95% conﬁdence intervals
were too small to visualize on the ﬁgure. Simulations for the human ge-
nome were based on 60 diploid individuals of European descent (CEU) from
Hapmap release 23a. Each iteration chose 54 random individuals to simulate
ascertainment bias, ﬁltered SNPs to an 80% call rate and to nonsingletons.
Our Affymetrix array was able to tag 5,508 SNPs in our sequence data using
the 4,894 SNPs on the array that overlapped with the 25,757 SNPs in our
sequence data (open triangle in Fig. 1B). Histograms in Fig. 1A are binned
into 20 evenly spaced bins of r
from 0 to 1. The plot is normalized such that
the sum of all bars in each histogram is equal to 1 to show the relative
proportions of SNPs in each bin. Simulation data are provided at ftp://ftp.
Drug Assays. Drug assays were performed as previously described (43) with
slight modiﬁcations for 384-well format (SI Methods). The range of drug
concentrations are shown in Fig. S1, and the IC
data along with raw input
data for all association tests, is provided in ref. 39.
EMMA. Single marker association tests were run using EMMA (14). Because
not all drugs have complete phenotype data for all 45 individuals, SNPs are
additionally ﬁltered to those that met our previous call rate and minor allele
criteria among the subset of samples for which drug data exists. This ﬁltering
results in 23,000–25,180 SNPs for any given drug. Log
) values were
used for this quantitative test. Biological replicates of drug data were pre-
sented to EMMA as multiple individuals from the same genetic strain, which
allows EMMA to use the additional data to discern heritable phenotypic
variance from nonheritable variance (15) and mimics the use of clonally
identical parasites in other studies (44, 45). Signiﬁcance was deﬁned as SNPs
that exceeded a Bonferroni- corrected threshold of P < 0.05 and also survive
60% of jackknife simulations. EMMA results were jackknifed by perform ing
200 random subsets of 38 samples and requiring an false-discovery rate-
corrected signiﬁcance of Q < 0.1. SNPs that passed this threshold in 60% of
jackknife simulations were considered to be robust against false-positives
because of small sample-size effects.
XP-EHH. Selection-association tests were run using the XP-EHH test (10). Each
drug deﬁned a partitioning of samples into two “subpopulations” (“sensi-
tive” and “resistant”) based on cutoffs shown in Fig. S1 andprovidedat
(SI Methods). XP-EHH requires a recombination map as input, which we
constructed with LDhat v2.1 (46) (SI Methods). XP-EHH also requires fully
imputed genotypes. Imputation was performed using PHASE 2.1.1 (47),
producing 29,605 nonsingleton SNPs (SI Methods).
XP-EHH computes a signiﬁcance value for each SNP in the genome, as-
suming that SNP comprises the haplotype “core” of selection. Because the
test identiﬁes long haplotypes, it results in a large number of genome-wide
signiﬁcant SNPs (deﬁned by Bonferroni-corrected P < 0.05) in clustered
stretches of the genome. We reduced the set of signiﬁcant SNPs to a set of
signiﬁcant genomic regions by taking each signiﬁcant core SNP, computing
a window around each one where EHH decayed to 0.05, and merging
overlapping windows. This process resulted in a smaller list of signiﬁcant
regions for each drug (Dataset S1). Regions were further ﬁltered by re-
moving those which did not contain at least one core SNP that survived 50%
of jackknife simulations. XP-EHH results were jackknifed by performing
200 random subsets of 38 samples and requiring a Bonferroni-corrected
signiﬁcance of P < 0.1.
Genotyping Arrays. A subset of 25 parasites was also hybridized to an Affy-
metrix array containing 74,656 markers (6). SNPs were called using BRLMM-P
from Affy Power Tools v1.10.2 and ﬁltered according to the same methods as
www.pnas.org/cgi/doi/10.1073/pnas.1210585109 Park et al.
Van Tyne, et al. (6), resulting in 15,075 validated SNPs, 8,778 of which were
polymorphic among the 25 individuals from Senegal. SNP coordinates were
converted from PlasmoDB v5.0 coordinates to v7.1 coordinates using whole
genome nucmer alignments (48). Concordance between array and sequencing
data were measured for the set of markers in which genotype calls existed by
both methods. For 24 samples, nearly perfect concordance between Affyme-
trix genotypes and sequence genotypes was observed for the 24 samples
(averaging 99.2% concordance, with all 24 samples above 98.2% concor-
dance). This level of concordance is similar to what is observed with technical
replicate hybridizations of the same DNA sample (6). One sample, SenP19.04.c,
reported a 28.2% mismatch rate, suggestive of a sample identiﬁcation error,
and was removed from the analysis. EMMA analyses were run on the array
data using the same ﬁlters and procedures as for sequence data described
above, using 4,514–4,653 SNPs per drug phenotype. Results are shown in
Fig. S5. Array data for these 24 samples are available from ref. 39.
ACKNOWLEDGMENTS. We thank the sample collection team in Senegal,
including Younouss Diedhiou, Lamine Ndiaye, Amadou Moctar Mbaye, Baba
Dieye, Moussa Dieng Sarr, Papa Diogoye Sene, and Ngayo Sy; the technical
staff at the Harvard School of Public Health who maintained parasite cultures,
including Kayla Barnes, Dave Rosen, Kate Fernandez, and Gilberto Ramirez;
members of the P.C.S. laboratory for a careful review of our manuscript,
including Kristian Andersen, Chris Edwards, Chris Matranga, Rachel Sealfon,
Jesse Shapiro, Ilya Shlyakhter, Matt Stremlau, and Shervin Tabrizi; and those
who made contributions to the community database, PlasmoDB.org,thatfa-
cilitated biological curation of candidate genes presented in this work. This
study is supported by the Bill and Melinda Gates Foundation; National Insti-
tutes of Health (NIH) Grant 1R01AI075080-01A1; the Ellison Medical Founda-
tion; the Exxon-Mobil Foundation; the NIH Fogarty International Center; the
National Institute of Allergy and Infectious Diseases, and Broad ScientiﬁcPlan-
ning and Allocation of Resources Committee (SPARC); a National Science Foun-
dation Graduate Research Fellowship (to D.J.P.); and fellowships from the
Burroughs Wellcome and Packard Foundations (to P.C.S.).
1. Murray CJL, et al. (2012) Global malaria mortality between 1980 and 2010: A sys-
tematic analysis. Lancet 379:413–431.
2. malERA Consultative Group on Drugs (2011) A research agenda for malaria eradica-
tion: Drugs. PLoS Med 8:e1000402.
3. Altshuler DM, Daly MJ, Lander ES (2008) Genetic mapping in human disease. Science
4. de Bakker PIW, et al. (2005) Efﬁciency and power in genetic association studies. Nat
5. Mu J, et al. (2010) Plasmodium falciparum genome-wide scans for positive selection,
recombination hot spots and resistance to antimalarial drugs. Nat Genet 42:268–271.
6. Van Tyne D, et al. (2011) Identiﬁcation and functional validation of the novel anti-
malarial resistance locus PF10_0355 in Plasmodium falciparum. PLoS Genet 7:
7. Yuan J, et al. (2011) Chemical genomic proﬁling for antimalarial therapies, response
signatures, and molecular targets. Science 333:724–729.
8. Volkman SK, Neafsey DE, Schaffner SF, Park DJ, Wirth DF (2012) Harnessing genomics
and genome biology to understand malaria biology. Nat Rev Genet 13:315–328.
9. Albrechtsen A, Nielsen FC, Nielsen R (2010) Ascertainment biases in SNP chips affect
measures of population divergence. Mol Biol Evol 27:2534–2547.
10. Sabeti PC, et al.; International HapMap Consortium (2007) Genome-wide detection
and characterization of positive selection in human populations. Nature 449:913–918.
11. Mouzin E, Thior PM, Diouf MB, Sambou B (2010) Focus on senegal. Progress & impact
series, no. 4, (WHO, Geneva, Switzerland). Available at http://www.path.org/publications/
12. Oyola SO, et al. (2012) Optimizing Illumina next-generation sequencing library
preparation for extremely AT-biased genomes. BMC Genomics 13:1.
13. Melnikov A, et al. (2011) Hybrid selection for sequencing pathogen genomes from
clinical samples. Genome Biol 12:R73.
14. Kang HM, et al. (2008) Efﬁcient control of population structure in model organism
association mapping. Genetics 178:1709–1723.
15. Price AL, Zaitlen NA, Reich DE, Patterson N (2010) New approaches to population
stratiﬁcation in genome-wide association studies. Nat Rev Genet 11:459–463.
16. Fidock DA, et al. (2000) Mutations in the P. falciparum digestive vacuole trans-
membrane protein PfCRT and evidence for their role in chloroquine resistance. Mol
17. Wootton JC, et al. (2002) Genetic diversity and chloroquine selective sweeps in Plas-
modium falciparum. Nature 418:320–323.
18. Duraisingh MT, et al. (2000) The tyrosine-86 allele of the pfmdr1 gene of Plasmodium
falciparum is associated with increased sensitivity to the anti-malarials meﬂoquine
and artemisinin. Mol Biochem Parasitol 108:13–23.
19. Nkhoma S, et al. (2009) Parasites bearing a single copy of the multi-drug resistance
gene (pfmdr-1) with wild-type SNPs predominate amongst Plasmodium falciparum
isolates from Malawi. Acta Trop 111:78–81.
20. Nair S, et al. (2003) A selective sweep driven by pyrimethamine treatment in south-
east Asian malaria parasites. Mol Biol Evol 20:1526–1536.
21. Kessl JJ, Meshnick SR, Trumpower BL (2007) Modeling the molecular basis of atova-
quone resistance in parasites and pathogenic fungi. Trends Parasitol 23:494–501.
22. Dong CK, et al. (2011) Identiﬁcation and validation of tetracyclic benzothiazepines as
Plasmodium falciparum cytochrome bc1 inhibitors. Chem Biol 18:1602–1610.
23. Cheeseman IH, et al. (2012) A major genome region underlying artemisinin resistance
in malaria. Science 336:79–82.
24. Kudaravalli S, Veyrieras JB, Stranger BE, Dermitzakis ET, Pritchard JK (2009) Gene
expression levels are a target of recent natural selection in the human genome. Mol
Biol Evol 26:649–658.
25. Venkatesan M, et al. (2012) Using CF11 cellulose columns to inexpensively and ef-
fectively remove human DNA from Plasmodium falciparum-infected whole blood
samples. Malar J 11:41.
26. Laufer MK, et al. (2010) Return of chloroquine-susceptible falciparum malaria in
Malawi was a reexpansion of diverse susceptible parasites. J Infect Dis 202:801–808.
27. Aurrecoechea C, et al. (2009) PlasmoDB: A functional genomic database for malaria
parasites. Nucleic Acids Res 37(Database issue):D539–D543.
28. Leprohon P, Légaré D, Ouellette M (2011) ABC transporters involved in drug
resistance in human parasites. Essays Biochem 50:121–144.
29. Cui L, Miao J (2010) Chromatin-mediated epigenetic regulation in the malaria para-
site Plasmodium falciparum. Eukaryot Cell 9:1138–1149.
30. Coleman BI, Duraisingh MT (2008) Transcriptional control and gene silencing in
Plasmodium falciparum. Cell Microbiol 10:1935–1946.
31. Castellini MA, et al. (2011) Malaria drug resistance is associated with defective DNA
Mol Biochem Parasitol 177:143–147.
32. Tarique M, Satsangi AT, Ahmad M, Singh S, Tuteja R (2012) Plasmodium falciparum
MLH is schizont stage speciﬁc endonuclease. Mol Biochem Parasitol 181:153–161.
33. Meng X, et al. (2012) Cytoplasmic Metadherin (MTDH) provides survival advantage
under conditions of stress by acting as RNA-binding protein. J Biol Chem 287:
34. Ponts N, et al. (2008) Deciphering the ubiquitin-mediated pathway in apicomplexan
parasites: A potential strategy to interfere with parasite virulence. PLoS ONE 3:e2386.
35. Jarosz DF, Lindquist S (2010) Hsp90 and environmental stress transform the adaptive
value of natural genetic variation. Science 330:1820–1824.
36. Trager W, Jensen JB (1976) Human malaria parasites in continuous culture. Science
37. Daniels R, et al. (2008) A general SNP-based molecular barcode for Plasmodium fal-
ciparum identiﬁcation and tracking. Malar J 7:223.
38. McKenna A, et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for
analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303.
39. Broad Institute (2012) Tagging simulation data, drug data, PLINK-formatted input
data for both sequence and array data, recombination maps, imputed genotypes,
GWAS outputs, and R code for generating all ﬁgures. Available at ftp://ftp.broadinstitute.
40. Broad Institute (2012) Consensus sequence calls for each of 45 strains and 23 million
bases. VCF ﬁle is bgzip compressed and indexed by tabix and vcftools (.tbi and .vcﬁdx
ﬁles are also in this directory). Available at ftp://ftp.broadinstitute.org/pub/malaria/
41. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS
42. Purcell S, et al. (2007) PLINK: a tool set for whole-genome association and population-
based linkage analyses. Am J Hum Genet 81:559–575.
43. Plouffe D, et al. (2008) In silico activity proﬁling reveals the mechanism of action of
antimalarials discovered in a high-throughput screen. Proc Natl Acad Sci USA 105:
44. Anderson TJC, et al. (2010) Inferred relatedness and heritability in malaria parasites.
Proc Biol Sci 277:2531–2540.
45. Anderson TJC, et al. (2010) High heritability of malaria parasite clearance rate in-
dicates a genetic basis for artemisinin resistance in western Cambodia. J Infect Dis
46. McVean G, Awadalla P, Fearnhead P (2002) A coalescent-based method for detecting
and estimating recombination from gene sequences. Genetics
47. Stephens M, Donnelly P (2003) A comparison of bayesian methods for haplotype
reconstruction from population genotype data. Am J Hum Genet 73:1162–1169.
48. Kurtz S, et al. (2004) Versatile and open software for comparing large genomes.
Genome Biol 5:R12.
Park et al. PNAS
August 7, 2012