Pathway-based genetic analysis of preterm birth
Alper Uzuna,b,c, Andrew T. Dewand, Sorin Istrailc, James F. Padburya,b,⁎
aWomen and Infants Hospital of Rhode Island, Department of Pediatrics, Providence, RI, USA
bBrown Alpert Medical School, Providence, RI, USA
cBrown University, Center for Computational Molecular Biology, Providence, RI, USA
dDepartment of Chronic Disease Epidemiology, Yale School of Public Health, New Haven, CT, USA
a b s t r a c ta r t i c l ei n f o
Received 28 August 2012
Accepted 25 December 2012
Available online 6 January 2013
Preterm birth in the United States is now 12%. Multiple genes, gene networks, and variants have been asso-
ciated with this disease. Using a custom database for preterm birth (dbPTB) with a refined set of genes exten-
sively curated from literature and biological databases, we analyzed GWAS of preterm birth for complete
genotype data on nearly 2000 preterm and term mothers. We used both the curated genes and a
genome-wide approach to carry out a pathway-based analysis. There were 19 significant pathways, which
withstood FDR correction for multiple testing that were identified using both the curated genes and the
genome-wide approach. The analysis based on the curated genes was more significant than genome-wide
in 15 out of 19 pathways. This approach demonstrates the use of a validated set of genes, in the analysis of
otherwise unsuccessful GWAS data, to identify gene–gene interactions in a way that enhances statistical
power and discovery.
© 2013 Elsevier Inc. All rights reserved.
Genome-wide association studies (GWAS) enable investigation of
the genetic associations underlying complex diseases without a priori
hypotheses [1,2]. Advances in high-throughput genotyping, sequenc-
ing technology and developments in computational power have
enhanced the feasibility of large case-controlled studies and reduced
costs . Since they have the potential of identifying novel genetic
variants, GWAS have become a popular approach to the investigation
of complex diseases. By the second quarter of 2011, there were 1449
reports in the Catalog of Published Genome-Wide Association Studies
(http://www.genome.gov/gwastudies/) covering hundreds of associ-
ations of common genetic variants with complex traits . These
reports have provided valuable insights into the genetic architecture
of disease, including inflammatory bowel disease, macular degenera-
tion, and obesity [5–7]. Nonetheless, GWAS for complex diseases have
had only a measured success. While, there have been many loci iden-
tified and replicated in GWAS, many studies have failed to identify
significant associations. Likewise, the genetic markers that have
been identified through the GWAS approach are rarely functional
variants in the diseases with which they are associated. In addition,
most common variants that are identified by GWAS are responsible
for only a small portion of the genetic variation and thus there
remains a large amount of “missing heritability” [8,9]. If the “common
disease common variant hypothesis” underlying the GWAS approach
does not explain the genetic contributions to complex diseases then
what does [8,10]? It is likely that rare variants and/or genetic interac-
tions, epistasis, underlie a significant portion of the ‘missing heritabil-
ity’ not revealed by conventional GWAS analyses [11–13]. It is also
likely that complex mechanisms and higher orders of gene–gene
interactions underlie the pathogenesis of many (most) complex
diseases and lead to variations/alterations of the phenotype [14–17].
Identification of multiple genes contributing to disease pathogenesis
may help in understanding the effects on phenotype and in the
search for missing heritability . Nonetheless, the GWAS-based
interrogation of large numbers of anonymous single nucleotide poly-
morphisms (SNP) severely limits power, thus weakening our com-
putational ability to examine combinatorial gene–gene interactions
We are interested in the genetic contribution(s) to preterm
birth. Preterm birth is an important, poorly understood clinical
problem [22–25]. The incidence of preterm birth (PTB) in the
United States is now 12%, or 1 in 8 women . It creates enormous
clinical, economic and psychological burdens. The pathogenesis has
remained elusive. Clinical tests and interventions to identify the
patients at risks for preterm birth have relied heavily on assessment
of common pathways associated with labor, such as myometrial
contractility, cervical ripening, and decidual/membrane activation
pathways. However, most of these interventions have proven ineffec-
tive. Multiple genes, gene networks, and variants have been associated
with preterm birth; however, single genes and pathways and simple
patterns of inheritance are inadequate to explain the pathogenesis of
the majority of preterm births . The pathogenesis of PTB may be
Genomics 101 (2013) 163–170
⁎ Corresponding author at: Department of Pediatrics, Women & Infants Hospital, 101
Dudley Street, Providence, RI 02905, USA. Fax: +1 401 453 7571.
E-mail address: JPadbury@wihri.org (J.F. Padbury).
0888-7543/$ – see front matter © 2013 Elsevier Inc. All rights reserved.
Contents lists available at SciVerse ScienceDirect
journal homepage: www.elsevier.com/locate/ygeno
better understood if the analysis incorporated a more complex model
that entails a host of genes  and with environmental triggers
overlaying the genetics as well.
We have developed an approach for identifying a parsimonious
set of genes for the study of preterm birth validated by a priori bi-
ological information. We used a semantic data mining and natural
language processing approach to extract all published articles re-
lated to preterm birth . Then, the genes identified from public
databases and archives of expression arrays were aggregated with
the gene set curated from the literature. Lastly, pathway analysis
was used to impute genes from pathways identified during
curation. The curated articles and collected genetic information
form a unique resource for investigators interested in preterm
birth, the Database for Preterm Birth (dbPTB), publicly-accessible
at http://ptbdb.cs.brown.edu/dbPTBv1.php. Recently, results from
a genome-wide study of preterm birth, “GENEVA,” became avail-
able in dbGAP . The dataset includes phenotypic information
and complete genotype data on nearly 2000 mothers, ranging
from 20 to 42 weeks of gestation. Since it has been demonstrated
that the genetic risk of preterm birth segregates heavily to the ma-
ternal genome, we haveconcentrated our analysisonly on maternalge-
notype information . Using the curated genes from dbPTB, we have
analyzed the GENEVA data set from mothers only. The results of the re-
fined curated genes were further analyzed by gene set enrichment
2.1. SNP basic association analysis
We applied standard case/control allelic testing in Plink v1.07 to
analyze the association of individual SNPs with preterm birth. In
the first analysis, we only used SNPs that belonged to the curated
dbPTB genes . We included SNPs within the genomic region
encompassing each gene as well as SNPs within 5 kb upstream or
downstream. Of the 617 genes identified in dbPTB, 551 were mapped
onto the Illumina 660 quad platform encompassing 9077 tag SNPs. In
the second analysis, we ran a genome-wide comparison using all of
the SNPs on the Illumina platform (n=560,768 SNPs). Preterm
women were divided into three gestational age categories: less
than 30 weeks of gestation (n=92), less than 34 weeks of gestation
(n=446), and less than 37 weeks of gestation (n=884) and
compared with women who delivered greater than or equal to
38 weeks of gestation (n=960). A Manhattan plot showing the re-
sults for all three preterm gestational age groups across the dbPTB
SNPs is shown in Fig. 1. As can be seen from this representative plot
of chromosome 1, there were multiple regions where associations
were seen for all three patient groups. Similar results were seen for
the other chromosomes. While several regions demonstrated −log P
values greater than 2.5 (Pb0.0032), no significant single variants were
identified in the dbPTB set of curated genes that withstood Bonferroni's
correction for multiple comparisons (Pb5.5×10−6). The lowest
P-value was 1.45×10−4for the SNP (rs5742637) which belongs to
insulin-like growth factor 1 gene (IGF1). In the genome-wide analysis
only a single variant reached the Bonferroni-corrected significance
threshold (Pb8.9×10−8). The P-value for this SNP (rs12682166) was
4.99×10−8. This SNP did not map within any known gene nor were
2.2. GSEA analysis of dbPTB genes
Gene set enrichment analysis was carried out using the 9077 SNPs
within the 515 curated dbPTB genes. We compared the cases less than
30 weeks of gestation with the controls described in the previous
section. For the GSEA analysis we selected the following analytical
options: tag SNPS with the 5 kb upstream and downstream of each
gene ; gene sets were selected, including “canonical pathways,
GO biological process, GO molecular function, GO cellular compo-
nent;” lastly, we used gene set size ranges from 5 to 200. For compar-
ison, we performed GSEA on the genome-wide SNP data (n=560,768
SNPs). All parameters were kept the same when running the genome
From the dbPTB-based pathway analysis, we identified a total of 30
pathways with high confidence values (false discovery rate correction
for multiple comparison, FDR, b0.05, see Supplemental Table 1. From
the whole genome based pathway analysis 39 pathways with high
confidence were identified. When we compared the analyses, there
were 37 shared pathways. A view illustrating the pathway analysis
results for both analyses is shown in Fig. 2 and Table 1. The vertical
axis represents the −log P values for each of the pathways. The statis-
tical values for the pathways identified by dbPTB are shown in dark
blue and for the whole genome analysis are shown in light blue.
Each of the pathways is shown along the horizontal axis. The thresh-
old value of −log P=1.3 corresponds to an FDR P value less than 0.05
which has already been adjusted for multiple comparisons. There
were nineteen shared pathways that reached significance by either
analysis. The dbPTB analysis showed greater significance in 15 out of
the 19 shared significant pathways (Fig. 2). There were 13 significant
pathways that were only identified using the dbPTB curated set of
genes and 33 significant pathways that were only identified in the
whole genome analysis. Remarkably, most of the pathways involving
inflammation were either shared or identified only by the dbPTB-
based pathway analysis (Supplemental Table 1). Prominent among
the results from the genome-wide analysis were metabolomic path-
ways including phospholipase A2 activity, amino acid derivative,
biosynthetic processes, plus the pathway involving the trans-Golgi
network, enzyme inhibitor pathway, lipase activity and nitrogen
compound biosynthetic processes and carboxyl esterase activity,
hypotaurine metabolism (Supplemental Table 2). A summary com-
parison of the results of the pathway analyses from both dbPTB and
the genome-wide data is shown in Supplemental Table 3 and Supple-
mental Fig. 1.
Although there have been some successes, GWAS based ap-
proaches have failed to provide comprehensive explanations for the
genetic basis of many complex diseases . There are many
challenges in identification of the causative genes. As noted above
for complex diseases, gene–gene interactions are a far more likely
model as complex molecular networks and metabolic pathways are
involved in polygenic diseases [31,33–35]. For our approach, we
took into consideration the a priori biological information about
genes involved in preterm birth from the published literature and
from available expression arrays. In addition to these initial steps
we included pathway analysis to impute additional genes likely to
be involved from pathways identified during curation. Combining
these three sources powered the curated gene set for our disease of
interest, preterm birth . We increased our power by focusing on
a smaller number of comparisons, none of the identified single gene
variants reached statistical significance. By employing pathway
based permutation testing we identified important genes and their
variants in this important disorder. Moreover, by using a more
parsimonious, curated set of genes or variants with demonstrated
biological significance, we greatly enhance our statistical power.
This was most evident in the statistical validation of pathways
involved in inflammation. Those pathways were not evident in the
genome wide analysis but were solely identified using the curated
set of genes for permutation testing.
Since a portion of the ‘missing heritability’ is likely explained by
gene–gene interactions, we employed a pathway-based approach to
analyze the results from the large GWAS on preterm birth . Our
A. Uzun et al. / Genomics 101 (2013) 163–170
Fig. 1. Manhattan plots of three preterm gestational age groups. We analyzed the single SNP association with genes from dbPTB by comparing the controls (38 gestational weeks and
high) andthree gestationalgroups(37,34,and30gestationalweeksorless).Manhattanplotsshowingthe resultsforthree pretermgestationalage groups.The Y-axisindicatesthe−log
37 gestational weeks or less.
A. Uzun et al. / Genomics 101 (2013) 163–170
pathway-based approach used the SNPs selected for the dbPTB set of
genes and whole genome from GENEVA data. In order to enhance
our likelihood of success by selecting the most “extreme phenotype”,
we restricted our analysis to comparison of controls which delivered
at 38 weeks of gestation or higher to patients who delivered at
30 weeks of gestation or lower.
In order to generate the “P-values” needed for the pathway analy-
sis, we first carried out single variant analysis using both dbPTB curat-
ed genes and whole genome data. As already noted, we did not find
significant single variant associated with any known genes using ei-
ther the dbPTB curated gene set or the genome-wide data. By compar-
ison, the pathway based approach yielded some rich and significant
results which replicate the findings from other studies . Among
the ranked list of SNPs in the dbPTB curated gene analysis, the best
SNP (rs5742637) mapped onto the IGF1 gene. IGF1 was identified in
the dbPTB gene set from a single manuscript which sought candidate
genes associated with coagulation and inflammatory pathways in
preterm birth . In that report, 1536 SNPs in 130 candidate genes
were interrogated and IGF1 was one of the significant findings. In
the pathway analysis, there were a total of 3 significant pathways
which included IGF1. These included the erythrocyte differentiation
pathway, prostate cancer and PIP3 signaling in cardiac myocytes.
These overlap with the pathways in which IGF1 has been more broad-
ly associated and are listed in the preterm birth database and include
hypertrophic cardiomyopathy, the mTOR signaling pathway and
prostate cancer. Also of note, the IGF1R was identified in the preterm
birth database as associated with preterm birth. This was the result of
a large linkage analysis done in the Finnish population . In the lat-
ter report, the association of IGF1R with preterm birth was verified by
haplotype analysis in a larger, independent group of patients . It is
likely that the failure to identify IGF1R in both our curated gene study
and the pathway analysis is due to the omission of these tag SNPs on
our genotyping platform. Nonetheless, the importance of this path-
way is suggested by our results and others . IGF1 and its signaling
pathway were included in previous candidate analyses because of
their participation in the decidua-chorioamniotic, and systemic in-
flammation signaling pathways which involve the PI3kinase and
mTOR signaling pathways , both of which were prominent in
The pathways identified in our analysis are not independent but
instead show a rich network of connectivity. This can best be seen
graphically in Fig. 3. Gephi was used to make the network maps .
Inthis figure, thebluenodesrepresentthesharedpathwaysandorange
nodes those pathways only identified by dbPTB. Likewise the genes
forming the connectivity are displayed. AKT1 was the most connected
gene, being identified as contributing to a significant role in preterm
tive way to view the strength of the pathway analysis is not to look
solely at the gene contributing the most pathways, but to identify
Fig. 2. Comparison of the pathway analysis results from dbPTB genes and from genome wide analysis. The vertical axis represents the −log P values for each the pathways. The
statistical values for dbPTB are shown in dark gray bars and for the whole genome analysis are shown in light gray bars. Each of the pathways is shown along the horizontal
axis. The threshold value of −log P=1.3 corresponds to an FDR P value less than 0.05 which has already been corrected for multiple comparisons.
The significant pathways from dbPTB in comparison to the analysis from the whole ge-
nome. The table shows the gene set FDR values of the significant pathways from dbPTB
in comparison to the analysis from the whole genome. The threshold of FDRb0.25 de-
notes the confidence of ‘possible’ or ‘hypothesis’, while the threshold of FDRb0.05 is
regarded as ‘high confidence’ or ‘with statistical significance’.
Gene sets Gene set FDR⁎
Gene set FDR⁎
for whole genome
Arginine and proline metabolism
Cardiac EGF pathway
Electron carrier activity
Fatty acid metabolism
Arginine and proline metabolism
Arachidonic acid metabolism
P53 signaling pathway
Toll like receptor signaling pathway
Long term depression
Small cell lung cancer
Multi organism process
Nitrogen compound metabolic process
Regulation of blood pressure
Regulation of kinase activity
Regulation of protein kinase activity
Regulation of transferase activity
Response to biotic stimulus
Response to other organism
SIG PIP3 signaling in cardiac myocytes
Smooth muscle contraction
A. Uzun et al. / Genomics 101 (2013) 163–170
which pathways had the most genes contributing to their significance
and which genes were these. These results are shown in Figs. 5A and
B. Breast cancer estrogen signaling and oxidoreductase activity path-
ways each had 10 different genes contributing significantly to their
in the dbPTB analysis.
While our approach to pathway analysis was hypothesis-free,
inspection of the genes which showed a significant relationship to
preterm birth reveals that several of the traditional mechanisms for
preterm birth were highly represented. This includes inflammation
and metabolomic disorders. Infection and inflammation have been
strongly linked to preterm birth . Genes involved in inflammatory
mechanisms that emerged from the pathway analysis include: IL6,
TGFB2, NOS1, NFKB1, AKT1, IRAK1, TLR3, TLR7, TP53, IFNG, and AR.
These are all important genes, receptors and signaling elements in
inflammation. Many of the associations of these inflammatory genes
with preterm birth emerged from their involvement in other related
pathways including GSK3 signaling, small lung cell cancers, organism
processes in response to biotic stimulus and PI3K signaling, protein
serine3 kinase activity and the NFAT pathway (Fig. 3). These genes
were not identified in the only other published candidate and
pathway-based interrogation although, in the latter study, pathways
associated with inflammation were identified including JAK-STAT
signaling, MAP kinase signaling, T cell receptor signaling and the Toll-
like receptor signaling pathway .
Metabolomics has recently been identified as an emerging tech-
nology that may provide clues to the pathogenesis of preterm birth
that were not previously apparent [38,39]. In a recent report, gas
chromatography–mass spectrometry was used to profile low molecu-
lar weight compounds in amniotic fluid of patients delivering with
preterm birth with and without inter-amniotic inflammation . A
classification profile was developed which subsequently allowed
Fig. 3. Network map of significant pathways for dbPTB genes. The blue nodes represent the shared pathways and orange nodes the pathways only identified by dbPTB. Likewise the
genes (green symbols) forming the connectivity are displayed.
A. Uzun et al. / Genomics 101 (2013) 163–170
correct classification of patients with preterm birth. We identified
several pathways involved in metabolomics that may provide a clue
to the genetic architecture underlying the role of metabolomics in
preterm birth. These include electron carrier activity, arginine and pro-
line metabolism, the signaling pathways involved in GSK3 and PI3K,
tyrosine metabolism, response to biotic stimuli, the oxido-reductase
pathway, Fig. 5B, protein oligomerization and serine threonine kinase
activity. Of the genes associated with these pathways, the most promi-
nent were NOS1 (which is also involved in inflammation), protein
kinase C-alpha and ALK associated with both phosphotransferase activ-
ityalcohol groupasacceptorand kinaseactivity and transferaseactivity
transferring phosphorus containing groups. The strength of our
approach can be seen through the inclusion of NOS1. NOS1 was not
identified during the literature curation process or during the aggrega-
able now is the inclusion of NOS1 in multiple pathways through the
GSEA analysis. NOS1 contributed prominently to the significance of
the smooth muscle contraction pathway, the oxidoreductase activity
pathway, arginine and proline metabolism, small lung cell cancer. In
similar fashion, AKT1 was included in the dbPTB set of curated genes
through pathway imputation. What is likewise remarkable is that
ed to significance in the pathway analysis (Fig. 4). AKT1 contributed to
identification of signaling pathways as diverse as GSK3 signaling tight
junctions, prostate cancer, small lung cell cancer, PI3K signaling in
cardiac myocytes, telomerase pathways. AKT1 was also prominent in
the pathways that were shared between the dbPTB analysis and the
whole genome-based pathway analysis including PI3K signaling,
HSA41/50 motor signaling, eIF4 pathway, protein serine3 kinase activi-
ty, phosphotransferase activity to alcohol groups, general kinase activi-
ty, melanoma, transferase activity transferring phosphorus-containing
groups, the NFAT pathway.
We are especially interested in comparing the pathway results from
the dbPTB curated genes and the pathway results from the genome-
wide analysis. As noted above, while there were 37 shared pathways,
there were 33 significant pathways that were only identified in the
whole genome-based pathway analysis. Prominent among these were
metabolomic pathways including phospholipase A2 activity, amino
acid derivative, biosynthetic processes, plus the pathway involving the
trans-Golgi network, enzyme inhibitor pathway, lipase activity and ni-
hypotaurine metabolism. In contrast, most of the pathways involving
inflammation were either shared or identified only by the dbPTB-
based pathway analysis. This demonstrates the strength of our hybrid
Fig. 4. The most connected gene. AKT1 was the most connected gene, being identified as contributing to a significant role in preterm birth in 15 pathways.
Fig. 5. The most connected pathways. Breast cancer estrogen signaling and oxidore-
ductase activity pathways each had 10 different genes contributing significantly to
their involvement in preterm birth. A. Breast cancer estrogen signaling. B. Oxidoreduc-
A. Uzun et al. / Genomics 101 (2013) 163–170
approach to identification of relevant genes and pathways in complex
diseases. There have been other efforts to collate information on
preterm birth. PTBGene is a publicly available database which stores
published information on genetic associations with preterm birth .
The database currently includes 84 genes with 189 polymorphisms.
Using meta analyses these investigators reported 5 significant variants.
Four of them were maternal and one was in the newborn.
In summary, we used a bioinformatically-driven strategy to
identify a parsimonious set of genes associated with preterm birth.
By aggregating genes from literature curation, publically-available
databases (most often from transcriptome-wide analysis) and then
using pathway-based imputation, we identified 617 genes for which
there was a priori biological evidence for involvement in preterm
birth. The tag SNPs associated with these genes were then used in
traditional candidate gene association testing using data from the
GENEVA genome-wide association study. While we increased our
power by focusing on a smaller number of comparisons, none of the
identified single gene variance reached statistical significance. We
did, however, corroborate the best of those curated genes, IGF1 in
the pathway-based analyses in both the dbPTB pathway analysis and
the genome-wide analysis. The database for preterm birth was built
to support analysis of gene/gene interactions. It is clear using
extremely large sets of SNPs that it's computationally expensive to
carry out even pair-wise comparisons of genes. Moreover, the
knowledge-based association of genetic variation with disease dic-
tates that all variants are not interacting with each other. Rather,
gene/gene interactions occur on the basis of known biological infor-
mation. This body of information has been built into robust databases
including KEGG, Biocarta, DAVID, GO and Ingenuity . Although
pathway-based analysis methods help us in understanding and eval-
uating GWAS data, improvements are forthcoming. Better summary
statistics will help to evaluate the results more robustly as described
in Wang et al. . Likewise, gene level P-values which usually
depend on SNP association test are limited by the number and prefer-
ence of SNPs on the arrays. Another limitation is to identify which
SNP is the best representative of a given gene by considering not
only the best association P-values but also the combined effects of
SNPs in linkage disequilibrium .
The gene set enrichment approach allowed us to interrogate the
known biological associations annotated by several of these data-
bases. By permutation testing, we compared the association of the
single nucleotide polymorphisms tagging the genes in our dataset
and their association in cases and controls. Even given the anticipated
improvements in pathway-based methods, the results were extraor-
dinary. We identified a large number of significant pathways in
which biologically relevant curated genes and their associated vari-
ants showed significant segregation between the preterm birth and
full term births. Moreover, the curated genes from the dbPTB dataset
gave much stronger associations than the genome wide analysis in
all but a few of these pathways. These results provide important con-
firmation of the role of genetic architecture in the risk of preterm
birth. They also provide important mechanistic insights and curated
genes which are suitable for future genetic association testing or
ideal targets for more thorough evaluation including targeted
re-sequencing. We recognize that, due to the lack of a replication
dataset, this study should be considered hypothesis generating and
that these results will need to be replicated in an appropriate dataset.
4. Materials and methods
4.1. dbPTB; the database for preterm birth
We identified 186 genes using the literature-based curation, 215
genes from publically-available databases and an additional 216
genes from the pathway-based interpolation . These 617 genes
represent a robust set of genes for which there is good prior biological
evidence for involvement in preterm birth .
4.2. The Gene Environment Association Studies initiative (GENEVA) data
We analyzed thesinglenucleotidepolymorphism (SNP) genotyping
data from a prospective cohort study in Denmark. The data were
derived from the Gene Environment Association Studies initiative
(GENEVA) funded by the trans-NIH Genes, Environment, and Health
Initiative (GEI) . The data from GENEVA consist of approximately
information from a genome-wide case/control study using approxi-
mately 1000 preterm mother–child pairs. There is also data from 1000
control mother–child pairs where the child was born greater than or
equal to 38 weeks of gestation. All data were deposited into the Data-
base for Genotypes and Phenotypes (dbGaP) . Genome wide SNP
genotyping was performed using Illumina Human 660 W-Quad_v1_A
(n=560,768 SNPs) at the Center for Inherited Disease Research,
Baltimore, MD. As reported in the data set release, genotypes were not
reported for any SNP which had a call rate less than 85% or which had
more than 1 replicate error as defined with the HapMap control
4.3. SNP association testing in PLINK
We ran basic SNP association tests in PLINK to obtain individual
marker P-values . The basic association test is based on comparing
allele frequencies between cases and controls. PLINK is a free, open-
source whole genome association analysis toolset which performs a
range of basic, large-scale analyses . The SNP-association analyses
were conducted in PLINK using only curated-genes from dbPTB as
well as using all the SNPs from the genome-wide analysis. For these
analyses, the study “controls” consisted of the 960 mothers who had
delivered at 38 weeks of gestation or higher. For comparison we
carried out the same curated gene analysis using three different
patient groups from the GENEVA study. We analyzed the single SNP
association with PTB by comparing the controls with the 884 patients
delivering less than 37 weeks, the 446 patients delivering less than
34 weeks, and the 92 patients delivering less than 30 weeks.
4.4. Gene set enrichment analysis
In recent years, gene set enrichment analysis (GSEA, ) has
become increasingly popular to support analysis of gene–gene inter-
actions and to help in understanding the individual contribution(s)
of biological pathways to genetic architecture. GSEA employs a new
way of considering the SNPs in GWAS . Instead of analyzing single
SNPs individually, GSEA tests disease association with genetic
variants in functionally related genes by analyzing the genes that
belong to the same pathway which may represent the possible SNP
or gene association with complex diseases . We first performed
GSEA on GENEVA GWAS data, using the SNP P-values from the
dbPTB curated genes with i-GSEA4GWAS. The i-GSEA4GWAS web
server implements i-GSEA to explore GWAS data efficiently.
“i-GSEA” is an enhanced application and extension of GSEA. The pro-
gram runs the analysis in three steps. First, it maps the variants to the
genes, each gene is represented by −log (P-value) of closely spaced
SNPs in a gene. Second, i-GSEA is performed to identify the pathways
correlated to traits based on the distribution of enrichment scores
generated by permutation. FDR is calculated and used to correct for
multiple testing. The threshold of FDRb0.25 denotes the confidence
of ‘possible’ or ‘hypothesis’, while the threshold of FDRb0.05 is
regarded as ‘high confidence’ or ‘with statistical significance’. Finally,
the program lists significant pathways, the genes within those path-
ways which contributed to the significant associations along with
the SNPs that contributed to the association.
A. Uzun et al. / Genomics 101 (2013) 163–170
Supplementary data to this article can be found online at http:// Download full-text
AU built the original curation database and web based interfaces,
participated in weekly curation meetings, had direct involvement in
all genetic analysis, and wrote and edited the paper.
ATD provided oversight and guidance on the initial genetic ap-
proaches, contributed to single variant analysis, gave feedback and
guidance on GSEA analysis, and edited the paper.
SI aided in identification of TAG SNPs, provided alternative seman-
tic data mining results to cross checking the original data, participat-
ed in monthly discussions of the evolving data and analyses,
contributed to the original funding, and edited the paper.
JFP developed the initial concepts, participated in the weekly
curation meetings, had direct involvement in all genetic analyses,
and wrote and edited the paper.
This work was supported by the National Foundation March of
Dimes Prematurity Initiative # 21-FY08-563, and the National Insti-
tutes of Health Grants NIH-5T35HL094308-02 and NIH-NCRR P20
RR018728. GWAS data for preterm birth was analyzed from the
Genome-Wide Association Studies of Prematurity and Its Complica-
tions, dbGaP study accession numbered “phs000103.v1.p1”. This
research was conducted using computational resources and services
at the Center for Computation and Visualization, Brown University.
 W.G. Feero, A.E. Guttmacher, F.S. Collins, Genomic medicine—an updated primer,
N. Engl. J. Med. 362 (2010) 2001–2011.
 F.S. Collins, E.D. Green, A.E. Guttmacher, M.S. Guyer, A vision for the future of
genomics research, Nature 422 (2003) 835–847.
 M.M. Iles, Genome-wide association studies, Methods Mol. Biol. 713 (2011)
 L.A. Hindorff, J. MacArthur, A. Wise, H.A. Junkins, P.N. Hall, A.K. Klemm, T.A.
Manolio, A Catalog of Published Genome-Wide Association Studies, in, 2012.
 C.G. Mathew, New links to the pathogenesis of Crohn disease provided by
genome-wide association scans, Nat. Rev. Genet. 9 (2008) 9–14.
 A. Dewan, M. Liu, S. Hartman, S.S. Zhang, D.T. Liu, C. Zhao, P.O. Tam, W.M. Chan, D.S.
Lam, M. Snyder, C. Barnstable, C.P. Pang, J. Hoh, HTRA1 promoter polymorphism in
wet age-related macular degeneration, Science 314 (2006) 989–992.
 J.T. Glessner, J.P. Bradfield, K. Wang, N. Takahashi, H. Zhang, P.M. Sleiman, F.D.
Mentch, C.E. Kim, C. Hou, K.A. Thomas, M.L. Garris, S. Deliard, E.C. Frackelton, F.G.
Otieno, J. Zhao, R.M. Chiavacci, M. Li, J.D. Buxbaum, R.I. Berkowitz, H. Hakonarson,
S.F. Grant, A genome-wide study reveals copy number variants exclusive to childhood
obesity cases, Am. J. Hum. Genet. 87 (2010) 661–666.
 D.B. Goldstein, Common genetic variation and human traits, N. Engl. J. Med. 360
 E.T. Cirulli, D.B. Goldstein, Uncovering the roles of rare variants in common disease
through whole-genome sequencing, Nat. Rev. Genet. 11 (2010) 415–425.
 J.N. Hirschhorn, Genome-wide association studies—illuminating biologic pathways,
N. Engl. J. Med. 360 (2009) 1699–1701.
 T.A. Manolio, F.S. Collins, N.J. Cox, D.B. Goldstein, L.A. Hindorff, D.J. Hunter, M.I.
McCarthy, E.M. Ramos, L.R. Cardon, A. Chakravarti, J.H. Cho, A.E. Guttmacher, A.
Kong, L. Kruglyak, E. Mardis, C.N. Rotimi, M. Slatkin, D. Valle, A.S. Whittemore,
M. Boehnke, A.G. Clark, E.E. Eichler, G. Gibson, J.L. Haines, T.F.C. Mackay, S.A.
McCarroll, P.M. Visscher, Finding the missing heritability of complex diseases,
Nature 461 (2009) 747–753.
 R.Cowper-Sallari,M.D.Cole,M.R. Karagas,M.Lupien,J.H. Moore,Layersofepistasis:
genome-wide regulatory networks and network approaches to genome-wide
association studies, Wiley Interdiscip. Rev. Syst. Biol. Med. 3 (2011) 513–526.
 E.E. Eichler, J. Flint, G. Gibson, A. Kong, S.M. Leal, J.H. Moore, J.H. Nadeau, Missing
heritability and strategies for finding the underlying causes of complex disease,
Nat. Rev. Genet. 11 (2010) 446–450.
 T. Hu, N.A. Sinnott-Armstrong, J.W. Kiralis, A.S. Andrew, M.R. Karagas, J.H. Moore,
Characterizing genetic interactions in human disease association studies using
statistical epistasis networks, BMC Bioinforma. 12 (2011) 364.
study data in bipolar disorder reveal genes mediating ion channel activity and
synaptic neurotransmission, Hum. Genet. 125 (2008) 63–79.
 T.G. Lesnick, S. Papapetropoulos, D.C. Mash, J. Ffrench-Mullen, L. Shehadeh, M. de
approach to a complex disease: axon guidance and Parkinson disease, PLoS Genet. 3
 T.G. Lesnick, E.J. Sorenson, J.E. Ahlskog, J.R. Henley, L. Shehadeh, S. Papapetropoulos,
D.M. Maraganore, Beyond Parkinson disease: amyotrophic lateral sclerosis and the
axon guidance pathway, PLoS One 3 (2008) e1449.
 G. Gibson, Hints of hidden heritability in GWAS, Nat. Genet. 42 (2010) 558–560.
 J. Gui, A.S. Andrew, P. Andrews, H.M. Nelson, K.T. Kelsey, M.R. Karagas, J.H. Moore,
A robust multifactor dimensionality reduction method for detecting gene–gene
interactions with application to the genetic analysis of bladder cancer susceptibility,
Ann. Hum. Genet. 75 (1) (Jan. 2011) 20–28.
 H.J. Cordell, Detecting gene–gene interactions that underlie human diseases, Nat.
Rev. Genet. 10 (2009) 392–404.
 J.H. Moore, Detecting, characterizing, and interpreting nonlinear gene–gene
interactions using multifactor dimensionality reduction, Adv. Genet. 72
 K.M. Adams, D.A. Eschenbach, The genetic contribution towards preterm delivery,
Semin. Fetal Neonatal. Med. 9 (2004) 445–452.
 K.S. Crider, N. Whitehead, R.M. Buus, Genetic variation associated with preterm
birth: a HuGE review, Genet. Med. 7 (2005) 593–604.
 R. Menon, S.J. Fortunato, P. Thorsen, S. Williams, Genetic associations in preterm
birth: a primer of marker selection, study design, and data analysis, J. Soc.
Gynecol. Investig. 13 (2006) 531–541.
 L.J. Muglia, M. Katz, The enigma of spontaneous preterm birth, N. Engl. J. Med. 362
 C.R. Weinberg, M. Shi, Thegeneticsofpretermbirth:usingwhatwe knowto design
better association studies, Am. J. Epidemiol. 170 (2009) 1373–1381.
 C.P. Weiner, C.W. Mason, Y. Dong, I.A. Buhimschi, P.W. Swaan, C.S. Buhimschi,
Human effector/initiator gene sets that regulate myometrial contractility during
term and preterm labor, Am. J. Obstet. Gynecol. 202 (2010), (474 e471–474 e).
 A. Uzun, A. Laliberte, J. Parker, C. Andrew, E. Winterrowd, S. Sharma, S. Istrail, J.F.
Padbury, dbPTB: a database for preterm birth, Database (Oxford) 2012 (2012)
 M.D. Mailman, M. Feolo, Y. Jin, M. Kimura, K. Tryka, R. Bagoutdinov, L. Hao, A.
Kiang, J. Paschall, L. Phan, N. Popova, S. Pretel, L. Ziyabari, M. Lee, Y. Shao, Z.Y.
Wang, K. Sirotkin, M. Ward, M. Kholodov, K. Zbicz, J. Beck, M. Kimelman, S.
Shevelev, D. Preuss, E. Yaschenko, A. Graeff, J. Ostell, S.T. Sherry, The NCBI
dbGaP database of genotypes and phenotypes, Nat. Genet. 39 (2007) 1181–1186.
 H.A. Boyd, G. Poulsen, J. Wohlfahrt, J.C. Murray, B. Feenstra, M. Melbye, Maternal
contributions to preterm delivery, Am. J. Epidemiol. 170 (2009) 1358–1364.
 S.B. Gabriel, S.F. Schaffner, H. Nguyen, J.M. Moore, J. Roy, B. Blumenstiel, J. Higgins,
M. DeFelice, A. Lochner, M. Faggart, S.N. Liu-Cordero, C. Rotimi, A. Adeyemo, R.
Cooper, R. Ward, E.S. Lander, M.J. Daly, D. Altshuler, The structure of haplotype
blocks in the human genome, Science 296 (2002) 2225–2229.
 J. McClellan, M.-C. King, Genetic heterogeneity in human disease, Cell 141 (2010)
 K. Wang, M. Li, H. Hakonarson, Analysing biological pathways in genome-wide
association studies, Nat. Rev. Genet. 11 (2010) 843–854.
 K. Zhang, S. Cui, S. Chang, L. Zhang, J. Wang, i-GSEA4GWAS: a web server for
gene set enrichment analysis to genome-wide association study, Nucleic Acids Res.
38 (2010) W90–W95.
 E.E. Schadt, Molecular networks as sensors and drivers of common human diseases,
Nature 461 (2009) 218–223.
 R. Haataja, M.K. Karjalainen, A. Luukkonen, K. Teramo, H. Puttonen, M. Ojaniemi, T.
Varilo, B.P. Chaudhari, J. Plunkett, J.C. Murray, S.A. McCarroll, L. Peltonen, L.J. Muglia,
A. Palotie, M. Hallman, Mapping a new spontaneous preterm birth susceptibility
gene, IGF1R, using linkage, haplotype sharing, and association analysis, PLoS Genet.
7 (2011) e1001293.
 H.S. Bastian, M. Jacomy, M. Gephi, An open source software for exploring and ma-
nipulating networks, International AAAI Conference on Weblogs and Social
 R. Romero, J. Espinoza, F. Gotsch, J.P. Kusanovic, L.A. Friel, O. Erez, S. Mazaki-Tovi,
N.G. Than, S. Hassan, G. Tromp, The use of high-dimensional biology (genomics,
transcriptomics, proteomics, and metabolomics) to understand the preterm partu-
rition syndrome, BJOG 113 (Suppl. 3) (2006) 118–135.
 S. Gracie, C. Pennell, G. Ekman-Ordeberg, S. Lye, J. McManaman, S. Williams, L.
Palmer, M. Kelley, R. Menon, M. Gravett, An integrated systems biology approach
to the study of preterm birth using “-omic” technology—a guideline for research,
BMC Pregnancy Childbirth 11 (2011) 71.
 S.M. Dolan, M.V. Hollegaard, M. Merialdi, A.P. Betran, T. Allen, C. Abelow, J. Nace,
B.K. Lin, M.J. Khoury, J.P. Ioannidis, S. Bagade, X. Zheng, R.A. Dubin, L. Bertram, D.R.
Velez Edwards, R. Menon, Synopsis of preterm birth genetic association studies:
the preterm birth genetics knowledge base (PTBGene), Public Health Genomics
13 (2010) 514–523.
 V.K. Ramanan, L. Shen, J.H. Moore, A.J. Saykin, Pathway analysis of genomic data:
concepts, methods and prospects for future development, Trends Genet. 28 (7)
(Jul. 2012) 323–332.
 GENEVA (Gene–Environment Association Studies), in, 2012.
 S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M.A. Ferreira, D. Bender, J. Maller,
P. Sklar, P.I. de Bakker, M.J. Daly, P.C. Sham, PLINK: a tool set for whole-genome
association and population-based linkage analyses, Am. J. Hum. Genet. 81
A. Uzun et al. / Genomics 101 (2013) 163–170