Novel genes identified in a high-density genome
wide association study for nicotine dependence
Laura Jean Bierut1,*, Pamela A.F. Madden1, Naomi Breslau2, Eric O. Johnson3,
Dorothy Hatsukami4, Ovide F. Pomerleau5, Gary E. Swan6, Joni Rutter7, Sarah Bertelsen1,
Louis Fox1, Douglas Fugman8, Alison M. Goate1, Anthony L. Hinrichs1, Karel Konvicka9,
Nicholas G. Martin10, Grant W. Montgomery10, Nancy L. Saccone1, Scott F. Saccone1,
Jen C. Wang1, Gary A. Chase11, John P. Rice1and Dennis G. Ballinger9
1Department of Psychiatry, Washington University School of Medicine, 660 South Euclid, Box 8134, St Louis, MO
63110, USA,2Michigan State University, East Lansing, MI, USA,3Research Triangle Institute International, Research
Triangle Park, NC, USA,4University of Minnesota, Minneapolis, MN, USA,5University of Michigan, Ann Arbor, MI,
USA,6SRI International, Menlo Park, CA, USA,7National Institute on Drug Abuse, Rockville, MD, USA,8Rutgers
University, Piscataway, NJ, USA,9Perlegen Sciences, Mountain View, CA, USA,10Queensland Institute of Medical
Research, Herston QLD, Australia and11Penn State College of Medicine, Hershey, PA, USA
Received August 11, 2006; Revised and Accepted November 15, 2006
Tobacco use is a leading contributor to disability and death worldwide, and genetic factors contribute in part
to the development of nicotine dependence. To identify novel genes for which natural variation contributes to
the development of nicotine dependence, we performed a comprehensive genome wide association study
using nicotine dependent smokers as cases and non-dependent smokers as controls. To allow the efficient,
rapid, and cost effective screen of the genome, the study was carried out using a two-stage design. In the first
stage, genotyping of over 2.4 million single nucleotide polymorphisms (SNPs) was completed in case and
control pools. In the second stage, we selected SNPs for individual genotyping based on the most significant
allele frequency differences between cases and controls from the pooled results. Individual genotyping was
performed in 1050 cases and 879 controls using 31 960 selected SNPs. The primary analysis, a logistic
regression model with covariates of age, gender, genotype and gender by genotype interaction, identified
35 SNPs with P-values less than 1024(minimum P-value 1.53 3 1026). Although none of the individual
findings is statistically significant after correcting for multiple tests, additional statistical analyses support
the existence of true findings in this group. Our study nominates several novel genes, such as Neurexin 1
(NRXN1), in the development of nicotine dependence while also identifying a known candidate gene,
the b3 nicotinic cholinergic receptor. This work anticipates the future directions of large-scale genome
wide association studies with state-of-the-art methodological approaches and sharing of data with the
Tobacco use, primarily through cigarette smoking, is respon-
sible for about five million deaths annually, making it the
largest cause of preventable mortality in the world (1), and
nicotine is the component in tobacco that is responsible for
the maintenance of smoking. Because of increasing tobacco
use in developing nations, it is predicted that the death toll
worldwide will rise to more than 10 million per year by 2020.
In the USA, 21% of adults were current smokers in 2004,
with 23% of men and 19% of women smoking (2). Each
year, ?440 000 people die of a smoking-related illness (3).
The economic burden of smoking is correspondingly high.
Annual costs are estimated at $75 billion in direct medical
# The Author 2006. Published by Oxford University Press. All rights reserved.
For Permissions, please email: firstname.lastname@example.org
*To whom correspondence should be addressed. Tel: þ1 3143623492; Fax: þ1 3143624247; Email: email@example.com
Human Molecular Genetics, 2007, Vol. 16, No. 1
Advance Access published on December 7, 2006
by guest on June 1, 2013
expenses and $92 billion in lost productivity. The prevalence of
cigarette smoking has decreased over the last 30 years in the
USA, primarily through smokers’ successful efforts to quit.
Yet, the rate of smoking cessation among adults has been
slowing since the mid-1990s underscoring the limitations of
current treatments for smoking. In addition, adolescents con-
tinue to initiate cigarette use, with 21% of high school students
reporting cigarette smoking in the last month (4).
Smoking behaviors, including onset of smoking, smoking
persistence (current smoking versus past smoking) and nico-
tine dependence, cluster in families (5), and large twin
studies indicate that this clustering reflects genetic factors
(6–10). Previous approaches have used genetic linkage
studies (11–14) and candidate gene tests (15–17) to identify
chromosomal regions and specific genetic variants suspected
to be involved in smoking and nicotine dependence. We
have extended the search for genetic factors by performing a
high-density whole genome association study using a case-
control design in unrelated individuals to identify common
genetic variants that contribute to the transition from cigarette
smoking to the development of nicotine dependence.
The final sample of 1050 nicotine dependent case subjects and
lation stratification, and no evidence of admixture was observed.
Quality control measures were applied to the individually
genotyped SNPs and 31 960 SNPs were available for analysis.
The most significant findings are presented in Table 1 for
those SNPs with a P-value of less than 1024. Several genes
not previously implicated in the development of nicotine
dependence are listed and their hypothesized mechanism of
involvement is discussed below. The most significant result
was observed with rs2836823 (P-value ¼ 1.53 ? 1026). This
SNP is intergenic, as are several of the top findings. A SNP
was defined as ‘intergenic’ if it was not physically in a gene
or within 10 kb of a known transcribed region. See Figure 1
for an overview of the individual genotyping results.
Because of the dense genome-wide scope of our study, the
interpretation of these P-values was complicated by the large
number of statistical tests. Approximately 2.4 million SNPs
were examined in the pooled screening stage. Although this
is a large sample with nearly 2000 subjects, no SNP showed
a genome-wide significant P-value after Bonferroni correction
for multiple tests. Yet, several independent lines of evidence
provided support that true genetic associations were identified
in this top group of SNPs.
We used the agreement of direction of effect for the top
SNPs in the Stage I samples (those included in the pooled
genotyping, n ¼ 948) as compared with those samples added
in Stage II (n ¼ 981) as a measure of evidence for real associ-
ations within the dataset. If there were no true associations in
the data, the expectation would be a random assortment of
effect direction between the two sample sets. In contrast, 30
of the top 35 SNPs in the Stage I samples show the same direc-
tion of effect in the additional Stage II sample set. This level
of agreement was highly significant, with a P-value of
1.1 ? 1025from the binomial distribution indicating the
error rate associated with rejecting the hypothesis of chance
agreement. Thus, our top SNPs were enriched for real and
reproducible allele frequency differences between cases and
Further evidence for the presence of true associations came
from comparison of these results with a candidate gene study
conducted simultaneously (described in the companion paper
by Saccone et al. (18). The b3 nicotinic receptor candidate
gene, CHRNB3, the most significant finding in the candidate
gene study, was also tagged by SNPs identified in the
genome wide association study. This gene has a strong prior
probability of a relationship with nicotine dependence, and
the likelihood of any of the candidate genes in the study by
Saccone and colleagues being selected in the top group of
SNPs in the genome wide association study is less than 5%.
To investigate the accuracy of pooled genotyping estimates
of the allele frequency differences between cases and controls,
we examined the relationship between the pooled and individ-
ual genotyping results. The pooled genotyping indeed
enriched the selected set of SNPs for sizable allele frequency
differences between cases and controls included in the pooled
study. When P-values were computed from individual geno-
types using only Stage I samples, there is a strong enrichment
of small P-values (Fig. 2A). If the pooled genotyping was not
at all successful, the distribution of P-values would be
uniform, and if the pooling was completely accurate, then
only small P-values would be present in the individual geno-
typing stage assessed in this sample subset. As seen in
Figure 2A, our results lie between these extremes. We also
examined the P-values of the samples added into the Stage II,
which were not in the pooling step. Because these Stage II
samples are an independent random sample from the case and
control populations, they are not expected to show the same
allele frequency differences as Stage I samples where those
differences are due to sampling error. Thus, their P-values
should be uniformly distributed except for possible real associ-
ations, which would be consistent between the two sets of
samples. This is seen in Figure 2B. The graph is fairly
uniform with only a slight increase in small P-values.
In addition, we directly compared allele frequency estimates
based on the pooled genotyping with those based on individual
genotyping. As seen in Figure 3, the majority of the allele fre-
quency estimates from the pooled and individual genotyping
results lie along the diagonal. A similar finding is seen if case
orcontrol samples are examinedseparately. Wecomputeda cor-
relation of 87% between allele frequencies estimated from the
case pooled genotyping and allele frequencies computed in the
individual genotyping sample of cases from Stage I (case sub-
jects n ¼ 482). Similarly, there was an 84% correlation of
allele frequencies seen in the comparison of the pooled and indi-
vidual genotyping in the control sample from Stage I (control
subjects n ¼ 466). When we compared the allele frequency
differences between cases and controls in pools (which is
implicitly large because the SNPs were selected for individual
genotyping) with the difference between cases and controls in
the individual genotyping, we found a 58% correlation. This
indicates a high level of concordance between the pooled and
individual genotyping results; thus, the pooled genotyping was
successful in identifying SNPs that would show allele frequency
differences in individually genotyped case and control subjects.
Human Molecular Genetics, 2007, Vol. 16, No. 125
by guest on June 1, 2013
Lastly, we examined potential differences between the US
and Australian samples. A comparison of cases and controls
from the two populations did not show any significant differ-
ences by gender or stratification results.
Smoking contributes to the morbidity and mortality of a large
component of the population and twin studies provide strong
evidence that genetic factors contribute substantially to the
risk of developing nicotine dependence. This is the first
high density, genome wide association study with the goal
to identify common susceptibility or resistance gene variants
for nicotine dependence.
Several novel genes were identified in this study as potential
Neurexin1(NRXN1).There were atleast twosignals inNRXN1
(Table 2). The SNP rs10490162 is weakly correlated with the
other two SNPs that were genotyped in the gene (maximum
pair wise correlation is r2¼ 0.45 with the other two SNPs,
which were found to be in strong disequilibrium with each
Table 1. SNPs with primary model P-value , 0.0001. Listed genes are within 10 kb of the SNP position
SNPGeneChrPos(bp) Risk Allelea
Primary P-valueMale odds ratio
Female odds ratio
AGER, PBX2, AGER
?Significantly different Odds Ratio for men and women.
aThe risk allele is chosen arbitrarily to be the allele more prevalent in cases to facilitate comparison of effect sizes across SNPs. This does not imply
that the effect of the variant is known in any case; the other allele could be protective. In addition, the alleles could be complementary to those reported
in dbSNP (see online SNP information).
bTwo Chr 10 SNPs with r2correlation of 0.89.
cThe allele frequency for rs999 is quite different in these data than reported in dbSNP; this may represent a failure to accurately genotype this SNP in
dTwo Chr 2 SNPs with r2correlation of 0.91 (the other two Chr 2 SNPs have pair-wise correlations of ,50%).
eNine Chr 14 SNPs with minimum pair-wise r2correlation of .0.85.
fFour Chr 9 SNPs with minimum pair-wise r2correlation of .0.85.
gTwo Chr 5 SNPs with r2correlation of 0.99 (the other two Chr 5 SNPs are uncorrelated).
hFour Chr 11 SNPs with minimum pair-wise r2correlation of .0.95.
iTwo Chr 8 SNPs with r2correlation of 1.
26 Human Molecular Genetics, 2007, Vol. 16, No. 1
by guest on June 1, 2013
other). Interestingly, another neurexin gene, Neurexin 3
(NRXN3), was reported as a susceptibility gene for polysub-
stance addiction in a pooled genome wide association study
by Uhl and colleagues (19). In addition, the most significant
SNP in NRXN3 in our study, rs2221299, had a P-value of
0.0034. While there was substantially less evidence for associ-
ation with NRXN3 in our study, the fact that two independent
studies of substance dependence found evidence of association
with neurexin genes merits further investigation.
The neurexin gene family is a group of polymorphic cell
surface proteins expressed primarily in neurons that function
in cell–cell interactions and are required for normal neuro-
transmitter release (20). Neurexins are important factors in
GABAergic and glutamatergic synapse genesis and are the
differentiation. NRXN1 and NRXN3 are among the largest
known human genes, and they utilize at least two promoters
and alternatively spliced exons to produce thousands of distinct
mRNA transcripts and protein isoforms. It is hypothesized that
synaptic specialization. Because substance dependence is
modeled as a relative imbalance of excitatory and inhibitory
neurotransmission (or related to ‘disinhibition’) (21), the neur-
exin genes are plausible new candidate genes that contribute to
the neurobiology of dependence through the regulated choice
between excitatory or inhibitory pathways. Biological charac-
terization of these genes may define a role of neural develop-
ment or neurotransmitter release and dependence.
as a potential contributor to nicotine dependence. Interestingly,
three independent genetic linkage studies of smoking (11–13)
identified a region on chromosome 9 near this gene. This
gene appears to control the cycling of proteins through the
cell membrane, and there are numerous alternative transcripts.
Variants in the VPS13A gene cause progressive neurodegenera-
tion and red cell acanthocytosis (22). Another novel gene for
further study is TRPC7 (transient receptor potential canonical)
channel which encodes a subunit of multimeric calcium chan-
nels (23). A recent study using animal model indicated that
TRPC channels can functionally regulate nicotine-induced
neuronal activity in the locomotion circuitry (24).
There are several other genes tagged by the top SNPs. An
alpha catenin gene, CTNNA3, inhibits Wnt signaling and has
variants that affect the levels of plasma amyloid beta protein
(Abeta42) in Alzheimer’s disease families (25), though other
reports fail to find an association with Alzheimer’s disease
(26). The CLCA1 gene encodes a calcium-activated chloride
channel that may contribute to the pathogenesis of asthma
(27) and chronic obstructive pulmonary disease (28). While
none of these genes has a known relationship to nicotine
metabolism or mechanism of action, they are involved in
Figure 1. P-values of genome-wide association scan for genes that affect the risk of developing nicotine dependence. 2log10(p) is plotted for each SNP in
chromosomal order. The spacing between SNPs on the plot is based on physical map length. The horizontal lines show P-values for logistic analysis. The vertical
lines show chromosomal boundaries. Black diamonds represent SNPs that result in non-synonymous amino acid changes.
Human Molecular Genetics, 2007, Vol. 16, No. 1 27
by guest on June 1, 2013
brain and lung function and therefore have plausible biological
relationships to smoking behavior and dependence. Replica-
tion of these findings and additional biological characteriz-
ation of these variants and genes may solidify these
In addition to the novel genes implicated in the genome
wide association study, a classic candidate gene, the b3
nicotinic receptor (CHRNB3) is among the top group. The
nicotinic receptors are a family of ligand-gated ion channels
that mediate fast signal transmission at synapses. Nicotine is
an agonist of these receptors that produce physiological
The SNPs were tested for varying gender effects as part
of the primary analytic model. Several of the top SNPs had
significantly different odds ratios for men and women
(Table 1). It is clear from epidemiological data that there are
significant gender differences in the risk for the development
of dependence, and this study provides evidence that separate
genes may contribute to the development of nicotine depen-
dence in men and women. Following the primary analyses,
we further analyzed the top ranked SNPs to determine if
there was evidence for other modes of transmission, such as
recessive or dominant models. There was no evidence for
improvement in the fit for either of these models for any of
the SNPs in the top group.
The maximum effect size for these top associated SNPs
is an odds ratio of 2.53. These estimates are likely to be
overestimates of the true population values due to the
‘jackpot effect’ of many multiple comparisons. Several
alternatives exist for correction of these estimates, but have
not been applied to these data. The effect size estimates are
consistent with multiple genes of modest effect contributing
to the development of dependence.
This genome wide association study is a first step in a
large-scale genetic examination of nicotine dependence. Our
analytic plan was determined a priori so that we would be
able to interpret the results most clearly. We purposefully
chose to examine the entire sample as the primary analysis,
rather than use a split sample design because we felt that
this had the greatest power to detect true findings (29).
Though we have evidence of true results in this study, confir-
mation in an independent sample is crucial.
Many other issues will need to be addressed in the future
examination of these data. For example, smoking and nicotine
dependence are correlated with many other disorders, such as
alcohol dependence and major depressive disorder (30–33).
Preliminary analyses of our sample have confirmed that this
clustering of other disorders with nicotine dependence is
present in our sample. In addition, nicotine dependence can
be defined by other measures, such as the American Psychia-
tric Association criteria in the Diagnostic and Statistical
Manual, Version IV (DSM-IV) (34). Previous work has
shown that though different measures of nicotine dependence
are correlated, there is not perfect overlap because the
Fagerstro ¨m Test for Nicotine Dependence (FTND) and
DSM-IV definitions focus on different features of dependence
(35). The FTND is a measure that focuses on physiological
dependence, whereas the DSM-IV dependence includes cogni-
tive and behavioral aspects of dependence. Different classifi-
cation by FTND and DSM-IV nicotine dependence is also
seen in our sample with 75% of our cases (FTND ?4) and
24% of our controls (FTND ¼ 0) affected with DSM-IV nico-
tine dependence. As we move forward with additional
analyses, which will include comorbid disorders and varying
definitions of nicotine dependence, we hope to explicate
some of the individual features that contribute to these find-
ings of association.
In summary, efforts to understand nicotine dependence are
important so that new approaches can be developed to reduce
tobacco use, especially cigarette smoking. This systematic
survey of the genome nominates novel genes, such as
NRXN1, that increase an individual’s risk of transitioning
from smoking to nicotine dependence. The continued genetic
and biological characterization of these genes will help in
understanding the underlining causality of nicotine dependence
and may provide novel drug development targets for smoking
cessation. These variants also may be involved in addictive
Figure 2. (A) Distribution of P-values from the Stage I sample of the 31 960
individually genotyped SNPs that were selected from pooled genotyping stage.
The distribution shows that the pooled genotyping produced an enrichment of
SNPs with small P-values. A uniform distribution from 0–1 would be
expected if there were no correlation between pooled genotyping and individ-
ual genotyping. (B) Distribution of P-values from the additional samples
added in Stage II. The distribution is fairly uniform with only a slight enrich-
ment of small P-values.
28Human Molecular Genetics, 2007, Vol. 16, No. 1
by guest on June 1, 2013
behavior in general. The current pharmacological treatments
for nicotine dependence continue to produce only limited absti-
nence success, and the tailoring of medications to promote
smoking cessation to an individual’s genetic background may
significantly increase the efficacy of treatment. Our work is
part of an emerging body of knowledge that may facilitate per-
sonalized approaches in the practice of medicine through
large-scale study of genetic variants. Novel targets can now
be studied and hopefully will facilitate the development of
improved treatment options to alleviate this major health
burden and reduce smoking-related deaths.
MATERIALS AND METHODS
The purpose of this study was to identify genes contributing to
the progression from smoking to the development of nicotine
dependence. As a result, the study examined the phenotypic
contrast between nicotine dependent subjects and individuals
who smoked but never developed nicotine dependence.
All subjects (1050 cases and 879 controls) were selected
from two ongoing studies: the Collaborative Genetic Study
of Nicotine Dependence, a US-based sample (St Louis,
Detroit and Minneapolis), and the Nicotine Addiction
Genetics study, an Australian-based, European-Ancestry
sample. The US sample was recruited through telephone
screening of community-based subjects to determine eligi-
bility for recruitment as case (current FTND ?4) or control
status. Qualifying subjects were invited to participate in the
genetic study. The Australian participants were enrolled at
the Queensland Institute of Medical Research as families
and spouses of the Australian Twin Panel.
The Institutional Review Board approved both studies, and
all subjects provided informed consent to participate. Blood
samples were collected from each subject for DNA analysis
and submitted together with electronic phenotypic data to
the NIDA Center for Genetic Studies, which manages the
sharing of research data in accordance with NIH guidelines.
All subjects were self-identified as being of European
descent. See Table 3 for further demographic details.
Equivalent assessments were performed at both sites. A per-
sonal interview that comprehensively assessed nicotine
dependence using several different criteria such as the
Fagerstro ¨m Test for Nicotine Dependence (36) and the Diag-
nostic and Statistical Manual of Mental Disorders-IV (34) was
Figure 3. Scatter plot of the allele frequencies from pooling and individual genotyping from the Stage I sample.
Human Molecular Genetics, 2007, Vol. 16, No. 1 29
by guest on June 1, 2013
Case definitions of nicotine dependence
The focus of this study was a case-control design of unrelated
individuals for a genetic association study of nicotine depen-
dence. Cases were defined by a commonly used definition of
nicotine dependence, a FTND score of 4 or more when
smoking the most (maximum score of 10) (36). No significant
difference was observed in FTND score between the US and
Australian samples (mean FTND: 6.43 for US and 6.06 for
Control subject status was defined as an individual who
smoked (defined by smoking at least 100 cigarettes during
their lifetime), yetnever
FTND ¼ 0). Historically, the threshold of smoking 100 or
more cigarettes has been used in survey research as a
definition of a ‘smoker’. With the selection of controls who
smoked, the study focused on those genetic effects related to
the transition from smoking to the development of nicotine
dependence. Additional data from the Australian twin panels
supports this designation of a control status. Among
monozygotic twins who smoked, the rate of nicotine depen-
dence, defined as a score of 4 or more using the Heavy
Smoking Index (HSI-an abbreviated version of the FTND)
(37), was lowest in those whose co-twin had an HSI score
of 0; lower even than in those whose co-twin had experimen-
ted with cigarettes, but never became a smoker, or those
whose co-twin had never smoked even a single cigarette
DNA was extracted from whole blood and EBV transformed
cell lines and was aliquoted and stored frozen at 2808C
until distributed to the genotyping labs.
Table 2. All SNPs individually genotyped in the genes NRNX1 and VPS13A
ratio (95% CI)
ratio (95% CI)
bPrimary 2df P-value from the logistic regression analysis.
Table 3. Distribution of sex, age, FTND score, and recruitment site in cases
Cases (n ¼ 1050) Controls (n ¼ 879)
Table 4. Prevalence of nicotine dependence in monozygotic twins
Co-twin smoking history Respondent % nicotine
dependent among smokers
Smoked 1–2 times
Smoked 3–20 times
Smoked 21–99 times
Smoked 100 times or more, HSI ¼ 0
Smoked 100 times or more, HSI ¼ 1
Smoked 100 times or more, HSI ¼ 2
Smoked 100 times or more, HSI ¼ 3
Smoked 100 times or more, HSI ¼ 4
Smoked 100 times or more, HSI ¼ 5
Smoked 100 times or more, HSI ¼ 6
30Human Molecular Genetics, 2007, Vol. 16, No. 1
by guest on June 1, 2013
To allow the efficient, rapid and cost-effective screening of
over 2.4 million SNPs, we performed a whole genome associ-
ation study using a two-stage design.
Stage I—pooled genotyping high-density oligonucleotide
In Stage I, 482 cases and 466 control DNA samples from US
and Australian subjects of European ancestry were selected for
study. To examine potential population stratification, we per-
formed a STRUCTURE analysis (38) using 295 individually
genotyped SNPs. The selected SNPs were roughly evenly
spaced across the autosomes and were selected for stratifica-
tion analyses (39). The STRUCTURE program identifies sub-
populations of individuals who are genetically similar through
a Markov chain Monte Carlo sampling procedure using
markers selected across the genome. There was no evidence
of population admixture. Cases and controls were then
placed in pools for genotyping of 2.4 million SNPs, and
estimates of allele frequency differences between case and
control pools were determined.
Pooled genotyping was performed using eight cases and
eight control pools. DNA was quantified using Pico
Green. The concentrations were normalized and verified to
within a coefficient of variation of , 10%. Equimolar
amounts of DNA from ?60 individuals were placed into
each of the 16 pools. An individual’s sample was included
in only one pool. The 16 pools were hybridized to 49 chip
designs to interrogate 2 427 354 SNPs across the whole
Determination of pooled allele frequency estimates
Allele frequencies were approximated using the intensities
collected from the high-density oligonucleotide arrays. A
SNP’s allele frequency p was a ratio of the relative
total amount of DNA, and thus can have values between
0 and 1:
reference allele to the
where CRef and CAlt are the concentrations of reference
allele and alternate allele, respectively. As probe intensities
were directly related to the concentrations of the SNP
alleles, the p ˆ computed from the intensities of reference
and alternate features was a good approximation of the
true allele frequency p. The p ˆ value was computed from
the trimmed mean intensities of perfect match features,
after subtracting a measure of background computed from
trimmed means of intensities of mismatch features:
^ p ¼
MMÞ þ ðITM
ITMwas the trimmed mean of perfect match or mismatch
intensities for a given allele and strand denoted by the sub-
script. The trimmed mean disregarded the highest and the
lowest intensity from the five perfect match intensities and
also from the five mismatch intensities in the 40-feature
tilings before computing the arithmetic mean.
Three quality control metrics were developed to assess the
reliability of the intensities for a SNP on an array scan. The
first metric, concordance, evaluated the presence of a target
for a SNP. The second metric, signal to background ratio,
related the amount of specific and non-specific binding, esti-
mated from the intensities of perfect match and mismatch fea-
tures. The third metric tracked the number of features in each
SNP tiling that had saturated intensities. Cutoffs were applied
to all three metrics, and SNP feature sets that did not pass were
discarded from further evaluation.
Concordance was computed independently for both refer-
ence and alternate allele feature sets, then a maximum was
taken of the two values. For each allele at each offset for
both the forward and reverse strand feature sets, the identity
of the brightest feature was noted. The concordance for a par-
ticular allele was computed as a ratio of the number of times
the perfect match feature was the brightest to the total number
of offsets over the forward and reverse strands. In the 40
feature SNP tiling each allele was represented by 20 features,
distributed along five offsets and forward and reverse strands.
was the number of times for allele X when the perfect
match feature was brighter than the mismatch feature over all
offsets and both strands, then:
concordance ¼ max
SNP feature sets with concordance ,0.9 were discarded from
Signal to background ratio was the ratio between the ampli-
tude of signal computed from trimmed means of perfect match
feature intensities and amplitude of background computed
from trimmed means of mismatch feature intensities. The
signal and background were computed as follows:
Human Molecular Genetics, 2007, Vol. 16, No. 1 31
by guest on June 1, 2013
The trimmed mean intensities ITMfor both the perfect match
and mismatch feature sets were obtained as described above.
SNP feature sets with signal/background ,1.5 were discarded
from further evaluations.
The number of saturated features was computed as
the number of features that reached the highest intensity poss-
ible for the digitized numeric intensity value. SNPs with
number of saturated features .0 were discarded from
Stage II SNP selection
Computation of empirical P-values to evaluate each SNP’s
association independently. Corrected t-test P-values were
computed similarly to regular t-test P-values. For testing of
the difference between average case p ˆ and average control p ˆ,
the standard error was corrected by a chip design-specific
additive constant. The additive constant was obtained by mini-
mizing the coefficient of variation of the t-tests for each chip
design. This standard error additive constant ensured that
SNP selection was not biased to low or high standard errors,
as there was no prior evidence that SNPs with low or high
standard errors were more or less likely to be associated
with the phenotype. The empirical P-values were computed
from ranks of the corrected t-test P-values for each chip
design by dividing the rank by the total number of passing
SNPs on the chip design. See Figure 4 for a distribution of
SNP selection criteria. The SNPs were selected from among
SNPs that had at least two passing p ˆ values for cases and con-
trols. Selected SNPs mapped onto human genome build 35 and
had successfully designed assays. An empirical P-value cutoff
of 0.0196 was used to select SNPs.
Stage II individual genotyping
For individual genotyping, we designed a custom array to inter-
rogate 41 402 SNPs that included SNPs selected from the
pooled genotyping (39 213) and stratification and quality
control SNPs (2189). In Stage II, we performed individual gen-
otyping on the original case and control samples and additional
case and control subjects of European descent, for a final
sample size of 1929 individuals (1050 cases and 879 controls).
Individual genotypes were determined by clustering all SNP
scans in the two-dimensional space defined by reference and
alternate perfect match trimmed mean intensities. Trimmed
mean intensities were computed as described above in
section ‘Determination of Pooled Allele Frequency Esti-
mates’. The genotype clustering procedure was an iterative
algorithm developed as a combination of K-means and con-
strained multiple linear regressions. The K-means at each
step reevaluated the cluster membership representing distinct
diploid genotypes. The multiple linear regressions minimized
the variance in p ˆ within each cluster while optimizing the
regression lines’ common intersect. The common intersect
defined a measure of common background that was used to
adjust the allele frequencies for the next step of K-means.
The K-means and multiple linear regression steps were iter-
ated until the cluster membership and background estimates
converged. The best number of clusters was selected by max-
imizing the total likelihood over the possible cluster counts of
1, 2 and 3 (representing the combinations of the three possible
diploid genotypes). The total likelihood was composed of data
likelihood and model likelihood. The data likelihood was
determined using a normal mixture model for the distribution
of p ˆ around the cluster means. The model likelihood was cal-
culated using a prior distribution of expected cluster positions,
resulting in optimal p ˆ positions of 0.8 for the homozygous
reference cluster, 0.5 for the heterozygous cluster and 0.2
for the homozygous alternate cluster.
A genotyping quality metric was compiled for each geno-
type from 15 input metrics that described the quality of the
SNP and the genotype. The genotyping quality metric corre-
lated with a probability of having a discordant call between
the Perlegen platform and outside genotyping platforms
(i.e. non-Perlegen HapMap project genotypes). A system of
10 bootstrap aggregated regression trees was trained using
an independent data set of concordance data between Perlegen
genotypes and HapMap project genotypes. The trained predic-
tor was then used to predict the genotyping quality for each of
the genotypes in this data set.
Hardy–Weinberg equilibrium (HWE) was tested separately
for cases and controls. SNPs that did not follow HWE at a
level of P-value , 10215in either cases or controls were dis-
carded. There were 859 and 797 autosomal SNPs excluded
because of this extreme disequilibrium in cases and controls,
respectively, and 765 of these SNPs were common to both
Figure 4. Plot of distributions of standard errors of SNPs selected using differ-
ent criteria. The plot illustrates that delta p ˆ cutoff selects preferentially SNPs
with high standard errors of delta p ˆ, regular t-test preferentially selects SNPs
with low standard errors and the corrected t-test is centered on the standard
error distribution from all SNPs.
32Human Molecular Genetics, 2007, Vol. 16, No. 1
by guest on June 1, 2013
groups. This level of deviation from HWE indicates issues
with SNP genotyping and clustering. Because association
with the phenotype can result in SNPs not being in HWE,
SNPs with HWE P-values between 1024and 10215were
visually inspected, and where problems with clustering were
detected, the SNP was discarded from further analysis. This
results in 31 960 SNPs available for analysis.
In order to avoid false positive results due to cryptic popu-
lation stratification in the larger sample, we repeated a
STRUCTURE analysis in the expanded sample of 1929
subjects (38) using genotype data for 289 well performing
SNPs (39). This again revealed no evidence of population
admixture. In addition, the non-inflated Q–Q plot of test
statistics in the Stage II only samples (Fig. 5) indicates a
lack of population admixture correlated with case control
The covariates available for individuals were sex, age, site
(USA or Australia) and sample (first or second). Prior to
performing genetic analyses, inspection of the data indicated
that the covariates of gender and recruitment site were import-
ant predictors of case and control status and were used as
covariates in the logistic regression model.
We developed an a priori analytic strategy so that we could
then interpret our results and avoid issues of multiple testing
from using varying methods of analysis. We chose to
examine the total sample of 1929 individuals in the primary
analysis because this had the greatest power to detect true find-
ings (29). For our primary single SNP association analyses, we
used logistic regression to incorporate the significant covari-
ates sex and site (USA and Australia), and tested the effect
of genotype together with a genotype-by-sex interaction
term using a standard likelihood-ratio x2statistic with two
degrees of freedom. This approach allowed us to detect
SNPs having gender-specific effects as well as SNPs with
similar effects in males and females. For these primary ana-
lyses, we coded genotype according to the number of ‘risk’
alleles (0, 1 or 2) where the risk allele was defined to be the
allele having higher frequency in cases than in controls. This
coding was additive on the log scale and thus corresponded
to a multiplicative genetic model. The full model was com-
pared to a reduced model including gender and recruitment
site only, and significance was assessed by a x2test with
two degrees of freedom. The resulting P-values were used to
rank the SNPs.
Following these primary analyses, we further analyzed the
top ranked SNPs to determine if there was significant evidence
for alternative modes of transmission such as dominant or
The authors wish to acknowledge the contributions of advisors
to this project. The NIDA Genetics Consortium, with Jonathan
Pollock, and NICSNP committees were vital to the success of
the research. The Data Analysis Committee helped oversee
analyses for the genome wide association studies and investi-
gated methodological issues in association analyses. Further,
the committee assisted in data management and data sharing
functions. In addition to the authors, committee members
included Andrew Bergen, Gerald Dunn, Mary Jeanne Kreek,
Huijun Ring, Lei Yu and Hongyu Zhao. At Perlegen Sciences,
we would like to acknowledge the work of Laura Stuve, Curtis
Kautzer, the genotyping laboratory, Laura Kamigaki, the
samplegroup,and John Blanchard,Geoff Nilsen, and the bioin-
formatics and data quality groups for excellent technical and
infrastructural support for this work performed under NIDA
Contract HHSN271200477471C. This work is supported by
NIH grants CA89392 from the National Cancer Institute,
DA12854 and DA015129 from the National Institute on Drug
Abuse, and the contract N01DA-0-7079 from NIDA. We are
greatly appreciative for the
preparation from Sherri Fisher. In memory of Theodore
Reich, founding Principal Investigator of COGEND; we are
indebted to his leadership in the establishment of COGEND,
and acknowledge his seminal scientific contributions to the
Data Access: Phenotypes and genotypes are available through
the NIDA Genetics Consortium to the scientific community at
the time of publication (http://nidagenetics.org).
Figure 5. Q–Q plot of logistic regression ANOVA deviance produced from
samples added to Stage I samples at Stage II. Because these samples are inde-
pendent of Stage I samples used for the SNP selection from pooled genotyping
the test statistic is expected to largely follow the null distribution (Chi-square
distribution with two degrees of freedom). Due to the lower power of this
sample set compared to the combined set of samples and the small effect
sizes found in this study, any possible associations are not expected to
cluster together at low P-values, thereby changing the linear shape of this
Q–Q plot. The dotted line represents 95% point-wise confidence envelope
of expected null distribution.
Human Molecular Genetics, 2007, Vol. 16, No. 1 33
by guest on June 1, 2013
Conflict of Interest statement. Dennis G. Ballinger and Karel
Konvicka are employed by Perlegen Sciences, Inc. With the
exception of D. Ballinger and K. Konvicka, none of the
authors or their immediate families are currently involved
with, or have been involved with, any companies, trade associ-
ations, unions, litigants or other groups with a direct financial
interest in the subject matter or materials discussed in this
manuscript in the past five years.
1. WHO (2006), The facts about smoking and health.http://
2. CDC (2005) Annual smoking-attributable mortality, years of potential life
lost, and productivity losses—United States, 1997–2001. Morb. Mortal.
Wkly Rep., 54, 625–628.
3. CDC (2005) Cigarette smoking among adults—United States, 2004.
Morb. Mortal. Wkly Rep., 54, 1121–1124.
4. CDC (2004) Cigarette use among high school students—United States,
1991–2003. Morb. Mortal. Wkly Rep., 53, 499.
5. Bierut, L.J., Dinwiddie, S.H., Begleiter, H., Crowe, R.R., Hesselbrock, V.,
Nurnberger, J.I., Jr., Porjesz, B., Schuckit, M.A. and Reich, T. (1998)
Familial transmission of substance dependence: alcohol, marijuana,
cocaine, and habitual smoking: a report from the Collaborative Study on
the genetics of alcoholism. Arch. Gen. Psychiatry, 55, 982–988.
6. Carmelli, D., Swan, G.E., Robinette, D. and Fabsitz, R. (1992) Genetic
influence on smoking—a study of male twins. N. Engl. J. Med., 327,
7. Heath, A.C. and Martin, N.G. (1993) Genetic models for the natural
history of smoking: evidence for a genetic influence on smoking
persistence. Addict. Behav., 18, 19–34.
8. True, W.R., Xian, H., Scherrer, J.F., Madden, P.A., Bucholz, K.K., Heath,
A.C., Eisen, S.A., Lyons, M.J., Goldberg, J. and Tsuang, M. (1999)
Common genetic vulnerability for nicotine and alcohol dependence in
men. Arch. Gen. Psychiatry, 56, 655–661.
9. Madden, P.A., Heath, A.C., Pedersen, N.L., Kaprio, J., Koskenvuo, M.J.
and Martin, N.G. (1999) The genetics of smoking persistence in men and
women: a multicultural study. Behav. Genet., 29, 423–431.
10. Lessov, C.N., Martin, N.G., Statham, D.J., Todorov, A.A., Slutske, W.S.,
Bucholz, K.K., Heath, A.C. and Madden, P.A. (2004) Defining nicotine
dependence for genetic research: evidence from Australian twins.
Psychol. Med., 34, 865–879.
11. Li, M.D., Ma, J.Z., Cheng, R., Dupont, R.T., Williams, N.J., Crews, K.M.,
Payne, T.J. and Elston, R.C. (2003) A genome-wide scan to identify loci
for smoking rate in the framingham heart study population. BMC Genet.,
4 (Suppl. 1), S103.
12. Bierut, L.J., Rice, J.P., Goate, A., Hinrichs, A.L., Saccone, N.L.,
Foroud, T., Edenberg, H.J., Cloninger, C.R., Begleiter, H., Conneally,
P.M. et al. (2004) A genomic scan for habitual smoking in families of
alcoholics: common and specific genetic factors in substance dependence.
Am. J. Med. Genet. A, 124, 19–27.
13. Gelernter, J., Liu, X., Hesselbrock, V., Page, G.P., Goddard, A. and
Zhang, H. (2004) Results of a genomewide linkage scan: support for
chromosomes 9 and 11 loci increasing risk for cigarette smoking.
Am. J. Med. Genet. B Neuropsychiatry Genet., 128, 94–101.
14. Swan, G.E., Hops, H., Wilhelmsen, K.C., Lessov-Schlaggar, C.N., Cheng,
L.S., Hudmon, K.S., Amos, C.I., Feiler, H.S., Ring, H.Z., Andrews, J.A.
et al. (2006) A genome-wide screen for nicotine dependence susceptibility
loci. Am. J. Med. Genet. B Neuropsychiatry Genet., 141, 354–360.
15. Li, M.D., Beuten, J., Ma, J.Z., Payne, T.J., Lou, X.Y., Garcia, V., Duenes,
A.S., Crews, K.M. and Elston, R.C. (2005) Ethnic- and gender-specific
association of the nicotinic acetylcholine receptor alpha4 subunit gene
(CHRNA4) with nicotine dependence. Hum. Mol. Genet., 14, 1211–1219.
16. Beuten, J., Ma, J.Z., Payne, T.J., Dupont, R.T., Crews, K.M., Somes, G.,
Williams, N.J., Elston, R.C. and Li, M.D. (2005) Single- and multilocus
allelic variants within the GABA(B) receptor subunit 2 (GABAB2) gene
are significantly associated with nicotine dependence. Am. J. Hum. Genet.,
17. Feng, Y., Niu, T., Xing, H., Xu, X., Chen, C., Peng, S., Wang, L. and
Laird, N. (2004) A common haplotype of the nicotine acetylcholine
receptor alpha 4 subunit gene is associated with vulnerability to nicotine
addiction in men. Am. J. Hum. Genet., 75, 112–121.
18. Saccone, S.F., Hinrichs, A.L., Saccone, N.L., Chase, G.A., Konvicka, K.,
Madden, P.A.F., Breslau, N., Johnson, E.O., Hatsukami, D., Pomerleau,
O. et al. (2006) Cholinergic nicotinic receptor genes implicated in a
nicotine dependence association study targeting 348 candidate genes with
3713 SNPs. Hum. Mol. Genet., 16, 36–49.
19. Liu, Q.R., Drgon, T., Walther, D., Johnson, C., Poleskaya, O., Hess, J. and
Uhl, G.R. (2005) Pooled association genome scanning: validation and use
to identify addiction vulnerability loci in two samples. Proc. Natl Acad.
Sci. U.S.A., 102, 11864–11869.
20. Craig, A.M., Graf, E.R. and Linhoff, M.W. (2006) How to build a central
synapse: clues from cell culture. Trends Neurosci., 29, 8–20.
21. Iacono, W.G., Carlson, S.R., Malone, S.M. and McGue, M. (2002) P3
event-related potential amplitude and the risk for disinhibitory disorders in
adolescent boys. Arch. Gen. Psychiatry, 59, 750–757.
22. Dobson-Stone, C., Danek, A., Rampoldi, L., Hardie, R.J., Chalmers, R.M.,
Wood, N.W., Bohlega, S., Dotti, M.T., Federico, A., Shizuka, M. et al.
(2002) Mutational spectrum of the CHAC gene in patients with
chorea-acanthocytosis. Eur. J. Hum. Genet., 10, 773–781.
23. Zagranichnaya, T.K., Wu, X. and Villereal, M.L. (2005) Endogenous
channels in HEK-293 cells. J. Biol. Chem., 280, 29559–29569.
24. Feng, Z., Li, W., Ward, A., Piggott, B.J., Larkspur, E.R., Sternberg, P.W.
and Xu, X.Z. (2006) A C. elegans model of nicotine-dependent behavior:
regulation by TRP-family channels. Cell, 127, 621–633.
25. Ertekin-Taner, N., Ronald, J., Asahara, H., Younkin, L., Hella, M.,
Jain, S., Gnida, E., Younkin, S., Fadale, D., Ohyagi, Y. et al. (2003)
Fine mapping of the alpha-T catenin gene to a quantitative trait locus on
chromosome 10 in late-onset Alzheimer’s disease pedigrees. Hum. Mol.
Genet., 12, 3133–3143.
26. Busby, V., Goossens, S., Nowotny, P., Hamilton, G., Smemo, S.,
Harold, D., Turic, D., Jehu, L., Myers, A., Womick, M. et al. (2004)
Alpha-T-catenin is expressed in human brain and interacts with the Wnt
signaling pathway but is not responsible for linkage to chromosome 10 in
Alzheimer’s disease. Neuromolecular Med., 5, 133–146.
27. Jeulin, C., Guadagnini, R. and Marano, F. (2005) Oxidant stress stimulates
Ca2þ -activated chloride channels in the apical activated membrane of
cultured nonciliated human nasal epithelial cells. Am. J. Physiol. Lung
Cell. Mol. Physiol., 289, L636–L646.
28. Hegab, A.E., Sakamoto, T., Uchida, Y., Nomura, A., Ishii, Y., Morishima,
Y., Mochizuki, M., Kimura, T., Saitoh, W., Massoud, H.H. et al. (2004)
CLCA1 gene polymorphisms in chronic obstructive pulmonary disease.
J. Med. Genet., 41, e27.
29. Skol, A.D., Scott, L.J., Abecasis, G.R. and Boehnke, M. (2006) Joint
analysis is more efficient than replication-based analysis for two-stage
genome-wide association studies. Nat. Genet., 38, 209–213.
30. Breslau, N., Novak, S.P. and Kessler, R.C. (2004) Daily smoking and the
subsequent onset of psychiatric disorders. Psychol. Med., 34, 323–333.
31. Breslau, N., Novak, S.P. and Kessler, R.C. (2004) Psychiatric disorders
and stages of smoking. Biol. Psychiatry, 55, 69–76.
32. Grant, B.F., Hasin, D.S., Chou, S.P., Stinson, F.S. and Dawson, D.A.
(2004) Nicotine dependence and psychiatric disorders in the United
States: results from the national epidemiologic survey on alcohol and
related conditions. Arch. Gen. Psychiatry, 61, 1107–1115.
33. Lasser, K., Boyd, J.W., Woolhandler, S., Himmelstein, D.U.,
McCormick, D. and Bor, D.H. (2000) Smoking and mental illness: a
population-based prevalence study. JAMA, 284, 2606–2610.
34. American Psychiatric Association (1994) Diagnostic and Statistical
Manual of Mental Disorders, 4th edn. American Psychiatric Association,
35. Breslau, N. and Johnson, E.O. (2000) Predicting smoking cessation and
major depression in nicotine-dependent smokers. Am. J. Public Health,
36. Heatherton, T.F., Kozlowski, L.T., Frecker, R.C. and Fagerstro ¨m, K.O.
(1991) The Fagerstro ¨m test for nicotine dependence: a revision of the
Fagerstro ¨m tolerance questionnaire. Br. J. Addict., 86, 1119–1127.
37. Heatherton, T.F., Kozlowski, L.T., Frecker, R.C., Rickert, W. and
Robinson, J. (1989) Measuring the heaviness of smoking: using
self-reported time to the first cigarette of the day and number of cigarettes
smoked per day. Br. J. Addict., 84, 791–799.
34Human Molecular Genetics, 2007, Vol. 16, No. 1
by guest on June 1, 2013
38. Pritchard, J.K., Stephens, M. and Donnelly, P. (2000) Inference of
population structure using multilocus genotype data. Genetics, 155,
39. Hinds, D.A., Stokowski, R.P., Patil, N., Konvicka, K., Kershenobich, D.,
Cox, D.R. and Ballinger, D.G. (2004) Matching strategies for genetic
association studies in structured populations. Am. J. Hum. Genet., 74,
40. Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger,
D.G., Frazer, K.A. and Cox, D.R. (2005) Whole-genome patterns of
common DNA variation in three human populations. Science, 307,
Human Molecular Genetics, 2007, Vol. 16, No. 1 35
by guest on June 1, 2013