Molecular Genetics of Addiction and Related Heritable Phenotypes

Molecular Neurobiology Branch, National Institutes of Health (NIH), Intramural Research Program (IRP), National Institute on Drug Abuse (NIDA), Baltimore, MD 21224, USA.
Annals of the New York Academy of Sciences (Impact Factor: 4.38). 11/2008; 1141(1):318-81. DOI: 10.1196/annals.1441.018
Source: PubMed
Genome-wide association (GWA) can elucidate molecular genetic bases for human individual differences in complex phenotypes that include vulnerability to addiction. Here, we review (a) evidence that supports polygenic models with (at least) modest heterogeneity for the genetic architectures of addiction and several related phenotypes; (b) technical and ethical aspects of importance for understanding GWA data, including genotyping in individual samples versus DNA pools, analytic approaches, power estimation, and ethical issues in genotyping individuals with illegal behaviors; (c) the samples and the data that shape our current understanding of the molecular genetics of individual differences in vulnerability to substance dependence and related phenotypes; (d) overlaps between GWA data sets for dependence on different substances; and (e) overlaps between GWA data for addictions versus other heritable, brain-based phenotypes that include bipolar disorder, cognitive ability, frontal lobe brain volume, the ability to successfully quit smoking, neuroticism, and Alzheimer's disease. These convergent results identify potential targets for drugs that might modify addictions and play roles in these other phenotypes. They add to evidence that individual differences in the quality and quantity of brain connections make pleiotropic contributions to individual differences in vulnerability to addictions and to related brain disorders and phenotypes. A "connectivity constellation" of brain phenotypes and disorders appears to receive substantial pathogenic contributions from individual differences in a constellation of genes whose variants provide individual differences in the specification of brain connectivities during development and in adulthood. Heritable brain differences that underlie addiction vulnerability thus lie squarely in the midst of the repertoire of heritable brain differences that underlie vulnerability to other common brain disorders and phenotypes.


Available from: Qing-Rong Liu
Molecular Genetics of Addiction and Related
Heritable Phenotypes
Genome-Wide Association Approaches Identify
“Connectivity Constellation” and Drug Target
Genes with Pleiotropic Effects
George R. Uhl,
To m a s D rgo n ,
Catherine Johnson,
Chuan-Yun Li,
Carlo Contoreggi,
Judith Hess,
Daniel Naiman,
and Qing-Rong Liu
Molecular Neurobiology Branch, National Institutes of Health (NIH), Intramural
Research Program (IRP), National Institute on Drug Abuse (NIDA),
Baltimore, Maryland, USA
Center for Bioinformatics, College of Life Sciences, Peking University, Beijing, China
Office of the Clinical Director, NIH-IRP (NIDA), Baltimore, Maryland, USA
Department of Mathematics, Johns Hopkins University, Baltimore, Maryland, USA
Genome-wide association (GWA) can elucidate molecular genetic bases for human indi-
vidual differences in complex phenotypes that include vulnerability to addiction. Here,
we review (a) evidence that supports polygenic models with (at least) modest hetero-
geneity for the genetic architectures of addiction and several related phenotypes; (b)
technical and ethical aspects of importance for understanding GWA data, including
genotyping in individual samples versus DNA pools, analytic approaches, power es-
timation, and ethical issues in genotyping individuals with illegal behaviors; (c) the
samples and the data that shape our current understanding of the molecular genetics
of individual differences in vulnerability to substance dependence and related pheno-
types; (d) overlaps between GWA data sets for dependence on different substances;
and (e) overlaps between GWA data for addictions versus other heritable, brain-based
phenotypes that include bipolar disorder, cognitive ability, frontal lobe brain volume,
the ability to successfully quit smoking, neuroticism, and Alzheimer’s disease. These
convergent results identify potential targets for drugs that might modify addictions and
play roles in these other phenotypes. They add to evidence that individual differences
in the quality and quantity of brain connections make pleiotropic contributions to in-
dividual differences in vulnerability to addictions and to related brain disorders and
phenotypes. A “connectivity constellation” of brain phenotypes and disorders appears
to receive substantial pathogenic contributions from individual differences in a con-
stellation of genes whose variants provide individual differences in the specification of
brain connectivities during development and in adulthood. Heritable brain differences
that underlie addiction vulnerability thus lie squarely in the midst of the repertoire of
heritable brain differences that underlie vulnerability to other common brain disorders
and phenotypes.
Key words: pleiotropic; cell adhesion; Monte Carlo
Address for correspondence: George R. Uhl, Molecular Neurobiology,
Box 5180, Baltimore, MD 21224. Voice: +410-550-2843x146,
fax: +410-550-1535.
Ann. N.Y. Acad. Sci. 1141: 318–381 (2008).
2008 New York Academy of Sciences.
doi: 10.1196/annals.1441.018
Page 1
et al.:
Addiction Molecular Genetics
Genome-wide association (GWA) is now in-
creasingly the method of choice for identifying
allelic variants that contribute to complex ge-
netic disorders, especially those with polygenic
bases (i.e., those derived from effects at many
gene loci, each with modest effects, as well
as from environmental determinants; see also
Substance dependence was one of
the first complex phenotypes for which repli-
cated association-based genome scanning data
were reported.
A torrent of information
is now available from GWA studies of a num-
ber of other complex, brain-based phenotypes
that display substantial heritability and are un-
likely (based on linkage study results) to result
from many common gene variants that produce
large effects.
A number of these other her-
itable, brain-based phenotypes co-occur with
addictions and are thus good candidates to dis-
play genetic overlaps with addiction.
No single approach to designing GWA
studies or to analyzing GWA data is now
universally accepted. There is now no uni-
versal standard for considering GWA results
significant in ways that allow us to identify
polygenic allelic variants in reasonably sized
single experiments. Here, we describe specific
sets of working hypotheses about the genetic
architecture of addiction (e.g., the vulnerability
to develop dependence on an addictive
substance). This set of hypotheses is also useful
for considering the molecular genetic bases for
other common, complex phenotypes that, like
addictions, display both substantial evidence
for heritability and little evidence for large
influences from any single gene (e.g., single-
gene, Mendelian influences or oligogenic
effects that come from a few genetic loci, each
with moderate effects on the phenotype). We
then detail experimental design and analytic
approaches that arise from working hypotheses
about underlying genetic architecture and
likely sources of false positive results.
A number of samples provide the bases
for these analyses. We focus on clusters of
genomic markers whose allele frequencies
distinguish control individuals from those with
substance dependence or addiction-related
phenotypes. We describe identification of
chromosomal regions that contain clusters of
such nominally positive results in replicate
samples for addiction vulnerability. We then
describe evidence for generalization that arises
from identification of overlapping chromoso-
mal locations of clustered positive results for
different phenotypes. These data thus support
pleiotropic influences (i.e., contributions of the
same allelic variants to multiple phenotypes)
of common allelic variants on several of
the brain-based phenotypes. The data thus
document overlapping heritable influences on
several interesting brain phenotypes.
We focus here on clinical phenotypes that
co-occur with addiction and a structural brain
phenotype: individual differences in frontal cor-
tical volume. Twin studies document sizable
heritable components for individual differences
in the volumes of brain regions. High heri-
tabilities are especially evident for individual
differences in frontal and temporal cerebral
cortical regions.
Volumes of these brain re-
gions have been reported to be reduced in
substance-dependent individuals.
ing evidence from functional magnetic reso-
nance imaging (fMRI) and positron emission
tomography (PET) studies of individuals with
substance dependence and related phenotypes
identifies functional differences in these brain
We thus focus on this “frontal cor-
tical volume” phenotype.
A number of the genes identified here en-
code classical “druggable” targets for phar-
macological modulation, including enzymes,
receptors, and transporters. Other genes en-
code cell adhesion–related molecules. We dis-
cuss genes in each of these classes below.
Utility of GWA in Examining the
Molecular Bases of Heritable Influences
on Vulnerability to Addiction and
Related Phenotypes
One way to view GWA is in relationship to
linkage-based genome scanning because most
Page 2
Annals of the New York Academy of Sciences
of the efforts to positionally clone gene variants
for complex human disorders (e.g., those that
are likely to be caused by multiple genetic and
environmental factors) have used linkage-based
methods. Linkage asks how addiction pheno-
types and genetic markers (typically genotyped
approximately every 1/400th–1/1,000th of the
genome) move together through pedigrees of
closely related individuals. Chromosomal re-
gions that contain marker alleles that move
through pedigrees together with the trait are
said to be linked to the trait. Many loci with
nominally significant linkage to addiction phe-
notypes have been identified
(further refer-
ences in
). Several of the loci identified in inde-
pendent linkage studies do overlap. However,
the large numbers of reported linkage-based
studies of addictions yield large numbers of
nominally positive results that cover virtually all
chromosomes. These widespread results may
not converge more than expected by chance, as
we have documented in recent analyses of the
reported data for linkage to smoking.
Such inconsistent linkage data are consis-
tent with the idea that most of the genetic
architecture for human addiction vulnerabil-
ity is polygenic in most populations. A growing
consensus now holds that GWA (also termed
whole-genome association or association genome scan-
ning) is more likely than linkage approaches
to yield positive results in polygenic complex
disorders, such as addictions.
sociation asks how addiction phenotypes and
genetic markers (genotyped approximately ev-
ery 1/500,000th to every 1/1,000,000th of the
genome in current data sets) are found together
in nominally unrelated individuals (although
we are all distantly related to each other, of
course). We and others have developed these
methods, relying on the increasing densities of
single nucleotide polymorphism (SNP) markers
that can be assessed using SNP chip microar-
rays of increasing sophistication.
GWA gains power as densities of genomic
markers increase. Association identifies much
smaller chromosomal regions than linkage-
based approaches, thus allowing us to iden-
tify variants in specific genes rather than in
large chromosomal regions. GWA fosters pool-
ing strategies that preserve confidentiality and
reduce costs, as we discuss below.
GWA also provides ample genomic controls.
Proper genomic controls can minimize the
chances that disease versus control differences
are confounded by occult stratification, such as
the stratification that might arise from unin-
tended occult ethnic mismatches between dis-
ease and control samples.
A Probable Underlying Genetic
Architecture for Addiction and Other
Heritable Brain-based Phenotypes
We approach analyses of the molecular ge-
netic bases of addiction and related disorders
from perspectives that are based on sets of
working hypotheses concerning the underly-
ing genetic architectures of these disorders and
phenotypes. In general, the best experimental
design and analytic approaches will probably
arise from the best possible working hypotheses
concerning (a) the genetic architectures of the
disorders or phenotypes being evaluated, (b)
the population genetics of the samples being
tested, and (c) the anticipated association sig-
nals. It is also desirable to consider and provide
controls for alternative hypotheses that might
explain systematic differences between disease
and control samples. Alternative hypotheses in-
clude (a) unintended stratification, such as that
based on racial or ethnic differences between
disease and control samples, as noted above;
(b) uneven distribution of noise in some assays
such that SNPs with the largest variance might
be identified rather than the SNPs whose allelic
frequencies truly differ between disease and
control individuals; (c) stochastic, chance dif-
ferences between disease and control samples
(at least some of these are highly likely in any
single study that uses many repeated compar-
isons); and (d) sampling issues such that genet-
ics related to the ways in which the samples are
ascertained and obtained (e.g., features such as
differential willingness to consent) are identified
Page 3
et al.:
Addiction Molecular Genetics
rather than true disease versus control differ-
ences. Many of these concerns become more
prominent as hundreds of thousands or millions
of repeated comparisons are made using single
sample sets. Moreover, many of these concerns
may become more acute as attempts to rapidly
assemble larger and larger sample sizes drive
investigators to include subsamples that may
well contribute occult heterogeneities to over-
all disease and/or control samples (see below).
Family, Adoption, and Twin Data that
Support Substantial Polygenic
Heritability for Addictions
Current models for the genetic architecture
of substance dependence in the population are
based on information from (a) family, adoption,
and twin data that support substantial heritabil-
ity for addictions; (b) twin data, in which con-
cordance in genetically identical monozygotic
and genetically half-identical dizygotic twins
are compared that document that most of this
heritable influence is not substance-specific;
and (c) linkage-based and GWA studies that
fail to provide evidence for genes of major effect
(i.e., for any single gene whose variants produce
substantial differences in addiction vulnerabil-
ity) for substance dependence.
Support for the idea that vulnerability to ad-
dictions is a complex trait with strong genetic
influences that are largely shared by abusers
of different legal and illegal addictive sub-
comes from classical genetic stud-
ies. Family studies document that first-degree
relatives (e.g., siblings) of addicts display a
greater risk for developing substance depen-
dence than more distant relatives.
tion studies find greater similarities in levels of
substance abuse between adoptees and biolog-
ical relatives than between adoptees and mem-
bers of the adoptive families.
In twin studies,
differences in concordance between genetically
identical and fraternal twins also support her-
itability for vulnerability to addictions.
Twin data allow quantitation of the amount of
addiction vulnerability, about half, that is heri-
Figure 1. Pie graph model for the genetic archi-
tecture of human vulnerability to dependence on ad-
dictive substances. Polygenic additive genetic influ-
ences and environmental influences that are largely
those that are not shared between members of sib-
ships are depicted. Potential roles for g × gand
g × e interactions are not depicted here.
table. Twin data also support the idea that the
environmental influences on addiction vulner-
ability that are not shared among members of
twin pairs are much larger than those that are
shared by members of twin pairs (e.g., e
in virtually every such study). Most environ-
mental influences on human addiction vulner-
ability are thus likely to come from outside of
the immediate family environment (Fig. 1).
Twin Data Document that Most
of this Heritable Influence Is Not
Substance-Specific, but Provides
Higher-Order Pharmacogenomics
We are fortunate to have data from stud-
ies of identical versus fraternal twin pairs that
evaluate the degree to which one twin’s de-
pendence on a substance enhances the chance
that his or her co-twin will become dependent
on a substance of a different class. Results of
these analyses document that most of the ge-
netic influences on addiction vulnerability are
common to dependence on multiple different
substances, although others do appear to be
we have suggested the follow-
ing levels of analysis for pharmacogenomics
and pharmacogenetics: (a) primary pharmacoge-
nomics, which describes the genetics of individ-
ual differences in the adsorption, distribution,
metabolism, and/or excretion of a drug; (b)
Page 4
Annals of the New York Academy of Sciences
secondary pharmacogenomics, which describes indi-
vidual differences in drug targets, such as the
G protein–coupled receptors, transporters, and
ligand-gated ion channels that are the primary
targets of opiates, psychostimulants, and bar-
biturates, respectively; and (c) higher-order phar-
macogenomics, which addresses individual dif-
ferences in postreceptor drug responses. Such
postreceptor drug responses are more likely to
be common to actions of abused substances that
come from several different chemical classes
and act at distinct primary receptor or trans-
porter sites in the brain. Based on the twin
data currently available, we thus postulate that
much, if not most, of the human genetics of
addition vulnerability represents higher-order
Failure to Document Evidence
for Substance-Dependence Genes
of Major Effect in Most Populations
Few careful studies have examined the ways
in which most human addiction vulnerabilities
move through families (e.g., segregation analyses).
No such study indicates a major gene effect on
addiction vulnerability in most current popu-
lations, with one exception: the “flushing syn-
drome” variants at the aldehyde (ALDH) and
alcohol dehydrogenase loci in Asian individuals
do provide genes of major effect in this popula-
tion. Individuals with these gene variants are at
lower risk for becoming dependent on alcohol
than individuals with other genotypes
in Chi-
and other
Homozygous ALDH22in-
dividuals are strongly protected from alcohol
This locus thus provides a
good example of primary pharmacogenomics,
though in a restricted population.
Quantity–frequency data for smoking also
provide evidence for a replicable secondary
pharmacogenomic effect of moderate magni-
tude. Several studies have shown that markers
in the chromosome 15 gene cluster, which en-
codes the α3, α5, and β4 nicotinic acetylcholine
receptors, display different allelic frequencies in
heavy versus light smokers.
This chromo-
some 15 locus probably provides a good exam-
ple of secondary pharmacogenomics because
it has not been associated as reproducibly with
dependence on other substances.
Linkage-based analyses for addiction vul-
nerabilities would be expected to reproducibly
identify many of the genes whose variants exert
major influences on human addiction vulnera-
bility. However, existing linkage data for human
dependence on alcohol, nicotine, and a number
of other substances fail to provide any highly re-
producible results that would support a role for
any major gene locus (
and references in
These results add to the conclusion that no lo-
cus individually appears to contribute a large
fraction of the vulnerability to dependence on
any addictive substance, with one caveat: these
data come from subjects with largely European
ethnic or racial backgrounds.
theless, as with many complex human disor-
ders in which initial hopes for a tractable (e.g.,
oligogenic) underlying genetic architecture sup-
ported the use of linkage approaches, the link-
age peaks that are identified in each individ-
ual study may be more likely to arise on other
bases when the underlying architecture is in
fact polygenic. Apparent linkage signals identi-
fied in single studies might result from polygenic
influences from several genes that happen to lie
near each other on human chromosomes or are
found on stochastic bases when there is no true
major effect from any single gene variant, for
Current Models for the Genetic
Architecture of Human Dependence
Current models for the genetic architecture
of human dependence on legal and illegal ad-
dictive substances in the population thus postu-
late that each is affected roughly 50% by poly-
genic influences, that is, by variants in more
than one individual gene, each of which con-
tributes modest amounts to this overall genetic
vulnerability. Such genetic architectural models
posit that many of these genetic vulnerabilities
Page 5
et al.:
Addiction Molecular Genetics
increase the risk for addiction to several phar-
macologic classes of abused substances, but that
some of these genetic influences are specific to
drugs of one class.
Analyses of twin data for vulnerability to de-
velop dependence on a substance fit with large,
additive genetic components (a
), large compo-
nents for nonshared environmental influences
), and small components for c
terms that
represent familial or other environmental in-
fluences that are shared between members of
the twin pair.
What about the possibil-
ity that large interactions could occur between
these genetic and environmental terms (G × E
interactions), invalidating additive models for
genetic and environmental contributions? G ×
E correlations of three types have been de-
In one terminology, passive G × E
correlation occurs when parents transmit both
genes and environmental influences that are
relevant for a trait,
active G × Ecorrela-
tion occurs where subjects of a certain geno-
type actively select environments that are cor-
related with that genotype, and reactive G × E
correlation occurs when an individual’s geno-
type provides different reactions in response to
the environment. Small values for c
of common environments shared by members
of sibpairs appear to provide evidence against
passive G × E correlations. On these bases, ac-
tive and reactive G × E correlations remain
of theoretical interest. However, one influen-
tial notion
suggests that G x E correlations
are best regarded as parts of the genetic vari-
ance because “the non-random aspects of the
environment are...consequence(s) of the geno-
Large interactions between genetic and envi-
ronmental components would probably lead to
(a) differences in estimates of heritability from
samples obtained in different environments and
(b) differences in molecular genetic findings in
individuals from different environments. Data
from studies of twins who were sampled from a
number of different environments is neverthe-
less largely convergent. Such convergence sup-
ports relatively modest upper limits on (G × E)
interactions between genetic and environmen-
tal influences on addiction vulnerability. Mod-
est G × E influences are also consistent with
molecular genetic results that identify substan-
tial overlaps between the molecular genetics
of vulnerability to dependence on illegal sub-
stances in samples from substantially different
environments, such as the United States and
Asia (see below).
Gene–gene interactions (G × G) of some
magnitude appear likely, apriori,to make at
least some contributions to addiction vulnera-
bility. However, in the presence of substantial
epistasis,G× G interactions in which specific
alleles at one gene locus are required for ex-
pression of the effects of allelic variants at a
second gene locus, segregation analysis data
might provide uneven patterns of familiality.
With considerable epistasis, second-degree rel-
atives (e.g., cousins) of addicts would be less
likely to display specific combinations of G × G
alleles than first-degree relatives (e.g., siblings).
Substance-dependence rates would thus drop
more precipitously between first- and second-
degree relatives of addicts than they would if
most risk alleles exerted largely independent
effects on addiction vulnerability.
Only limited family data allow for a com-
parison of concordance in first- versus second-
degree relatives. However, the existing evidence
does not support less concordance in second-
degree relatives than we would anticipate based
on the observed concordance in first-degree rel-
atives and the assumption that most risk alleles
produce largely independent effects.
The Genetic Architecture for Substance
Dependence in Individuals
What about the genetic architecture for sub-
stance dependence in individuals? Both between-
locus heterogeneity and within-locus heterogeneity are
likely. If we follow the implications of polygenic
models for addiction vulnerability, we can in-
fer that each dependent individual might dis-
play a nearly distinct set of risk-elevating or
risk-reducing allelic variants. As an illustrative
Page 6
Annals of the New York Academy of Sciences
example, we might postulate that (a) an indi-
vidual must display at least 50 risk alleles to
robustly elevate his or her likelihood of acquir-
ing a substance-dependence disorder and (b)
200 genes contain common allelic variants that
can augment addiction risk. Under such cir-
cumstances, it is easy to see that the exact ge-
netic recipe for addiction vulnerability found in
one addicted individual might be replicated in
only a relatively small number of other addicted
individuals. Such an underlying genetic archi-
tecture would be consistent with the failure of
linkage-based methods to provide reproducible
results in addictions because linkage relies on
the identification of consistent patterns in the
ways that specific DNA markers and pheno-
types move through many families that display
high densities of the disorder.
As noted above, the best-documented ge-
netic heterogeneity for addictions comes from
the chromosome 4 major gene effects found
in poorly alcohol-metabolizing (“flushing”)
Asian individuals.
The best-documented
substance-specific influence comes from the
chromosome 15 nicotinic acetylcholinergic
receptor gene cluster. Other examples of
between-locus genetic heterogeneity and of
genes whose variants exert substance-specific
effects on use and/or dependence probably ex-
ist but have yet to be elucidated.
We also postulate that within-locus het-
erogeneity is likely, though not yet clearly
documented in addiction, to our knowledge.
Many common Mendelian disorders and rarer
Mendelian phenocopies of common disorders
display substantial heterogeneity within their
pathogenic loci. For example, a number of vari-
ants of the same CFTR gene produce cys-
tic fibrosis disorders,
and α synuclein mis-
sense variants and copy number variants can
each provide phenocopies of idiopathic Parkin-
sons disease.
Evidence for within-locus het-
erogeneity in complex disorders is just begin-
ning to accrue; such evidence now includes
data from neurexin gene family variants in
Epigenetics and Individual Differences
in Vulnerability to Addiction
and Related Phenotypes
Epigenetics is now used with both classi-
cal and recently revised definitions. Classical
definitions of epigenetic emphasize influences of
variations that are not encoded in the primary
DNA sequence but are nevertheless inherited;
in other words, “a change in the state of ex-
pression of a gene that does not involve a mu-
tation, but that is nevertheless inherited in the
absence of the signal (or event) that initiated the
However, more recent definitions
of epigenetic emphasize gene regulatory mech-
anisms that do not alter the primary DNA se-
quence and deemphasize the documentation of
In the context of this review, heritable epi-
genetic influences are most relevant. One ex-
ample of a classical, heritable epigenetic in-
fluence is imprinting, in which information is
conveyed from parent to child through mecha-
nisms that include DNA methylation or histone
acetylation. These mechanisms retain the pri-
mary DNA sequence but can dramatically alter
the function of specific genes. DNA methyla-
tion at CpG sequences in the promoter regions
of genes can profoundly alter gene transcrip-
tion. Because methylation during the course of
maternal oocyte (or paternal sperm) develop-
ment is key to this process, familial patterns
of gender-specific transmission can provide ev-
idence for this subset of heritable epigenetic in-
fluence. The modest quality of current family
data sets for addiction renders them a relatively
weak basis for any strong inferences concerning
parent-of-origin effects. Nevertheless, no segre-
gation data of which we are aware supports
strong parent-of-origin effects on substance de-
pendence. Thus, although there are obvious
and large roles for nonheritable epigenetic in-
fluences in the biology of addiction, no current
compelling evidence exists for any strong ef-
fects of overall heritable epigenetic influences,
as classically defined. We nevertheless need to
Page 7
et al.:
Addiction Molecular Genetics
be alert for such influences as we unravel the
effects of variants in specific genes.
The Nature of the Allelic Variants Likely
to Contribute to Individual Differences
in Vulnerability to Addiction
and Related Phenotypes
The analytic strategies described below are
based on postulates that common disease or com-
mon allele models hold for many of the variants
that alter vulnerabilities to addiction and re-
lated phenotypes. Rare variants may also ex-
plain significant fractions of the genetic risk
for common diseases. However, increasing ev-
idence supports roles in addiction vulnerabil-
ity for allelic variants that are currently com-
mon and that are thus likely to be old in an
evolutionary sense. Data indicating that such
variants can be identified in diseased individu-
als from European, African, and Asian genetic
backgrounds also point, in general, to variants
of substantial age. How could genetic selection act
on such common functional allelic variants over
the large number of generations that are im-
plied by this substantial age? It is conceivable
that some currently common allelic variants
could exert polygenic influences on addiction
vulnerability without exerting positive or neg-
ative selective effects during lengthy evolution-
ary histories. However, it also seems likely that
many allelic variants that influence addiction
vulnerability can provide balancing selection;in
other words, the effects of these variants may
be favorable in some individuals, organs, or
circumstances and unfavorable in others. Bal-
ancing selection might thus maintain relatively
high frequencies of multiple functional allelic
variants in the population over long periods of
The biology of some genes might allow for
common, functional allelic variants that could
escape selective pressures or exert balancing se-
lection over many generations. However, other
genes might not be able to harbor such allelic
variations without engendering selective pres-
sures that would reduce the frequency of one of
the allelic variants in the population over time.
Common allelic variants that are able to influ-
ence addiction vulnerability are thus likely to
be restricted to a subset of genes whose prod-
ucts are involved in addictive processes. An im-
portant consequence of this logic follows: if a
gene fails to display variants that influence vul-
nerability to addiction, the gene’s products are
not at all excluded from involvement in ad-
diction. On the other hand, convincing data
implicating a gene’s common variants in ad-
diction should prompt us to consider mech-
anisms whereby such variants might provide
balancing (i.e., both positive and negative) se-
lective influences in the differing environments
through which the ancestors of current human
populations have passed.
How does this discussion of common disease
or common allele hypotheses relate to the pos-
tulates of genetic heterogeneity noted above?
None of the points in the above discussion
about common alleles and common variants
precludes, or even reduces the likelihood of,
contributions of rarer (or even private) allelic
variants, including those that have arisen more
recently in evolutionary time. Recently arising
variations would be much more likely to persist
for a number of generations in the face of even
moderately negative influences on survival or
fertility. Indeed, based on experience with other
genetic disorders, it may be worthwhile to ac-
tively search for effects of rarer phenocopy vari-
ants in genes that are initially identified based
on common, and evolutionarily older, allelic
A rarer copy number variant might
contribute to addiction vulnerability by altering
the levels of expression of a gene that also has
more common allelic variants that alter expres-
sion via SNPs in other gene elements, for exam-
Such considerations support searches
within identified loci for molecular genetic het-
erogeneity relevant to addiction.
In the analyses presented in this review, we
focus on addiction-associated allelic variants
that lie within genes. Evolutionarily old com-
mon haplotypes (i.e., groups of nearby variants
that travel together through generations) that
Page 8
Annals of the New York Academy of Sciences
lie within genes are among the most likely to be
tagged by SNP markers that are represented on
current microarrays. Haplotypes that involve
genes are thus among the most likely variants
to exist in currently reported data sets. It seems
reasonable to postulate that many of these al-
lelic variants that lie within genes provide reg-
ulatory variants that alter expression or regula-
tion. Other variants are likely to alter mRNA
halflives or mRNA splicing. Variants that al-
ter mRNA splicing could occur at the locus of
the affected gene (cis) or at genes at different
loci that alter generic mRNA splicing processes
(trans). Reproducible association of A2BP1 gene
variants with addiction vulnerability, for exam-
provide a good candidate for trans effects
on mRNA splicing because this gene’s product
regulates splicing and thus is likely to modify the
functions of a number of other genes expressed
in the brain. It also seems likely that a minority
of the addiction-associated variants will involve
missense effects on expressed proteins.
It also seems likely that many addiction-
associated variants will lie outside of genes, at
least as we currently understand them. Loci re-
producibly associated with diabetes and body
mass, for example, lack the conventional hall-
marks of genes, such as expressed sequences.
Although the analyses in this review focus on
the identification of variants within genes, we
should also remain alert for the roles of inter-
genic variations in chromosomal regions that
lie between currently understood genes.
Samples for Genome Studies of
Human Addiction Vulnerabilities
and Related Phenotypes
Sample 1: European American
Polysubstance Abusers and Controls
European Americans volunteered to be sub-
jects in research conducted at the National In-
stitutes of Health (NIH) National Institute on
Drug Abuse (NIDA) Intramural Research Pro-
Figure 2. Venn diagram of overlapping genetic
contributions to several of the phenotypes discussed
here based on GWA data sets of about 500,000
SNPs. Note that the area of overlap in the figure
does not necessarily represent the extent of overlap
in the data sets. See text for more details.
gram (IRP) in Baltimore, Maryland, based on
word-of-mouth referrals and newspaper adver-
tisements. Volunteers self-reported their eth-
nicities, provided drug-use histories, and pro-
vided Diagnostic and Statistical Manual (DSM)
diagnoses of substance-use disorders.
Abusers” displayed heavy lifetime use of ille-
gal substances
and dependence on at least
one illegal substance. “Controls” displayed nei-
ther abuse nor dependence on any addictive
substance and reported no significant lifetime
histories of use of any addictive substance. In-
dividuals with intermediate levels of lifetime
substance use without dependence were thus
not included in analyses of substance depen-
dence, although a number of them were in-
cluded among samples studied for cognitive
abilities (see Samples 15 and 16, below). Con-
trol individuals thus combined those who had
no lifetime experience with any addictive sub-
stance with those who had modest to moderate
exposures to legal addictive substances.
Page 9
et al.:
Addiction Molecular Genetics
Sample 2: African American
Polysubstance Abusers and Controls
African Americans who volunteered for re-
search at the NIH IRP (NIDA) were also char-
acterized and separated into “abusers,” “con-
trols,” and samples with intermediate lifetime
uses of substance use, as noted above.
Samples 1 and 2: Efficacies of Recruiting
Subjects for Association Studies and
Comparison with Recruitment for
Linkage-Based Studies
Limited currently available data document
the features of drug abuse research volunteers
who might consent to participate in molecu-
lar genetic studies for linkage or association.
We thus describe the details of the recruitment
of subjects for these studies at the NIH (IRP)
facility in Baltimore, MD, during a 30-month
period (J. H., G. R. Uhl et al., unpublished ob-
servations). During this period, 13,969 individ-
uals were screened by telephone and 2633 were
interviewed in person for all (genetic and non-
genetic) studies at this research facility. This
group included 68% African American, 29%
European American, and 1% Hispanic indi-
viduals; 72% were men and their average age
was 35 years.
Six hundred thirteen unrelated proband in-
dividuals from the group of interviewed sub-
jects were offered participation in this genetic
study, based on the availability of screening
resources. No individual who was offered par-
ticipation during this time period refused to
participate. The individuals who accepted par-
ticipation had an average age of 34 years
and 72% were men; this group included 72%
African American and 26% European Ameri-
can individuals as well as 2% of other ethnic-
ities, based on self-report. Subjects accepting
research participation in this study thus appear
representative of this overall research volunteer
population in this area. They also share some
characteristics of the drug-abusing population
in Baltimore, based on population trends and
1981 data from the Baltimore Epidemiological
Catchment Area site, which identified 57% of
the men and 41% of the women who displayed
substance abuse and/or dependence as Euro-
pean American (J. Anthony, personal commu-
nication, 1998).
Each volunteering subject was offered three
choices concerning family member contacts.
One hundred twelve probands (18%) provided
Type I consents, which allowed investigators to
contact their family members; 73% of these in-
dividuals were men, 66% were African Amer-
ican, 30% were European American, and 3%
self-reported other ethnicities. Three hundred
twelve probands (51%) gave Type II consents,
stating that they would contact their family
members; 72% of these individuals were men,
74% were African American, 23% were Euro-
pean American, and 3% reported other ethnic-
ities. One hundred eighty-nine (31%) refused
family member contacts; 72% of these individ-
uals were men, 74% were African American,
23% were European American, and 3% re-
ported other ethnicities.
For 33% of the pedigrees for which the
proband had provided Type I consents, at least
one member could be reached by telephone
or mail. At least one member of 12.5% of the
pedigrees for whom the proband had provided
Type II consent called an investigator and kept
an appointment for study participation.
Of the 79 pedigrees from which family mem-
bers made and kept appointments for study
participation over this 30-month time period,
75% had African American probands and 69%
had male probands. The sizes of the potentially
accessible sibships in the pedigrees of these in-
dividuals, as described by the probands, were:
1 (22%), 2 (23%), 3 (13%), 4 (11%), 5 (12%), 6
(9%), 7 (7%), 8 (1%), and 11 (2%). The num-
bers of accessible parents from these pedigrees
was 0 (10%), 1 (28%), and 2 (62%). Average
families for which more than the proband were
accessible thus had 1.5 accessible parents and
about 3.5 accessible siblings.
Over this 30-month time period, DNA
and clinical information were collected from
Page 10
Annals of the New York Academy of Sciences
two-thirds of the members of the average pedi-
gree from which any member came for an
appointment. More complete sampling was ob-
tained from smaller than from larger pedi-
grees. Of the 79 pedigrees for which DNA
and clinical information were obtained, 54
had 2 members, 15 had 3 members, 7 had
4 members, and 3 had 5 members. By the
end of this period, DNA and clinical informa-
tion was successfully collected from 2.5 mem-
bers of the average pedigree. Fourteen of these
pedigrees contained same-gender siblings, and
15 contained opposite-gender siblings within
5 years of the age of the proband who were
discordant for drug abuse phenotype. Eigh-
teen siblings were within 5 years of age of the
proband and concordant for substance abuse
phenotype. Thirteen siblings differed from the
probands in age by more than 5 years and
were discordant for phenotype, whereas four
were concordant.
Reliability of pedigree structure reports was
assessed by comparing family structure in-
formation provided by the proband and an-
other first-degree relative informant from these
79 families. Parents agreed (completely) with
100% of the proband’s pedigree assignments,
whereas siblings agreed (completely) with 70%
of all possible pedigrees. Disagreements were
largely a result of differential reporting of half
versus full siblings.
Reliability of information about drug histo-
ries was assessed by comparing drug use survey
(DUS) and family history/research diagnostic
criteria (FH/RDC) estimates of drug use by
the proband with FH/RDC estimations from
first-degree relatives. Seventy-three percent of
parental evaluations agreed (completely) with
the proband’s evaluations of all pedigree mem-
bers’ status with respect to all abused sub-
stances. Eighty-one percent of sibling evalu-
ations were concordant (completely) with the
proband’s evaluations of his or her own drug
use. Differences were most prominent in both
cases as a result of family member underesti-
mation and/or under reporting of offspring or
sibling drug use.
As estimate of the validity of DUS quan-
tity/frequency estimates was obtained by com-
paring DUS ratings for substances with crite-
ria identified as strongly heritable in work by
Tsuang and colleagues
: based on subjects’
reports that the substances were “never used,
used fewer than 5 times, and used more than
5 times.” All individuals who denied using al-
cohol, cocaine, heroin, cannabis, or nicotine
when questioned as part of the addiction sever-
ity index screening received “0” scores on the
DUS for the appropriate substance. All but two
of the individuals who scored 2+ or 3+ on the
DUS reported use more than five times with
a separate instrument on a separate occasion.
One individual who reported use of a substance
five or fewer times during screening obtained a
2+ and one obtained a 3+ score on the DUS
for cannabis. Individuals who reported one to
five lifetime uses during screening with the ad-
diction severity index received 1+ DUS ratings
on several occasions. The percentage of indi-
viduals who were rated as 1+ on DUS and who
reported use of substances between one and five
times on a different scale was 7%, 12%, 24%,
32%, and 39% for alcohol, nicotine, cocaine,
cannabis, and heroin, respectively.
Although it is encouraging that none of the
individuals who were offered participation in
this study refused to participate themselves,
only one-fifth of the probands agreed, during
initial evaluations, to provide permission for
research staff to directly contact other mem-
bers of their families. Individuals who agreed
to contact their own family members did so at
rates substantially lower than the contact rates
achieved when family contacts were initiated by
research staff members. Such differential par-
ticipation may well provide occult differences
between samples collected for association ver-
sus linkage studies.
It may also be important to consider that
these subjects consented to studies in which
genotyping was performed using pooling ap-
proaches that provide maximal protection from
research risks (see below). Samples collected
in studies that propose to conduct unlimited
Page 11
et al.:
Addiction Molecular Genetics
high-density individual genotyping in any num-
ber of different laboratories might well expe-
rience different consent rates from some sub-
groups of participants, providing an additional
confounding factor.
Samples 1 and 2: Genotyping
The primary data reviewed here are based
on assessments of allele frequencies in mul-
tiple DNA pools, each of which contained
equal amounts of DNA from 20 individuals of
the same racial and phenotype group. Each
DNA pool was assessed on four sets of four
arrays, two from 100K and two from 500K
Affymetrix microarray sets. We also document
some of the features of assessments using 1M
SNP Affymetrix 6.0 arrays.
The 600K methods used for these samples
revealed correlations (r = 0.95) between pooled
and individual genotyping in extensive valida-
tion studies
(see below). Much of this variation
(1%–2%) can be attributed to the variations
in pipetting and DNA quantification required
for pool construction (T. D. and D. W., unpub-
lished observations, 2002–2006). The remain-
der of the variance is reduced for arrays used
at the end of a study, in comparison to validat-
ing experiments that use data from some of the
first arrays of the type that were studied within
this laboratory. The correlation (r = 0.95) thus
probably reflects an upper limit of the variance
noted in actual disease versus control compar-
isons. Even higher correlations between indi-
vidual genotyping and pooled genotyping data
sets can be revealed by 1M SNP assays (Liu
et al.,inpreparation).
Data for each SNP provide a score that
results in a continuous measure of the per-
centage of its two alleles in the DNA from
the hetero- and homozygous individuals rep-
resented in each pool. For SNPs that displayed
nominally significant differences between ad-
dicted and control individuals in 600K studies,
the correlation between the magnitude of the
differences in pooled versus individual geno-
typing approaches was r 0.9. Variance from
array-to-array (assessing the same DNA pool)
and variance from pool-to-pool were mod-
est, around 3%, suggesting that the validity of
these pooling data is good. Results from 1M
SNP assays provide even more modest pool-to-
pool and array-to-array variances (Liu et al.,in
Sample 3: COGA European American
Alcohol-Dependent Subjects
and Controls
Unrelated individuals sampled from pedi-
grees collected by the Collaborative Study on
the Genetics of Alcoholism (COGA) provide an
interesting sample for several reasons. Depen-
dence on alcohol and other substances has been
carefully characterized in these individuals us-
ing validated instruments. Unrelated control in-
dividuals free from substance abuse or depen-
dence diagnoses, largely individuals who marry
into these pedigrees, are available. We thus
identified 120 unrelated alcohol-dependent in-
dividuals and 160 unrelated unaffected con-
trols with self-reported European American
Sample 3: Genotyping
Allele frequencies were assessed in 14 DNA
pools, each containing equal amounts of DNA
from 20 individuals of the same phenotype
group, using four sets of four arrays, two from
100K and two from 500K Affymetrix SNP ar-
rays, using approaches that were extensively
validated as noted above.
Sample 4: Taiwanese
Subjects and Controls
Unrelated subjects recruited in Taipei in-
cluded 140 methamphetamine-dependent in-
dividuals independently diagnosed by each of
two psychiatrists using DSM–IV criteria
and 240 matched Han Chinese controls who
Page 12
Annals of the New York Academy of Sciences
denied any history of use of illegal drugs and de-
nied any history of psychotic symptoms. Thirty
percent of the subjects were women and their
average age was 32.5 ± 10 years. Depen-
dent individuals reported methamphetamine
use more than 20 times per year or de-
scribed well-documented methamphetamine
psychosis with lower levels of regular use. They
denied histories of psychosis either prior to
methamphetamine use or in relation to other
psychedelic drugs. Most reported use of at least
one other addictive substance. Controls denied
illegal drug use or psychotic symptoms and
were matched for gender and age.
Sample 5: JGIDA Japanese
Subjects and Controls
Twenty-one percent of the Japanese sub-
jects were women and their average age was
40 years. One hundred methamphetamine-
dependent subjects were inpatients or outpa-
tients of psychiatric hospitals in the regions
that participate in the Japanese Genetics Ini-
tiative for Drug Abuse (JGIDA)
met ICD-10-DCR criteria F15.2 and F15.5
for methamphetamine dependence in indepen-
dent diagnoses made by each of two trained
psychiatrists based on interviews and review
of records. Ninety-one percent revealed his-
tories of methamphetamine psychosis, 89%
used methamphetamine intravenously, 62%
also abused organic solvents, and most abused
at least one other substance. Subjects who
displayed clinical diagnoses of schizophrenia,
other psychotic disorders, or organic mental
syndromes were excluded. Control subjects in-
cluded 100 age-, gender-, and geographically
matched staff recruited at the same institutions
who denied use of any illegal substance, abuse
of or dependence on any legal substance, any
psychotic psychiatric illness, or any family his-
tory of substance dependence or psychotic psy-
chiatric illness during interviews with trained
Samples 4 and 5: Genotyping
We assessed allele frequencies for metham-
phetamine-dependent and control subjects in
DNA pools, each containing equal amounts of
DNA from 20 individuals of the same pheno-
type group, on four sets of arrays, two arrays
from 100K and two from 500K Affymetrix sets
(Sample 5) or two arrays from 500K Affymetrix
sets (Sample 4).
Sample 6: Australian and U.S.
Dependent versus Nondependent
Smokers of European Ancestry
Dependent smokers of European ances-
try were diagnosed using the Fagerstr
Test for Nicotine Dependence criteria (FTND
score 4) and were compared with smok-
ers who did not display dependence (FTND
scores = 0).
About one-quarter of the indi-
viduals who displayed dependence by FTND
standards did not display DSM–IV nicotine de-
pendence. About one-quarter of the individuals
used in this study as controls did display DSM
nicotine dependence.
Sample 6: Genotyping
Allele frequency data were assessed in 16
DNA pools, each containing equal amounts of
DNA from 60 individuals using a single set of
49 arrays that assessed 2,427,354 SNPs. Meth-
ods used for these samples revealed overall cor-
relations (r = 0.85) between pooled and indi-
vidual genotyping for all SNPs. However, much
more modest correlations (r = 0.58) were found
between dependent versus nondependent data
derived from pooled versus individual genotyp-
ing for the SNPs that displayed nominally sig-
nificant differences. Individual genotyping fol-
lowed up nominally positive results for 39,213
SNPs in 1050 dependent and 879 nondepen-
dent smokers. The convergence analyses pre-
sented here use data from the subset of these
39,213 SNPs that lie within genes.
Page 13
et al.:
Addiction Molecular Genetics
Sample 7: WTCCC Subjects with Bipolar
Disorder and Controls
GWA for bipolar disorder compared con-
trols to 1868 U.K. individuals of European de-
scent with bipolar disorders from the Wellcome
Trust Case Control Consortium (WTCCC).
Bipolar mood disorders were diagnosed using
Research Diagnostic Criteria. Uncharacterized
control samples included (a) 1480 individuals
from a 1958 birth cohort sample, (b) 1458 in-
dividuals from a UK Blood Service sample of
consenting blood donors, and (c) individuals
with disease phenotypes whose genetics were
deemed unlikely to overlap with the genetics of
bipolar disorder.
Sample 7: Genotyping
Genotyping for the 436,604 autosomal SNPs
analyzed was performed using Affymetrix
500K arrays with allele calls made by a CHI-
AMO algorithm with an a posteriori probability
threshold of at least 0.9.
Of the 469,557 SNPs
assessed, 436,604 could be assigned confident
chromosomal localizations. A P value for each
SNP was determined based on its χ
-test for
significance of allele frequency differences in
bipolar versus control (i.e., control plus other
disease) subjects.
Sample 8: NIMH Subjects with Bipolar
Disorder and Controls
GWA was assessed in controls compared
with 461 unrelated bipolar I probands of
self-reported European American ancestry
who were selected from families that in-
cluded at least one affected sibling pair
who participated in the National Institute
of Mental Health (NIMH) Genetics Initia-
tive (
Probands were
assigned a “confident” diagnosis of DSM–IV
bipolar I disorder by each of two trained clini-
cians. Five hundred sixty-three unrelated con-
trol individuals of European American ancestry
who failed to display evidence for DSM–IV cri-
teria for major depression, any history of bipo-
lar disorder, or any history of psychosis were
recruited by a marketing firm.
Sample 9: German Subjects with Bipolar
Disorder and Controls
GWA for 536,288 autosomal SNPs was per-
formed in controls compared with 772 bipolar
I patients diagnosed using DSM–IV criteria
who were recruited from consecutive hos-
pital admissions.
Eight hundred seventy-six
population-based controls were randomly re-
cruited; individuals with histories of affective
disorder or schizophrenia were excluded.
Samples 8 and 9: Genotyping
For genotyping, NIMH samples (Sample 8)
were divided into seven bipolar and nine con-
trol pools of 50–80 subjects per pool. German
samples (Sample 9) were divided into 13 bipo-
lar and 10 control pools of 42–60 subjects per
pool. SNP allelic distributions were assessed us-
ing duplicate Illumina HumanHap550 assays
(Illumina Inc., La Jolla, CA, USA).
ized allele frequencies were calculated from raw
intensity data averaged across duplicate pools
to obtain a relative allele frequency estimate for
each SNP in each pool. SNPs with allele fre-
quencies that displayed greater than 2% vari-
ance between replicate pools were excluded.
Pool-to-pool variation within phenotypes was
compared with phenotype-to-phenotype differ-
ences using t-tests.
Sample 10: Unrelated Members
of NHLBI Twin Pairs for Assessment
of Frontal Brain Volume
Two hundred forty-two unrelated individu-
als were selected randomly from members of
twin pairs from a population-based registry of
European American male World War II vet-
eran twin pairs who received volumetric MRI
studies as part of the National Heart, Lung, and
Blood Institute (NHLBI) Twin Study. When
studies were performed, subject age averaged
Page 14
Annals of the New York Academy of Sciences
72.6 years, and they reported an average of
13.6 years of education. Frontal lobar volumes,
corrected for intracranial volumes, were ob-
in ways that produced interrater reli-
abilities greater than 0.90.
Sample 10: Genotyping
For genotyping, DNA samples were carefully
quantitated and combined into 12 pools that
each represented about 20 subjects based on
estimates for frontal brain volume corrected for
total cranial volume. We thus constructed four
DNA pools from individuals with the lowest
estimates of total frontal brain volumes, four
pools from individuals with intermediate vol-
umes, and four pools from individuals with the
highest estimates of total frontal brain volumes.
We subjected these DNA pools to Affymetrix
500K genotyping as noted above and used t-
tests to compare differences between the high-
est and lowest brain volume groups. SNPs that
displayed nominally significant t values and that
also displayed rank order of allelic frequencies
with either highest tercile > intermediate ter-
cile > lowest tercile or the converse are included
in these analyses.
Sample 11: Framingham Study
Participants for Assessment
of Frontal Brain Volume
Subjects were 705 stroke- and dementia-free
participants in the Framingham study. The av-
erage age of subjects was 62 ± 9years,and
50% were men. Subjects received volumetric
brain MRI studies that were analyzed as noted
above for the NHLBI subjects.
Sample 11: Genotyping
Genotyping provided data from 70,987 au-
tosomal SNPs using Affymetrix 100K arrays.
Allele frequencies for SNPs that displayed (a)
minor allele frequencies 0.10, (b) genotype
success 0.80, and (c) Hardy–Weinberg equi-
librium P 0.001 were used. A generalized
estimating equation provided corrections for
familial relatedness and other covariates.
Sample 12: European American
Smokers Who Successfully versus
Unsuccessfully Quit Smoking
in Trials in Philadelphia, PA,
Washington, DC, and Buffalo, NY
European American smokers who success-
fully versus unsuccessfully quit smoking in tri-
als in Philadelphia, Pennsylvania, Washing-
ton D.C., and Buffalo, New York responded
to advertising and physician referrals for
help in smoking cessation.
Subjects aged
18–65 enrolled in randomized clinical trials
for smoking cessation accompanied by stan-
dardized behavioral counseling that used a
blinded, placebo-controlled trial of bupropion
(300 mg/day) or matching placebo for 10 weeks
or an open-label trial of nicotine nasal spray
versus nicotine patch for 8 weeks.
One hun-
dred twenty-six individuals with biochemically
confirmed abstinence for at least the 7 days
prior to assessments at both 8 weeks and
24 weeks were contrasted with 140 unsuccessful
quitters who were not abstinent at either time
Sample 13: European American
Smokers Who Successfully versus
Unsuccessfully Quit Smoking
in Trials in North Carolina
Participants received either active nicotine
(21 mg/day) or placebo skin patches for two
weeks before the targeted quit date as well
as mecamylamine (10 mg/day p.o.) prior to
the target quit-smoking date.
After the quit
date, participants were randomly assigned to
mecamylamine (10 mg/day) versus matching
placebo, and to 21 mg/24 h versus 42 mg/24 h
nicotine skin patch doses. Fifty-five individu-
als reported continuous abstinence from smok-
ing when assessed 6 weeks after the quit date
with biochemical confirmation; 79 were not
Page 15
et al.:
Addiction Molecular Genetics
Sample 14: European American
Smokers Who Successfully versus
Unsuccessfully Quit Smoking
in Trials in Rhode Island
Participants engaged in a 10-week, double-
blind, placebo-controlled trial of placebo or
bupropion (150 mg/day for the first 3 days,
then 300 mg/day) with a target quit date 1 week
following initiation of drug or placebo.
individuals with biochemically confirmed ab-
stinence for at least the 7 days prior to the end
of treatment and at a 24-week assessment were
contrasted with 90 unsuccessful quitters who
were not abstinent at either time point.
Samples 12–14: Genotyping
Genotyping for samples 12–14 used
Affymetrix 500K arrays and multiple pools of
DNA samples (n = 16 to 20), as noted above.
Sample 15: African American Individuals
with Different Levels of General
Cognitive Ability
Research volunteers with a variety of differ-
ent levels of lifetime use of addictive substances
volunteered for research protocols at the NIH
(NIDA) facility in Baltimore, Maryland, as de-
scribed above. Eighteen pools were constructed
with DNA from 20 individuals each; 33% of the
subjects were women, and the average age was
32.1 years (range: 18–65 years). Mean cogni-
tive function scores estimated from the Ship-
ley Institute of Living scales ranged from “IQ”
equivalents of 75.9 to 109.2 for the individuals
in these DNA pools.
Sample 16: African American Individuals
with Different Levels of General
Cognitive Ability
Eleven pools were constructed with DNA
from 16 individuals each; 34% of subjects were
women, and the average age was 31.8 years
(range: 18–65 years). Mean estimated “IQ”
scores from the Shipley Institute of Living scales
ranged from 79.1 to 112.3 for the individuals
represented in these DNA pools.
Samples 15 and 16: Genotyping
Genotyping for samples 15–16 used
Affymetrix 500K arrays and pools of DNA
samples (n = 16 to 20), as noted above. The
nominal significance of the correlations be-
tween pool-to-pool differences in assessments
of allele frequency and pool-to-pool differences
in Shipley scores was assessed for each SNP.
Sample 17: Subjects with Alzheimer’s
Disease versus Controls: Brain Donors
Subjects were 1086 brain donors who were
at least 65 years old at death (with a mean
age at death of 82 years); 43% of subjects were
men. Brains and clinical data met patholog-
ical criteria for Alzheimer’s disease (AD) or
control status. DNA samples were subjected
to Affymetrix 500K genotyping. From the files
TGEN_WGA_DATA_ recode_ped.txt,
type calls for 552 control individuals and 859
individuals with AD allowed us to (a) calculate
values for the AD versus control differences
for each SNP, (b) select SNPs that displayed
values with P < 0.05 as nominally positive,
and (c) assess which of these nominally positive
SNPs fell into chromosomal clusters such that
at least three nominally positive SNPs repre-
senting both array types lie no more than 25 kb
from each other.
Sample 18: Subjects with Alzheimer’s
Disease versus Controls:
Memory Clinic Participants
Seven hundred fifty-three individuals with
AD and 736 control subjects with European
ancestry were recruited in Canadian memory
clinics. Probable AD was diagnosed by clinical
criteria, and controls were selected who dis-
played no histories of memory impairment or
any impairment on neuropsychological tests.
DNA samples were subjected to Affymetrix
Page 16
Annals of the New York Academy of Sciences
500K genotyping. From files available through
the GlaxoSmithKline Clinical Trial Register
(available at
observational/studylist.asp), P values
for each SNP, derived from Fisher’s exact tests,
were extracted and data were analyzed as de-
scribed above.
Sample 19: Individuals with Scores
on Tests of Neuroticism
One thousand thirty-eight individuals from
southwestern England sites with European
backgrounds and with high neuroticism (N)
scores on the revised Eysenck Personality Ques-
tionnaire and 1016 individuals with low N
scores were studied; 63% of these subjects were
women. A replication sample (61% female) in-
cluded 831 high versus 702 low N individ-
Genotyping of eight pools of DNA
from mouth swabs compared (a) men with
high N scores (n = 112), (b) men with low
Nscores(n = 158), (c) men with very high
Nscores(n = 245), (d) men with very low N
scores (n = 238), (e) women with high N scores
(n = 320), (f) women with low N scores (n = 205),
(g) women with very high N scores (n = 340),
and (h) women with very low N scores (n = 436).
Very high or low N scores were defined as more
than 1.5 SD from the mean score adjusted to
age and sex (on average, 2 SD), whereas high
and low N scores were between 1 and 1.5 SD
from the mean score (on average, 1.3 SD). Data
for relative allele score from the 452,574 SNPs
from 100 and 500K Affymetrix arrays with mi-
nor allele frequencies above 5% were obtained
from five replicate arrays used to assess each
Selected Methodological Issues
Genotyping Using Pooled DNA
Genotyping for GWA studies can be per-
formed in either individual samples or in pools
of DNA from individuals with the same racial
or ethnic background and the same pheno-
type. Pooling strategies have several advan-
tages. Pooling fits well with association genetics,
can allow for efficient allele typing, can preserve
confidentiality, and can reduce costs.
Not all pooling strategies are alike. Single-pool
strategies seek differences in allele frequencies
by comparing data between a single pool of
DNA from diseased individuals and a single
pool of DNA from control individuals. Such
results generate hypotheses. However, such de-
signs, and related designs with very few pools,
provide little ability to differentiate between
(a) the variability between disease and control
samples and (b) the variability within disease
samples or within control samples.
We focus here instead on multiple-pool strate-
gies. With careful attention to a large number of
small details, these approaches can provide ac-
curate allele typing. Multiple-pool approaches
provide estimates of (a) differences between dis-
ease and control samples, (b) variability within
disease samples, and (c) variability within con-
trol samples. Assessment of the differences
between disease and control in the context of as-
sessments of the variability within disease sam-
ples and within control samples allows us to
use standard statistical approaches to assess the
significance of the results.
Here, we provide assessment of several of
the steps necessary to validate and characterize
features of the power and sensitivity of multiple-
pool genome-scanning strategies. Used with
high densities of genomic markers, carefully
performed multiple-pool studies can provide
increased study feasibility and preserve virtu-
ally absolute genetic confidentiality with only
modest effects on the sensitivity and specificity
of GWA.
(a) DNA quality, quantity, and contam-
ination. Care in assessing and maintain-
ing the quantity and quality of DNA in
every sample is crucial for pooling studies.
Rough DNA quantitation procedures that
are routinely used in most genotyping lab-
oratories are likely to introduce such sub-
stantial errors that many of the apparent
Page 17
et al.:
Addiction Molecular Genetics
disease versus control differences will ac-
tually arise from occult over- or underrep-
resentation of genotypes of individuals with
misquantitated DNA. Uneven DNA quality
can provide the same selective over- and un-
derrepresentation of genotypes of selected
individuals in each pool, leading to more
false positives and less sensitivity for detec-
tion of true positive results. Contamination
of pooled DNA samples with even a modest
amount of DNA from laboratory personnel
or other sources can also provide difficulties
for pooling procedures.
(b) Numbers of individuals per pool and
numbers of pools. To obtain maximal
benefits from multiple-pool GWA, one must
meet the following requirements: (a) The
numbers of individuals in each pool should
be sufficient that even sophisticated analy-
ses of pooled data cannot reconstruct indi-
vidual identities or genotypes. Treatments
of this subject suggest that pools need to
contain more than four or five individuals
for maximal confidentiality protection.
(b) The numbers of individuals in each pool
should allow significant cost and time sav-
ings compared with individual genotyping.
(c) The numbers of pools should be suffi-
cient to provide good estimates of pool-to-
pool variability, which can then be used to
compare to the differences between disease
and control individuals using standard sta-
tistical tests.
Multiple genotyping assessments of each
DNA pool can aid the precision of esti-
mates of pool-to-pool variability as well. We
have used three to four microarrays to assess
DNA samples from each pool. These num-
bers are based on preliminary studies that
seek to optimize estimates of “true” rela-
tive allele frequencies at acceptable cost. We
construct each pool using DNA from 20 in-
dividuals of the same self-reported ethnicity
and the same disease or “control” pheno-
type. Thus, we obtain results in quadrupli-
cate at 1/5–1/7 the reagent costs of indi-
vidual genotyping using a single array set
per person.
(c) Number of different replicate sam-
ples. Power calculations assess the likeli-
hood that an experiment can detect a dif-
ference of a certain magnitude in a spe-
cific SNP. Experiments require reasonable
levels of protection against false positive re-
sults, α. They also require reasonable power,
β. GWA requires many repeated measures;
considerations of α and β thus need to be
applied to hundreds of thousands or mil-
lions of SNPs.
One approach to the dilemma raised by
these large numbers of multiple compar-
isons has been to propose single studies with
increasingly large sample sizes. However,
accretion of very large samples is expensive.
Attempts to assemble large samples from
smaller subsamples collected at various sites
also run greater and greater risks of incor-
porating increasing numbers of occult het-
erogeneities that could provide confound-
ing influences on the results obtained.
The approaches that we outline here rely
on initial use of achievable sample sizes that
may be more likely to be more homoge-
neous. Initial samples can nominate sets of
SNP markers, genomic regions, and genes
that can be studied in additional indepen-
dent, replicate samples. A requirement that
genes display SNPs whose allelic frequen-
cies distinguish disease from control individ-
uals in multiple samples is one of the few as-
surances against false positive results that is
likely to pass ultimate statistical muster and
also to yield feasible experimental designs.
There is a downside to this approach: rates
of false negative results are also unavoidably
elevated by requirements for replication (see
(d) Validation studies. Validation studies as-
sess the fits between individual and pooled
genotyping in a number of different ways.
These include assessments of the con-
cordance between results from sense and
Page 18
Annals of the New York Academy of Sciences
antisense probes for the same SNP and con-
cordance between results for the same SNPs
obtained using arrays of different types.
The same DNA samples can be pooled
multiple times, and the same pool can be
analyzed multiple times, to further assess
We focus here on a core validating test for
pooling. This core test comes from analyses
of the relationships between (a) observed
allele ratios, background-subtracted, nor-
malized hybridization intensity ratio val-
ues obtained from different pools of DNA
samples; and (b) expected allele ratios, the
fraction of, for example, A” and “B” al-
leles obtained from individual genotypes.
We and others have compared allelic de-
terminations from individual DNA samples
versus results from pools with equal or dif-
fering amounts of DNA from small num-
bers or larger numbers of control individu-
als using HuSNP; Affymetrix 10K, 100K,
500K, and 1 million SNP products; Per-
legen arrays; and Illumina 300 and 500K
Using 500K Affymetrix arrays, we have
evaluated pooling using equal and varying
amounts of DNA from CEPH individuals.
Overall data for 150,000 SNPs from these
comparisons produces correlations between
pooled and individually determined geno-
types of r
= 0.95 (Fig. 3). These over-
all results derive from studies of 5475 and
6230 informative Nsp IandSty ISNPs
in experiments in which equal amounts of
DNA from homozygotes were mixed (cor-
relation ca 0.9); 10,032 informative Nsp I
and 10,249 informative Sty I SNPs in ex-
periments in which these same DNA sam-
ples were mixed in 1:1, 1:5, and 1:15 ratios
(correlations of 0.96 and 0.98, respectively)
and 31,201 informative Nsp I and 39,827
Sty I SNPs for studies of one homozygote
and one heterozygote (correlations of 0.89
and 0.92, respectively). When we compare
these results with those reported using long-
range PCR products, using the reported
Figure 3. Validation of SNP genotyping in DNA
pools (From 12). The relationship (
= 0.95) between
individual and pooled genotyping using 500K SNP
Affymetrix arrays provides an opportunity to assess
the sensitivity of pooled genotyping. Because these
validation experiments were the first ones performed
with new array sets, these data provide a lower limit.
Current results from 1M SNP arrays (6.0) provide
relationships of around R = 0.98 (Drgon
et al.
procedure of eliminating the 9% of SNPs
that yielded the more problematic correla-
the overall correlations for the re-
maining SNPs is 0.98. Correlations using
1M Affymetrix SNP arrays are at least as
strong, with r
> 0.98.
Power Assessments
Approaches to assessing power of GWA have
used a variety of assumptions about the fre-
quencies of disease-causing alleles, the hetero-
geneity and penetrance of disease-causing al-
leles, marker frequencies, and the nature and
distribution of linkage disequilibrium across the
genomic intervals surveyed.
Many ap-
proaches to this problem use linkage disequilib-
rium distributions identified in HapMap sam-
ples, even though these HapMap individuals
represent only very small subsets of several cur-
rent human populations.
We have been impressed by the variability in
the detailed distribution of linkage disequilib-
rium across different genomic loci in different
Page 19
et al.:
Addiction Molecular Genetics
Figure 4. The power of GWA as assessed using Gene Detective. Power is simulated here with 620,000
diallelic markers for samples with
= 400 cases and
= 400 controls with nominal 0.05 α levels. Note
the striking relationship between power and effect size. The power to detect effects that would produce odds
ratios of less than 1.2-fold is modest, whereas the power to detect effects as high as 1.7-fold is relatively
We have also been impressed by
the potential to model approximations of this
variability, on average, using simple functions.
Under these circumstances, estimates of effects
of sample size, locus-specific effect sizes for the
underlying functional alleles, genetic hetero-
geneity, penetrance, and marker density can
produce reasonable models that can allow as-
sessments of the effects of variation in these
parameters on power.
We have focused on diallelic markers and
disease/no disease phenotypes. We have devel-
oped a model to simulate the effects of vary-
ing these parameters that has resulted in the
program Gene Detective. We can use this ap-
proach (see Supplement for details) to simulate
the approximately 620,000 diallelic markers re-
ported to date for samples with n = 400 cases
and n = 400 controls with nominal 0.05 α lev-
els (Fig. 4). We can observe effects of sample
size, heterogeneity/penetrance ratios, marker
minor allele frequencies, and disease frequen-
cies. Such effects are each relatively modest over
a reasonable range of values for genome-wide
distributions of linkage disequilibrium. How-
ever, the relationship between power and effect
size is striking. The power to detect effects that
would produce odds ratios of less than 1.2-fold
is modest, whereas the power to detect effects
as high as 1.7-fold is relatively good.
Increments in marker density from 630,000
to 1,000,000 and increases in sample sizes from
n = 400 to n = 2000 samples in case and
control improve power (Fig. 5). However, the
steep relationships between power and effect
sizes are also found in these simulations. Such
Page 20
Annals of the New York Academy of Sciences
Figure 5. The power of GWA as assessed using Gene Detective. Power is simulated here with 1,000,000
diallelic markers for samples with
= 2000 cases and
= 2000 controls with nominal 0.05 α levels. Note
that the striking relationship between power and effect size is retained.
results underscore the distinctions that we have
made above concerning analytic approaches to
oligogenic disorders, in which variants at in-
dividual gene loci produce relatively large dif-
ferences in risk, versus polygenic disorders, in
which the effects at each locus are likely to be
These power calculations apply only to the
initial genome scans. As we note above, repli-
cate studies aid in distinguishing false posi-
tive from true positive associations but also in-
crease the cumulative number of false negative
Limits on the precision of the power calcu-
lations that derive from the approach outlined
here include limits on the precision of estima-
tion of the parameters whose estimation is re-
quired. Several of these parameters can be es-
timated based on substantial empirical data.
These include the size of the genome or ge-
nomic segment under consideration, the sizes
of the samples of disease and nondisease control
subjects studied, and the value for α desired.
The frequency of the disease, P [D], in the
population under study is available from epi-
demiological studies. It is thus important that
the samples for association genome scanning
display characteristics similar to those of the
populations in which disease probabilities have
been determined, so that estimates of P [D] are
as accurate as possible.
To assess the statistical power of analysis,
we have also used more standard power
calculations. The program PS v2.1.31
α = 0.05, (b) sample sizes equal to the numbers
of pools from the current data set, (c) mean
abuser/control differences of 0.05 and 0.1,
and (d) standard deviations from the SNPs
that provided the largest differences between
control and abuser populations are used in
several of the discussions below. We have also
used data from the Genetic Power Calculator
for some analyses.
Page 21
et al.:
Addiction Molecular Genetics
Achieving Significant Genome-Wide
Association in Single Samples versus
Seeking Replication and Generalization
in Multiple Samples
(a) Single-sample approaches. As noted
above, GWA gains power to detect variants
in more and more of the genome as more
and more genetic markers, generally SNPs
and/or copy number variants, are assayed.
Because many hundreds of thousands of
SNPs and/or copy number variants are
assayed in current data sets, stringent ap-
proaches to correct for the large number of
multiple comparisons are needed.
No clear-cut consensus has been reached
regarding the ability of any single method
to produce only true results from any single
sample. One approach to concerns about
the large numbers of comparisons that are
key components of GWA focuses on achiev-
ing genome-wide significance in single samples.
Single samples that demonstrate genome-
wide significance in this way must con-
tain single SNPs whose association displays
a striking nominal P value, often in the
neighborhood of about 10
results may be the most likely to be pub-
lished in prominent journals. However, in
most studies with findings of this statisti-
cal magnitude, effects of variants at a sin-
gle locus are sufficiently large that linkage
studies also provided significant evidence at
the same chromosomal locus.
For oli-
gogenic contributions to common, complex
disorders, seeking association with genome-
wide significance in single samples thus pro-
vides a reasonable approach. When a single
gene has a large effect, a number of cor-
rections for multiple comparisons can be
applied without creating many false neg-
atives. Bonferroni corrections for multiple
comparisons are advocated by some inves-
tigators, although they are generally con-
sidered to provide a conservative correc-
False discovery rate corrections
can also be applied.
Permutation and
Monte Carlo tests provide additional ap-
As the expected effect of each locus falls
from the large effects characteristic of oli-
gogenic influences to the small effects that
characterize polygenic influences, however,
the sample sizes needed to generate P values
in these ranges provide a daunting problem.
Costs of individually genotyping such large
samples become limiting in all but the best-
supported enterprises.
The risks of intro-
ducing occult heterogeneities increase when
subsamples are collected at a variety of dis-
tinct sites.
As more occult heterogeneities
are included—because disease and control
samples need to be assembled from more
and more diverse sources to achieve a suffi-
cient sample size—more and more of the
results obtained may well represent false
positives based on such occult sample het-
erogeneity for genetic background or for
heritable traits that are not (nominally) be-
ing studied.
(b) Replicate sample approaches. Here
we use an alternative analytic approach that
focuses on stepwise assessments. These step-
wise analyses address the problem of multi-
ple testing by seeking nominally significant
results that can be replicated in several inde-
pendent samples. We can assess the signifi-
cance of these replicated, nominally signifi-
cant results through the use of Monte Carlo
methods that correct for multiple compar-
It is important to emphasize the ways in
which the stepwise analyses presented here
identify, first, evidence for genes that con-
tain haplotypes found at different frequen-
cies in single disease versus control sam-
ple comparisons and, second, evidence for
genes that display haplotypes with such dif-
ferent frequencies in multiple samples that
results are unlikely to be due to chance.
We thus (a) first identify nominally signif-
icant SNPs in each sample, (b) identify
the clustering of such SNPs (within small
Page 22
Annals of the New York Academy of Sciences
chromosomal regions) in each sample, (c)
seek replication, identifying small genomic
areas in which clusters from multiple repli-
cate samples from the same phenotype
also identify clustered nominally significant
SNPs, and (d) seek generalization, identify-
ing genes that contain clusters of nominally
positive SNPs from studies of related, genet-
ically determined phenotypes.
The criterion used here identifies clustering
based on chromosomal position. This ap-
proach allows direct comparison between
data sets that assess different sets of SNPs in
samples that may well differ in the details
of their patterns of linkage disequilibrium.
The Monte Carlo simulation methods used
here do not make assumptions about the
underlying distribution of the data assessed.
Monte Carlo methods provide empirical P
values based on repeated random samples
from the actual data sets analyzed. Such
approaches are especially useful when we
seek to assess the significance of apparently
reproducible results from convergent data
from multiple independent data sets that
differ from each other in sample size, num-
ber and types of genomic markers, racial
or ethnic background of the subjects, and
other key features. No alternative method
of which we are aware provides as tractable
a method for assessing the significance of re-
sults obtained in multiple samples without
assumptions about underlying distributions
of the data as do Monte Carlo approaches.
We use 10,000 Monte Carlo trials in cir-
cumstances in which moderately high sig-
nificance is anticipated, and 100,000 trials
in circumstances in which extremely high
significance is anticipated.
This approach seeks to identify genes with
variants that are likely to play roles in ad-
diction and in related phenotypes. This ap-
proach allows for locus heterogeneity and
thus does not use the more stringent crite-
rion that the same SNP is required to display
nominal significance in each of the samples
in which association data are said to support
association at a specific gene locus. This ap-
proach allows for differences in the phase of
association and thus does not use the more
stringent criterion that the same allele of the
SNP (or haplotype) must be associated with
nominal significance in each of the sam-
ples in which association data are said to
support association at a specific gene locus.
The approach allows for different details
of the patterns of linkage disequilibrium
between marker and functional haplotype
from sample to sample. The approach al-
lows us to combine data sets in which differ-
ent marker sets are used. With each of these
limitations, it is clear that subsequent follow-
up analyses are required. Analyses in the
same and in additional independent sam-
ples are required to untangle any locus het-
erogeneity, to unequivocally identify which
individual SNPs are associated, and to iden-
tify pathological haplotypes and the phases
with which they are associated with pheno-
types in samples from different racial and
ethnic backgrounds. Although we describe
examples of such follow-up studies for the
NrCAM and neurexin 3 (NRXN3) genes
below, it is important to note the limited
numbers of genes for which such confirma-
tory follow-up data are available. In many
circumstances, we believe that this sort of
follow-up requires molecular biologic, be-
havioral, and other evidence to buttress the
data that come from association genetics
Stepwise Approaches to Analyses
(a) Determination of nominally signifi-
cant markers. Nominal P values that
come from t-tests (for pooled data), χ
(for individual genotype frequencies), or
ρ (for correlational approaches) statistics
delineate the nominal significance of the
differences between disease and control
groups for each SNP. For pooled assess-
ments, proper definition of the pool-to-
pool variability is crucial for proper assign-
ment of the appropriate nominal t value.
Page 23
et al.:
Addiction Molecular Genetics
However, the continuous results that come
from pooled data sets do provide the ad-
ditional statistical power characteristic of
statistics based on continuous measures.
(b) Identifying chromosomal clusters of
nominally significant markers in sin-
gle samples. We focus on the SNPs whose
chromosomal positions can be accurately
determined. Because gender ratios differ
substantially in many of these data sets, we
omit data from sex chromosomes for most
of these samples.
We focus on clusters of nominally positive
autosomal SNPs that lie within 100 or 25 kb
of each other, depending on the density of
markers available; the latter figure is closer
to the average haplotype block length in
the samples studied here.
use a valuable technical control that is pos-
sible with Affymetrix 500K reagents, we re-
quire that SNPs in each cluster come from
both Sty IandNsp I array types where pos-
In assessment of the data from each
sample set, these criteria thus provide some
assurance that haplotypes do occur at dif-
fering frequencies in disease versus controls.
These criteria provide significant technical
controls, based on requirements that multi-
ple nearby SNPs must display positive re-
sults and that these positive results must
come from two array types.
It is important to note that, if stochastic
events produce a nominally significant as-
sociation at a given SNP in a single sample,
linkage disequilibrium with nearby SNPs
might provide a cluster of several SNPs with
nominal significance in this single sample
on stochastic grounds alone. Control for the
possibility that these differences in haplo-
type frequencies result from stochastic dif-
ferences between samples thus awaits the
next analytic step (c, below).
We test the nonrandomness of clustering
of nominally significant SNPs using Monte
Carlo simulations. We can also use these
approaches to identify the nonrandomness
of clustering within genes. For each sim-
ulation trial, a random set of SNPs from
the database that contains the results from
these studies is subjected to the same ana-
lytic procedures that had been used for the
actual data analysis. The number of trials
for which the results from the randomly
selected set of SNPs match or exceeded
the results actually observed from the SNPs
identified in the current study is tabulated.
Empirical P values are calculated by divid-
ing the number of trials for which the ob-
served results are matched or exceeded by
the total number of Monte Carlo simulation
trials performed. This method examines the
properties of the actual SNPs contained in
each data set. It is therefore relatively ro-
bust despite the uneven distributions of SNP
markers across the genome, differences in
linkage disequilibrium across the genome
in different samples, and the different SNPs
genotyped using different assays.
(c) Identifying the clustered, nominally
positive SNPs with the strongest posi-
tive support from several replication
data sets. We next seek convergence be-
tween data from several replicate samples.
We focus on samples that test the same un-
derlying hypothesis (i.e., that common al-
lelic variants contribute to genetic compo-
nents of vulnerability to develop substance
dependence). Some of these samples and
their matched controls differ from each
other on other bases (e.g., racial or eth-
nic background or primary substances of
abuse). We thus use replication here in a re-
stricted sense that allows us to reserve use
of the term generalization to denote identi-
fication of genes whose pleiotropic influ-
ences are evident in studies of other heri-
table phenotypes that often co-occur with
addictions (see below). Obviously, an aspect
of generalization also applies when compar-
ing data from (a) polysubstance-dependent
versus control samples collected from indi-
viduals of two racial or ethnic backgrounds
with (b) methamphetamine-dependent
versus control samples collected from
Page 24
Annals of the New York Academy of Sciences
individuals with a third racial or ethnic
background (see below).
Analyses focus on genes identified by clus-
tered positive results from several samples.
This approach, rather than a focus on indi-
vidual SNPs whose informativeness might
differ across samples, allows for some de-
gree of genetic heterogeneity and for some
sample-to-sample differences in the detailed
patterns of linkage disequilibrium.
Clustering of positive results in the same
gene in each of several independent sam-
ples is much less likely to represent purely
stochastic effects than observations made in
any single sample. Such clustering in mul-
tiple samples is more likely to reflect true
differences related to the phenotype of in-
terest, such as differences in terms of a
dependence on addictive substances. How-
ever, it is important to emphasize again that
these criteria are aimed at the identification
of genes, rather than a precise definition
of exact disease-associated haplotypes. We
thus allow the phase of association to differ
between samples at this level of analysis. De-
tailed studies of the phase of association can
provide a very valuable fine mapping tool
to allow for the identification of the exact
pathogenic haplotype.
(d) Identifying the clustered, nominally
positive SNPs with the strongest
positive support from several gen-
eralization data sets. To seek possible
generalization of results, we have sought
chromosomal locations where the clus-
tered positive data from several substance-
dependence GWA samples lie near clus-
tered, nominally positive (and reproducibly
positive) results from studies of other re-
lated, heritable phenotypes.
Bayesian approaches to these analyses sug-
gest that the stronger the evidence for co-
heritabilities of substance dependence and
these related phenotypes, the higher the
likelihood that molecular genetic studies
will demonstrate true overlaps.
We fo-
cus on phenotypes that display good evi-
dence for heritability from classical genetic
studies, including evidence that complex ge-
netics plays substantial etiologic roles. We
focus first on heritable phenotypes that co-
occur with addictions at frequencies much
greater than those that we would expect if
they were independent of each other. For ex-
ample, even though substance dependence
and bipolar disorder are both common,
the product of their population frequencies
does not nearly explain the approximately
two-thirds of bipolar individuals who report
abuse of or dependence on an addictive sub-
Twin data that compare co-occurrence fre-
quencies in monozygotic versus dizygotic
twin pairs provide evidence for shared
heritability of some of these phenotypes.
For other phenotypes, the magnitude of
genetic influences and the frequency of
co-occurrence with substance dependence
indicate the likelihood of pleiotropic in-
fluences of some of the same allelic vari-
ants on both phenotypes. Finally, we also
present here the idea that transitive genetic
approaches may also identify evidence for
generalization of effects of some pleiotropic
influences. If substance dependence and in-
termediate heritable phenotypes share ge-
netic overlap and co-occur, then a third
heritable phenotype that is documented to
co-occur with the intermediate phenotype
might also share substantial heritability with
addiction vulnerability.
Examples of heritable phenotypes for which
twin data document shared genetic de-
terminants include frontal lobe brain vol-
ume and cognitive abilities.
of heritable phenotypes for which co-
occurrence makes shared genetics highly
likely, apriori, include substance dependence
and bipolar disorder.
A transitive ge-
netic approach could be applied to the
shared genetics of substance dependence
and cognitive abilities or brain volume on
one hand, and the likely shared genetics
of cognitive ability or brain volume and
Page 25
et al.:
Addiction Molecular Genetics
vulnerability to AD, on the other hand.
Data from cognitive function and frontal
brain volume genetics thus provide poten-
tial intermediate phenotypes to link the ge-
netics of addiction with that of AD. Such
links might or might not have been antic-
ipated, based on equivocal evidence from
current epidemiologic studies.
As we seek to document the extent of the
generalization of effects of alleles that were
initially identified in studies of addiction, we
test the null hypothesis that clustered posi-
tive results from the GWA data from addic-
tion vulnerability do not converge with the
chromosomal positions of clustered nomi-
nally positive SNPs in comparisons of other
phenotypes, such as individuals with bipo-
lar disorder versus control samples. Monte
Carlo simulations that test this null hy-
pothesis sample from data within the SNP
data sets noted above. As noted above,
100,000 trials allow estimates of the signifi-
cance of the generalization of the effects of
the alleles identified in studies of addiction
(e) Controls for the alternative possibil-
ities that results could come from oc-
cult racial or ethnic stratification or
assay noise. Several alternative hypothe-
ses might explain observed results. To test
some of these alternative hypotheses, we
compare the clustered positive SNPs from
different samples with SNPs that display
the largest allele frequency differences in
appropriate control data sets. These com-
parison data sets include those that contrast
allele frequencies in (a) European American
versus African American control individu-
als from NIDA samples,
(b) Japanese ver-
sus Han Chinese individuals from HapMap
samples (JPT, Japanese from Tokyo; HCB,
Han Chinese from Beijing), (c) control in-
dividuals sampled in different portions of
the United Kingdom,
and (d) SNPs that
display the largest variances from array to
We can thus compare data from
the true comparisons in our experiments to
similarly analyzed data from samples that
test alternative hypotheses, providing sub-
stantial additional control evidence.
(f) Results from alternative approaches:
principal components analysis and
hierarchical clustering. Several alterna-
tive approaches to the analysis of GWA
data sets can also provide interesting re-
sults that assess the structure of the pool-
to-pool variance, based on data from 500
or 600K SNP sets. Principal components
analysis (PCA) of the pool-to-pool differ-
ences in 500K data from European Amer-
ican, African American, and Asian sam-
ples divides the data along these racial and
ethnic lines, as we might expect. However,
PCA also subdivides data from experiments
studying two distinct U.S. samples of nomi-
nally equivalent genetic background: Sam-
ple 1, NIDA European American subjects
recruited in Baltimore, Maryland versus
Sample 3, COGA European American sub-
jects recruited in St. Louis, Missouri, the
Bronx, New York, San Diego, California,
and other sites. Similarly, these PCA anal-
yses separate the samples of Asian subjects
recruited in Japan from those recruited in
or near Taipei who are self-characterized as
Han Chinese. Each of these results under-
scores the need for extremely careful match-
ing of the racial and ethnic backgrounds of
control and disease samples.
Hierarchical clustering is most conveniently
limited to data from individual genes.
Large, 500K SNP data sets provide sub-
stantial limitations based on computer time.
When we examine data from several genes
using this approach in pools from a sin-
gle racial or ethnic background, we can
identify relatively clear patterns of sepa-
ration between data from pools contain-
ing substance-dependent individuals and
data from pools containing control indi-
viduals. These hierarchical clustering ap-
proaches are independent of the principal
analyses noted above. These results reassure
us that modest association signals can be
Page 26
Annals of the New York Academy of Sciences
identified in many of these addiction-
associated genes using a variety of different
statistical approaches.
Ethical Issues in High-Density
Genotyping of Individuals Selected
Based on Self-Reported Illegal Behaviors
Individuals who are individually genotyped
in relationship to addiction and related pheno-
types are subject to a number of potential risks.
Some of these risks are shared with individuals
who are subjected to high-density genotyping
in relationship to other disorders and pheno-
types. Other risks are more likely to come to
the fore in studies of illegal behaviors.
Concerns relating to insurability, employa-
bility, paternity determination, and providing
(or not providing) genotyped individuals with
access to their genotypes and/or genetic coun-
seling are shared by individuals with other com-
plex disorders.
Pending legislation in the
United States may mitigate several of these con-
cerns, and they are reviewed elsewhere. We
therefore will not consider these issues further
High-density individual genotyping of DNA
from individuals who are addicted to illegal
substances raises additional issues. Many of
these individuals are likely to have experienced
involvement in criminal activities that goes be-
yond the use of illegal substances. Because the
risks of high-density individual genotyping in
this population have not been generally dis-
cussed elsewhere, we discuss several points that
may inform thinking about these special ethical
Increasingly ubiquitous DNA testing re-
lated to criminal activities lies at the heart
of these concerns. In the United States, each
state has a DNA database that collects in-
formation from crime scenes and from in-
dividuals convicted of particular offenses. A
combined DNA index system (CODIS) op-
erates local, state, and national DNA pro-
file databases of convicted offenders, unsolved
crime scenes, and missing persons. Numer-
ous suspects have been identified through
matches between DNA profiles from crime
scenes and profiles from convicted offenders.
A relevant website reports that the “success
of CODIS is demonstrated by the thousands
of matches that have linked serial cases to
each other and cases that have been solved
by matching crime scene evidence to known
convicted offenders.” The European Union is
just one of the other international entities with
a similar system (
Core CODIS data come from genotypes at
13 simple sequence length polymorphic (SSLP)
loci. These loci lie near SNP markers with in-
formation about virtually all of these loci, pro-
viding a ready means of translating between
SNP and SSLP genotypes. Other mitochon-
drial, sex chromosome, and autosomal markers
are also genotyped on substantial numbers of
these DNA samples.
A recent, October 2007, analysis of the
CODIS-linked DNA index system revealed
individually identifying genotype profiles for
more than 5 million convicted offenders, as well
as almost 200,000 DNA profiles from crime
scenes (
Almost 40% of men and 15% of women in
cohorts from the areas of Baltimore from which
Sample 1 and Sample 2 research volunteers
come had experienced significant adult crimi-
nal justice system involvement (e.g., incarcera-
tion as adults) by the time they reached their
late 20s (N. Ialongo, personal communication,
2008). It thus seems reasonable to conclude
that several of the more than 3400 research
form Samples 1 and 2 might be at potential
risk for matches with crime scene DNA pro-
files. Similar potential risks might also be in-
curred through genetic study participation by
individuals who report dependence on illegal
substances in other parts of the United States.
Although this problem is not unique to studies
of the genetics of illegal behaviors, it appears to
Page 27
et al.:
Addiction Molecular Genetics
be much more likely in this area than in most
other areas of complex genetics.
Our laboratory, along with most other labo-
ratories that work in this field, has established
an elaborate means of coding, providing physi-
cal and electronic protections for the electronic
and paper records that might identify our re-
search volunteers. Subjects are protected by
confidentiality certificates obtained through the
Department of Health and Human Services.
Data from these studies are analyzed and re-
ported in ways that do not identify individual
However, the strongest protection for in-
dividuals who volunteer for this work comes
from development and use of DNA pooling ap-
proaches. Because these approaches never gen-
erate high densities of genotypes for any indi-
vidual, it is impossible to abuse or misuse these
pooled data for unintended purposes. Pooling
approaches provide these research volunteers
with the strongest confidentiality protections
currently available. Pooling may also merit in-
creasing attention in other settings in which the
risks of DNA-based personal identification are
of significant concern.
Results: Clustering within
Individual Samples and
Convergence between Replicate
Samples for the Same Phenotypes
Polysubstance Dependence versus
Control (Samples 1 and 2)
NIDA substance dependence samples were
by selecting nominally positive
SNPs that displayed P values < 0.05 for com-
parisons between substance-dependent and
control samples within both European Amer-
ican and African American samples. We as-
sessed the extent to which nominally positive
SNPs that were identified in both samples by
SNPs represented on at least two different ar-
ray types cluster together in small chromo-
somal regions. Clusters contain at least three
SNPs that display P < 0.05 in both samples
and lie within 100 kb of each other. In this
data set, 6666 of the 639,401 tested SNPs
displayed reproducible, nominally significant
abuser versus control allele frequency differ-
ences (P < 0.05) in both samples. The criterion
that the same SNP display nominally significant
gent than criteria used in other comparisons
(see below). This criterion was applied to reduce
the number of false positive results, but does
not allow as much within-locus heterogene-
ity. Of these 6666 reproducibly positive SNPs,
1158 were within 320 chromosomal clusters;
184 of these clusters identified 244 annotated
Monte Carlo simulation trials that assess the
probability that these results are attributable
to chance showed that none of the 100,000
simulation trials identified as many SNPs dis-
playing nominally positive differences between
substance-dependent and control samples in
both European and African American sam-
ples as observed here (thus P < 0.00001).
Of the 100,000 Monte Carlo simulation trials,
each of which began by selecting 6666 random
SNPs, 2100 provided chromosomal clustering
as marked as that observed for the true repro-
ducibly positive SNPs (P = 0.021).
Methamphetamine-Dependent versus
Control (Samples 4 and 5)
Nominally positive SNPs from each of
these two samples clustered together, with no
more than a 25-kb separation between nomi-
nally positive SNPs, more than anticipated by
chance. In Sample 4, 846 clusters contained
3749 of the 15,569 nominally positive SNPs
and, in Sample 5, 1787 clusters contained 8388
of the 25,538 nominally positive SNPs. Such
clustering was not found in any Monte Carlo
simulation trial (P < 0.0001 for both Sample 4
and Sample 5).
When we evaluated the genes identified
by clustered, nominally positive results from
Samples 4 and 5, we obtained evidence for
Page 28
Annals of the New York Academy of Sciences
replication and results that could not be ex-
pected by chance alone. This criterion for genes
to be identified by clustered, nominally posi-
tive SNPs from each of two samples is not as
stringent as the criterion that the same SNPs
produce nominally positive results in each of
two samples. It does allow for within-locus het-
erogeneity. The degree of convergent identi-
fication of genes by data from each of these
two samples was never observed by chance in
any of 100,000 Monte Carlo simulation trials
(P < 0.00001).
Bipolar Disorder versus Control
(Samples 7, 8 and 9)
WTCCC Bipolar Disorder versus Control
Of the 426,604 SNPs analyzed in the
WTCCC bipolar disorder collection, 28,192
displayed χ
values with P < 0.05.
Of these,
12,560 SNPs fell into 1775 clusters in which at
least 4 SNPs, each of which displayed P < 0.05
(and were sampled on at least two array types),
lay within 25 kb of each other. Monte Carlo
simulation trials that assessed the probability
that these results were attributable to chance
showed that none of the 100,000 simulation
trials identified as many clusters of SNPs that
displayed nominally positive differences be-
tween bipolar disorder and control (control and
other disease) samples as were actually identi-
fied (thus, P < 0.00001).
NIMH Bipolar Disorder versus Control
Of the 536,288 SNPs analyzed in the NIMH
bipolar collection,
32,835 displayed t values
with P < 0.05. Of these SNPs, 9971 fell into
1770 clusters in which at least 4 SNPs, each
of which displayed t values corresponding to
P < 0.05, lay within 25 kb of each other.
Monte Carlo simulation trials that assessed
the probabilities that these results were due to
chance found that none of the 100,000 simula-
tion trials identified as many clustered SNPs
that displayed nominally positive differences
between bipolar disorder and control samples
as were actually identified from this work (thus,
P < 0.00001).
German Bipolar Disorder versus Control
Of the 532,835 SNPs analyzed in the Ger-
man samples, 27,057 displayed t values that
corresponded to P < 0.05. Of these SNPs,
6110 fell into 1137 clusters in which at least
4 SNPs, each of which displayed t values cor-
responding to P < 0.05, lay within 25 kb of
each other.
Monte Carlo simulation trials that
assessed the probability that these results were
due to chance found that none of the 100,000
simulation trials identified as many clustered
SNPs that displayed nominally positive differ-
ences between bipolar disorder and control
samples as were observed in this work (thus,
P < 0.00001).
Convergent Data from at Least Two of
the Three Bipolar Disorder versus
Control Comparisons
Simulation trials assessed the likelihood that
the clusters of nominally positive SNPs from at
least two of these three bipolar samples iden-
tified the same genes. Monte Carlo simulation
trials that assessed the probability that these
results were attributable to chance found that
none of the 100,000 simulation trials identi-
fied as many clustered SNPs displaying nomi-
nally positive differences between bipolar dis-
order and control samples in the same genes in
multiple samples as we actually observed (thus,
P < 0.00001).
Frontal Brain Volume
(Samples 10 and 11)
In 500K Affymetrix data from unrelated
members of NHLBI twin pairs, 10,266 SNPs
provided nominally positive results (G.R. Uhl
et al., submitted). Of these SNPs, 583 fell
into 169 chromosomal clusters, each of which
contained at least three nominally posi-
tive SNPs that lie within 25 kb of each
other and come from both the Sty Iand
Nsp I array types. Monte Carlo trials did
not identify such a degree of clustering
Page 29