ArticlePDF AvailableLiterature Review

Assessment and ascertainment in psychiatric molecular genetics: challenges and opportunities for cross-disorder research

Authors:

Abstract and Figures

Psychiatric disorders are highly comorbid, heritable, and genetically correlated [1–4]. The primary objective of cross-disorder psychiatric genetics research is to identify and characterize both the shared genetic factors that contribute to convergent disease etiologies and the unique genetic factors that distinguish between disorders [4, 5]. This information can illuminate the biological mechanisms underlying comorbid presentations of psychopathology, improve nosology and prediction of illness risk and trajectories, and aid the development of more effective and targeted interventions. In this review we discuss how estimates of comorbidity and identification of shared genetic loci between disorders can be influenced by how disorders are measured (phenotypic assessment) and the inclusion or exclusion criteria in individual genetic studies (sample ascertainment). Specifically, the depth of measurement, source of diagnosis, and time frame of disease trajectory have major implications for the clinical validity of the assessed phenotypes. Further, biases introduced in the ascertainment of both cases and controls can inflate or reduce estimates of genetic correlations. The impact of these design choices may have important implications for large meta-analyses of cohorts from diverse populations that use different forms of assessment and inclusion criteria, and subsequent cross-disorder analyses thereof. We review how assessment and ascertainment affect genetic findings in both univariate and multivariate analyses and conclude with recommendations for addressing them in future research.
Content may be subject to copyright.
EXPERT REVIEW OPEN
Assessment and ascertainment in psychiatric molecular
genetics: challenges and opportunities for cross-disorder
research
Na Cai
1,2,3,32
, Brad Verhulst
4,32
, Ole A. Andreassen
5,6,7
, Jan Buitelaar
8,9
, Howard J. Edenberg
10,11
, John M. Hettema
12
,
Michael Gandal
13,14
, Andrew Grotzinger
15,16
, Katherine Jonas
17
, Phil Lee
18,19
, Travis T. Mallard
20,21
, Manuel Mattheisen
22,23,24
,
Michael C. Neale
12,25
, John I. Nurnberger Jr
11,26,27
, Wouter Peyrout
28,29
, Elliot M. Tucker-Drob
30
, Jordan W. Smoller
20,21,31
and
Kenneth S. Kendler
12,25
© The Author(s) 2024
Psychiatric disorders are highly comorbid, heritable, and genetically correlated [14]. The primary objective of cross-disorder
psychiatric genetics research is to identify and characterize both the shared genetic factors that contribute to convergent disease
etiologies and the unique genetic factors that distinguish between disorders [4,5]. This information can illuminate the biological
mechanisms underlying comorbid presentations of psychopathology, improve nosology and prediction of illness risk and
trajectories, and aid the development of more effective and targeted interventions. In this review we discuss how estimates of
comorbidity and identication of shared genetic loci between disorders can be inuenced by how disorders are measured
(phenotypic assessment) and the inclusion or exclusion criteria in individual genetic studies (sample ascertainment). Specically, the
depth of measurement, source of diagnosis, and time frame of disease trajectory have major implications for the clinical validity of
the assessed phenotypes. Further, biases introduced in the ascertainment of both cases and controls can inate or reduce estimates
of genetic correlations. The impact of these design choices may have important implications for large meta-analyses of cohorts
from diverse populations that use different forms of assessment and inclusion criteria, and subsequent cross-disorder analyses
thereof. We review how assessment and ascertainment affect genetic ndings in both univariate and multivariate analyses and
conclude with recommendations for addressing them in future research.
Molecular Psychiatry; https://doi.org/10.1038/s41380-024-02878-x
INTRODUCTION
The comorbidity between psychiatric disorders stems, at least in
part, from overlapping genetic factors. Understanding the genetic
etiology of psychiatric outcomes can illuminate the common
biological mechanisms that contribute to comorbid presentations
of psychopathology, delineate distinct psychiatric disorders, and
aid the development of more effective and targeted interventions.
We focus on binary diagnoses of psychiatric disorders to link the
implications of our recommendations to clinically validated
outcomes and remain consistent with existing psychiatric genetics
research. It could be argued that dimensional characterizations of
psychopathology have more statistical power and are more
Received: 19 June 2024 Revised: 7 November 2024 Accepted: 16 December 2024
1
Helmholtz Pioneer Campus, Helmholtz Munich, Neuherberg, Germany.
2
Computational Health Centre, Helmholtz Munich, Neuherberg, Germany.
3
School of Medicine and
Health, Technical University of Munich, Munich, Germany.
4
Department of Psychiatry and Behavioral Sciences, Texas A&M University, College Station, TX, USA.
5
Centre of
Precision Psychiatry, University of Oslo, Oslo, Norway.
6
Division of Mental Health and Addiction, Oslo University Hospital, Oslo, Norway.
7
KG Jebsen Centre for
Neurodevelopmental disorders, University of Oslo, Oslo, Norway.
8
Department of Cognitive Neuroscience, Donders Institute for Brain, Cognition and Behavior, Radboud
University Medical Center, Nijmegen, The Netherlands.
9
Karakter Child and Adolescent University Center, Nijmegen, The Netherlands.
10
Department of Biochemistry and
Molecular Biology, Indiana University School of Medicine, Indianapolis, IN, USA.
11
Department of Medical and Molecular Genetics, Indiana University School of Medicine,
Indianapolis, IN, USA.
12
Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, USA.
13
Departments of Psychiatry and
Genetics, University of Pennsylvania, Philadelphia, PA, USA.
14
Lifespan Brain Institute at Penn Med and the Childrens Hospital of Philadelphia, Philadelphia, PA, USA.
15
Institute for
Behavioral Genetics, University of Colorado Boulder, Boulder, CO, USA.
16
Department of Psychology and Neuroscience, University of Colorado Boulder, Boulder, CO, USA.
17
Department of Psychiatry & Behavioral Health, Stony Brook University, Stony Brook, NY, USA.
18
Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
19
Department of Psychiatry, Harvard Medical School, Boston, MA, USA.
20
Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Boston, MA,
USA.
21
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
22
Department of Community Health
and Epidemiology and Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada.
23
Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital of
Munich, Munich, Germany.
24
Department of Biomedicine, Aarhus University, Aarhus, Denmark.
25
Department of Psychiatry, Virginia Commonwealth University, Richmond, VA,
USA.
26
Department of Psychiatry, Indiana University School of Medicine, Indianapolis, IN, USA.
27
Stark Neurosciences Research Institute, Indiana University School of Medicine,
Indianapolis, IN, USA.
28
Department of Psychiatry, Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands.
29
Amsterdam Public Health, Amsterdam UMC, Vrije
Universiteit, Amsterdam, The Netherlands.
30
Department of Psychology, University of Texas at Austin, Austin, TX, USA.
31
Stanley Center for Psychiatric Research, Broad Institute of
MIT and Harvard, Cambridge, MA, USA.
32
These authors contributed equally: Na Cai, Brad Verhulst. email: kenneth.kendler@vcuhealth.org
www.nature.com/mp
Molecular Psychiatry
1234567890();,:
capable of dissecting symptom heterogeneity. Nevertheless, we
emphasize the clinical validity of categorical diagnoses which
have been used extensively in psychiatric genetic analyses to
situate our discussion within the context of large genomic
initiatives such as the Psychiatric Genomics Consortium (PGC).
Our goal in this review is to summarize current research and
perspectives on how assessment and ascertainment strategies
impact genetic ndings for both individual disorders (Table 1) and,
in turn, cross-disorder genetic sharing and the genetics of
comorbidity and disease trajectories (Table 2). After examining
the impact of various common assessment and ascertainment
methods, we conclude with recommendations for collecting new
genomic data and conducting rigorous genetic analyses in the
future.
ASSESSMENT OF INDIVIDUAL PSYCHIATRIC DISORDERS
Diagnoses of psychiatric disorders used in genomic studies are
obtained from a variety of research designs: clinician (or trained-
research staff) administered structured interviews, self-
administered questionnaires on current, and lifetime worst-
episode symptoms [6,7], self-reports of a prior or current
diagnosis or treatment [7], and diagnostic codes from electronic
health records (EHRs) [810] or registries [11]. The reliability and
validity of psychiatric diagnoses are a function of variation in these
assessment strategies within three primary domains.
The rst domain is the depth of clinical detail with which a
diagnosis is based. Structured interviews epitomize deep
phenotyping, providing rich information on the clinical character-
istics used to assign a diagnosis. Established instruments, such as
the Structured Clinical Interview for DSM (SCID) [12] or the
Composite International Diagnostic Interview (CIDI) [13], assess all
symptoms, functional impairment, and exclusion criteria required
for a DSM [14] or ICD [15] diagnosis. Some studies use the
Operational Criteria Checklist [16], which leverages multiple
operational diagnostic systems to enable consensus best-
estimate procedures [17]. Such deepphenotyping was widely
applied in the initial phases of the Psychiatric Genomic
Consortium (PGC) meta-analyses [1820]. This approach results
in diagnoses that reect current clinical standards and enables
investigations into clinical heterogeneity. Supplementing deep
phenotyping with assessments of other relevant psychiatric
disorders, personality traits [21], early life factors [22], and stressful
life events [23] further enables investigations into psychological
and environmental correlates of disorders [24]. Conversely,
shallowassessments allow us to quickly obtain large, inexpen-
sive samples, accelerating gene-discovery efforts by increasing
statistical power. However, shallow assessments, such as very
short screening tools (one to four item scales) [25], while
correlated with structured interviews, often yield high false
positive rates [25] jeopardizing their clinical validity. Between
these extremes exists a spectrum of assessment methods that vary
in their depth, including self-reported symptom-based question-
naires, self-reported professional diagnoses or treatment, diag-
nostic codes (ICD9, ICD10), hospital visits, prescription records, and
insurance claims based on clinical assessments from EHRs. These
assessment techniques vary wildly in their reliability and validity.
For example, diagnoses based on brief internet surveys may have
questionable clinical validity while, for some disorders, online
assessment instruments that assess a full set of diagnostic criteria
have better psychometric properties [26]. Alternatively, those
derived from prescriptions of restricted drugs, such as clozapine
for treatment-resistant schizophrenia (SZ), can offer highly valid
diagnoses [18,27,28]. Lower levels of reliability and clinical
validity of shallow assessments may result in the misclassication
of sub-clinical respondents as cases, inuencing both genetic
associations with the primary diagnosis and subsequent genetic
correlations with comorbid conditions [2932].
Table 1. Summary of assessment and ascertainment strategies for individual psychiatric disorder diagnoses in case-control cohorts, volunteer-based biobanks, and electronic health records (EHRs)
or registries.
Aspects Domains Case-control Volunteer-based biobanks Electronic health records (EHRs), registries
General Sample size 1001000 s 100,000 s Millions
Sampling Hospital-based clinical samples
may be more severe than
community samples
Volunteers may be healthier and have higher
socioeconomic status than population average
Individuals accessing health system may be more
unwell and have more comorbidity than
population average
Assessment Depth Usually deep, using fully structured
interviews
Varies, ranging from shallow single-item self-reports of
illness, diagnosis, or treatment to deeper self-
administered symptom-based questionnaires
Usually a diagnostic code, doctorsnotes may be
available, secondary information such as
prescription codes may be available
Source Clinicians or trained interviewers
assess patients, parents, and/or
teachers
Self-reported questionnaires (online or paper), or
interviews at a collection center
Clinician, nurses, insurance codes
Timeframe Cross-sectional Cross-sectional, sometimes repeated assessments
through voluntary recontact
Longitudinal, length of follow-up based on the
frequency of contact with the health system
Ascertainment Case ascertainment Usually severe, may have screened
and excluded other psychiatric
conditions
Cases may be less severe with less loss of function;
comorbidity information may be available
Cases may be more severe and comorbid, there
usually is comorbidity information
Control
ascertainment
Usually screened, may exclude
cases of other psychiatric
conditions
Sometimes screened, but may contain some cases due
to mis-reporting or memory biases
Usually unscreened, may contain some cases due
to diagnostic biases
N. Cai et al.
2
Molecular Psychiatry
Table 2. Sources of biases in individual disorder genetics, their potential impact on cross disorder genetic studies, mitigating factors in analyses, and recommendations for future data collection
taking these biases and their effects into account.
Source of bias Potential effect on cross-disorder
analysis
Strategies for mitigation Future data collection
recommendations
Depth Diagnoses from shallow phenotyping
assessments are usually less valid,
incurring high levels of misdiagnosis
that may not be random
Nonspecic genetic effects on
individual disorders; inated rG
between disorders in the case of
bidirectional misdiagnosis between
two disorders; mixture of unknown
biases in shared genetic effects
identied between two disorders
Assess the replicability or heterogeneity
of effects across assessment strategies;
assess specicity of polygenic risk scores
(PRS); methods to combine different
measures while maintaining specicity:
(1) LT-FH, (2) MTAG, (3) Genomic SEM, (4)
imputation
Use deep phenotyping where possible;
use brief self-report versions of full
diagnostic criteria if only shallow
phenotyping is possible; repeated
assessments; expand collection of data
to include non-DSM symptoms and
non-diagnostic information to
supplement clinical characteristics
obtained
Source Diagnoses made by different sources
may have different levels of validity
and biases; concordance between
diagnoses made by different sources
may differ by disorder
Assessments by trained mental health
professionals who are familiar with the
relevant symptomatology of individual
disorders and their usual comorbidities;
establish quality control of interviewers;
complement interviews with doctors
notes, prescription and other medical
records. For online assessment, avoid
single item screens and use brief
measures that assess full diagnostic
criteria
Timeframe Diagnoses made in different
timeframes may reect
subsyndromal states or lifetime
liabilities to disorder; effects
compound with those from source of
info and depth of assessment; effect
of timeframe may be disorder
specic
Focus on diagnoses made with
assessments of lifetime, not current,
symptoms; repeated assessments
EHRs Only capture those who interact with
the healthcare system, who may be
unhealthier while having higher
socioeconomic status than the
population
Genetics of individual disorders
unrepresentative of disorder in the
population; rG between disorders may
be inated if they share common
ascertainment patterns; disorders with
different levels of dysfunction may not
share ascertainment patterns, leading
to deation in their rGs
Epidemiologically verify disorder validity
using known relationships with non-
clinical factors; inverse probability (IP)
weightings to improve
representativeness (up-weighting
participants with features identied to be
associated with lower participation)
Collection of non-clinical
epidemiological information, collection
of repeated measurements; for
international studies, pay attention to
translations of assessment instruments
and, when possible, assess your success
though measurement non-invariance
techniques
Biobanks Only capture those who volunteer to
participate, who may be healthier,
better educated, and of higher
socioeconomic status than the
population
Case-control
cohorts
Biased toward treatment-seeking,
high severity, excess comorbidity,
and treatment non-responsiveness
Exaggerated case-control differences;
may deate rG proportional between
disorders depending on genetic
sharing between high severity forms of
both disorders
Assess extent of biases to understand
differences between inclusion and
exclusion criteria
Design an ascertainment frame for cases
that avoids oversampling of those with
severe and/or treatment resistant illness
The exclusion of prior or lifetime
diagnoses of other disorders
Deate rG between disorders as
explicitly removed those with shared
genetics
Collect information on prior or lifetime
diagnosis of other disorders to assess
their impact on individual disorder
liability and cross disorder sharing
N. Cai et al.
3
Molecular Psychiatry
The second domain is the source of the assessment. Assess-
ments of psychiatric disorders may come from clinicians (e.g.,
psychiatrists, other physicians, psychologists), trained research
staff, self-reports, and relative or teacher reports [3337]. The
reliability and clinical validity of the psychiatric assessments vary
as a function of the expertise of the interviewer, especially if the
training or background of the interviewers enable them to create
a sense of safety or rapport that allows the respondent to answer
honestly, even for embarrassing topics. Consistency between
trained psychiatrists and primary care physicians varies but is
often high with repeated examinations [38,39]. Diagnostic
interviews conducted by trained research staff using semi-
structured interviews such as the CIDI have been shown to have
high validity when compared with structured interviews by
clinicians [40]. However, diagnoses based on clinician ratings
show signicant differences from those relying on self-report
[41,42], with self-reports often being more severe [43,44].
Furthermore, genetic analyses nd that self-reports [45,46]
capture non-specic genetic effects and miss a signicant portion
of the genetic contributions to clinically dened disorders [4548].
It remains unknown whether differences in validity between
clinical and self-report diagnoses can be compensated for by
repeated assessments [49]. Notably, the validity of self-reports can
be inuenced by disease-, symptom- and individual-specic
factors that depend on a respondents comprehension of the
questionnaire, motivation, and ability to answer accurately [50].
These self-report biases may be related to personality traits [51]or
specic psychiatric symptoms [52] (which may inuence disorder
vulnerability), potentially impacting the reliability and general-
izability of research ndings.
The third domain is the time frame of the assessment. Genomic
studies have started to explore how genetic variants affect
temporal features of psychiatric disorders. Notably, lifetime
diagnoses tend to be more heritable than current diagnoses
[53,54]. Genetic analyses demonstrate that self-reported current
symptoms assessed by the Patient Health Questionnaire 9 are more
reective of subsyndromal dysphoria that is related to stressful life
events and neuroticism, while self-reported worst-episode symp-
toms assessed through the CIDI Short Form [55]showgreater
genetic sharing with major depressive disorder (MDD). This
suggests using current symptoms for identifying genetic contribu-
tions to disorders is likely to result in ndings with low specicity
that may be best limited to use in making current diagnoses.
Alternatively, lifetime symptoms and diagnoses, may be modestly
affected by inaccurate recollections, or other features of state-
dependent memory [56]. The combination of over- and under-
reporting due to selective recall introduces an unpredictable
mixture of biases that depend on the lifetime prevalence of
subsyndromal symptoms and is confounded with the source of the
information (i.e., self-report vs clinician assessment) [57]. Genomic
studies have started to explore how genetic variants affect other
temporal features of psychiatric disorders. For example, age at
onset or recurrence can reect differences in genetic risk [5860],
and the timing of assessment relative to disorder onset can
substantially affect genetic ndings. More targeted analyses that
isolate the effects of different time scale factors are needed.
As effect sizes of associations between individual genetic
variants and psychiatric phenotypes are usually small, we need
large sample sizes to obtain reproducible results. This means
meta-analyzing data spanning all three assessment domains. The
justication for integrating potentially heterogeneous phenotypes
is usually based on high genetic correlations (rGs) between them.
However, there are notable differences in the rGs among
assessments of different disorders. The reported rGs between SZ
samples collected through different means and populations are
high (>0.9) [61,62] while the rGs between MDD samples are as
low as 0.59 [10]. Ignoring this variability may skew our under-
standing of the genetic architecture of individual disorders, rGs
Table 2. continued
Source of bias Potential effect on cross-disorder
analysis
Strategies for mitigation Future data collection
recommendations
Screened super-
normalcontrols
Screened for the disorder being
studied and other psychiatric
disorders not screened out in cases
Exaggerates case-control differences;
produces spurious co-aggregation
between disorders; inates rG
proportional to the population
prevalence of the two disorders
Predicting disorder liability for
unscreened controls
Use representative (not super-normal)
controls
Unscreened
controls
Containing cases of the target
disorder at approximately the
population prevalence
Genetic associations of the GWAS are
downwardly biased, with the
magnitude of the bias increasing for
more prevalent disorders in the
population, affecting rG between
disorders accordingly
Screen controls as much as possible
N. Cai et al.
4
Molecular Psychiatry
between disorders, and downstream analyses such as tissue-
enrichment of the SNP-based heritability (h
2SNP
), and prioritization
of GWAS ndings for ne-mapping and drug-target identication.
How strictly should individual or cross-disorder psychiatric
genetics research rely on deep, clinician-assessed diagnoses based
on established DSM criteria rather than shallow, self-reported
symptoms or EHRs? The DSM is neither perfect nor immutable and
is periodically revised based on advances in the understanding of
the etiology of the disorders. DSM criteria do not, nor are they
designed to, exhaustively capture the diagnostic complexity of
any specic disorder [63,64]. However, DSM-based diagnoses
correspond with current best-practice patient care, providing
reliable assessments and underscoring their clinical validity for
translating research into benecial patient outcomes. Never-
theless, dichotomizing individuals into cases and controls discards
potentially valuable information regarding disease severity
thereby potentially reducing the power to detect genetic
associations. Alternatively, self-reported questionnaires are less
expensive to administer, allowing researchers to collect substan-
tially more data, increasing statistical power at the potential cost
of clinical reliability and validity. Thus, it is important to consider
supplementing data on current diagnostic criteria with additional
measures, such as self-reports, to identify additional factors that
may play an important role for rening the diagnostic formula-
tions and subtypes of psychiatric disorders. In many ways deep,
clinician-assessed diagnoses compliment shallow, self-reported
measures, and vice versa. The challenge will be to integrate
seemingly disparate assessment methods in a way that maximizes
the clinical validity of structured interviews and the recruitment
potential of self-reported measures. As such, understanding how
different assessment procedures affect empirical ndings will
streamline the integration of genomic evidence into future DSM
revisions [65], with the goal of using epistemic iteration to rene
diagnostic criteria [66,67].
ASCERTAINING CASES AND CONTROLS FOR INDIVIDUAL
DISORDERS
Case ascertainment
Strategies for identifying and recruiting individuals who meet
diagnostic criteria for a psychiatric disorder can inuence genetic
associations and their interpretations [68]. Ascertainment for
genomic studies primarily occurs in three forms: targeted recruit-
ments of cases with a specic disorder from clinical or research
settings, sampling from EHRs, and population-based sampling.
While ascertainment strategies are theoretically independent of
assessment methods and the prevalence of the target phenotype,
practical constraints can confound these design factors.
Early in the psychiatric GWAS era, genomic studies primarily
relied on targeted recruitments, requiring the coordination of
networks of mental health professionals to screen patients for a
target disorder, typically employing deep phenotyping [6971].
This strategy was effective for the initial GWAS of rare disorders,
particularly SZ [72] and bipolar disorder [73] (BD). Importantly,
participants recruited from clinical settings frequently exhibit
more severe illness than their counterparts in EHR and population-
based studies [7476]. Targeted approaches are typically the best
way to obtain large numbers of cases of relatively rare disorders
[77,78]. One concern with this approach is whether such samples
are representative, or biased toward treatment-seeking, severity,
excess comorbidity, and/or treatment non-responsiveness. In
addition, the exclusion of cases with other comorbid disorders
(common among core PGC cohorts) likely affects its prole of
genetic sharing, dependent on the patterns of comorbidity.
Nonetheless, these ascertainment techniques, underscored by
rigorous assessment methods, contributed to the success of the
early PGC GWAS efforts.
National registries [79,80]andEHRs[8183] record healthcare
information for everyone in their catchment, making them
effective ascertainment strategies for identifying common and
rare disorders. Patient diagnostic codes available through these
resources can, in some instances, have high validity. For example,
several follow-up clinical studies of cases [84,85]ofSZ[86,87],
BD [31], and obsessive compulsive disorder (OCD) [88]inSwedish
and Danish registries and American EHRs have demonstrated
strong validation against DSM criteria. Some EHRs have
comprehensive doctorsnotes from individual interviews, which
if carefully coded can augment case-control outcomes for
genetic analyses [32,89,90]. Diagnostic data from EHRs and
registries, however, can be heterogeneous. First, some healthcare
systems use billing codes and base insurance claims or
reimbursements on diagnostic assignment, while others do not.
These incentive structures can create systematic biases in code
assignment [91,92]. Second, diagnoses inferred from adminis-
trative sources (e.g., pharmaceutical records) are indirect, adding
uncertainty into the casephenotype. Third, different diagnostic
biases, such as those related to search satisfaction (leading to
underdiagnosis of comorbidities) and diagnostic momentum
(sticking to a previous or working diagnosis even when it is
erroneous) may differentially affect specic psychiatric disorders
[44,93].
EHRs and registries, however, may not be representative,
capturing only those who interact with the healthcare system,
and may oversample individuals with comorbidities and increased
access to healthcare [94,95]. This results in a disproportionate
number of unhealthy individuals in EHRs, depending on the
specic psychiatric disorder [96]. Further, EHRs based on insurance
records, common in the US, may bias the presence of diagnosis or
diagnostic classications due to variable mobility, socioeconomic
status and access to healthcare. This ascertainment problem can
lead to biased estimates of polygenic score effect sizes. They EHRs
also substantially under-represent early-onset disorders such as
autism spectrum disorder [97], especially in females, though
correlates later-in-life may be informative [84,85]. These
ascertainment problems affect the representativeness of the
samples and can signicantly affect cross-disorder genetic results
by potentially biasing genetic analyses [96]. Finally, registries or
EHRs may not contain information that provides a psychosocial
context for the patients illness. Nonetheless, innovative ways to
utilize EHR and registry data have potential for case identication
[98100].
Population-based biobanks are a common non-targeted means
to collect data on psychiatric disorders [101] which have proven
particularly useful for genomic analyses of common psychiatric
disorders that are amenable to large-scale data collection using
self-administered questionnaires with varying depth and time
frames of assessment [45,55]. However, population-based
recruitment is sensitive to healthy volunteer biases. For example,
the UK Biobank [102] invited approximately 9 million individuals
to participate but only recruited 500,000 respondents (5.5%
response rate), who are more likely to be older, female, living in
less socioeconomically deprived areas, and reporting fewer
physical and mental health conditions than the general popula-
tion in the UK [101,103]. Many studies have shown that this
healthy volunteer biasdistorts the associations among pheno-
types [104106], and with genetic variants [107] that are
associated with self-selection. Notably, several genetic variants
that are associated with self-selection are also associated with
psychiatric disorders [108111]. Unless adequately mitigated
through statistical approaches [104,112115] or validated through
experimental means [112], genetic ndings from volunteer
samples may compound biases [104]. Despite these limitations,
population-based biobanks have made important contributions to
progress in psychiatric genetics.
N. Cai et al.
5
Molecular Psychiatry
Control ascertainment
While the recruitment and assessment of cases dominate
ascertainment debates, the selection of controls poses under-
appreciated methodological issues [116118]. In clinical ascertain-
ment, case and control participants are typically recruited
independently, so case-control differences may be driven by both
disease liability and ascertainment procedures. While the ascertain-
ment biases discussed above regarding the selection of cases apply
to the selection of controls in a broad sense, there are several
control specic ascertainment factors that deserve attention. Most
importantly, to identify meaningful case-control differences, con-
trols should resemble cases in all characteristics except for the
absence of the disorder for which cases are selected. Controls
selected on this principle are referred to as normal controls.
However, the collection of controls in many genetic studies
does not follow this principle, and the strategies used are not
always adequately reported [74,75]. In particular, many psychiatric
GWAS use super-normal controls who are screened for the
disorder being studied and other psychiatric disorders that are
not screened out of cases [119,120]. Epidemiological studies have
shown that the use of super-normal controls not only exaggerates
case-control differences but can induce familial/genetic correla-
tions in the absence of any true relationships [120]. In family
studies, the use of super-normal controls produces spurious co-
aggregation between disorders, with the magnitude of the bias
increasing proportional to the population prevalence of screened-
out correlated disorders [121]. Simulation studies demonstrate
that the symmetrical use of super-normal controls in GWAS of two
disorders inates rG proportional to the population prevalence of
the two disorders and the simulated magnitude of the association
[122]. For example, if parallel GWASs of MDD and SUD were
conducted that included the opposite disorder in the cases but
excluded them from the controls, the resulting MDD-SUD rG
estimate would be overestimated.
The problem here, simply put, is the case-vs-super-normal-
control difference reects not only case-control differences for the
target disorder but also of any traits or diseases that were
asymmetrically screened out of the control group. This will upwardly
bias GWAS effect sizes as a function of the prevalence of the
diseases that are disproportionately screened out of the controls,
compounding biases in analyses that use the summary statistics
[122]. To further complicate the situation, some studies not only
screen controls based on their own phenotype but also on the
phenotypes of close relatives [123]. Alternatively, because screen-
ing potential controls can be effortful and expensive, unscreened
controls have been used in some psychiatric GWAS [124,125]. In
this scenario, the control group may contain cases of the target
disorder at approximately the population prevalence. Here,
without appropriate correction, genetic associations are down-
wardly biased, with the magnitude of the bias increasing for more
prevalent disorders in the population [126].
GOING FORWARD
In GWAS meta-analyses, most of the samples for common
disorders (e.g., MDD) are population-ascertained with shallow
phenotyping, whereas those for less common disorders (e.g. SZ,
BD) are predominantly clinically ascertained or obtained through
EHRs and registries. Thus, biases in GWAS meta-analysis may
operate differently across disorders. This complicates cross-
disorder analyses, where shared genetic effects across disorders
may reect an unknown mixture of biases due to the different
assessment and ascertainment strategies and true etiologic
overlap between diagnostic entities. While misdiagnosis inu-
ences rGs between genetically related disorders [127], simulation
studies suggest that an implausibly high level of misdiagnosis [3]
would be required to account for the observed rGs between most
pairs of psychiatric disorders in the absence of true genetic
overlap. Nevertheless, lower levels of case misclassications can
inate rG especially when misdiagnosis occurs for both disorders,
and the magnitude of ination depends on the magnitude of the
rGs between disorders [45] and their prevalence. Finally, ination
of rGs can result from other sources including cross-trait
assortative mating [128]. While some of these biases may cancel
each other out, accurately identifying the source of pleiotropy and
comorbidity remains essential for illuminating the shared genetic
architecture of psychiatric disorders. In this section, we summarize
ways to reduce or quantify biases that affect assessment and
ascertainment strategies in both individual and cross-disorder
genetic ndings and give recommendations for future data
collection efforts.
Rening phenotypes
Phenotypic quality control substantially increases the validity of
psychiatric diagnoses, including applying stringent clinical criteria
[45], requiring multiple endorsements from different assessment
strategies [49,129], and ensuring consistency of endorsements
across time [130]. For example, correcting for mis-reports in
different measures of alcohol use increases the rGs across different
assessment strategies from 0.79 to >0.9 [130].
We now have a wide range of tools to quantify and compare
the genetic architectures of the same disorder collected through
different assessment and ascertainment strategies [131]. At the
individual locus level, we can assess the replicability or hetero-
geneity of effects across assessment strategies [19,28,132]. At the
genome-wide level, we can assess whether SNP-heritability
estimates of the same disorder are similar across different study
designs, and whether rGs among them are close to unity
[10,45,62]. We can further assess whether polygenic risk scores
(PRSs) from each assessment or ascertainment strategy robustly
associate with scores from the other strategies [8,62]. A recently
derived metric called PRS Pleiotropy takes these approaches
further, by assessing how well a PRS predicts the disorder of
interest relative to other phenotypes (available in biobanks and
EHRs) [133]. With PRS Pleiotropy as a means to assess specicity,
we can identify clinically valid shallow phenotypes (e.g. clozapine
treatment for SZ [18,27,28]) to include in GWAS meta-analyses.
While no single test provides unambiguous evidence of bias,
consistency across multiple tests provide convergent evidence of
stable genetic effects.
We can also utilize statistical methods that combine genetic
effects from shallow and deep measures to maximally leverage all
data collected for improving GWAS power while maintaining
reasonable specicity. These methods include LT-FH [134] (which
models family history-based liability to disease), MTAG [135](a
meta-analytical approach leveraging information from collateral
GWAS phenotypes with high rG to target GWAS), and Genomic
SEM [136] (a framework for modeling genetic covariance structure
that can be used to specify common and unique genetic factors
underlying a system of GWAS phenotypes and perform GWAS
discovery on those factors). In contrast to methods that require
carefully choosing input phenotypes, multiple-phenotype imputa-
tion presents a relatively agnostic way to boost sample sizes for
deep measures of a disorder (usually available in only a subset of
individuals in a biobank) [133,137]. Exploring different imputation
approaches, especially non-linear models, can further allow us to
utilize more data modalities (multi-omics [138141], imaging
[142,143], data from smartphones and wearable devices
[144,145]). Further methodological developments applied to
time-censored and longitudinal data in EHRs may help to rene
diagnostic accuracy beyond missing value imputation [29,92].
Accounting for ascertainment biases
As biases are prevalent and unavoidable, developing methods to
assess and control for them is critical for obtaining generalizable
ndings [96]. One way to address known bias, such as
N. Cai et al.
6
Molecular Psychiatry
sex-differential participation, is to stratify GWAS and all subse-
quent analyses by the known factor [114,146] However,
psychiatric disorders and relevant comorbid traits are unlikely to
be biased by a single factor as straightforward as sex-differential
participation, and stratication by factors that are also genetically
regulated may induce collider biases [107,113].
Several studies have proposed the use of inverse probability (IP)
weightings (up-weighting participants with features identied to
be associated with lower participation) [113,147,148] to improve
representativeness of relationships identied between variables of
interest (and interactions between them) in participants of
volunteer-based biobanks [96,104,146]. This approach has been
shown to improve the robustness of GWAS ndings, rGs, and
results of Mendelian randomization (MR) [115]. Notably, IP
weighting relies on training feature selection models using
variables affecting participation that are available in both the
unrepresentative dataset (e.g., the UK Biobank) and a representa-
tive dataset from the same population (e.g. the UK Census
microdata [104]). As misspecication of IP weightings may
introduce further biases [113], feature selection for IP models will
vary across different psychiatric disorders based on disease
severity and other known risk factors [115,128,130,146]. Further,
under some circumstances IP weighting may reduce power [149].
Despite these limitations, this approach can be applied to correct
for participation biases in EHRs and cohort studies. Of note, as we
move towards analyzing disease trajectories that involve diag-
nostic conversions and comorbidities, we need to address a
specic form of ascertainment bias: the index event bias [113]. For
example, genetic effects identied as associated with late-onset
BD (the disease incidence) in MDD cases would be biased by
genetic effects associated with MDD diagnosis (the index event)
[150,151]. However, their utility in investigations into comorbid-
ities among psychiatric disorders are limited, as they assume no
correlation or interaction between SNP effects on disease
progression and incidence. Methods for identifying, clustering,
and correcting for incidence have been developed [152,153], but
like IP weighting methods, they are currently low in power.
Quantifying and correcting for ascertainment biases is an active
area of research [113]. Nevertheless, novel methods are likely to
remain imperfect. As such, sensitivity analyses of genetic
associations are recommended to identify the bounds of worst-
case biases and the minimal level of bias necessary to account for
the genetic ndings [154].
Investigating disease trajectories and comorbidities from a
genomic perspective
While most psychiatric disorders have clear developmental
components, developmental processes are just beginning to be
integrated into genomic analyses. Genetic studies of disease
trajectories have become more feasible with the increased
availability of data from biobanks, EHRs and registries linked with
genetic data that may inform the interrelated development of
multiple disorders. Self-reports of rst diagnosis from the UK
Biobank [155], for instance, enable the examination of temporal
factors that may affect the comorbidity between symptom criteria
for anxiety disorders and MDD [156] as well as their comorbidities
with non-psychiatric phenotypes [157,158]. Alternatively,
repeated measurements from EHR or registry records provide
the longitudinal elements necessary for prospective genomic
studies [159,160]. Furthermore, there are now large genotyped
prospective samples, not relying on retrospective data [161,162].
When considering the trajectory of disease progression, how
patients are sampled also has major implications for genetic
analyses and comorbidity. A recent longitudinal Swedish study of
cases of MDD, BD, and SZ (using recorded discharges from the
Swedish registry) concluded that Over time clinical diagnosis and
genetic risk proles became increasingly consilient [58]. These
results suggest that genetic correlations between BD and SZ may
be higher in cases examined early versus later in their course of
illness. What might be termed diagnostic error could in part reect
the clinical development of the disorders over time [59,60].
Records of clinical diagnoses of psychiatric disorders from
millions of individuals in the Swedish and Danish registries have
shown high, though variable, rates of comorbidities between
different pairs of psychiatric disorders [163165], corroborated by
ndings from a Columbian EHR study [147]. Studies using
polygenic risk scores (PRS) [166,167] or family genetic risk scores
(FGRS) [168171] can investigate patterns of shared genetic risk
between pairs of disorders or their comorbidities. Many interest-
ing insights conrm previous expectations: FGRS of disorders vary
in their ability to predict comorbid disorders as would be expected
from variation in the prevalence of individual disorders and
genetic correlations between them [165]; MDD cases with higher
FGRS for BD have an elevated rate of conversion to a BD diagnosis
(also generally true for other pairs of disorders) [58]; multinomial
logistic regression using both PRS and FGRS are able to identify
genetic heterogeneities among cases of MDD [170] and ADHD
with different comorbid disorders [166]. Some ndings, however,
defy previous expectations and offer new opportunities for
expanding our understanding of psychiatric disorders: other
non-affective psychoses are found to have much lower SZ FGRS
than expected, calling into question their inclusion in SZ analyses
[168]. To date, psychiatric GWAS has not typically stratied
analyses by different patterns of comorbidity. Following from the
PRS and FGRS genetic heterogeneity results, this reects a
promising avenue for future cross-disorder genomic research to
evaluate the extent to which different comorbid presentations
implicate unique biological pathways.
Most psychiatric genetic studies to date have taken a cross-
sectional disease-centric approach, focusing on investigations into
genetic contributions to individual disorders while ignoring
current comorbidities or subsequent conversions to other
disorders. We would hypothesize that phenotypes that share
similar trajectories also share genetic (in addition to environ-
mental) precursors. Not all diagnostic switches (dened to be
conversions among disorders that are exclusion criteria for each
other in the DSM [172]) may pass this validity test, as they are
based entirely on DSM-dened exclusion criteria that may be
arbitrary. Disease trajectory analyses, therefore, present important
opportunities for improving and rening disease nosology and
DSM criteria. In fact, taking the trajectory-centric approach may
enable us to get traction on potential biases that might otherwise
inate (or deate) estimates of apparent pleiotropy, such as cross-
sectional misclassications of two diagnoses with frequent
transitions [173] (e.g. BD and MDD, psychotic disorders and
affective psychoses), and age-related differences in genetic
correlations. Accordingly, we need strategies for keeping analyses
tractable without losing resolution. This may require identifying
biologically interesting questions, dening relevant phenotypes
[58], designing useful data formats [164], and developing
necessary statistical metrics [174]. Statistical approaches devel-
oped for assessing multimorbidity across the entire disease
classication tree, currently employed on rst-diagnoses or
inpatient data in the UK Biobank, may also be customized to
accommodate diagnostic criteria specic to psychiatric disorders,
or longitudinal trajectory data in EHRs and registries [175177].
Recommendations for future data collections
Integrating data from disparate assessment and ascertainment
strategies will continue to pose challenges to psychiatric genetics
in the foreseeable future. While little can be done to alter the
study design choices of existing data, we hope that in planning
future genomic data collection efforts, researchers will consider
the implications that assessment and ascertainment techniques
have on the validity, severity, comorbidity and genetic sharing
across psychiatric disorders.
N. Cai et al.
7
Molecular Psychiatry
Diagnostic validity for individual disorders is a necessary but
insufcient condition for any phenotyping approach. Cases and
controls in new cohorts, especially when collected through
different strategies, should demonstrate similar epidemiological
relationships with known risk and protective factors in the
population they are obtained from. For example, SZ cases should
show a range of characteristics including male excess, mean age
of onset in early to mid-20s, and present evidence of poor
premorbid social or educational functioning and impaired social
functioning, in addition to the canonically assessed key symptoms.
Further, tests of the specicity of identied genetic risk (see
above) are also critical.
Deep phenotyping studies will play a vital role in dissecting and
understanding ndings from heterogeneous meta-analyses, but-
tressing the translation of psychiatric molecular genetic results
into diagnostic and treatment regimens. This is particularly
important for cross-disorder genetic studies, as shallow phenotyp-
ing may be less accurate for some disorders than others. For these,
we recommend: (i) expanding symptom assessment beyond DSM
or ICD criteria to permit the measurement of other relevant clinical
and non-clinical dimensions and/or subtypes that may not be
captured by standard criteria, (ii) hiring trained mental health
interviewers familiar with the relevant symptomatology of the
case sample, (iii) establishing rigorous quality control procedures
for interviewers such as monitored interview recordings by trained
editors, and (iv) where possible, especially for more severe
disorders, complementing interviews with reviews of relevant
clinical records. For such studies we would also recommend
consensus all-sources diagnostic procedures.
Conversely, studies that use non-clinician assessment
approaches will continue to play a key role in recruiting large
samples that are necessary for genomic analyses. For these
studies, we recommend: (i) avoiding single item screens and prior
treatment- or diagnosis-based questions (e.g., Have you ever
been diagnosed with …”) in favor of brief self-report versions of
full diagnostic criteria, some of which have been validated in
genetic designs [178,179]; (ii) remaining cognizant of the
potential for misdiagnosis especially with regard to false positives
and negatives for standard screens for psychotic symptoms [180];
(iii) recognizing the impact of the time-frame of assessment,
recalling that, overall, lifetime measures are likely to be more
genetically informative, and (iv) utilizing modular assessment
designs that allow participants to be recontacted to obtain more
detailed assessments where necessary or followed-up for long-
itudinal assessments and trajectory analyses.
Selecting ascertainment strategies for psychiatric genomic
investigations will likely be guided by the researchers access to
data. However, it is important to keep the corresponding
ascertainment biases in mind when analyzing genomic data.
Furthermore, we recommend (i) using representative (not super-
normal) controls, (ii) developing an ascertainment frame for cases
that avoids oversampling severe and/or treatment resistant illness
unless that is a specic focus of the design, and (iii) when possible,
assessing phenotypes through measurement non-invariance
techniques.
Finally, we call for greater efforts recruiting cohorts from
diverse ancestries and environments. Most genetic studies have
been performed on individuals of European descent who have
relatively easy access to healthcare. Not only do we need to
increase data collection in previously underrepresented commu-
nities, we must also pay careful attention to the translation of
assessment instruments and, where necessary, design and
benchmark new data collection protocols to address language
and cultural differences. Further, with the increasing use of
electronic health records in genetic research, we would like to
urge the greater research community, not just in psychiatric
genetics [82,181183], to investigate the social determinants
that bias representation of different communities in these
resources [184,185]. Such biases can skew our understanding
of disorder risk and comorbidities, and if uncorrected, result in
increasing healthcare disparities [186].
CONCLUSIONS
Over the last 15 years, robust genetic associations have been
identied for numerous psychiatric disorders, both under the
auspices of the PGC and in independent studies. As we move into
an era of historically large sample sizes in the genomic sciences, it
is essential that we avoid assuming that larger samples will
overcome biases and remain vigilant to the challenges associated
with various measurement and ascertainment approaches in
studies contributing to large meta-analyses. The translation of
genetic ndings into novel diagnostic techniques and treatment
regimens for psychiatric disorders are predicated on valid
assessment techniques and unbiased ascertainment strategies,
as well as statistical methods to analyze genomic data. The aim,
which we should always keep in mind, is identifying loci affecting
risk for the disorders and disaggregating pleiotropic from
disorder-specic variants. This will enable us to understand the
biological mechanisms of individual psychiatric disorders and their
comorbidity and serve as the foundation for improvements in
diagnoses and individualized treatments of patients living with
mental illness.
REFERENCES
1. Kendler KS, Aggen SH, Knudsen GP, ysamb E, Neale MC, Reichborn-Kjennerud
T. The structure of genetic and environmental risk factors for syndromal and
subsyndromal common DSM-IV axis I and all axis II disorders. Am J Psychiatry.
2011;168:2939.
2. Pettersson E, Lichtenste in P, Larsson H, Song J, Attention Decit/Hyperactivity
Disorder Working Group of the iPSYCH-Broad-PGC Consortium, Autism Spec-
trum Disorder Working Group of the iPSYCH-Broad-PGC Consortium, Bipolar
Disorder Working Group of the PGC, Eating Disorder Working Group of the PGC,
Major Depressive Disorder Working Group of the PGC, Obsessive Compulsive
Disorders and Tourette Syndrome Working Group of the PGC, Schizophrenia
CLOZUK, Substance Use Disorder Working Group of the PGC, Agrawal A, et al.
Genetic inuences on eight psychiatric disorders based on family data of 4 408
646 full and half-siblings, and genetic data of 333 748 cases and controls.
Psychol Med. 2019;49:116673.
3. Brainstorm Consortium, Anttila V, Bulik-Sullivan B, Finucane HK, Walters RK, Bras
J, et al. Analysis of shared heritability in common disorders of the brain. Science.
2018;360:eaap8757.
4. Grotzinger AD, Mallard TT, Akingbuwa WA, Ip HF, Adams MJ, Lewis CM, et al.
Genetic architecture of 11 major psychiatric disorders at biobehavioral, func-
tional genomic and molecular genetic levels of analysis. Nat Genet.
2022;54:54859.
5. Cross-Disorder Group of the Psychiatric Genomics Consortium. Electronic
address: plee0@mgh.harvard.edu, Cross-disorder group of the psychiatric
genomics consortium. genomic relationships, novel loci, and pleiotropic
mechanisms across eight psychiatric disorders. Cell. 2019;179:146982.e11.
6. Howard DM, Adams MJ, Shirali M, Clarke T-K, Marioni RE, Davies G, et al.
Genome-wide association study of depression phenotypes in UK Biobank
identies variants in excitatory synaptic pathways. Nat Commun. 2018;9:1470.
7. Hyde CL, Nagle MW, Tian C, Chen X, Paciga SA, Wendland JR, et al. Identication
of 15 genetic loci associated with risk of major depression in individuals of
European descent. Nat Genet. 2016;48:10316.
8. Wray NR, Ripke S, Mattheisen M, Trzaskowski M, Byrne EM, Abdellaoui A, et al.
Genome-wide association analyses identify 44 risk variants and rene the
genetic architecture of major depression. Nat Genet. 2018;50:66881.
9. Howard DM, Adams MJ, Clarke T-K, Hafferty JD, Gibson J, Shirali M, et al.
Genome-wide meta-analysis of depression identies 102 independent variants
and highlights the importance of the prefrontal brain regions. Nat Neurosci.
2019;22:34352.
10. Levey DF, Stein MB, Wendt FR, Pathak GA, Zhou H, Aslan M, et al. Bi-ancestral
depression GWAS in the Million Veteran Program and meta-analysis in >1.2
million individuals highlight new therapeutic directions. Nat Neurosci.
2021;24:95463.
11. Schork AJ, Won H, Appadurai V, Nudel R, Gandal M, Delaneau O, et al. A
genome-wide association study of shared risk across psychiatric disorders
N. Cai et al.
8
Molecular Psychiatry
implicates gene regulation during fetal neurodevelopment. Nat Neurosci.
2019;22:35361.
12. First MB, Williams JBW, Karg RS, Spitzer RL SCID-5-CV: Structured Clinical Inter-
view for DSM-5 Disorders : Clinician Version. American Psychiatric Pub; (2015).
13. Wittchen HU. Reliability and validity studies of the WHO-Composite Interna-
tional Diagnostic Interview (CIDI): a critical review. J Psychiatr Res.
1994;28:5784.
14. Diagnostic and Statistical Manual of Mental Disorders: Dsm-5. Amer Psychiatric
Pub Incorporated; (2013).
15. World Health Organiz ation. The International Statistical Classication of Diseases
and Health Related Problems ICD-10: Tenth Revision. Volume 2: Instruction
Manual. World Health Organization; 2004.
16. Azevedo MH, Soares MJ, Coelho I, Dourado A, Valente J, Macedo A, et al. Using
consensus OPCRIT diagnoses. An efcient procedure for best-estimate lifetime
diagnoses. Br J Psychiatry. 1999;175:1547.
17. Leckman JF, Sholomskas D, Thompson WD, Belanger A, Weissman MM. Best
estimate of lifetime psychiatric diagnosis: a methodological study. Arch Gen
Psychiatry. 1982;39:87983.
18. Schizophrenia Work ing Group of the Psychiatric Genomics Consortium. Biolo-
gical insights from 108 schizophrenia-associated genetic loci. Nature.
2014;511:4217.
19. Mullins N, Forstner AJ, OConnell KS, Coombes B, Coleman JRI, Qiao Z, et al.
Genome-wide association study of more than 40,000 bipolar disorder cases
provides new insights into the underlying biology. Nat Genet. 2021;53:
81729.
20. Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium,
Ripke S, Wray NR, Lewis CM, Hamilton SP, Weissman MM, et al. A mega-analysis
of genome-wide association studies for major depressive disorder. Mol Psy-
chiatry. 2013;18:497511.
21. Eysenck HJ, Eysenck SBG Eysenck personality inventory. PsycTESTS Dataset.
(2016).
22. Parker G, Tupling H, Brown LB. A parental bonding instrument. Br J Med Psychol.
1979;52:110.
23. Goodman LA, Corcoran C, Turner K, Yuan N. Green BL Stressful life events
screening questionnaire. PsycTESTS Dataset. (2011).
24. CONVERGE consortium. Sparse whole-genome sequencing identies two loci
for major depressive disorder. Nature. 2015;523:58891.
25. Mitchell AJ, Coyne JC. Do ultra-short screening instruments accurately detect
depression in primary care? A pooled analysis and meta-analysis of 22 studies.
Br J Gen Pract. 2007;57:14451.
26. van Ballegooijen W, Riper H, Cuijpers P, van Oppen P, Smit JH. Validation of
online psychometric instruments for common mental health disorders: a sys-
tematic review. BMC Psychiatry. 2016;16:45.
27. Rees E, Walters JTR, Georgieva L, Isles AR, Chambert KD, Richards AL, et al.
Analysis of copy number variations at 15 schizophrenia-associated loci. Br J
Psychiatry. 2014;204:10814.
28. Hamshere ML, Walters JTR, Smith R, Richards AL, Green E, Grozeva D, et al.
Genome-wide signicant associations in schizophrenia to ITIH3/4, CACNA1C
and SDCCAG8, and extensive replication of associations reported by the Schi-
zophrenia PGC. Mol Psychiatry. 2013;18:70812.
29. Smoller JW. The use of electronic health records for psychiatric phenotyping
and genomics. Am J Med Genet B Neuropsychiatr Genet. 2018;177:60112.
30. Madden JM, Lakoma MD, Rusinak D, Lu CY, Soumerai SB. Missing clinical and
behavioral health data in a large electronic health record (EHR) system. J Am
Med Inform Assoc. 2016;23:11439.
31. Sellgren C, Landén M, Lichtenstein P, Hultman CM, Långström N. Validity of
bipolar disorder hospital discharge diagnoses: le review and multiple register
linkage in Sweden. Acta Psychiatr Scand. 2011;124:44753.
32. Castro VM, Minnier J, Murphy SN, Kohane I, Churchill SE, Gainer V, et al. Vali-
dation of electronic health record phenotyping of bipolar disorder cases and
controls. Am J Psychiatry. 2015;172:36372.
33. Thapar A, Harrington R, Ross K, McGufn P. Does the denition of ADHD affect
heritability? J Am Acad Child Adolesc Psychiatry. 2000;39:152836.
34. Overgaard KR, Oerbeck B, Friis S, Pripp AH, Aase H, Zeiner P. Predictive validity
of attention-decit/hyperactivity disorder from ages 3 to 5 Years. Eur Child
Adolesc Psychiatry. 2022;31:110.
35. Merwood A, Greven CU, Price TS, Rijsdijk F, Kuntsi J, McLoughlin G, et al. Dif-
ferent heritabilities but shared etiological inuences for parent, teacher and self-
ratings of ADHD symptoms: an adolescent twin study. Psychol Med.
2013;43:197384.
36. Ip HF, van der Laan CM, Krapohl EML, Brikell I, Sánchez-Mora C, Nolte IM, et al.
Genetic association study of childhood aggression across raters, instruments,
and age. Transl Psychiatry. 2021;11:413.
37. Van der Laan CM, Ip HF, Schipper M, Hottenga J-J, Krapohl EML, Brikell I, et al.
Meta-analysis of genome wide association studies on childhood ADHD
symptoms and diagnosis reveals 17 novel loci and 22 potential effector genes.
bioRxiv. (2024).
38. Kendler KS, Ohlsson H, Bacanu S, Sundquist J, Sundquist K. Differences in
genetic risk score proles for drug use disorder, major depression, and ADHD as
a function of sex, age at onset, recurrence, mode of ascertainment, and treat-
ment. Psychol Med. 2023;53:344860.
39. Mitchell AJ, Vaze A, Rao S. Clinical diagnosis of depression in primary care: a
meta-analysis. Lancet. 2009;374:60919.
40. Kessler RC, Abelson J, Demler O, Escobar JI, Gibbon M, Guyer ME, et al. Clinical
calibration of DSM-IV diagnoses in the World Mental Health (WMH) version of
the World Health Organization (WHO) Composite International Diagnostic
Interview (WMHCIDI). Int J Methods Psychiatr Res. 2004;13:12239.
41. Sayer NA, Sackeim HA, Moeller JR, Prudic J, Devanand DP, Coleman EA, et al. The
relations between observer-rating and self-report of depressive symptomatol-
ogy. Psychol Assess. 1993;5:35060.
42. von Glischinski M, von Brachel R, Thiele C, Hirschfeld G. Not sad enough for a
depression trial? A systematic review of depression measures and cut points in
clinical trial registrations. J Affect Disord. 2021;292:3644.
43. Thombs BD, Kwakkenbos L, Levis AW, Benedetti A. Addressing overestimation
of the prevalence of depression based on self-report screening questionnaires.
CMAJ. 2018;190:E44E49.
44. Fried EI, Flake JK, Robinaugh DJ. Revisiting the theoretical and methodological
foundations of depression measurement. Nat Rev Psychol. 2022;1:35868.
45. Cai N, Revez JA, Adams MJ, Andlaue r TFM, Breen G, Byrne EM, et al. Minimal
phenotyping yields genome-wide association signals of low specicity for major
depression. Nat Genet. 2020;52:43747.
46. Davies MR, Buckman JEJ, Adey BN, Armour C, Bradley JR, Curzons SCB, et al.
Comparison of symptom-based versus self-reported diagnostic measures of
anxiety and depression disorders in the GLAD and COPING cohorts. J Anxiety
Disord. 2022;85:102491.
47. Kendler KS, Gardner CO, Neale MC, Aggen S, Heath A, Colodro-Conde L, et al.
Shared and specic genetic risk factors for lifetime major depression, depressive
symptoms and neuroticism in three population-based twin samples. Psychol
Med. 2019;49:274553.
48. Dahl A, Thompson M, An U, Krebs M, Appadurai V, Border R, et al. Phenotype
integration improves power and preserves specicity in biobank-based genetic
studies of major depressive disorder. Nat Genet. 2023;55:208293.
49. Glanville KP, Coleman JRI, Howard DM, Pain O, Hanscombe KB, Jermy B, et al.
Multiple measures of depression to enhance validity of major depressive dis-
order in the UK Biobank. BJPsych Open. 2021;7:e44.
50. Stone AA, Bachrach CA, Jobe JB, Kurtzman HS, Cain VS. The Science of Self-
report: Implications for Research and Practice. Psychology Press; (1999).
51. Kendler KS, Prescott CA, Jacobson K, Myers J, Neale MC. The joint analysis of
personal interview and family history diagnoses: evidence for validity of diag-
nosis and increased heritability estimates. Psychol Med. 2002;32:82942.
52. Heath AC, Neale MC, Kessler RC, Eaves LJ, Kendler KS. Evidence for genetic
inuences on personality from self-reports and informant ratings. J Pers Soc
Psychol. 1992;63:8596.
53. Cheesman R, Major Depressive Disorder Working Group of the Psychiatric
Genomics Consortium, Purves KL, Pingault J-B, Breen G, Rijsdij K F, et al.
Extracting stability increases the SNP heritability of emotional problems in
young people. Transl Psychiatry. 2018;8:223.
54. Zavos HMS, Gregory AM, Eley TC. Longitudinal genetic analysis of anxiety
sensitivity. Dev Psychol. 2012;48:20412.
55. Huang L, Tang S, Rietkerk J, Appadurai V, Krebs MD, Schork AJ, et al. Polygenic
analyses show important differences between MDD symptoms collected using
PHQ9 and CIDI-SF. Biol Psychiatry. 2023. 4 December 2023. https://doi.org/
10.1016/j.biopsych.2023.11.021.
56. Brewin CR, Andrews B, Gotlib IH. Psychopathology and early experience: a
reappraisal of retrospective reports. Psychol Bull. 1993;113:8298.
57. Levis B, Benedetti A, Ioannidis JPA, Sun Y, Negeri Z, He C, et al. Patient Health
Questionnaire-9 scores do not accurately estimate depression prevalence:
individual participant data meta-analysis. J Clin Epidemiol. 2020;122:11528.e1.
58. Kendler KS, Ohlsson H, Sundquist J, Sundquist K. Relationship of family genetic
risk score with diagnostic trajectory in a Swedish national sample of incident
cases of major depression, bipolar disorder, other nonaffective psychosis, and
schizophrenia. JAMA Psychiatry. 2023;80:2419.
59. Feng Y-CA, Ge T, Cordioli M, Ganna A, Smoller JW, Neale BM, et al. Findings and
insights from the genetic investigation of age of rst reported occurrence for
complex disorders in the UK Biobank and FinnGen. bioRxiv. (2020).
60. Baker E, Leonenko G, Schmidt KM, Hill M, Myers AJ, Shoai M, et al. What does
heritability of Alzheimers disease represent? PLoS One. 2023;18:e0281440.
61. Lam M, Chen C-Y, Li Z, Martin AR, Bryois J, Ma X, et al. Comparative genetic
architectures of schizophrenia in East Asian and European populations. Nat
Genet. 2019;51:16708.
N. Cai et al.
9
Molecular Psychiatry
62. Pardiñas AF, Holmans P, Pocklington AJ, Escott-Price V, Ripke S, Carrera N, et al.
Common schizophrenia alleles are enriched in mutation-intolerant genes and in
regions under strong background selection. Nat Genet. 2018;50:3819.
63. Kendler KS. DSM disorders and their criteria: how should they inter-relate?
Psychol Med. 2017;47:205460.
64. Kendler KS. The Phenomenology of Major Depression and the Representative-
ness and Nature of DSM Criteria. Am J Psychiatry. 2016;173:77180.
65. Kendler KS. A history of the DSM-5 Scientic Review Committee. Psychol Med.
2013;43:17931800.
66. Chang H Inventing Temperature: Measurement and Scientic Progress. Oxford
University Press on Demand; (2004).
67. Kendler KS, Parnas J Philosophical Issues in Psychiatry II: Nosology. OUP Oxford;
(2012).
68. Trzaskowski M, Mehta D, Peyrot WJ, Hawkes D, Davies D, Howard DM, et al.
Quantifying between-cohort and between-sex genetic heterogeneity in major
depressive disorder. Am J Med Genet B Neuropsychiatr Genet. 2019;180:43947.
69. Bjornson-Benson WM, Stibolt TB, Manske KA, Zavela KJ, Youtsey DJ, Buist AS.
Monitoring recruitment effectiveness and cost in a clinical trial. Control Clin
Trials. 1993;14:52S67S.
70. Flint J, Chen Y, Shi S, Kendler KS, CONVERGE consortium. Epilogue: Lessons from
the CONVERGE study of major depressive disorder in China. J Affect Disord.
2012;140:15.
71. Lovato LC, Hill K, Hertert S, Hunninghake DB, Probsteld JL. Recruitment for
controlled clinical trials: literature summary and annotated bibliography. Control
Clin Trials. 1997;18:32852.
72. Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium.
Genome-wide association study identies ve new schizophrenia loci. Nat
Genet. 2011;43:96976.
73. Psychiatric GWAS Consortium Bipolar Disorder Working Group. Large-scale
genome-wide association analysis of bipolar disorder identies a new sus-
ceptibility locus near ODZ4. Nat Genet. 2011;43:97783.
74. Lopez R, Scheutz F, Errboe M, Baelum V. Selection bias in case-control studies on
periodontitis: a systematic review. Eur J Oral Sci. 2007;115:33943.
75. Malay S, Chung KC. How to use outcomes questionnaires: pearls and pitfalls. Clin
Plast Surg. 2013;40:2619.
76. Legge SE, Pardiñas AF, Woolway G, Rees E, Cardno AG, Escott-Price V, et al.
Genetic and Phenotypic Features of Schizophrenia in the UK Biobank. JAMA
Psychiatry. 2024;81:68190.
77. Taherdoost H Sampling methods in research methodolo gy; How to choose a
sampling technique for research. SSRN Electron J. 2016. 2016. https://doi.org/
10.2139/ssrn.3205035.
78. Cross-Disorder Group of the Psychiatric Genomics Consortium, Lee SH, Ripke S,
Neale BM, Faraone SV, Purcell SM, et al. Genetic relationship between ve
psychiatric disorders estimated from genome-wide SNPs. Nat Genet.
2013;45:98494.
79. Schmidt M, Schmidt SAJ, Sandegaard JL, Ehrenstein V, Pedersen L, Sørensen HT.
The Danish National Patient Registry: a review of content, data quality, and
research potential. Clin Epidemiol. 2015;7:44990.
80. Ludvigsson JF, Andersson E, Ekbom A, Feychting M, Kim J-L, Reuterwall C, et al.
External review and validation of the Swedish national inpatient register. BMC
Public Health. 2011;11:450.
81. All of Us Research Program Investigators, Denny JC, Rutter JL, Goldstein DB,
Philippakis A, Smoller JW, et al. The All of UsResearch Program. N Engl J Med.
2019;381:66876.
82. Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR, et al.
Development of a large-scale de-identied DNA biobank to enable personalized
medicine. Clin Pharmacol Ther. 2008;84:3629.
83. Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, Manolio TA, et al. The
electronic medical records and genomics (eMERGE) network: past, present, and
future. Genet Med. 2013;15:76171.
84. Engelhard MM, Henao R, Berchuck SI, Chen J, Eichner B, Herkert D, et al. Pre-
dictive value of early autism detection models based on electronic health record
data collected before age 1 year. JAMA Netw Open. 2023;6:e225430 3.
85. Amit G, Bilu Y, Sudry T, Avgil Tsadok M, Zimmerman DR, Baruch R, et al. Early
prediction of autistic spectrum disorder using developmental surveillance data.
JAMA Netw Open. 2024;7:e2351052.
86. Lichtenstein P, Björk C, Hultman CM, Scolnick E, Sklar P, Sullivan PF. Recurrence
risks for schizophrenia in a Swedish national cohort. Psychol Med.
2006;36:141725.
87. Ekholm B, Ekholm A, Adolfsson R, Vares M, Osby U, Sedvall GC, et al. Evaluation
of diagnostic procedures in Swedish patients with schizophrenia and related
psychoses. Nord J Psychiatry. 2005;59:45764.
88. Rück C, Larsson KJ, Lind K, Perez-Vigil A, Isomu ra K, Sariaslan A, et al. Validity and
reliability of chronic tic disorder and obsessive-compulsive disorder diagnoses
in the Swedish National Patient Register. BMJ Open. 2015;5:e007520.
89. Beaulieu-Jones BK, Villamar MF, Scordis P, Bartmann AP, Ali W, Wissel BD, et al.
Predicting seizure recurrence after an initial seizure-like episode from routine
clinical notes using large language models: a retrospective cohort study. Lancet
Digit Health. 2023;5:e882e894.
90. Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large
language model for electronic health records. NPJ Digit Med. 2022;5:194.
91. Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PRO, Bernstam EV, et al.
Caveats for the use of operational electronic health record data in comparative
effectiveness research. Med Care. 2013;51:S30S37.
92. Abul-Husn NS, Kenny EE. Personalized medicine and the power of electronic
health records. Cell. 2019;177:5869.
93. Croskerry P. The importance of cognitive errors in diagnosis and strategies to
minimize them. Acad Med. 2003;78:77580.
94. Swanson JM. The UK Biobank and selection bias. Lancet. 2012;380:110.
95. Berkson J. Limitations of the application of fourfold table analysis to hospital
data. Biom Bull. 1946;2:47.
96. Lee YH, Thaweethai T, Sheu Y-H, Feng Y-CA, Karlson EW, Ge T, et al. Impact of
selection bias on polygenic risk score estimates in healthcare settings. Psychol
Med. 2023;53:743545.
97. Dueñas HR, Seah C, Johnson JS, Huckins LM. Implicit bias of encoded variables:
frameworks for addressing structured bias in EHR-GWAS data. Hum Mol Genet.
2020;29:R33R41.
98. Goldstein ND A Researchers Guide to Using Electronic Health Records: From
Planning to Presentation. CRC Press; (2023).
99. Beaulieu-Jones BK. Machine Learning Methods to Identify Hidden Phenotypes in
the Electronic Health Record. (2017).
100. Polubriaginof FCG, Vanguri R, Quinnies K, Belbin GM, Yahi A, Salmasian H, et al.
Disease heritability inferred from familial relationships reported in medical
records. Cell. 2018;173:1692704.e11.
101. Davis KAS, Coleman JRI, Adams M, Allen N, Breen G, Cullen B, et al. Mental
health in UK Biobank - development, implementation and results from an online
questionnaire completed by 157 366 participants: a reanalysis. BJPsych Open.
2020;6:e18.
102. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an
open access resource for identifying the causes of a wide range of complex
diseases of middle and old age. PLoS Med. 2015;12:e1001779.
103. Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Com-
parison of Sociodemographic and Health-Related Characteristics of UK Biobank
Participants With Those of the General Population. Am J Epidemiol.
2017;186:102634.
104. van Alten S, Domingue BW, Galama T, Marees AT. Reweighting the UK Biobank
to reect its underlying sampling population substantially reduces pervasive
selection bias due to volunteering. bioRxiv. (2022).
105. Baltes PB, Mayer KU Die Berliner Altersstudie. Akademie Verlag; (1999).
106. Batty GD, Gale CR, Kivimäki M, Deary IJ, Bell S. Comparison of risk factor asso-
ciations in UK Biobank against representative, general population based studies
with conventional response rates: prospective cohort study and individual
participant meta-analysis. BMJ. 2020;368:m131.
107. Munafò MR, Tilling K, Taylor AE, Evans DM, Davey Smith G. Collider scope: when
selection bias can substantially inuence observed associations. Int J Epidemiol.
2018;47:22635.
108. Mignogna G, Carey CE, Wedow R, Baya N, Cordioli M, Pirastu N, et al. Patterns of
item nonresponse behavior to survey questionnaires are systematic and have a
genetic basis. bioRxiv. (2022).
109. Tyrrell J, Zheng J, Beaumont R, Hinton K, Richardson TG, Wood AR, et al. Genetic
predictors of participation in optional components of UK Biobank. Nat Commun.
2021;12:886.
110. Martin J, Tilling K, Hubbard L, Stergiakouli E, Thapar A, Davey Smith G, et al.
Association of genetic risk for schizophrenia with nonparticipation over time in
a population-based cohort study. Am J Epidemiol. 2016;183:114958.
111. Adams MJ, Hill WD, Howard DM, Dashti HS, Davis KAS, Campbell A, et al. Factors
associated with sharing e-mail information and mental health survey partici-
pation in large population cohorts. Int J Epidemiol. 2020;49:41021.
112. Rothman KJ, Gallacher JEJ, Hatch EE. Why representativeness should be avoi-
ded. Int J Epidemiol. 2013;42:10124.
113. Mitchell RE, Hartley AE, Walker VM, Gkatzionis A, Yarmolinsky J, Bell JA, et al.
Strategies to investigate and mitigate collider bias in genetic and Mendelian
randomisation studies of disease progression. PLoS Genet. 2023;19:e1010596.
114. Lee H, Han B. A theory-based practical solution to correct for sex-differential
participation bias. Genome Biol. 2022;23:138.
115. Schoeler T, Speed D, Porcu E, Pirastu N, Pingault J-B, Kutalik Z. Participation bias
in the UK Biobank distorts genetic associations and downstream analyses. Nat
Hum Behav. 2023;7:121627.
116. Hodge SE, Subaran RL, Weissman MM, Fyer AJ. Designing case-control studies:
decisions about the controls. Am J Psychiatry. 2012;169:7859.
N. Cai et al.
10
Molecular Psychiatry
117. Lubin JH, Gail MH. Biased selection of controls for case-control analyses of
cohort studies. Biometrics. 1984;40:6375.
118. Wacholder S, McLaughlin JK, Silverman DT, Mandel JS. Selection of controls in
case-control studies. I. Principles. Am J Epidemiol. 1992;135:101928.
119. Chen TJH, Blum K, Mathews D, Fisher L, Schnautz N, Braverman ER, et al. Are
dopaminergic genes involved in a predisposition to pathological aggression?
Hypothesizing the importance of super normal controlsin psychiatricgenetic
research of complex behavioral disorders. Med Hypotheses. 2005;65:7037.
120. Schwartz S, Susser E. The use of well controls: an unhealthy practice in psy-
chiatric research. Psychol Med. 2011;41:112731.
121. Kendler KS. Toward a scientic psychiatric nosology. Strengths and limitations.
Arch Gen Psychiatry. 1990;47:96973.
122. Kendler KS, Chatzinakos C, Bacanu S-A. The impact on estimations of genetic
correlations by the use of super-normal, unscreened, and family-history
screened controls in genome wide case-control studies. Genet Epidemiol.
2020;44:2839.
123. Wray NR, Pergadia ML, Blackwood DHR, Penninx BWJH, Gordon SD, Nyholt DR,
et al. Genome-wide association study of major depressive disorder: new results,
meta-analysis, and lessons learned. Mol Psychiatry. 2012;17:3648.
124. Kirov G, Zaharieva I, Georgieva L, Moskvina V, Nikolov I, Cichon S, et al. A
genome-wide association study in 574 schizophrenia trios using DNA pooling.
Mol Psychiatry. 2009;14:796803.
125. ODonovan MC, Craddock N, Norton N, Williams H, Peirce T, Moskvina V, et al.
Identication of loci associated with schizophrenia by genome-wide association
and follow-up. Nat Genet. 2008;40:10535.
126. Peyrot WJ, Boomsma DI, Penninx BWJH, Wray NR. Disease and polygenic
architecture: avoid trio design and appropriately account for unscreened control
subjects for common disease. Am J Hum Genet. 2016;98:38291.
127. Wray NR, Lee SH, Kendler KS. Impact of diagnostic misclassication on esti-
mation of genetic correlations using genome-wide genotypes. Eur J Hum Genet.
2012;20:66874.
128. Border R, Athanasiadis G, Buil A, Schork AJ, Cai N, Young AI, et al. Cross-trait
assortative mating is widespread and inates genetic correlation estimates.
Science. 2022;378:75461.
129. Jermy BS, Glanville KP, Coleman JRI, Lewis CM, Vassos E. Exploring the genetic
heterogeneity in major depression across diagnostic criteria. Mol Psychiatry.
2021;26:733745.
130. Xue A, Jiang L, Zhu Z, Wray NR, Visscher PM, Zeng J, et al. Genome-wide
analyses of behavioural traits are subject to bias by misreports and longitudinal
changes. Nat Commun. 2021;12:20211.
131. van Rheenen W, Peyrot WJ, Schork AJ, Lee SH, Wray NR. Genetic correlations of
polygenic disease traits: from theory to practice. Nat Rev Genet. 2019;20:56781.
132. Han B, Eskin E. Random-effects model aimed at discovering associations in
meta-analysis of genome-wide association studies. Am J Hum Genet.
2011;88:58698.
133. Dahl A, Thompson M, An U, Krebs M, Appadurai V, Border R, et al. Phenotype
integration improves power and preserves specicity in biobank-based genetic
studies of MDD. bioRxiv. (2022).
134. Hujoel MLA, Gazal S, Loh P-R, Patterson N, Price AL. Liability threshold modeling
of case-control status and family history of disease increases association power.
Nat Genet. 2020;52:5417.
135. Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, Fontana MA, et al. Multi-trait
analysis of genome-wide association summary statistics using MTAG. Nat Genet.
2018;50:22937.
136. Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, Mallard TT, Hill WD, et al.
Genomic structural equation modelling provides insights into the multivariate
genetic architecture of complex traits. Nat Hum Behav. 2019;3:51325.
137. An U, Pazokitoroudi A, Alvarez M, Huang L, Bacanu S, Schork AJ, et al. Deep
learning-based phenotype imputation on population-scale biobank data
increases genetic discoveries. Nat Genet. 2023;55:226976.
138. PsychENCODE Consorti um, Akbarian S, Liu C, Knowles JA, Vaccarino FM, Farn-
ham PJ, et al. The PsychENCODE project. Nat Neurosci. 2015;18:170712.
139. Wang D, Liu S, Warrell J, Won H, Shi X, Navarro FCP, et al. Comprehensive
functional genomic resource and integrative model for the human brain. Sci-
ence. 2018;362:eaat8464.
140. Gandal MJ, Zhang P, Hadjimi chael E, Walker RL, Chen C, Liu S, et al.
Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and
bipolar disorder. Science. 2018;362:eaat8127.
141. Gandal MJ, Haney JR, Parikshak NN, Leppa V, Ramaswami G, Hartl C, et al.
Shared Molecular Neuropathology Across Major Psychiatric Disorders Parallels
Polygenic Overlap. Focus. 2019;17:6672.
142. Opel N, Goltermann J, Hermesdorf M, Berger K, Baune BT, Dannlowski U. Cross-
disorder analysis of brain structural abnormalities in six major psychiatric dis-
orders: a secondary analysis of mega- and meta-analytical ndings from the
ENIGMA consortium. Biol Psychiatry. 2020;88:67886.
143. Hettwer MD, Lariviere S, Park B-Y, van den Heuvel OA, Schm aal L, Andreassen
OA, et al. Coordinated cortical thickness alterations across psychiatric condi-
tions: A transdiagnostic ENIGMA study. bioRxiv. (2022).
144. Balliu B, Douglas C, Shenhav L, Wu Y, Seok D, Chatzopoulou D, et al. Persona-
lized mood prediction from patterns of behavior collected with smartphones.
bioRxiv. (2022).
145. Freimer NB, Mohr DC. Integrating behavioural health tracking in human
genetics research. Nat Rev Genet. 2019;20:12930.
146. Pirastu N, Cordioli M, Nandakumar P, Mignogna G, Abdellaoui A, Hollis B, et al.
Genetic analyses identify widespread sex-differential participation bias. Nat
Genet. 2021;53:66371.
147. Grifth GJ, Morris TT, Tudball MJ, Herbert A, Mancano G, Pike L, et al. Collider
bias undermines our understanding of COVID-19 disease risk and severity. Nat
Commun. 2020;11:5749.
148. Gkatzionis A, Burgess S. Contextualizing selection bias in Mendelian randomi-
zation: how bad is it likely to be? Int J Epidemiol. 2019;48:691701.
149. Cole SR, Hernán MA. Constructing inverse probability weights for marginal
structural models. Am J Epidemiol. 2008;168:65664.
150. Dudbridge F, Allen RJ, Sheehan NA, Schmidt AF, Lee JC, Jenkins RG, et al.
Adjustment for index event bias in genome-wide association studies of sub-
sequent events. Nat Commun. 2019;10:1561.
151. Cai S, Hartley A, Mahmoud O, Tilling K, Dudbridge F. Adjusting for collider bias in
genetic association studies using instrumental variable methods. Genet Epide-
miol. 2022;46:30316.
152. Mahmoud O, Dudbridge F, Davey Smith G, Munafo M, Tilling K. A robust
method for collider bias correction in conditional genome-wide association
studies. Nat Commun. 2022;13:619.
153. Qi G, Chatterjee N. Mendelian randomization analysis using mixture models for
robust and efcient estimation of causal effects. Nat Commun. 2019;10:1941.
154. Cinelli C, LaPierre N, Hill BL, Sankararaman S, Eskin E. Robust Mendelian ran-
domization in the presence of residual population stratication, batch effects
and horizontal pleiotropy. Nat Commun. 2022;13:1093.
155. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK
Biobank resource with deep phenotyping and genomic data. Nature.
2018;562:2039.
156. Thorp JG, Campos AI, Grotzinger AD, Gerring ZF, An J, Ong J-S, et al. Symptom-
level modelling unravels the shared genetic architecture of anxiety and
depression. Nat Hum Behav. 2021;5:143242.
157. Nakada S, Ho FK, Celis-Morales C, Jackson CA, Pell JP. Individual and joint
associations of anxiety disorder and depression with cardiovascular disease: A
UK Biobank prospective cohort study. Eur Psychiatry. 2023;66:e54.
158. Qiao Y, Ding Y, Li G, Lu Y, Li S, Ke C. Role of depression in the development of
cardiometabolic multimorbidity: Findings from the UK Biobank study. J Affect
Disord. 2022;319:2606.
159. Han X, Hou C, Yang H, Chen W, Ying Z, Hu Y, et al. Disease trajectories and
mortality among individuals diagnosed with depression: a community-based
cohort study in UK Biobank. Mol Psychiatry. 2021;26:673646.
160. Mulugeta A, Zhou A, King C, Hyppönen E. Association between major depres-
sive disorder and multiple disease outcomes: a phenome-wide Mendelian
randomisation study in the UK Biobank. Mol Psychiatry. 2020;25:146976.
161. Magnus P, Birke C, Vejrup K, Haugan A, Alsaker E, Daltveit AK, et al. Cohort
Prole Update: The Norwegian Mother and Child Cohort Study (MoBa). Int J
Epidemiol. 2016;45:3828.
162. Havdahl A, Wootton RE, Leppert B, Riglin L, Ask H, Tesli M, et al. Associations
between pregnancy-related predisposing factors for offspring neurodevelop-
mental conditions and parental genetic liability to attention-decit/hyper-
activity disorder, autism, and Schizophrenia: The Norwegian Mother, Father and
Child Cohort Study (MoBa). JAMA Psychiatry. 2022;79:799810.
163. Plana-Ripoll O, Pedersen CB, Holtz Y, Benros ME, Dalsgaard S, de Jonge P, et al.
Exploring comorbidity within mental disorders among a Danish National
population. JAMA Psychiatry. 2019;76:25970.
164. Kr ebs MD, T hemudo G E, Benr os ME, Mors O, Børglum AD, Ho ugaard D, et al.
Associations between patterns in comorbid diagnostic trajectories of indi-
viduals with schizophrenia and etiological factors. Nat Commun.
2021;12:6617.
165. Kendler KS, Ohlsson H, Sundquist J, Sundquist K Selecting cases of major psy-
chiatric and substance use disorders in Swedish national registries on the basis
of clinical features to maximize the strength or specicity of the genetic risk.
Mol Psychiatry. 2023. 2023. https://doi.org/10.1038/s41380-023-02156-2.
166. LaBianca S, Brikell I, Helenius D, Loughnan R, Mefford J, Palmer CE, et al. Poly-
genic proles dene aspects of clinical heterogeneity in attention decit
hyperactivity disorder. Nat Genet. 2023. 2023. https://doi.org/10.1038/s41588-
023-01593-7.
167. Musliner KL, Krebs MD, Albiñana C, Vilhjalmsson B, Agerbo E, Zandi PP, et al.
Polygenic risk and progression to bipolar or psychotic disorders among
N. Cai et al.
11
Molecular Psychiatry
individuals diagnosed with unipolar depression in early life. Am J Psychiatry.
2020;177:93643.
168. Kendler KS, Ohlsson H, Sundquist J, Sundquist K. Family genetic risk scores and
the genetic architecture of major affective and psychotic disorders in a Swedish
national sample. JAMA Psychiatry. 2021;78:73543.
169. Kendler KS, Ohlsson H, Sundquist J, Sundquist K. The patterns of family genetic
risk scores for eleven major psychiatric and substance use disorders in a
Swedish national sample. Transl Psychiatry. 2021;11:326.
170. Dybdahl Krebs M, Georgii Hellberg K-L, Lundberg M, Appadurai V, Ohlsson H,
Pedersen EM, et al. PA-FGRS is a novel estimator of pedigree-based genetic
liability that complements genotype-based inferences into the genetic archi-
tecture of major depressive disorder. bioRxiv. (2023).
171. Dybdahl Krebs M, Appadurai V, Georgii Hellberg K-L, Ohlsson H, Steinb ach J,
Pedersen E, et al. The relationship between genotype- and phenotype-based
estimates of genetic liability to psychiatric disorders, in practice and in theory.
bioRxiv. (2023).
172. De la Hoz JF, Arias A, Service, SK, Castaño M, Diaz-Zuluaga AM, Song J, et al.
Electronic health records reveal transdiagnostic clinical features and diverse
trajectories of serious mental illness. bioRxiv. (2022).
173. Bromet E, Andrade LH, Hwang I, Sampson NA, Alonso J, de Girolamo G, et al.
Cross-national epidemiology of DSM-IV major depressive episode. BMC Med.
2011;9:90.
174. Studer M, Ritschard G. What matters in differences between life trajectories: a
comparative review of sequence dissimilarity measures. J R Stat Soc Ser A Stat
Soc. 2016;179:481511.
175. Cortes A, Dendrou CA, Motyer A, Jostins L, Vukcevic D, Dilthey A, et al. Bayesian
analysis of genetic association across tree-structured routine healthcare data in
the UK Biobank. Nat Genet. 2017;49:13118.
176. Cortes A, Albers PK, Dendrou CA, Fugger L, McVean G. Identifying cross-disease
components of genetic risk across hospital data in the UK Biobank. Nat Genet.
2020;52:12634.
177. Zhang Y, Jiang X, Mentzer AJ, McVean G, Lunter G. Topic modeling identies
novel genetic loci associated with multimorbidities in UK Biobank. Cell Genom.
2023;3:100371.
178. Kendler KS, Peders en NL, Neale MC, Mathé AA. A pilot Swedish twin study of
affective illness including hospital- and population-ascertained subsamples:
results of model tting. Behav Genet. 1995;25:21732.
179. Sanchez-Roige S, Palmer AA, Fontanillas P, Elson SL, 23andMe Research Team,
the Substance Use Disorder Working Group of the Psychiatric Genomics Con-
sortium, Adams MJ, et al. Genome-Wide Association Study Meta-Analysis of the
Alcohol Use Disorders Identication Test (AUDIT) in Two Population-Based
Cohorts. Am J Psychiatry. 2019;176:10718.
180. Kendler KS, Gallagher TJ, Abelson JM, Kessler RC. Lifetime prevalence, demo-
graphic risk factors, and diagnostic validity of nonaffective psychosis as asses-
sed in a US community sample. The National Comorbidity Survey. Arch Gen
Psychiatry. 1996;53:102231.
181. All of Us Research Program Genomics Investigators. Genomic data in the all of
us research program. Nature. 2024;627:3406.
182. Verma A, Huffman JE, Rodriguez A, Conery M, Liu M, Ho Y-L, et al. Diversity and
scale: genetic architecture of 2068 traits in the VA Million Veteran Program.
Science. 2024;385:eadj1182.
183. Belbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, et al. Toward a
ne-scale population health monitoring system. Cell. 2021;184:206883.e11.
184. Smith MA, Gigot M, Harburn A, Bednarz L, Curtis K, Mathew J, et al. Insights into
measuring health disparities using electronic health records from a statewide
network of health systems: A case study. J Clin Transl Sci. 2023;7:e54.
185. Yan C, Zhang X, Yang Y, Kang K, Were MC, Embí P, et al. Differences in health
professionalsengagement with electronic health records based on inpatient
race and ethnicity. JAMA Netw Open. 2023;6:e2336383.
186. Hsu C-Y, Yang W, Parikh RV, Anderson AH, Chen TK, Cohen DL, et al. Race,
genetic ancestry, and estimating kidney function in CKD. N Engl J Med.
2021;385:175060.
ACKNOWLEDGEMENTS
BV is supported by the Brain and Behavior Research Foundation (BBRF 31397). OAA a
consultant to Cortechs.ai and Precision Health, and received speakers honorarium
from Lundbeck, Janssen, Otsuka and Sunovion. He is supported by the Research
Council of Norway (#324499, #324252, #296030), NIH 1R01MH124839, KG Jebsen
Stiftelsen (SKGJ-MED-021), European Unions Horizon 2020 RIA grant (#964874). JB is
supported by the EU-AIMS (European Autism Interventions) and AIMS-2-TRIALS
programs which receive support from Innovative Medicines Initiative Joint Under-
taking Grant No. 115300 and 777394, the resources of which are composed of
nancial contributions from the European Unions FP7 and Horizon2020 Programs,
and from the European Federation of Pharmaceutical Industries and Associations
(EFPIA) companiesin-kind contributions, and AUTISM SPEAKS, Autistica and SFARI;
and by the Horizon2020 supported programs CANDY Grant No. 847818, and R2D2
Grant No. 101057385. AG was supported by NIH Grant R01MH120219. KJ is a
consultant to Allia Health. PL was supported by R01MH119243. TTM was supported
by K08MH135343. EMTD was supported by R01MH120219. KSK was supported by
R01MH130665 and U01MH126798. These funders had no role in the design of the
study; in the collection, analyses, or interpretation of data; in the writing of the
manuscript, or in the decision to publish the results. Any views expressed are those of
the author(s) and not necessarily those of the funders.
AUTHOR CONTRIBUTIONS
NC, BV and KSK outlined the specic issues reviewed in this paper and prepared the
rst draft of the manuscript. OA, JB, HE, JH, MG, AG, KJ, PL, TM, MM, MN, JN, WP, ET-D,
and JW participated in the initial discussions on developing this paper, reviewed the
initial draft and provided important input into the nal document which was
reviewed and approved by all authors.
COMPETING INTERESTS
The authors declare no competing interests.
ADDITIONAL INFORMATION
Correspondence and requests for materials should be addressed to
Kenneth S. Kendler.
Reprints and permission information is available at http://www.nature.com/
reprints
Publishers note Springer Nature remains neutral with regard to jurisdictional claims
in published maps and institutional afliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party
material in this article are included in the articles Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material is not included in the
articles Creative Commons licence and your intended use is not permitted by statutor y
regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/.
© The Author(s) 2024
N. Cai et al.
12
Molecular Psychiatry
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
One of the justifiable criticisms of human genetic studies is the underrepresentation of participants from diverse populations. Lack of inclusion must be addressed at-scale to identify causal disease factors and understand the genetic causes of health disparities. We present genome-wide associations for 2068 traits from 635,969 participants in the Department of Veterans Affairs Million Veteran Program, a longitudinal study of diverse United States Veterans. Systematic analysis revealed 13,672 genomic risk loci; 1608 were only significant after including non-European populations. Fine-mapping identified causal variants at 6318 signals across 613 traits. One-third ( n = 2069) were identified in participants from non-European populations. This reveals a broadly similar genetic architecture across populations, highlights genetic insights gained from underrepresented groups, and presents an extensive atlas of genetic associations.
Article
Full-text available
Over the last ten years, there has been considerable progress in using digital behavioral phenotypes, captured passively and continuously from smartphones and wearable devices, to infer depressive mood. However, most digital phenotype studies suffer from poor replicability, often fail to detect clinically relevant events, and use measures of depression that are not validated or suitable for collecting large and longitudinal data. Here, we report high-quality longitudinal validated assessments of depressive mood from computerized adaptive testing paired with continuous digital assessments of behavior from smartphone sensors for up to 40 weeks on 183 individuals experiencing mild to severe symptoms of depression. We apply a combination of cubic spline interpolation and idiographic models to generate individualized predictions of future mood from the digital behavioral phenotypes, achieving high prediction accuracy of depression severity up to three weeks in advance ( R ² ≥ 80%) and a 65.7% reduction in the prediction error over a baseline model which predicts future mood based on past depression severity alone. Finally, our study verified the feasibility of obtaining high-quality longitudinal assessments of mood from a clinical population and predicting symptom severity weeks in advance using passively collected digital behavioral data. Our results indicate the possibility of expanding the repertoire of patient-specific behavioral measures to enable future psychiatric research.
Article
Full-text available
Attention deficit hyperactivity disorder (ADHD) is a complex disorder that manifests variability in long-term outcomes and clinical presentations. The genetic contributions to such heterogeneity are not well understood. Here we show several genetic links to clinical heterogeneity in ADHD in a case-only study of 14,084 diagnosed individuals. First, we identify one genome-wide significant locus by comparing cases with ADHD and autism spectrum disorder (ASD) to cases with ADHD but not ASD. Second, we show that cases with ASD and ADHD, substance use disorder and ADHD, or first diagnosed with ADHD in adulthood have unique polygenic score (PGS) profiles that distinguish them from complementary case subgroups and controls. Finally, a PGS for an ASD diagnosis in ADHD cases predicted cognitive performance in an independent developmental cohort. Our approach uncovered evidence of genetic heterogeneity in ADHD, helping us to understand its etiology and providing a model for studies of other disorders.
Article
Full-text available
Biobanks often contain several phenotypes relevant to diseases such as major depressive disorder (MDD), with partly distinct genetic architectures. Researchers face complex tradeoffs between shallow (large sample size, low specificity/sensitivity) and deep (small sample size, high specificity/sensitivity) phenotypes, and the optimal choices are often unclear. Here we propose to integrate these phenotypes to combine the benefits of each. We use phenotype imputation to integrate information across hundreds of MDD-relevant phenotypes, which significantly increases genome-wide association study (GWAS) power and polygenic risk score (PRS) prediction accuracy of the deepest available MDD phenotype in UK Biobank, LifetimeMDD. We demonstrate that imputation preserves specificity in its genetic architecture using a novel PRS-based pleiotropy metric. We further find that integration via summary statistics also enhances GWAS power and PRS predictions, but can introduce nonspecific genetic effects depending on input. Our work provides a simple and scalable approach to improve genetic studies in large biobanks by integrating shallow and deep phenotypes.
Article
Full-text available
Biobanks that collect deep phenotypic and genomic data across many individuals have emerged as a key resource in human genetics. However, phenotypes in biobanks are often missing across many individuals, limiting their utility. We propose AutoComplete, a deep learning-based imputation method to impute or ‘fill-in’ missing phenotypes in population-scale biobank datasets. When applied to collections of phenotypes measured across ~300,000 individuals from the UK Biobank, AutoComplete substantially improved imputation accuracy over existing methods. On three traits with notable amounts of missingness, we show that AutoComplete yields imputed phenotypes that are genetically similar to the originally observed phenotypes while increasing the effective sample size by about twofold on average. Further, genome-wide association analyses on the resulting imputed phenotypes led to a substantial increase in the number of associated loci. Our results demonstrate the utility of deep learning-based phenotype imputation to increase power for genetic discoveries in existing biobank datasets.