Using PhenX Measures to Identify Opportunities for
Huaqin Pan,1∗Kimberly A. Tryka,2Daniel J. Vreeman,3,4Wayne Huggins,1Michael J. Phillips,1Jayashri P. Mehta,2†
Jacqueline H. Phillips,3Clement J. McDonald,5Heather A. Junkins,6Erin M. Ramos,6and Carol M. Hamilton1
1RTI International, Research Triangle Park, North Carolina;2National Center for Biotechnology Information, National Library of Medicine, National
Institutes of Health, Bethesda, Maryland;3Regenstrief Institute, Inc., Indianapolis, Indiana;4Indiana University School of Medicine, Indianapolis,
Indiana;5Lister Hill National Center for Biomedical Communication, National Library of Medicine, National Institutes of Health, Bethesda
Maryland;6National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
For the Deep Phenotyping Special Issue
Received 3 November 2011; accepted revised manuscript 28 February 2012.
Published online 13 March 2012 in Wiley Online Library (www.wiley.com/humanmutation).DOI: 10.1002/humu.22074
with recommended, well-established, low-burden mea-
sures suitable for human subject research. The database
of Genotypes and Phenotypes (dbGaP) is the data reposi-
tory for a variety of studies funded by the National Insti-
tutes of Health, including genome-wide association stud-
ies. The dbGaP requires that investigators provide a data
dictionary of study variables as part of the data submis-
sion process. Thus, dbGaP is a unique resource that can
help investigators identify studies that share the same or
similar variables. As a proof of concept, variables from 16
studies deposited in dbGaP were mapped to PhenX mea-
sures. Soon, investigators will be able to search dbGaP
using PhenX variable identifiers and find comparable and
related variables in these 16 studies. To enhance effective
data exchange, PhenX measures, protocols, and variables
were modeled in Logical Observation Identifiers Names
and Codes (LOINCR ?). PhenX domains and measures are
also represented in the Cancer Data Standards Registry
and Repository (caDSR). Associating PhenX measures
with existing standards (LOINCR ?and caDSR) and map-
ping to dbGaP study variables extends the utility of these
measures by revealing new opportunities for cross-study
Hum Mutat 33:849–857, 2012. Published 2012 Wiley Periodi-
The PhenX Toolkit provides researchers
KEY WORDS: phenotype; environmental exposure; epi-
demiologic methods; GWAS
Additional Supporting Information may be found in the online version of this article.
†Current address: Merck & Co., Inc., North Wales, Pennsylvania.
∗Correspondence to: Huaqin Pan, RTI International, 3040 Cornwallis Road, P.O. Box
12194, Research Triangle Park, NC 27709. E-mail: email@example.com
Contract grant sponsor: This work was supported by the National Human Genome
Research Institute and the American Recovery and Reinvestment Act through NHGRI
U01 HG004597-01 to RTI International; by the National Library of Medicine through
HHSN2762008000006C and National Center for Research Resources 3UL1RR025761-
02S6 to Regenstrief Institute; and by the Intramural Research Program of the National
Institutes of Health.
The influx of genome-wide association studies (GWAS) has led
outcomes. More than 1,000 publications are currently included in
the Catalog of Published GWAS [Hindorff et al., 2009]. Despite the
notypic and environmental measurements has limited the ability to
combine data from GWAS and other large-scale genomic and epi-
seemingly disparate studies with similar underlying risk factors, in-
creasing statistical power so that relatively modest or more complex
Lubin, 1999; Khoury et al., 2009]. However, in most longitudinal
clinical studies, each investigator develops a set of clinical variables
that are not same across other studies.
In response to a clear need for standard measures of phe-
notypes and exposures, PhenX (consensus measures for Pheno-
types and eXposures) engaged 21 working groups of experts to
identify high-quality, relatively low-burden, well-established mea-
sures of phenotypes and exposures. These measures were vetted by
the scientific community prior to inclusion in the PhenX Toolkit
(https://www.phenxtoolkit.org). The PhenX Toolkit provides re-
searchers with a source of standard measures suitable for a variety
of study designs in population-based research. Because the PhenX
Toolkit provides a variety of high-quality measures, investigators
can come to the Toolkit and select measures to expand their study,
especially to add measures that are beyond the primary research
focus of the study.
The nomenclature for the PhenX Toolkit was defined by the
PhenX Steering Committee and is shown in Table 1. Currently,
the PhenX Toolkit includes 295 measures spanning 21 research do-
mains [Hamilton et al., 2011; Hendershot et al., 2011]. A measure
correspond to many items in the other data sets described in this
Challenges in phenotype harmonization have been widely recog-
nized, and efforts have been made in this emerging research field
[Bennett et al., 2011; Fortier et al., 2010]. To help address these
problems, all 295 PhenX measures have been mapped to multi-
ple resources, including the database of Genotypes and Phenotypes
(dbGaP; http://www.ncbi.nlm.nih.gov/gap/), Logical Observation
Identifiers Names and Codes (LOINCR ?; http://loinc.org/), and the
Published 2012 Wiley Periodicals, Inc.∗This article is a US Government work and, as such, is in the public domain of the United States of America.
Table 1. PhenX Toolkit Nomenclature
PhenX Toolkit nomenclature Definition Example
PhenX domainA PhenX domain is a field of research with a unifying theme and easily
enumerated quantitative and qualitative measures (e.g., demographics,
anthropometrics, organ systems, complex diseases, and lifestyle factors).
A PhenX measure refers broadly to a standardized way of capturing data on a
certain characteristic of or relating to a study subject.
A PhenX protocol is a standard procedure recommended by a working group
for investigators to collect and record a PhenX measure.
A PhenX variable is developed for each data element collected by the PhenX
protocols. PhenX variables can be found in the PhenX data dictionary files
at the “My Toolkit” page when the PhenX protocol is selected. For each
PhenX variable, the data dictionary includes a unique variable name,
variable ID, variable description, and other related attributes such as data
types, units, and permitted values when applicable.
A PhenX collection is a collection of measures with a shared characteristic,
target population, or topic. The measures included in a collection may cut
across research domains.
Alcohol, tobacco and other substances
PhenX measureAlcohol—lifetime use
PhenX protocolAlcohol Use Disorder and Associated Disabilities Interview
Schedule, Fourth Edition Version (AUDADIS-IV)
PX030101_lifetime_use—In your entire life, have you had
at least one drink of any kind of alcohol, not counting
small tastes or sips?
This article describes how PhenX measures were integrated into
these standards and demonstrates the utility of this approach.
Database of Genotypes and Phenotypes
The dbGaP database, which was created by the National Center
level genotype, sequence, and phenotype data, and the associations
between them [Mailman et al., 2007]. dbGaP currently contains
enough to PhenX variables that they could be considered compara-
in PhenX variables find similar variables in dbGaP, we developed a
proof of concept, variables from 16 completed studies deposited in
dbGaP were mapped to PhenX measures. These results will be fully
incorporated in dbGaP and will bring to light additional opportu-
nities for cross-study analysis.
Investigators who submit data to dbGaP will be asked to identify
PhenX variables as part of the data submission process. PhenX
variables will then be highlighted as such in dbGaP. Because dbGaP
was established before PhenX measures were developed, none of
the studies currently in dbGaP used PhenX protocols. However,
we know that there are many variables in dbGaP that are similar,
or nearly identical, to PhenX variables and potentially could be
combined with data collected using PhenX protocols as well as with
each other. Although it is possible to run full-text searches within
the dbGaP database to find data that are similar, experience tells
us that the full-text searches for variables are likely to return large
numbers of false positives. For example, a search on “education”
will return more than 10,000 variables.
To make it easier for researchers to find non-PhenX variables
that might be compared or combined with PhenX variables, scien-
tists from PhenX and dbGaP investigated the feasibility of mapping
dbGaP variables to PhenX variables. The first attempt began by ex-
amining four dbGaP studies, so that we could begin to develop the
process and refine our ideas about what it means to map one vari-
studies, including the variable description and a link to the variable
report page on the dbGaP website. Using this information and all
of the information available on the PhenX Toolkit, each scientist
generated his or her own set of mappings for the dbGaP variables.
Many factors such as measurement concept, protocol, code cate-
gory (answer list), and measurement unit were discussed. Based on
these discussions, the team decided on the following two levels of
• Comparable: The data collected in these variables are con-
ceptually the same and should be able to be compared either
directly or after a straightforward transformation/conversion.
Examples of comparable variables are:
– variables whose data were collected using the same pro-
– variables that are recognized as producing the same data
(e.g., age) or producing data that can be easily trans-
formed (e.g., measured weight in kilograms and pounds,
or “birth date” and “birth year”), although they do not
share identical protocols.
• Related: The data in these variables cannot be directly com-
pared, but could be compared after further manipulation. Ex-
amples of related variables are:
cols, but that measure similar properties (e.g., “measured
weight” and “self-reported weight”);
– multiple PhenX variables that might need to be com-
bined to reflect a single dbGaP variable (e.g., “weight”
and “height” for “body mass index”); or
– variables that have different qualifiers (e.g., “since last
visit” vs. “have you ever,” “regularly” vs. “at least once a
week,” or “hormone therapy” vs. “hormone therapy with
a specific hormone name”).
It is possible that a dbGaP variable neither corresponds nor is re-
or relation could be considered “not found,” but this mapping level
is not explicitly shown when looking at the dbGaP variable. Rather,
variables that do not have a mapping level simply are not displayed.
A dbGaP variable can be mapped to multiple PhenX variables
and/or measures. For example, the dbGaP variable phv00111936
(smok_evr: smoked more than 100 cigarettes or five packs in a
lifetime) is mapped to four PhenX variables (as comparable to one
different PhenX Measures (see Fig. 1).
Once the mapping criteria had been agreed on, the remaining
studies were mapped. Mapping was performed by at least two inde-
pendent curators. Results were compared and, as before, discrepan-
cies were resolved by consensus after discussion. Any new mapping
criteria that were developed during this process were added to the
guidelines for future use.
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012
result is “comparable”; a half-filled yellow circle indicates that the mapping result is “related.” The links below “PhenX Variable” take the user to
a complete list of dbGaP variables with mapping to the PhenX variable. The “Measure” links take the user to the PhenX website’s measures page.
Screen shot of the report page for dbGaP variable phv00111936. In the column “Mapping,” a green circle indicates that the mapping
In this report, we show the results of mapping 13 Gene Envi-
ronment Association Studies (GENEVA) consortium studies and
three electronic Medical Records and Genomics (eMERGE) net-
work studies to PhenX. The GENEVA consortium (https://www
.genevastudy.org) consists of 16 GWAS that aim to accelerate the
and disease on a collection of mostly traditional epidemiologic co-
horts [Cornelis et al., 2010]. The eMERGE network (https://www
is a national consortium formed to develop, disseminate, and ap-
ply approaches to research that combine DNA biorepositories with
electronic medical record systems for large-scale, high-throughput
genetic research [McCarty et al., 2011; Kho et al., 2011]. We used
variable descriptions from GENEVA and eMERGE studies released
their dbGaP accession numbers, the total number of variables from
each study, and the number of variables that mapped to a PhenX
variable or measure. The percentage of variables mapped for a par-
contain variables that are not phenotype data, and therefore cannot
be expected to have an analog in PhenX. These types of variables
include administrative data such as IDs (e.g., for subjects, subjects’
parents, locations of data collection, etc.), consent status, or infor-
mation about instrumentation (e.g., sequencing platforms). Aside
from administrative variables, there are some dbGaP variables that
do not map to PhenX. In general, these variables reflect concepts
that are study specific (e.g., “Are your ear lobes creased?” or “What
is your US shoe size?”).
Results of mapping the dbGaP studies to PhenX are summa-
rized in the PhenX–dbGaP variables cross-reference table in Supp.
Table S1. For these 16 dbGaP studies, the cross-reference ta-
ble lists a total of 2,041 mappings, with 604 dbGaP variables
.phenxtoolkit.org). Examples of these mappings are illustrated
in Table 3, in which individual PhenX variables are mapped to
many variables from multiple studies, highlighting opportunities
for cross-study analysis at the investigator’s discretion. Note that
(a condition). Although this mapping may be at first disconcerting,
it is actually a good example of how mapping can identify data
that is comparable or related, even though the reasons for collect-
ing the data were different. “Lipid_total_cholesterol” is a variable
associated with the PhenX lipid profile measure, and the data col-
lected can be used to derive the condition “dyslipidemia.” On the
dbGaP website, mapping information for a variable is shown on
that variable’s report page. Figure 1 is a screenshot of the report
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012
Table 2. List of dbGaP Studies Mapped to PhenX
# Total # Mapped% Mapped
GENEVA—early onset stroke
phs000188 36 1644
aNames of studies have been shortened to reflect the title of the initiative they are part
of, as well as the major research area to save space. Full names can be found at the
dbGaP page for each study.
bTo find the most recent version of these studies in dbGaP, use the following base URL
and add the dbGaP accession to the end:
dbGaP, database of Genotypes and Phenotypes; GENEVA, Gene Environment
Association Studies; eMERGE, electronic Medical Records and Genomics.
page for the dbGaP variable phv00111936 (smok_evr). The “Terms
Linked tothis Variable” section lists the PhenX variables mapped to
smok_evr. The left column shows the level of mapping; a full green
circle indicates comparable, and a half-filled yellow circle indicates
related. The second column lists the name of the PhenX variable
or measure that has been mapped to; these names are linked to a
search page that displays all of the dbGaP variables that map to
that PhenX variable (see Supp. Fig. S1). The third column gives a
column lists the PhenX measure associated with the variable. The
Figure 2 shows the number of dbGaP mappings to PhenX as a
function of the PhenX measure. Although dbGaP variables map
to more than 100 different PhenX measures, only the 25 PhenX
measures with the most mappings are shown here. When looking
at this plot, you should keep in mind the following points:
• A single dbGaP variable can map multiple times onto a
PhenX measure because there are multiple variables that are
either comparable or related. For example, dbGaP variable
phv00142512 (hgla: family history of glaucoma in first-degree
history of eye disease and treatments six times, each time rep-
for mother, father, sister, brother, daughter, and son).
• A single dbGaP variable can map to multiple PhenX measures
because there are variables in each measure that are compa-
rable or related. This was described earlier for the variable
phv00111936 (smok_ever), for which a single dbGaP variable
mapped into three different measures.
a single basic concept (e.g., gender, age, height, or weight), the
same piece of information can be collected in other PhenX
measures. This is a consequence of PhenX absorbing entire
instruments to retain their coherence rather than just cherry
picking particular questions from an instrument to include in
PhenX. For example, the concept of “gender” is represented in
PhenX as its own measure, gender. It is also present in the fol-
lowing PhenX measures: cancer–personal and family history,
Table 3. Examples of Four PhenX Variables Mapped to Variables from 16 Mapped dbGaP Studies
PhenX variable name dbGaP variablesLevel
aGENEVA—birth weight.bGENEVA—blood clotting.cGENEVA—dental caries.dGENEVA—diabetes.eGENEVA—early onset stroke.fGENEVA—glaucoma.gGENEVA—
prostate cancer.hGENEVA—addiction.iGENEVA—venous thrombosis.jeMERGE—cataract.keMERGE—peripheral arterial disease.leMERGE—electrocardiogram QRS.
dbGaP, database of Genotypes and Phenotypes; GENEVA, Gene Environment Association Studies; eMERGE, electronic Medical Records and Genomics.
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012
0 20 40 60 80 100120140
Cancer: personal and family history
Cardiorespiratory fitness–exercise test estimate
Personal and family history of hearing loss
International travel history
Cardiorespiratory fitness–non-exercise test estimate
Dental caries experience–prevalence
Personal medical history of allergies, infectious diseases, and immunizations
Substances–lifetime abuse and dependence
Oral glucose tolerance test
Total physical activity–comprehensive
Pre-existing conditions associated with cancer
Assessment of gallbladder disease and related conditions
Personal and family history of autoimmune and inflammatory disorders
Eating disorders screener
Tobacco–30-day quantity and frequency
# dbGaP variables
A plot that shows the number of dbGaP mappings to PhenX as a function of the PhenX measure.
sleep apnea, spirometry, schizophrenia screener, migraine,
cardiorespiratory fitness–exercise test estimate, integrated fit-
ness, and international travel history among others.
The final point explains why the measures that have the most
dbGaP mappings to PhenX are those like “sleep apnea” and “mi-
apnea” or “migraine” contains an extensive protocol that collects
a large number of discrete variables including age, gender, height,
and weight in addition to the more specific data suggested by its
name. Therefore, these measures will have many dbGaP variables
mapped to them from a single study, whereas the measure “gender”
may have only one variable mapped to it from each study.
For the pilot study described, only a handful of dbGaP studies
it was relatively easy to identify all variables related to a given con-
somewhat laborious, resulted in thoughtful, consistent mappings.
That said, scaling up will present challenges, and using natural lan-
guage processing (NLP) algorithms to identify similarities and dif-
ferences among the variables may be helpful in this regard. For
example, NLP has been used successfully to identify cataract cases
from electronic health records [Peissig et al., 2012]. Perhaps in the
future, NLP can be used to augment and extend the described ap-
Logical Observation Identifiers Names and Codes
LOINCR ?(http://loinc.org/) is a vocabulary standard for iden-
tifying laboratory tests, clinical measurements and reports, survey
instruments, and other kinds of clinical observations. By provid-
ing universal identifiers for a wide range of measurements and
observations, LOINCR ?enables exchange and aggregation of elec-
tronic health data from independent systems for many purposes
[Vreeman et al., 2010b; McDonald et al., 2003]. LOINCR ?has been
widely adopted in the private and public sectors, both within the
United States and by users in more than 140 countries worldwide.
mittee of the Federal Office of the National Coordinator for Health
Information Technology recently adopted LOINCR ?as the coding
system for transmitting results of laboratory and other tests, as-
sessment instruments, and many other clinical variables [Health IT
Standards Committee, 2011]. LOINCR ?has now incorporated all
of the PhenX content, enabling results of PhenX measures from in-
dependent systems to be shared using the same exchange, storage,
and processing infrastructure that health information systems use
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012
An example of accessory (image) content for a PhenX variable as represented in LOINCR ?.
advantages to this linkage, and some of the lessons learned.
Each term in LOINCR ?provides a “fully specified” name using
an established model that contains six main axes (Supp. Table S2)
[McDonald et al., 2011]. The model produces names that are de-
a collection, PhenX contains many kinds of measurements, from
laboratory tests to anthropomorphic measures and validated ques-
standardized assessment instruments, recognizing that they have
psychometric properties that are essential for interpreting meaning
[Vreeman et al., 2010a]. Thus, in addition to the structured name,
LOINCR ?stores many other attributes about the individual vari-
able, including the exact question text and source, example units
of measure (for quantitative variables) and full answer lists (for
categorical variables), references, descriptions, and external copy-
right information when applicable. LOINCR ?also creates terms for
named collections of variables (called “panels” in LOINCR ?) and
enumerates the child elements contained in that set into an explicit
the entire set of PhenX measures into LOINCR ?, either by creating
new LOINCR ?terms or by linking the PhenX variables to existing
LOINCR ?terms. We extracted content from the PhenX Toolkit for
every variable in each measure and domain, starting with a small
set of PhenX content that was first represented in LOINCR ?version
2.29 as a proof of concept. Some variables, such as head circumfer-
ence and gestational age, were already present in LOINCR ?, but the
majority of them were not. We modeled variables new to LOINCR ?
text, we extracted and stored the key accessory attributes (e.g., units
of measure or the allowable answer choices). Many PhenX variables
are defined or illustrated by graphics (e.g., line drawings or pho-
tographs) to show exactly how a measurement should be taken or
how to answer that particular question. The LOINCR ?team created
a mechanism for storing and displaying these graphics in the free
Assistant (http://loinc.org/relma) and the online LOINCR ?search
cessory content for a PhenX variable is represented in LOINCR ?,
including the structured answer list, exact question text, and a
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012
reference image. To capture the hierarchical arrangement of vari-
ables into collections, we created LOINCR ?panel terms at the level
include all of the corresponding PhenX child elements in a formal
hierarchy linked to that panel. Over time, the LOINCR ?team added
now completed modeling of all PhenX variables from 295 measures
in 21 research domains; 138 existing LOINCR ?terms were mapped
to PhenX variables, and approximately 4,500 new LOINCR ?terms
were added based on the PhenX content.
Incorporating the PhenX content into LOINCR ?has many ad-
vantages. Adding the PhenX measures to LOINCR ?enables the
results to be shared using the same HIT infrastructure and stan-
dards that are now becoming widely adopted. In addition, the
LOINCR ?model provides the same uniform computable repre-
sentation of the PhenX content as other standard assessments and
data sets contained in LOINCR ?, including many mandated by the
Centers for Medicare & Medicaid Services (CMS) and provided by
other National Institutes of Health institutes, such as the Patient
Reported Outcomes Measurement Information System (PROMIS;
and Quality of Life Outcomes in Neurological Disorders (Neuro-
QOL; http://www.neuroqol.org/). Having such a common repre-
sentation that promotes sharing will accelerate genomic and other
clinical research. Moreover, because of LOINCR ?’s broad adop-
tion worldwide, representing the PhenX measures in LOINCR ?will
widen the audience for PhenX measures.
The process of integrating the PhenX content into LOINCR ?elu-
cidated several important lessons. Many of the PhenX measures
selected instruments and protocols that were initially conceptual-
ized as paper data collection forms. As the LOINCR ?team defined
its terms and parsed this content into its data model, it revealed
many of the same challenges that were encountered with coding
other widely used survey instruments [Vreeman et al., 2010b]. For
example, some protocols did not specify all of the variables needed
to collect the data, or lacked sufficient detail to precisely define the
observation. In other cases, the information model of the protocols
differed substantially from the typical information model used to
saging standards like Health Level Seven International (HL7). The
LOINCR ?team always found solutions to these problems through
discussions with the PhenX team. One strategy was to turn a long
list of “check all that apply (yes or no)” questions into a single
variable with an answer choice list that could be repeated as many
times as necessary. For example, a protocol requiring answers of
yes or no to a long list of potential diseases could be transformed
into an active diseases variable whose answer values could be the
ables and was consistent with the prevailing health data exchange
formality required for computer representation of instruments in
LOINCR ?was chosen as the vocabulary standard for several rea-
sons. The goal was to represent PhenX content in a widely adopted
vocabulary standard that would enable data aggregation using pre-
vailing conventions (e.g., HL7 messaging). The value of LOINCR ?
in this context is that it provides a set of universal identifiers and a
uniform model of that instrument across any context. LOINCR ?is
well suited for clinical observations and formal surveys and ques-
tionnaires, and it is the standard adopted by the HIT Standards
Committee for laboratory and non-laboratory measurements and
observations. When this pilot study was initiated, LOINCR ?al-
ready contained many similar complete packages of standardized
assessments and data sets, including the CMS-required Minimum
Data Set (https://www.cms.gov/MinimumDataSets20/), Outcome
and Assessment Information Set (https://www.cms.gov/OASIS/),
the new Continuity Assessment Record and Evaluation instru-
ment, Patient Health Questionnaire, PROMIS [Gershon et al.,
2010], and Neuro-QOL. Making PhenX content available in the
same model and format will facilitate data interoperability and data
Common Data Elements in caDSR
gic Planning Workspace, 2007; Kakazu et al., 2004]. It is an open-
source, open-access information network designed to enable se-
caDSR includes a catalog of common data elements (CDEs). Each
the question metadata, and a value domain, which represents the
answer metadata. One or more CDEs are either assigned or created
multiple protocols). There are 353 PhenX protocols mapped with
379 CDEs; 343 of these CDEs were newly created for PhenX.
PhenX has reused existing CDEs when available. The need to
create so many new CDEs is not surprising; the CDEs previously
as gender, race, and age, or on specific concepts related to cancer,
whereas the focus of PhenX is much broader. PhenX represents 21
research domains, most of which are outside the traditional cancer
research domain; such domains include the psychiatric, psychoso-
were created for the protocols of the measure assay for chlamy-
dia/gonorrhea: immunology Chlamydia trachomatis assay labora-
tory finding result (3151324) and immunology gonorrhea assay
laboratory finding result (3153202). At the request of the caDSR
administrator, the PhenX CDEs’ workflow status was changed from
“draft new” to “released” so that they would be available for reuse;
they have been already used by other studies. In the caDSR, PhenX
the CDE Browser (https://cdebrowser.nci.nih.gov/CDEBrowser/),
as shown in Figure 4.
Table 4 shows an excerpt from the cross-reference table that in-
cludes LOINCR ?codes and caCDR CDEs that are associated with
vided, in Supp. Table S4, is available on the Toolkit website and will
serve as a valuable resource to investigators as well as bioinformati-
Recognition and use of the PhenX Toolkit continues to increase
as investigators begin to realize the importance of collecting data
with standard instruments or tools. At of the end of January 2012,
there were 259,077 visitors to the Toolkit website. Most Toolkit visi-
tors are from the United States, the United Kingdom, and Australia,
but there have also been visitors from 143 other countries. There
are currently 637 registered users. Registered users have access to
additional features such as the “My Toolkit” for collecting and sav-
for users of the network to contact each other. The idea is that the
network can be used to facilitate collaboration at the study-design
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012
Common Data Elements (CDEs) for PhenX domains listed in CDE Browser.
Table 4. Representation of Five PhenX Protocols in LOINCR ?and caDSR CDE
PhenX protocol namePhenX protocol ID LOINCR ?identifier caDSR CDE (public ID)
Personal history of type 1
and type 2 diabetes
Current age proto (62293-6)
Tobac smoke status adoles proto (62553-3)
Lipid profile proto (62391-8)
Pers hx type 1 and 2 diabetes proto
Derived person age value (2423393)
Gender code (2179640)
Adolescent tobacco smoking history indicator (2923486)
Person high cholesterol indicator (2936262)
Person diabetes personal medical history assessment
description text (3070673)
Each row of the table represents one PhenX protocol. Supp. Table S2 contains the complete list of LOINCR ?codes and CDEs that are equivalent PhenX measures and protocols.
LOINCR ?, Logical Observation Identifiers Names and Codes; caDSR, Cancer Data Standards Registry and Repository; CDE, common data element.
PhenX measures include PhenX RISING (Real world, Implementa-
tion, SharIng) project (https://www.phenx.org/Default.aspx?tabid=
748), the National Eye Institute Glaucoma Human Genetics
Collaboration consortium, and the Gulf Long-Term Follow-up
Study (http://nihgulfstudy.org/). Additional information about
early adopters is available on the Toolkit website.
dbGaP and PhenX will continue to collaborate and extend the
relationship between the two resources. As noted previously, when
new studies submit their data to dbGaP and identify their variables
as PhenX, this information will be stored in the database. Other
areas of development in dbGaP include adding the ability to fil-
ter search results to return variables submitted as, or mapped to,
PhenX; mapping additional retrospective studies; and adding other
languages/ontologies beneath the “Terms Linked to this Variable”
heading on the variable report page (e.g., International Classifica-
tion of Diseases-9 codes or Medical Subject Headings terms). These
developments will expand the ability of investigators to identify
variables of interest across dbGaP. This information can be used
prospectively, at the study design stage, or retrospectively, to iden-
tify opportunities for cross-study analysis with or without the need
up to date when the Toolkit is expanded or updated, either by link-
ing to concepts already extant in those resources or by creating new
concepts within them (as described earlier). By maintaining collab-
orations and close connections to these resources (and potentially
adding resources), PhenX will be able to expand and update the
cross-reference table accordingly. The results presented here extend
LOINCR ?, and caDSR.
The goal of associating PhenX measures with existing standards
is to make it easier for investigators to share data and to compare
and combine study results. Integrating PhenX measures into exist-
ing standards (LOINCR ?, CDE) and mapping PhenX variables to
dbGaP study variables extend the utility of PhenX measures and
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012
reveal new opportunities for cross-study analysis. The primary lim-
itation of data sharing is that study-specific measures are needed
to support scientific inquiry. That is, deciding what measures are
needed to effectively address a specific research question is inher-
ent to study design. Striking a balance between the inclusion of
study-specific measures and the inclusion of standard measures is
necessary; both types of measures will affect the overall scientific
impact of the study results. The work presented here will facilitate
and PhenX resources will help promote data sharing and thus will
have a significant positive impact on biomedical research.
Disclosure Statement: The authors declare no conflict of interest.
Hansel NN, Heiss G, Heit JA, Kang JH, Kittner SJ, and others. 2011. Phenotype
harmonization and cross-study collaboration in GWAS consortia: the GENEVA
experience. Genet Epidemiol 35:159–173.
Burton PR, Hansell AL, Fortier I, Manolio TA, Khoury MJ, Little J, Elliott P. 2009. Size
matters: just how big is BIG? Quantifying realistic sample size requirements for
human genome epidemiology. Int J Epidemiol 38:263–273.
caBIG Strategic Planning Workspace. 2007. The Cancer Biomedical Informatics Grid
(caBIG): infrastructure and applications for a worldwide research community.
Stud Health Technol Inform 129:330–334.
Cornelis MC, Agrawal A, Cole JW, Hansel NN, Barnes KC, Beaty TH, Bennett SN,
Bierut LJ, Boerwinkle E, Doheny KF, Feenstra B, Feingold E, and others. 2010.
the knowledge obtained from GWAS by collaboration across studies of multiple
conditions. Genet Epidemiol 34:364–372.
and harmony: the DataSHaPER approach to integrating data across bioclinical
studies. Int J Epidemiol 39:1383–1393.
Garc´ ıa-Closas M, Lubin JH. 1999. Power and sample size calculations in case-control
J Epidemiol 149:689–692.
Gershon RC, Rothrock N, Hanrahan R, Bass M, Cella D. 2010. The use of PROMIS
and assessment center to deliver patient-reported outcome measures in clinical
research. J Appl Meas 11:304–314.
Hamilton, CM, Strader LC, Pratt JG, Maiese D, Hendershot T, Kwok RK, Hammond
JA, Huggins W, Jackman D, Pan H, Nettles DS, Beaty TH, and others. 2011. The
PhenX Toolkit: get the most from your measures. Am J Epidemiol 174:253–260.
Health IT Standards Committee. 2011. Recommendations to the Office of the Na-
tional CoordinatorforHealth Information Technology(ONC) on the assignment
of code sets to clinical concepts [data elements] for use in quality measures.
[Letter]. Accessed at: http://healthit.hhs.gov/portal/server.pt/gateway/PTARGS
Hendershot TP, Pan H, Haines J, Harlan WR, Junkins HA, Ramos EM, Hamilton CM.
2011. Use the PhenX Toolkit to add standard measures to your study. Curr Protoc
Hum Genet 71:1.21.1–1.21.18.
2009. Potential etiologic and functional implications of genome-wide association
loci for human diseases and traits. Proc Natl Acad Sci USA 106:9362–9367.
Kakazu KK, Cheung LW, Lynne W. 2004. The Cancer Biomedical Informatics Grid
(caBIG): pioneering an expansive network of information and tools for collabo-
rative cancer research. Hawaii Med J 63:273–275.
C, Bielinski S, Kullo I, Li R, Manolio T, Chisholm R, Denny J. 2011. Electronic
medical records for genetic research: results of the eMERGE Consortium. Sci
Transl Med 3:79re1.
Khoury MJ, Bertram L, Boffetta P, Butterworth AS, Chanock SJ, Dolan SM, Fortier
I, Garcia-Closas M, Gwinn M, Higgins JP, Janssens AC, Ostell J, and others.
2009. Genome-wide association studies, field synopses, and the development of
the knowledge base on genetic variation and human diseases. Am J Epidemiol
Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A,
of genotypes and phenotypes. Nat Genet 39:1181–1186.
Manolio TA. 2009. Collaborative genome-wide association studies of diverse diseases:
programs of the NHGRI’s office of population genomics. Pharmacogenomics
McCarty C, Chisholm R, Chute C, Kullo I, Jarvik G, Larson E, Li R, Masys D, Ritchie
M, Roden D, Struewing J, Wolf W, eMERGE Team. 2011. The eMERGE network:
a consortium of biorepositories linked to electronic medical records data for
conducting genomic studies. BMC Med Genomics 4:13.
McDonald CJ, Huff SM, Mercer K, Hernandez J, Vreeman DJ, editors. 2011. Logical
Observation Identifiers Names and Codes (LOINCR ?) users’ guide. Indianapolis,
Indiana: Regenstrief Institute.
McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, Forrey A, Mercer K,
DeMoor G, Hook J, Williams W, Case J, Maloney P. 2003. LOINC, a universal
standard for identifying laboratory observations: a 5-year update. Clin Chem
Peissig PL, Rasmussen LV, Berg RL, Linneman JG, McCarty CA, Waudby C, Chen L,
Denny JC, Wilke RA, Pathak J, Carrell D, Kho AN, Starren JB. 2012. Importance
of multi-modal approaches to effectively identify cataract cases from electronic
health records. J Am Med Inform Assoc 19:225–234.
Riley WT, Pilkonis P, Cella D. 2011. Application of the National Institutes of Health
tal health research. J Ment Health Policy Econ 14:201–208.
Thorisson GA, Muilu J, Brookes AJ. 2009. Genotype–phenotype databases: challenges
and solutions for the post-genomic era. Nat Rev Genet 10:9–18.
Vreeman DJ, McDonald CJ, Huff SM. 2010a. LOINCR ?: a universal catalogue of indi-
Int J Funct Inf Pers Med 3:273–291.
Vreeman DJ, McDonald CJ, Huff SM. 2010b. Representing patient assessments in
LOINCR ?. AMIA Annu Symp Proc 2010:832–836.
HUMAN MUTATION, Vol. 33, No. 5, 849–857, 2012