ArticlePDF Available

Linking clinotypes to phenotypes and genotypes from laboratory test results in comprehensive physical exams

Authors:

Abstract and Figures

Background In this work, we aimed to demonstrate how to utilize the lab test results and other clinical information to support precision medicine research and clinical decisions on complex diseases, with the support of electronic medical record facilities. We defined “clinotypes” as clinical information that could be observed and measured objectively using biomedical instruments. From well-known ‘omic’ problem definitions, we defined problems using clinotype information, including stratifying patients—identifying interested sub cohorts for future studies, mining significant associations between clinotypes and specific phenotypes-diseases, and discovering potential linkages between clinotype and genomic information. We solved these problems by integrating public omic databases and applying advanced machine learning and visual analytic techniques on two-year health exam records from a large population of healthy southern Chinese individuals (size n = 91,354). When developing the solution, we carefully addressed the missing information, imbalance and non-uniformed data annotation issues. Results We organized the techniques and solutions to address the problems and issues above into CPA framework (Clinotype Prediction and Association-finding). At the data preprocessing step, we handled the missing value issue with predicted accuracy of 0.760. We curated 12,635 clinotype-gene associations. We found 147 Associations between 147 chronic diseases-phenotype and clinotypes, which improved the disease predictive performance to AUC (average) of 0.967. We mined 182 significant clinotype-clinotype associations among 69 clinotypes. Conclusions Our results showed strong potential connectivity between the omics information and the clinical lab test information. The results further emphasized the needs to utilize and integrate the clinical information, especially the lab test results, in future PheWas and omic studies. Furthermore, it showed that the clinotype information could initiate an alternative research direction and serve as an independent field of data to support the well-known ‘phenome’ and ‘genome’ researches.
This content is subject to copyright. Terms and conditions apply.
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
https://doi.org/10.1186/s12911-021-01387-z
RESEARCH
Linking clinotypes tophenotypes
andgenotypes fromlaboratory test results
incomprehensive physical exams
Thanh Nguyen1, Tongbin Zhang2,3, Geoffrey Fox4, Sisi Zeng2†, Ni Cao2†, Chuandi Pan2,3 and Jake Y. Chen1*
From 16th MCBIOS Birmingham, AL, USA. 28-30 March 2019
Abstract
Background: In this work, we aimed to demonstrate how to utilize the lab test results and other clinical information
to support precision medicine research and clinical decisions on complex diseases, with the support of electronic
medical record facilities. We defined “clinotypes” as clinical information that could be observed and measured objec-
tively using biomedical instruments. From well-known omic’ problem definitions, we defined problems using clino-
type information, including stratifying patients—identifying interested sub cohorts for future studies, mining signifi-
cant associations between clinotypes and specific phenotypes-diseases, and discovering potential linkages between
clinotype and genomic information. We solved these problems by integrating public omic databases and applying
advanced machine learning and visual analytic techniques on two-year health exam records from a large population
of healthy southern Chinese individuals (size n = 91,354). When developing the solution, we carefully addressed the
missing information, imbalance and non-uniformed data annotation issues.
Results: We organized the techniques and solutions to address the problems and issues above into CPA frame-
work (Clinotype Prediction and Association-finding). At the data preprocessing step, we handled the missing value
issue with predicted accuracy of 0.760. We curated 12,635 clinotype-gene associations. We found 147 Associations
between 147 chronic diseases-phenotype and clinotypes, which improved the disease predictive performance to
AUC (average) of 0.967. We mined 182 significant clinotype-clinotype associations among 69 clinotypes.
Conclusions: Our results showed strong potential connectivity between the omics information and the clinical lab
test information. The results further emphasized the needs to utilize and integrate the clinical information, especially
the lab test results, in future PheWas and omic studies. Furthermore, it showed that the clinotype information could
initiate an alternative research direction and serve as an independent field of data to support the well-known ‘phe-
nome’ and genome’ researches.
Keywords: Clinotype, Lab test result, Electronic medical record, Machine learning
© The Author(s) 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco
mmons .org/publi cdoma in/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Background
As electronic health records (EHR) has been increasingly
supporting biomedical and healthcare service research,
utilizing the clinical information, especially the clini-
cal test information, to strengthen precision medicine is
still an open challenge [1]. Here, we have seen many EHR
Open Access
*Correspondence: jakechen@uab.edu
Sisi Zeng and Ni Cao contributed equally to the work
1 Informatics Institute, School of Medicine, The University of Alabama
at Birmingham, AL, Birmingham, USA
Full list of author information is available at the end of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
applications in improving precision medicine and qual-
ity of care, including: identifying disease risk factors [2],
molecular biomarkers [3]; identifying high-risk/special-
treatment cohorts [4, 5]; identifying the comorbidities[6,
7]; detecting drug adverse events and side effects [8];
repurposing drugs [9]; and predicting early hospitaliza-
tions [10]. However, it is still unclear to what extent the
findings associate to specific clinical test results, which
are among the most practical information for the care
providers [11]. In addition, whether these associations
imply that the test results are risk factors or just the
reflection of the phenotype is still ambiguous. For exam-
ple, the monocyte count, which is a popular blood test,
is the result of the inflammatory response in chronic
obstructive pulmonary disease and could be as a risk fac-
tor leading to cardiovascular diseases [12].
In the other hands, electronic medical data systems and
analytical methods, which are the essential facilities to
tackle the challenge above, have been gradually matured.
At the data system component, elements in EHR data,
including the medical test information, unified medical
language system [13], and data integration [14] have been
standardized [1517] and well-supported to EHR extrac-
tion and refinement. In addition, from natural language
processing tools [18], manual curation and crowd-sourc-
ing efforts, there have been many data sources [1921]
potentially allows linking the clinical test results, the phe-
notypic/clinical outcomes, and genotype information. At
the analytical component, custom statistical data min-
ing and machine learning techniques have been applied
to EHR data to cope with challenges in understand-
ing biomedical and healthcare big data. To determine
disease risks, one can use a popular statistical analysis
technique—disproportionality analysis [22]. To predict
patient survival and track disease progression using clini-
cal biomarkers [23, 24], one can perform temporal data
analysis such as regression in time series analysis [25] and
Cox regression model [26]. To perform classifications
based on multivariate models [27], one can build statis-
tical learning models such as decision tree [28], artificial
neural network [29], hidden Markov model, and support
vector machine [30, 31]. In addition, set-based statistical
analysis methods, such as chi-square and Fisher’s exact
test are also useful in evaluating the significance of the
findings [32]. ere have been several examples of infor-
matics systems allowing utilization of medical test and
other clinical information, such as eMERGE [33] and
I2B2 [34], where the integration of test results and gen-
otype information would help in specifying the cohorts
of interest and customized algorithm are developed for
disease-specific problems.
Given these better facilities, why EHR and its rich
clinical test information has not been able to play a
more active role in precision medicine? Among many
limitations, [35] highlights the data quality issues:
“interoperability, poor quality, and accuracy of the col-
lected information”. In other words, EHR data have has
three specific challenging issues to address. First, EHR
data contains missing values [36] because of human
error or non-response subjects [37]. Second, EHR data
is naturally imbalanced: class imbalance, for exam-
ple, the small percentage of ‘abnormality’ events, and
patient demographic imbalance. ird, EHR data lacks
thorough and uniform annotation. Usually, the annota-
tion needs to be made patient-specific.
is work is a pioneering framework in better-uti-
lizing EHR, especially its rich clinical test result, to
enhance precision medicine, defining new problems
and providing solutions in biomedicine involving these
data. We proposed the concept “clinotype” in response
to the call for clinical information modeling, especially
for querying and analytics over clinical content and
decision support over clinical content [38]. We define
“clinotypes” as clinical information, excluding the treat-
ment, that can be observed and measured objectively
using biomedical instruments. Most of the clinotypes
are hospital lab tests. However, we argue that the “clino-
type” concept and the “hospital lab test” are not entirely
the same due to two reasons. First, with the develop-
ment of mobile devices, the patients can self-perform
some measurements outside the hospital laboratory;
therefore, the term “hospital lab test” may not be well-
applied in this case. Second, hospital lab tests include
drug testing (treatment-related); therefore, this type of
lab test is excluded from “clinotype” definition. In addi-
tion, different from “phenotypes” commonly used in
biomedicine, which is associated to disease morphol-
ogy developed by healthcare professionals [39], clino-
types are qualitative or quantitative measurements that
are neutral to expert judgment. We tackled the data
quality issues by both data quality control and machine
learning support. We defined three board problems of
‘clinotype’ data analytics: clinotype-clinotype associa-
tion discovery, clinotype-phenotype association discov-
ery and clinotype-genotype relationship discovery. We
named the framework CPA (Clinotype Prediction and
Association-finding). e dataset used in this study,
provided by the 1st affiliated hospital—Wenzhou Medi-
cal University—China (acronym: 1AH), contains values
of totally 400 clinotypes, with no specification on inter-
ested cohorts or diseases. is dataset was collected
between 2012 and 2014 from 91,354 patients, which
well-represents the Southern Chinese population,
mostly from south of Fujian province and the entire
Zhejiang province with more than 20 million civilians.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
Materials andmethods
CPA is an integrative machine learning framework,
including data preprocessing and clinotype analy-
sis as presented in Fig. 1. From the original data (P0),
which consist of 9,283,306 clinotype results from 91,354
patients and 400 clinotypes, we filtered insignificant cli-
notypes and patients and normalized the data. In data
preprocessing, due to technical limitations in Chinese
natural language processing, we were unable to include
the non-numerical clinotype results. After preproc-
essing, we used P2 data subset and available diagnosis
information to solve the clinotypes problem: discovering
clinotype-phenotype (disease) associations and strati-
fying the patients’ clinotype data for interested cohort
identification. We curated the existing ’omic’ data sources
for clinotype-genotype information.
Acquire andpreprocess data
We acquired, preprocessed and organized the dataset
according to the workflow in Fig.1 by 3 steps, which
creates 5 data subsets: P0, P1, P2, Pr and Pt. P0 stands
for the original dataset after removing patients’ identifi-
able information. P1 stands for subsets of data related to
numerical clinotype. P2 stands for the normalized data-
set from P1. Pr and Pt stand for the training set and the
test set correspondingly in machine learning. e data
preprocessing would tackle the non-uniform annotation
issues and support machine learning as follow.
Fig. 1 Flowchart for CPA framework. The rectangle boxes represent clinotype data subsets from P0 to Pr/Pt. The dash rectangle boxes represent
clinotype problems and main results. The rounded rectangle boxes represent external (non-clinotype) data and techniques help solving the
clinotype problems
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
e original P0 subset, acquired directly from the
health checkup (which is an independent department at
1AH), contains records on 400 health clinotype values of
91,354 patients between September 2011 and May 2014.
Among 91,354 patients, 712 patients (0.7%) are under
18years old. More information about the selected cohort
could be found in Table1. Since this work focuses on
health clinotype, we manually translated the clinotype
names from Chinese to English. To improve the quality of
our translation, we queried our translated English name
in popular medical terminology resources: MedLine-
Plus (http://www.nlm.nih.gov/medli neplu s/), Lab Tests
Online (https ://labte stson line.org/), PubMed (http://
www.ncbi.nlm.nih.gov/pubme d/ for title/abstract) and
adjusted our translation according to the closest matched
terms in these resources. Importantly, for each personal
clinotype result in P0, the 1AH provided the normal ref-
erence ranges, which referred to Chinese medical guid-
ance and was the standard requirement at any 1AH
medical record. e reference ranges are subjected to
individuals. For example, the Hematocrit test in P0 has
two reference ranges: 35–45% for female individuals
and 40–50% for male individuals. e normal reference
ranges allow annotating all clinotype results as ‘high’,
‘normal’ and ‘low’. erefore, in this work, we tackled the
annotation issue by applying the domain knowledge and
data standard from the care provider.
e P1 subset results from P0 by filtering out low-
confidence patient and clinotype information. Among
400 clinotypes, 97 clinotypes are numerical. In this work,
due to the technical limitation in Chinese natural lan-
guage processing, we did not include the non-numerical
test result, which often include free text. ree clino-
types: Yeast Culture, Creatinine (Enzymatic) and y-
roid Globulin Antibody (ECLIA) are rare (taken by less
than 1000 patients, or 1% of the population size) and
excluded from the study to reduce the noisy effect in sta-
tistical machine learning methods. us, 94 clinotypes
remained for further preprocessing and analysis. We also
removed patients having no numerical clinotypes and
213 pediatric patients (< 0.1%) due to low count. P1 con-
tains 4,122,917 patients’ health clinotypes entries from
68,419 patients.
e P2 subset results from P1 by normalizing clinotype
results with the z-score formula
in which i is the clinotype index, n is the patient index,
xi
is the mean of clinotype i,
σi
is the standard deviation
of clinotype i and
xi,n
is the normalized value of patient
n on clinotype i. e mean and standard deviation was
calculated only from the training set. We chose z-score
normalization because it could remove all of the clino-
type biases and variances in machine learning. In addi-
tion, z-score normalization is a linear method, which is
suitable for interpreting and validating the results from
linear regression later. We scaled the normal range for
each individual clinotype result using the same mean and
standard deviation at (1).
We setup the training subset Pr and subset Pt for
downstream machine learning analysis and validation.
We selected the date June 30 2013 to separate the data-
set. is date divides the P2 set into a training set and
test set following conventional ratio 3:1 (Fig.1). Pt and
Pr allow tackling the missing value issues using machine
learning, which we would describe later. For missing val-
ues existing in Pt and Pr, we replaced them with the cor-
responding predicted values computed from the missing
value models. e P2, Pt and Pr subsets allow defining
and solving the clinotype—related problems as shown in
Fig.1 pipeline.
In addition to the P0 dataset, the outpatient depart-
ment at 1AH provided the diagnostic history, identified
by Chinese ICD version 10. More information about dis-
ease-specific cohort could be found in Additional file1:
TableS1.
Handle themissing value anddata imbalance
Technical solution
Built upon machine learning techniques, the CPA frame-
work handled the missing value issue and partially data
imbalance issue in one step. We select the support vector
linear regression (SVLR) to build models predicting the
missing value. Compared to other techniques in handling
missing data [43, 44], we preferred SVLR because of not
only its higher sparsity [45, 46] but also its models could
be directly applied to discover clinotype-clinotype associ-
ations. For each clinotype y, the SVLR estimate the miss-
ing value using the linear model
yn
=w
T
x
n
+
b
if the
clinotype value of patient n is missing. Here,
yn
denotes
the estimation for missing value, xn is the vector of
other (non-missing) clinotype value for patient n, and w
(1)
x
i,n=
x
i,n
x
i
σi
Table 1 Statistics about the demographic information
intheselected cohort
Age group Gender No. patient (%)
Young (18–39) Male 14,594 (21.33)
Female 12,596 (18.41)
Middle (40–59) Male 18,717 (27.36)
Female 14,137 (20.66)
Old (60 and above) Male 5207 (7.61)
Female 3168 (4.63)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
denotes the coefficient for these non-missing clinotypes.
SVLR uses the non-missing y in Pr subset to train the
model. Briefly, the SVLR setup the solution minimizing:
Here, yn denotes the non-missing value for y in train-
ing, ε 0 is the ‘tolerance’, or expected error between the
predicted and the real yn in regression, and
ξn
is the slack
variable as defined in [45, 46]. Parameter C and ε decide
the trade-off between the smoothness of regression func-
tion and how tolerance the predicted clinotype value
could deviate from the true clinotype value. We decided
to use C = 1 and ε = 0.001 after testing multiple choices
of C = 0.001, 0.01, 0.1, 1, 100, 1000 and multiple choices
of ε = 0.001, ε = 0.01, ε = 0.1 ε = 1. We used ILOG CPLEX
Optimizer [47] to solve the problem (2).
To partially tackle the data imbalance issue, in imple-
mentation, we applied the under-resampling method in
[48] to select the balanced subset in the training phase.
By balancing, we mean for each predicted-target clino-
type y in (2), the ratio among ‘normal’, ‘high’ and ‘low’ yn
selected in training is relatively 1:1:1. For each clinotype
prediction, we ran resampling, learning and predicting 50
times and reported the average for coefficients and pre-
dicted value.
Performance metric andvalidation
We used the models (2) built upon Pr subset to estimate
the non-missing clinotype values in Pt set. Since each
non-missing clinotype value has a reference range, the
real and estimated clinotype value could be annotated as
either ‘high’, ‘normal’ or ‘low’. erefore, we have 9 pos-
sible outcomes as shown in Table2.
With the emphasize on predicting abnormality, we had
the accuracy (ACC) and positive predictive value (PPV)
metrics as
Curate theclinotype—genotype association
Since we did not have genetic test information among
the study cohort, we used public databases PAGER [49,
1
2
|w|+C
N
n
ξ
n
(2)
subject to
w
T
xn+bynεξ
n
wTxn+byn+ε+ξ
n
and ξn
0n
ACC
=
TP +TN
TP +TN +FP +FN
(3)
PPV
=
TP
TP +FP
50] and REACTOME [51, 52] (pathway and metabo-
lism only) to find genes associated with the clinotypes.
PAGER is a geneset database, which integrates the most
popular geneset-level databases known today (including
MsigDB) and collection of phenotype-related genes from
popular manual curated databases, including OMIM [53,
54], MSigDB and GeneSigDB [55]. REACTOME is one
of the most well-known curated biological pathway data-
bases known today. We removed non-biological words
in each clinotype name, such as absolute value, percent-
age, ratio, volume, etc. and convert all names to singu-
lar form before querying. For example, with clinotypes
“Basophils Percentage” and “Monocytes Absolute value”,
we queried “Basophil” and “Monocyte”. After acquiring
the clinotype’s related gene set, we used DAVID Gene
ID conversion tool [56, 57] to map the names retrieved
from REACTOME and PAGER to UniProt ID to remove
potential alias names and ensure that the genes found
were reviewed. After querying and filtering, we obtained
12,635 connections between 6145 genes and only 61 cli-
notypes, as showed in Additional file2: TableS2.
Find disease‑phenotype andclinotype associations
Technical solution
Using the diagnostic information for the cohort covered
in P1 subset, we found the disease-phenotype and cli-
notype associations with the help of student t-test [58]
as follow. In P1, we select patients having less than 5%
abnormal clinotype values and no diagnostic history into
the control set. For each disease, we use the ICD10 diag-
nostic code to select the ‘disease’ set. Comparing between
the disease and control sets with t-test, we computed the
p-value for each clinotype. e clinotypes having signifi-
cant p-value (less than 0.05) was considered to have sig-
nificant associations with the underlying disease.
Performance metric andvalidation
To validate these associations, we compared the disease-
versus-control classification performance using two
types of model. For the first type of model, noted as ASS
(abbreviation of association), we only use the disease’s
associated tests as features for classification. For the
second type of model, noted as NON (abbreviation of
Table 2 Confusion matrix betweentheestimated andreal
clinotype value annotation
TP: true positive, TN: true negative, FP: false positive, FN: false negative
Estimated value annotation
High Normal Low
Real value annotation High TP FN FP
Normal FP TN FP
Low FP FN TP
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
non-association), we only used the non-associated tests
as features for classification. We trained the classification
models using the Pr set and measure the performance on
the Pt set, as shown in the above section. We expect that
the classification metrics: area under the curve (AUC)
and accuracy [59] of the ASS models should be higher
than the ones in the NON model. For training classifi-
cation models, we applied Random Forest [60] imple-
mented in Weka version 3.8 [61], which was significantly
successful in Google’s and Mt. Sinai’s DeepPatient [62].
Identify subcohorts ofinterest bypatient stratication
We used the Plotviz tool [63, 64], built upon the high-
performance computing platform at Indiana University,
to cluster the P2 subset patients. Deterministic Anneal-
ing Pairwise Clustering (DAPWC) algorithm [65], which
focuses on highlighting the datapoint difference in high
dimensional data, Plotviz significantly reduced the com-
putational time, performed dimensionality reduction
and visualize the results in 3D. To determine the number
of cluster parameters (k) in Plotviz, we applied Silhou-
ette index [66] (Si) to select the best number of clusters.
Si closed to 1 implies appropriate clustering structure;
meanwhile, Si closed to -1 implies inappropriate cluster-
ing structure, including too few and too many clusters.
From multiple experiments, we choose k = 5 (Si = 0.793).
We proposed two option to annotate the clusters. First,
we found the significant clinotypes expressing in each
cluster by the ANOVA test. Clinotypes returning sig-
nificant average p-value (less than 0.05) could be used to
annotate the clusters. Second, we found which clusters c
would over-represent a specific disease D using hyperge-
ometric distribution p-value computed as
where
N
(nu) is the number of patients in P2 subset,
K(kappa)
is the number of patients having disease D diag-
nosis,
η
is the size of cluster c and
κ
is the number of patients
having disease D in cluster c. e less-than-0.05 p-value
implies that cluster c significantly enriches disease D.
Results
In this work, we use the following acronyms:
SVLR: support vector linear regression
PPV: positive predictive value
NPV: negative predictive value
ACC: accuracy
AUC: area under the receiver-operating characteris-
tic curve
(4)
min
(K,η)
τ=κ
K!
(Kτ)!τ!

(NK)!
(ητ)!((NK)(ητ))!
N!
(Nη)!η!
Robust missing value prediction models
In tackling missing value issue, the prediction perfor-
mance of SVLR is desirable for predicting values of a
number of numerical clinotypes. Overall, the weighted
prediction accuracy for all measurement is 0.760, the
weighted average PPV is 0.488, and the weighted average
NPV is 0.829. is performance is significantly higher
than the random prediction, in which, due to the met-
ric defined in the method sections, the expected ran-
dom ACC/PPV/NPV would be 0.33. Additional file 3:
Table S3 shows all prediction performance metrics of
all clinotypes. ere are three scenarios for the perfor-
mance of SVLR on predicting missing clinotypes. First,
Blood Platelet Hematocrit, Average Erythrocyte Volume,
and Lymph Absolute Value show both high (above 0.7)
PPV and accuracy. Second, Albumin, RBC Volume Dis-
tributed SD Value and Neutrophils Absolute value show
average PPV (from 0.5 to 0.7) and high accuracy. ird,
Lipid-related measurements, such as LDL-Cholesterol,
Apolipoprotein B and Triglycerides achieve moderate
PPV but moderate or low accuracy (below 0.7), except
LDL cholesterol. Most of the clinotype NPVs are high,
except for lipid-related measurements.
e SVLR may not be very accurate to model clino-
types for old people. In Fig. 2, accuracy, PPV and NPV
of models trained by young-age and middle-age groups
are higher than the ones trained using old groups. Fur-
thermore, the average NPV and accuracy trained by old-
age groups are lower than the average NPV and accuracy
using the entire dataset.
The signicant disease‑phenotype‑clinotype associations
could potentially improve disease identication
Here, we focused on the phenotype-clinotype associa-
tions of five popular chronic diseases: chronic gastritis,
coronary, cataract, hyperlipidemia, and diabetes. We
found 147 significant phenotype-clinotype associations
(Additional file4: Table S4). We demonstrated the top
10 significant clinotype-phenotype associations, sorted
by p-value, in Table3. Figure 3 shows that the classifi-
cation models built upon these associations (acronym:
ASS models) are completely superior to the models built
without using these associations (non-association, acro-
nym: NON models). Briefly, the ASS models only use the
clinotypes that have strong associations to the diseases;
while the NON models do not use these clinotypes. e
details on constructing these models, from finding clino-
type-phenotype associations to classification algorithms
(random forest) could be found in the method section.
In all diseases, the ASS models achieve higher AUC and
PPV. By average, the ASS models AUC of 0.967 and PPV
of 0.923; meanwhile, the NON models only achieve AUC
of 0.942 and PPV of 0.886.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
Cohort identied bystratication ofpatients’ clinotype
reveals potential chronic comorbidities
For 5 subcohorts identified by Plotviz clustering, the
ANOVA tests return 67 significant clinotypes (Additional
file5: Table S5) which could be used to annotate each
cluster. Information for selecting the number of clus-
ters could be found in Additional file5. Interestingly, the
unbias and domain-knowledge free clustering method
(Plotviz) results in patients subgroups who have poten-
tially similar disease phenotypes. e top 5 significant
clinotypes are Blood Platelet Distributed Width (p-value
1.79 × 10–169), Postprandial 2h Blood Sugar (p-value
3.58 × 10–133), Glucose (p-value 9.69 × 10–104), Sacchari-
fication Blood Protein (p-value 6.01 × 10–73) and Crys-
tallization (p-value 7.92 × 10–49). ese top 5 clinotypes
annotate two clusters. Blood platelet Distributed Width
and Crystallization is higher cluster 3 containing 101
patients (Figs.4, 5). Postprandial 2h Blood sugar, Glucose
and Saccharification Blood-red Protein specify cluster 1
containing 843 patients. Additional file6: TableS6 sum-
marizes the disease-phenotype annotation for each clus-
ter. ese annotations could be visualized using with
Plotviz (http://salsa hpc.india na.edu/plotv iz/) visualiza-
tion and data files in Additional file7.
Discussions
In this work, CPAs machine learning technique could
successfully predict the missing health clinotype values.
Accurate missing-value prediction provides qualified
information for supporting diagnosis and a better under-
standing of the patient at an individual level. In addition,
Plotviz clustering technique could reveal patient sub-
groups who potentially share similar health issues. Vali-
dation via curation shows potential explanation about
significant clinotype-clinotype associations at the gene
level. is result could be used to suggest new biological
research topic about the clinotype-genotype associations.
We also want to clarify the difference of “clinical mod-
eling” concept, which our CPA framework aims for,
with the “clinical information models” (CIM) defined
by Moreno-Conde’s group [40]. In [40], CIM is a board
concept for structural and semantic artifacts providing
multiple functionalities: organizing, storing, querying,
visualizing, exchanging and analyzing data. In the CPA
framework, missing value prediction and clinotype-cli-
notype association discovery could be called analyzing
data functionalities. In addition, the results from patient
clustering and linking clinotypes to genomic databases
could certainly lead to new clinical trials and research.
erefore, CPA could extend the CIM concept by adding
the recommendation functionality, which could be very
helpful for doctor and research users.
Fig. 2 Performance of SVLR models for predicting missing values: average ACC, PPV and NPV comparison between different groups of patients
(defined in the method sections)
Table 3 Top 10 signicant clinotype-phenotype
association found inP2 dataset
Clinotype Disease‑phenotype p‑value
Blood crystallization Diabetes 3.36 × 10–18
Blood crystallization Coronary 1.48 × 10–17
Rheumatoid factor Hypertension 1.78 × 10–16
Blood crystallization Hyperlipidemia 1.47 × 10–13
Rheumatoid factor Chronic gastritis 4.77 × 10–12
Glucose Diabetes 1.71 × 10–11
Crystallization Cataract 4.22 × 10–11
Rheumatoid factor Hyperlipidemia 6.47 × 10–9
Blood platelet Hyperlipidemia 6.24 × 10–7
Triglycerides Hyperlipidemia 6.61 × 10–7
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
Fig. 3 AUC/PPV comparison between two types of the disease-specific classification model: using (ASS) and not using (NON) only
disease-phenotype-clinotype association
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
Fig. 4 Top 5 clinotypes annotating identified subcohorts. x axis stands for the cluster index. y axis stands for the normalized clinotype values
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
ere are three main limitations of this research work.
e first limitation is that the linear prediction models do
not work well with patients from old-age groups. ere-
fore, the nonlinear methods are better-recommended to
learn the clinotype-clinotypes associations the follow-up
analysis from the old-age-group data. e second limita-
tion is constructing the semantic structure among health
clinotype names. us, we could not use standard anno-
tation code for diseases, symptoms and other pheno-
types, such as ICD10 and MeSH term to acquire better
curation as in [41].
In addition, to complete the triangle among clinotype,
phenotype and genotype, the CPA framework should
include the following problems. First, mining clinotype-
clinotype association would complete the clinotype-
clinotype edge, which has not been addressed. Machine
learning techniques could be reapplied in this problem.
Second, linking the clinotype-clinotype and clinotype-
genotype associations to the gene level would provide
insights explaining the associations above. Here, integrat-
ing PheWas with better clinotype-phenotype association
(from curation and natural language processing) would
be a promising solution. We would solve these problems
in some future work.
In addition, PPV leaves two issues for open dis-
cussion in this work. First, the weak anti-correlation
between prediction accuracy and PPV leaves an issue
in sampling the training set. It is expected that when
we use totally random balance sampling in the training
set, the distribution of predicted labels in the test set
may contain less ‘normal’ label and may increase PPV.
However, ‘normal’ is the major label; therefore, increas-
ing PPV may decrease accuracy. We do not have a clear
answer whether or not more advanced data sampling
approaches in [42] could be a better solution due to
the missing value. Second, although the average PPV
achieved in this work is moderate (PPV), we argue that
it is a reportable outcome. In this study, the ‘positive’
class stands for abnormal measurement value (either
high or low), which is often the minor class in health
data. In addition, our definition for true positive (see
method section of setup metrics for prediction per-
formance) only allows the predicted label and the true
label as either ‘high’ or ‘low’. In other words, if the pre-
dicted is ‘low’ but the true label is ‘high’ and vice versa,
we still consider this case as false positive although
both the predicted label and the true label are not ‘nor-
mal’. With this definition, the expected random PPV is
0.33, much less than the average PPV we achieved. Our
plausible results in clinotype-clinotype association dis-
covery and patient clustering, which directly use clino-
type missing value prediction, show that the discovery
is still solid with the PPV above. However, we believe
that the discovery could be improved if we apply other
techniques with higher PPV.
Conclusions
By CPA framework, we showed how utilizing clinical
test results information (clinotype) could further sup-
port precision medicine. e proposed problems and
solutions with clinotypes demonstrate that clinotype
could potentially be an independent area but associat-
ing with the well-known genotype–phenotype associa-
tion studies. Machine learning techniques play a key
role in this pioneering work. It could lay out the general
ideas from which the future techniques could improve
the solution for each problem proposed in this work.
Supplementary Information
The online version contains supplementary material available at https ://doi.
org/10.1186/s1291 1-021-01387 -z.
Additional le1. Count of patients diagnosed with each disease (identi-
fied by Chinese ICD10).
Additional le2. Curated associations between clinotype and genotype.
Additional le3. Missing value models performance in all clinotypes.
Additional le4. List of significant clinotype - phenotype (disease) asso-
ciations, with p-value < 0.05.
Additional le5. ANOVA test result for each clinotype when using to
annotate the disease cohorts.
Additional le6. Hypergeometric enrichment test result when annotat-
ing patient-clusters by phenotype.
Additional le7. Five cluster visualization using PlotViz software.
Fig. 5 Clustering heatmap with top 5 measurements: Patients
are represented by rows. The order of columns is Blood platelet
Distributed Width, Crystallization, Postprandial 2h Blood sugar,
Glucose, and Saccharification Blood-red Protein
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
Acknowledgements
The authors thank the IT staff from Department of Computer Technology and
Information Management, The First Affiliated Hospital of Wenzhou Medical
University, Zhejiang, China for helpful guidance in preprocessing the data.
About this supplement
This article has been published as part of BMC Medical Informatics and
Decision Making, Volume 21 Supplement 3 2021: Proceedings of the 16th
Annual MCBIOS Conference: medical informatics and decision making. The full
contents of the supplement are available at https ://bmcme dinfo rmdec ismak
.biome dcent ral.com/artic les/suppl ement s/volum e-21-suppl ement -3.
Authors’ contributions
TN designed the data structures, preprocessed the data, built the machine
learning techniques, designed the validation metric, and primarily prepared
the manuscript. SZ translated the medical terminologies (including ICD10
diagnosis and clinotype/lab test name) from Chinese to English and curated
the clinotype-genotype association. NC built the database supporting
the analysis and executed the clinotype-clinotype association validation.
TZ executed the clinotype-phenotype association mining. GF performed
patient stratification using Plotviz technique. CP collected the original data.
JC conceived the project ideas, develop the clinotype conceptual framework,
oversaw the team’s development of computational approaches, provided
guidances, and participated in the development of the manuscript from draft-
ing to revision. All authors participated in preparing the manuscript, including
writing, commenting and revising. All authors read and approved the final
manuscript.
Funding
This work was supported partially by Wenzhou Department of Science and
Technology Development (Wenzhou Municipal Science and Technology
Bureau), under project number ZG2017020 granted to Chuandi Pan (titled
“Research and Development of Disease Prevention and Prediction System
Based on Cloud Computing and Medical Big Data”); the National Institute
of Health funded Center for Clinical and Translational Science grant award
(3UL1TR003096-02) to the University of Alabama at Birmingham (UAB)
(Co-I: Jake Chen), the American Heart Association institutional data science
fellowship award to the Informatics Institute of UAB (Co-I: Jake Chen), the
National Cancer Institute Grant Award (U01CA223976) (Co-PI: Jake Chen) and
the ‘startup budget’ granted to Jake Chen from the University of Alabama at
Birmingham. The publication cost is funded by Chuandi Pan’s research grant
as showed above.
Availability of data and materials
The original datasets are not included in this work. Researchers interested in
using the dataset should contact Chuandi Pan or Jake Chen for further details
and permission.
Ethics approval and consent to participate
The research protocol in this work was approved by Wenzhou Municipal Sci-
ence and Technology Bureau and The First Affiliated Hospital, Wenzhou Medi-
cal University, Wenzhou, Zhejiang, China. This is in accordant to the scientific
description in Project Number ZG2017020, titled “Research and Development
of Disease Prevention and Prediction System Based on Cloud Computing and
Medical Big Data”. Since the protocol used a large number of individuals’ medi-
cal records, it was practically impossible to obtain all participants’ consents.
Therefore, the consent requirement was waived. All authors have completed
the training required by the Institutional Review Board in this project.
Consent for publication
This work does not include any include identifiable details related to
individuals.
Competing interests
The authors declare that this work has no competing interest.
Author details
1 Informatics Institute, School of Medicine, The University of Alabama at Bir-
mingham, AL, Birmingham, USA. 2 School of First Clinical Medical Sciences -
School of Information and Engineering, Wenzhou Medical University, Zhejiang,
China. 3 Department of Computer Technology and Information Management,
The First Affiliated Hospital of Wenzhou Medical University, Zhejiang, China.
4 School of Informatics, Computing, and Engineering, Indiana University,
Bloomington, IN, USA.
Received: 11 November 2020 Accepted: 6 January 2021
Published: 24 February 2021
References
1. Manrai AK, Patel CJ, Ioannidis JPA. In the era of precision medicine and
big data, who is normal? JAMA. 2018;319(19):1981–2.
2. Liu S, Hou J, Zhang H, Wu Y, Hu M, Zhang L, Xu J, Na R, Jiang H, Ding Q.
The evaluation of the risk factors for non-muscle invasive bladder cancer
(NMIBC) recurrence after transurethral resection (TURBt) in Chinese
population. PLoS ONE. 2015;10(4):e0123617.
3. Goldstein BA, Assimes T, Winkelmayer WC, Hastie T. Detecting clinically
meaningful biomarkers with repeated measurements: an illustration with
electronic health records. Biometrics. 2015;71:478–86.
4. Hillestad R, Bigelow J, Bower A, Girosi F, Meili R, Scoville R, Taylor R. Can
electronic medical record systems transform health care? Potential health
benefits, savings, and costs. Health Aff (Millwood). 2005;24(5):1103–17.
5. Martirosyan L, Arah OA, Haaijer-Ruskamp FM, Braspenning J, Denig P.
Methods to identify the target population: implications for prescribing
quality indicators. BMC health services research. 2010;10:137.
6. Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extract-
ing principal diagnosis, co-morbidity and smoking status for asthma
research: evaluation of a natural language processing system. BMC Med
Inform Decis Mak. 2006;6:30.
7. Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, Hansen
T, Soeby K, Bredkjaer S, Juul A, Werge T, et al. Using electronic patient
records to discover disease correlations and stratify patient cohorts. PLoS
Comput Biol. 2011;7(8):e1002141.
8. Harpaz R, Chase HS, Friedman C. Mining multi-item drug adverse
effect associations in spontaneous reporting systems. BMC Bioinform.
2010;11(Suppl 9):S7.
9. Xu H, Aldrich MC, Chen Q, Liu H, Peterson NB, Dai Q, Levy M, Shah A,
Han X, Ruan X, et al. Validating drug repurposing signals using electronic
health records: a case study of metformin associated with reduced
cancer mortality. J Am Med Inform Assoc. 2015;22(1):179–91.
10. Roberts MH, Mapel DW, Von Worley A, Beene J. Clinical factors, including
All Patient Refined Diagnosis Related Group severity, as predictors of early
rehospitalization after COPD exacerbation. Drugs Context. 2015;4:212278.
11. Wians FH. Clinical laboratory tests: which, why, and what do the results
mean? Lab Med. 2009;40(2):105–13.
12. Kim JH, Lim S, Park KS, Jang HC, Choi SH. Total and differential WBC
counts are related with coronary artery atherosclerosis and increase the
risk for cardiovascular disease in Koreans. PLoS ONE. 2017;12(7):e0180332.
13. Adamusiak T, Shimoyama N, Shimoyama M. Next generation pheno-
typing using the unified medical language system. JMIR Med Inform.
2014;2(1):e5.
14. Lenz R, Beyer M, Kuhn KA. Semantic integration in healthcare networks.
Int J Med Inform. 2007;76(2–3):201–7.
15. Kush RD, Helton E, Rockhold FW, Hardison CD. Electronic health
records, medical research, and the Tower of Babel. N Engl J Med.
2008;358(16):1738–40.
16. Kabachinski J. What is health level 7? Biomed Instrum Technol Assoc Adv
Med Instrum. 2006;40(5):375–9.
17. Kalra D, Beale T, Heard S. The openEHR foundation. Stud Health Technol
Inform. 2005;115:153–73.
18. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC,
Chute CG. Mayo clinical Text Analysis and Knowledge Extraction System
(cTAKES): architecture, component evaluation and applications. J Am
Med Inform Assoc. 2010;17(5):507–13.
19. Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.
org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of
human genes and genetic disorders. Nucleic Acids Res. 2015;43(Database
issue):D789–98.
20. Ramos EM, Hoffman D, Junkins HA, Maglott D, Phan L, Sherry ST, Feolo
M, Hindorff LA. Phenotype-Genotype Integrator (PheGenI): synthesizing
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 12
Nguyenetal. BMC Med Inform Decis Mak 2021, 21(Suppl 3):51
genome-wide association study (GWAS) data with existing genomic
resources. Eur J Hum Genet. 2014;22(1):144–7.
21. Greshake B, Bayer PE, Rausch H, Reda J. openSNP–a crowdsourced web
resource for personal genomics. PLoS ONE. 2014;9(3):e89204.
22. Wang X, Hripcsak G, Markatou M, Friedman C. Active computerized
pharmacovigilance using natural language processing, statistics, and
electronic health records: a feasibility study. J Am Med Inform Assoc
JAMIA. 2009;16(3):328–37.
23. Oztekin A, Delen D, Kong ZJ. Predicting the graft survival for heart-lung
transplantation patients: an integrated data mining methodology. Int J
Med Inform. 2009;78(12):e84-96.
24. Delen D, Oztekin A, Kong ZJ. A machine learning-based approach
to prognostic analysis of thoracic transplantations. Artif Intell Med.
2010;49(1):33–42.
25. Gibbons RD, Amatya AK, Brown CH, Hur K, Marcus SM, Bhaumik DK,
Mann JJ. Post-approval drug safety surveillance. Annu Rev Public Health.
2010;31:419–37.
26. Cox DR. Regression models and life-tables. In: Breakthroughs in statistics.
Springer; 1992. p. 527–541.
27. Delen D, Walker G, Kadam A. Predicting breast cancer survivabil-
ity: a comparison of three data mining methods. Artif Intell Med.
2005;34(2):113–27.
28. Mathias JS, Agrawal A, Feinglass J, Cooper AJ, Baker DW, Choudhary A.
Development of a 5 year life expectancy index in older adults using pre-
dictive mining of electronic health record data. J Am Med Inform Assoc.
2013;20(e1):e118-124.
29. Shadmi E, Flaks-Manov N, Hoshen M, Goldman O, Bitterman H, Balicer
RD. Predicting 30-day readmissions with preadmission electronic health
record data. Med Care. 2015;53(3):283–9.
30. Rochefort CM, Verma AD, Eguale T, Lee TC, Buckeridge DL. A novel
method of adverse event detection can accurately identify venous
thromboembolisms (VTEs) from narrative electronic health record data. J
Am Med Inform Assoc. 2015;22(1):155–65.
31. Boxwala AA, Kim J, Grillo JM, Ohno-Machado L. Using statistical and
machine learning to help institutions detect suspicious access to elec-
tronic health records. J Am Med Inform Assoc. 2011;18(4):498–505.
32. Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry
K, Wang D, Masys DR, Roden DM, Crawford DC. PheWAS: demonstrating
the feasibility of a phenome-wide scan to discover gene-disease associa-
tions. Bioinformatics. 2010;26(9):1205–10.
33. Herr TM, Peterson JF, Rasmussen LV, Caraballo PJ, Peissig PL, Starren JB.
Corrigendum to: Pharmacogenomic clinical decision support design and
multi-site process outcomes analysis in the eMERGE Network. J Am Med
Inform Assoc. 2019;26(5):490.
34. Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, Kohane
I. Serving the enterprise and beyond with informatics for integrating biol-
ogy and the bedside (i2b2). J Am Med Inform Assoc. 2010;17(2):124–30.
35. Joyner MJ, Paneth N, Ioannidis JP. What Happens When Underperforming
Big Ideas in Research Become Entrenched? JAMA. 2016;316(13):1355–6.
36. Denny JC. Mining electronic health records in the genomics era. PLoS
Comput Biol. 2012;8(12):e1002823.
37. Raghunathan TE. What do we do with missing data? Some options for
analysis of incomplete data. Annu Rev Public Health. 2004;25:99–117.
38. Moreno-Conde A, Jodar-Sanchez F, Kalra D. Requirements for clinical
information modelling tools. Int J Med Inform. 2015;84:524–36.
39. Boland MR, Hripcsak G, Shen Y, Chung WK, Weng C. Defining a com-
prehensive verotype using electronic health records for personalized
medicine. J Am Med Inform Assoc. 2013;20(e2):e232-238.
40. Moreno-Conde A, Moner D, Cruz WD, Santos MR, Maldonado JA, Robles
M, Kalra D. Clinical information modeling processes for semantic inter-
operability of electronic health records: systematic review and inductive
analysis. J Am Med Inform Assoc. 2015;22:925–34.
41. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL. The human
disease network. Proc Natl Acad Sci USA. 2007;104(21):8685–90.
42. Japkowicz N, Stephen S. The class imbalance problem: a systematic study.
Intell Data Anal. 2002;6(5):429–49.
43. Wang G, Deng Z, Choi KS. Tackling missing data in community health
studies using additive LS-SVM classifier. IEEE J Biomed Health Inform.
2018;22(2):579–87.
44. Little RJ, Rubin DB. Statistical analysis with missing data, vol. 793. Hobo-
ken: Wiley; 2019.
45. Smola AJ, Scholkopf B. A tutorial on support vector regression, Berlin,
Germany. NeuroCOLT2 Technical Report Series; 1998.
46. Salazar DA, Vélez JI, Salazar JC. Comparison between SVM and logistic
regression: which one is better to discriminate? Rev Colomb Estad.
2012;35(2):223–37.
47. Ibm I. CPLEX optimizer. 2010.
48. Estabrooks A, Jo T, Japkowicz N. A multiple sampling method for learning
from imbalanced data sets. Comput Intell. 2014;20(1):18–36.
49. Yue Z, Zheng Q, Neylon MT, Yoo M, Shin J, Zhao Z, Tan AC, Chen JY.
PAGER 2.0: an update to the pathway, annotated-list and gene-signature
electronic repository for Human Network Biology. Nucleic Acids Res.
2018;46(D1):D668–76.
50. Yue Z, Kshirsagar MM, Nguyen T, Suphavilai C, Neylon MT, Zhu L, Ratliff
T, Chen JY. PAGER: constructing PAGs and new PAG-PAG relationships for
network biology. Bioinformatics. 2015;31(12):i250-257.
51. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M,
Garapati P, Gopinath G, Jassal B, et al. Reactome: a database of reactions,
pathways and biological processes. Nucleic Acids Res. 2011;39(Database
issue):D691–7.
52. Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P,
Haw R, Jassal B, Korninger F, May B, et al. The reactome pathway knowl-
edgebase. Nucleic Acids Res. 2018;46(D1):D649–55.
53. Baxevanis AD. Searching Online Mendelian Inheritance in Man (OMIM) for
information on genetic loci involved in human disease. Current protocols
in human genetics/editorial board, Jonathan L Haines [et al] 2012, Chap-
ter 9:Unit 9 13. 11–10.
54. Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: leveraging
knowledge across phenotype-gene relationships. Nucleic Acids Res.
2019;47(D1):D1038–43.
55. Culhane AC, Schroder MS, Sultana R, Picard SC, Martinelli EN, Kelly C,
Haibe-Kains B, Kapushesky M, St Pierre AA, Flahive W, et al. GeneSigDB: a
manually curated database and resource for analysis of gene expression
signatures. Nucleic Acids Res. 2012;40(Database issue):D1060–6.
56. da Huang W, Sherman BT, Lempicki RA. Systematic and integrative analy-
sis of large gene lists using DAVID bioinformatics resources. Nat Protoc.
2009;4(1):44–57.
57. da Huang W, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, Guo Y, Stephens R,
Baseler MW, Lane HC, et al. DAVID Bioinformatics Resources: expanded
annotation database and novel algorithms to better extract biology from
large gene lists. Nucleic Acids Res. 2007;35(Web Server issue):W169–75.
58. Peck R, Olsen C, Devore JL. Introduction to statistics and data analysis.
Boston: Cengage Learning; 2015.
59. Zaki MJ, Meira W Jr. Data mining and analysis: fundamental concepts and
algorithms. 1st ed. Cambridge: Cambridge University Press; 2014.
60. Liaw A, Wiener M. Classification and regression by randomForest. R News.
2002;2(3):18–22.
61. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The
WEKA data mining software: an update. ACM SIGKDD Explor Newslett.
2009;11(1):10–8.
62. Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised rep-
resentation to predict the future of patients from the electronic health
records. Sci Rep. 2016;6:26094.
63. Choi JY, Bae S-H, Qiu X, Fox G. High performance dimension reduction
and visualization for large high-dimensional data analysis. In: Proceedings
of the 2010 10th IEEE/ACM international conference on cluster, cloud and
grid computing. IEEE Computer Society. 2010; 331–340.
64. Fox G. Robust scalable visualized clustering in vector and non vector
semi-metric spaces. Parallel Process Lett. 2013;23(02):1340006.
65. Hofmann T, Buhmann JM. Pairwise data clustering by deterministic
annealing. IEEE Trans Pattern Anal Mach Intell. 1997;19(1):1–14.
66. Rousseeuw P. Silhouettes: a graphical aid to the interpretation and valida-
tion of cluster analysis. Comput Appl Math. 1987;20:53–65.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub-
lished maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Systematic software platforms to organize large metadata and clinical data [also called "clinotype" (Nguyen et al., 2021)] is essential in biomedical research (Burgun and Bodenreider, 2008;Ohmann and Kuchinke, 2009). These software platforms, such as (Ta et al., 2018;Kim et al., 2019;Hume et al., 2020), have two key objectives. ...
Article
Full-text available
Unsupervised learning techniques, such as clustering and embedding, have been increasingly popular to cluster biomedical samples from high-dimensional biomedical data. Extracting clinical data or sample meta-data shared in common among biomedical samples of a given biological condition remains a major challenge. Here, we describe a powerful analytical method called Statistical Enrichment Analysis of Samples (SEAS) for interpreting clustered or embedded sample data from omics studies. The method derives its power by focusing on sample sets, i.e., groups of biological samples that were constructed for various purposes, e.g., manual curation of samples sharing specific characteristics or automated clusters generated by embedding sample omic profiles from multi-dimensional omics space. The samples in the sample set share common clinical measurements, which we refer to as “clinotypes,” such as age group, gender, treatment status, or survival days. We demonstrate how SEAS yields insights into biological data sets using glioblastoma (GBM) samples. Notably, when analyzing the combined The Cancer Genome Atlas (TCGA)—patient-derived xenograft (PDX) data, SEAS allows approximating the different clinical outcomes of radiotherapy-treated PDX samples, which has not been solved by other tools. The result shows that SEAS may support the clinical decision. The SEAS tool is publicly available as a freely available software package at https://aimed-lab.shinyapps.io/SEAS/ .
Article
Full-text available
For over 50 years Mendelian Inheritance in Man has chronicled the collective knowledge of the field of medical genetics. It initially cataloged the known X-linked, autosomal recessive and autosomal dominant inherited disorders, but grew to be the primary repository of curated information on both genes and genetic phenotypes and the relationships between them. Each phenotype and gene is given a separate entry assigned a stable, unique identifier. The entries contain structured summaries of new and important information based on expert review of the biomedical literature. OMIM.org provides interactive access to the knowledge repository, including genomic coordinate searches of the gene map, views of genetic heterogeneity of phenotypes in Phenotypic Series, and side-by-side comparisons of clinical synopses. OMIM.org also supports computational queries via a robust API. All entries have extensive targeted links to other genomic resources and additional references. Updates to OMIM can be found on the update list or followed through the MIMmatch service. Updated user guides and tutorials are available on the website. As of September 2018, OMIM had over 24,600 entries, and the OMIM Morbid Map Scorecard had 6,259 molecularized phenotypes connected to 3,961 genes.
Article
Full-text available
The Reactome Knowledgebase (https://reactome.org) provides molecular details of signal transduction, transport, DNA replication, metabolism, and other cellular processes as an ordered network of molecular transformations-an extended version of a classic metabolic map, in a single consistent data model. Reactome functions both as an archive of biological processes and as a tool for discovering unexpected functional relationships in data such as gene expression profiles or somatic mutation catalogues from tumor cells. To support the continued brisk growth in the size and complexity of Reactome, we have implemented a graph database, improved performance of data analysis tools, and designed new data structures and strategies to boost diagram viewer performance. To make our website more accessible to human users, we have improved pathway display and navigation by implementing interactive Enhanced High Level Diagrams (EHLDs) with an associated icon library, and subpathway highlighting and zooming, in a simplified and reorganized web site with adaptive design. To encourage re-use of our content, we have enabled export of pathway diagrams as 'PowerPoint' files.
Article
Full-text available
Integrative Gene-set, Network and Pathway Analysis (GNPA) is a powerful data analysis approach developed to help interpret high-throughput omics data. In PAGER 1.0, we demonstrated that researchers can gain unbiased and reproducible biological insights with the introduction of PAGs (Pathways, Annotated-lists and Gene-signatures) as the basic data representation elements. In PAGER 2.0, we improve the utility of integrative GNPA by significantly expanding the coverage of PAGs and PAG-to-PAG relationships in the database, defining a new metric to quantify PAG data qualities, and developing new software features to simplify online integra-tive GNPA. Specifically, we included 84 282 PAGs spanning 24 different data sources that cover human diseases, published gene-expression signatures , drug-gene, miRNA-gene interactions, pathways and tissue-specific gene expressions. We introduced a new normalized Cohesion Coefficient (nCoCo) score to assess the biological relevance of genes inside a PAG, and RP-score to rank genes and assign gene-specific weights inside a PAG. The companion web interface contains numerous features to help users query and navigate the database content. The database content can be freely downloaded and is compatible with third-party Gene Set Enrichment Analysis tools. We expect PAGER 2.0 to become a major resource in integrative GNPA. PAGER 2.0 is available at http://discovery.informatics.uab.edu/ PAGER/.
Article
Full-text available
Objective: Inflammation is a key mechanism of atherosclerosis. White blood cells (WBCs) play a pivotal role in the inflammatory process. We investigated the relationships between total and differential WBC counts and multi-detector cardiac computed tomography (MDCT) findings, as well as the risk of cardiovascular disease in asymptomatic patients in Korea. Materials and methods: We recruited asymptomatic men (n = 7274) and women (n = 5478) aged ≥30 years who were free of known coronary heart disease. All patients underwent MDCT during a routine health check-up in the Seoul National University Bundang Hospital between 2006 and 2007, and were followed-up for 5.6 years. We reviewed medical records for cardiovascular diseases (CVDs) and covariates. Results: In covariate-adjusted logistic regression models for MDCT findings, subjects within the third tertile of all WBC subtypes had a higher risk for significant stenosis and noncalcified plaques compared with the first tertile of each subtype. In Cox proportional hazard regression models for the risk of CVDs, subjects within the third tertiles of lymphocytes and monocytes were at an increased risk of CVDs (total WBC, HR = 1.22 [1.02-1.44]; lymphocyte, HR = 1.47 [1.25-1.74]; monocytes, HR = 1.26 [1.02-1.35]) even after further adjustment for covariates and coronary artery stenosis. Conclusions: Total WBC counts were related with the severity of coronary artery disease, and higher WBC counts increased the risk of CVDs in asymptomatic Koreans mainly by virtue of monocytes.
Article
To better understand the real-world effects of pharmacogenomic (PGx) alerts, this study aimed to characterize alert design within the eMERGE Network, and to establish a method for sharing PGx alert response data for aggregate analysis. Seven eMERGE sites submitted design details and established an alert logging data dictionary. Six sites participated in a pilot study, sharing alert response data from their electronic health record systems. PGx alert design varied, with some consensus around the use of active, post-test alerts to convey Clinical Pharmacogenetics Implementation Consortium recommendations. Sites successfully shared response data, with wide variation in acceptance and follow rates. Results reflect the lack of standardization in PGx alert design. Standards and/or larger studies will be necessary to fully understand PGx impact. This study demonstrated a method for sharing PGx alert response data and established that variation in system design is a significant barrier for multi-site analyses.
Article
The definition of “normal” values for common laboratory tests often governs the diagnosis, treatment, and overall management of tested individuals. Some test results may depend on demographic traits of the tested population including age, race, and sex. Ideally, laboratory test results should be interpreted in reference to a population of “similar” “healthy” individuals. In many settings, however, it is unclear exactly who these individuals are. How much population stratification and what criteria for healthy individuals are optimal? In particular, with the evolution of medicine into fully personalized or “precision” medicine and the availability of large-scale data sets, there may be interest in trying to match each person to an increasingly granular normal reference population. Is this precision feasible to obtain in reliable ways and will it improve practice?
Article
Missing data is a common issue in community health and epidemiological studies. Direct removal of samples with missing data can lead to reduced sample size and information bias, which deteriorates the significance of the results. While data imputation methods are available to deal with missing data, they are limited in performance and could introduce noises into the dataset. Instead of data imputation, a novel method based on additive least square support vector machine (LS-SVM) is proposed in this paper for predictive modeling when the input features of the model contain missing data. The method also determines simultaneously the influence of the features with missing values on the classification accuracy using the fast leave-one-out cross-validation strategy. The performance of the method is evaluated by applying it to predict the quality of life (QOL) of elderly people using health data collected in the community. The dataset involves demographics, socioeconomic status, health history and the outcomes of health assessments of 444 community-dwelling elderly people, with 5% to 60% of data missing in some of the input features. The QOL is measured using a standard questionnaire of the World Health Organization. Results show that the proposed method outperforms four conventional methods for handling missing data – case deletion, feature deletion, mean imputation and K-nearest neighbor imputation, with the average QOL prediction accuracy reaching 0.7418. It is potentially a promising technique for tackling missing data in community health research and other applications.
Article
For several decades now the biomedical research community has pursued a narrative positing that a combination of ever-deeper knowledge of subcellular biology, especially genetics, coupled with information technology will lead to transformative improvements in health care and human health. In this Viewpoint, we provide evidence for the extraordinary dominance of this narrative in biomedical funding and journal publications; discuss several prominent themes embedded in the narrative to show that this approach has largely failed; and propose a wholesale reevaluation of the way forward in biomedical research.
Article
The Reactome Knowledgebase (www.reactome.org) provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network of molecular transformations - an extended version of a classic metabolic map, in a single consistent data model. Reactome functions both as an archive of biological processes and as a tool for discovering unexpected functional relationships in data such as gene expression pattern surveys or somatic mutation catalogues from tumour cells. Over the last two years we redeveloped major components of the Reactome web interface to improve usability, responsiveness and data visualization. A new pathway diagram viewer provides a faster, clearer interface and smooth zooming from the entire reaction network to the details of individual reactions. Tool performance for analysis of user datasets has been substantially improved, now generating detailed results for genome-wide expression datasets within seconds. The analysis module can now be accessed through a RESTFul interface, facilitating its inclusion in third party applications. A new overview module allows the visualization of analysis results on a genome-wide Reactome pathway hierarchy using a single screen page. The search interface now provides auto-completion as well as a faceted search to narrow result lists efficiently.