ArticlePDF Available

Considerations for Using Research Data to Verify Clinical Data Accuracy

Authors:

Abstract and Figures

Collected to support clinical decisions and processes, clinical data may be subject to validity issues when used for research. The objective of this study is to examine methods and issues in summarizing and evaluating the accuracy of clinical data as compared to primary research data. We hypothesized that research survey data on a patient cohort could serve as a reference standard for uncovering potential biases in clinical data. We compared the summary statistics between clinical and research datasets. Seven clinical variables, i.e., height, weight, gender, ethnicity, systolic and diastolic blood pressure, and diabetes status, were included in the study. Our results show that the clinical data and research data had similar summary statistical profiles, but there are detectable differences in definitions and measurements for individual variables such as height, diastolic blood pressure, and diabetes status. We discuss the implications of these results and confirm the important considerations for using research data to verify clinical data accuracy.
No caption available
… 
Content may be subject to copyright.
Considerations for Using Research Data to Verify Clinical Data Accuracy
Daniel Fort, MPH1, Chunhua Weng, PhD1, Suzanne Bakken, RN, PhD1,2,
Adam B. Wilcox, PhD3
1Department of Biomedical Informatics, 2School of Nursing, Columbia University, New
York City; 3Intermountain Healthcare, Salt Lake City, UT
Abstract
Collected to support clinical decisions and processes, clinical data may be subject to validity issues when used for
research. The objective of this study is to examine methods and issues in summarizing and evaluating the accuracy
of clinical data as compared to primary research data. We hypothesized that research survey data on a patient
cohort could serve as a reference standard for uncovering potential biases in clinical data. We compared the
summary statistics between clinical and research datasets. Seven clinical variables, i.e., height, weight, gender,
ethnicity, systolic and diastolic blood pressure, and diabetes status, were included in the study. Our results show
that the clinical data and research data had similar summary statistical profiles, but there are detectable differences
in definitions and measurements for individual variables such as height, diastolic blood pressure, and diabetes
status. We discuss the implications of these results and confirm the important considerations for using research data
to verify clinical data accuracy.
Introduction
Computational reuse of clinical data from the electronic health record (EHR) has been frequently recommended for
improving efficiency and reducing cost for comparative effectiveness research[1]. This goal faces significant
barriers because clinical data are collected to aid individual clinicians in diagnosis, treatment, and monitoring of
health-related conditions rather than for research uses[2]. A risk to reuse is potential hidden biases in clinical data.
While specific studies have demonstrated positive value in clinical data research, there are concerns about whether
they are generally usable. An opaque data capture processes and idiosyncratic documentation behaviors of clinicians
from multiple disciplines may lead to data biases. A difference in the population who seek medical care versus the
general residential population may introduce a selection bias when clinical data are used to estimate population
statistics.
Comparison of EHR data with a gold standard is by far the most frequently used method for assessing accuracy[3].
Recent efforts have taken a more implicit approach to validating clinical data in the form of study result replication.
Groups such as HMORN, OMOP, and DARTNet assessed the accuracy of clinical data by comparing research
results derived from clinical data with those derived from randomized controlled trials[4-6]. This reflects a focus on
making a new system work, rather than a lack of recognition of a potential problem.
The Washington Heights/Inwood Informatics Infrastructure for Community-Centered Comparative Effectiveness
Research (WICER) Project (http://www.wicer.org) has been conducting community-based research and collecting
patient self-reported health information. We assume research data are of better quality than clinical data given their
rigorous data collection processes. For patients with information in both the survey and electronic health records, an
analysis of the differences between data collected through survey and data collected in clinical settings may help us
understand the potential biases in clinical data. This study compares WICER Community Survey results to data for
the same variables from the same people collected within our EHR as well as attempts to replicate the WICER
research sample using only clinical data. We discuss the implications of these results and three potential categories
of accuracy of clinical data.
Methods
Our conceptual framework for using research data to verify clinical data includes four consecutive steps: (1) cohort
selection; (2) variable selection; (3) data point selection; and (4) measurement selection.
Step 1: Cohort Selection
We selected the patients who had data in both data sources: the WICER community population health survey and
our institutional clinical data warehouse. The WICER Community Survey collected data from residents in
Washington Heights, an area of New York with a population of approximately 300,000 people, through cluster and
211
snowball sampling methodologies. Surveys were administered to
individuals over the age of 18 who spoke either English or
Spanish. Survey data was collected and processed from March
2012 through September 2013. A total of 5,269 individuals took
the WICER Community Survey in either the Household or Clinic
setting.
The Columbia University Medical Center's Clinical Data
Warehouse (CDW) integrates patient information collected from
assorted EHR systems for about 4 million patients for more than
20 years. The initial effort to replicate the WICER research sample
restricted the CDW to adult patients who had an address within
one of the five same zip codes and one recorded visit during the
WICER data collection time period, resulting in a cohort of 78,418
patients.
The WICER data set includes a higher proportion of women and
Hispanic individuals than either the CDW sample or what was
expected based on census data for the same area codes. New
clinical data samples were created to match the proportion of
women and Hispanic ethnicity as found in the WICER data set, as
well as new samples for both which match the census distributions
for age and gender. A total of 1,279 individuals were identified
from the intersection of the two datasets to compare clinical data in
CDW and research data in WICER without a sampling bias.
Step 2: Variable Selection
Because the WICER study included variables related to
hypertension, the American Heart Association (AHA) / American
College of Cardiology (ACC) original guidelines for cardiac risk
were chosen to guide the variable selection process[7]. The content
overlap between the wide range of information collected for the
WICER Community Survey and that available in the CDW is
limited to some basic demographic and baseline health
information. Of the factors in the AHA/ACC Guidelines, Age, Race, Ethnicity, Gender, the components of BMI
(height and weight), Smoking Status, Blood Pressure (systolic and diastolic) were available as structured data in
both data sources. See Table 1 for concept definitions.!
A simple clinical phenotyping method, consistent with the eMERGE diabetes phenotype[8] but excluding
medication orders, was developed for type 2 diabetes in the CDW using ICD-9 Codes, HbA1c test values, and
glucose test values. Using the strictest criteria, a patient will only be identified as having diabetes if there are at least
two ICD-9 codes for diabetes, at least one HbA1c test value >6.5, or at least two high glucose test values. A glucose
test value is coded as high if it is >126 for a fasting glucose test or >200 otherwise. Effectiveness of labeling of each
of these components was also explored.
Step 3: Data Point Selection
Each clinical variable could have many data points from multiple points of measurement across time, which
necessitated careful data point selection to ensure that summary data points were both representative of all data
points and comparable across data sources without introducing data sampling biases. This includes an issue of
temporal bias, where some data variables, such as weight, might naturally be expected to change over time. To make
a comparable cross-section to the Survey dataset and to ensure the resulting data reflects not only the same sample
but also the same sample at the same time, we selected only data points recorded during the 18-month WICER study
period from the CDW. In this way, assuming the survey participants are measured at random throughout an 18-
month period, so too are the clinical data population.
In the matched sample we had an opportunity to more finely tune the data comparison. The most direct approach is
to simply select the clinical data point closest in time to the survey measurement of any given participant.
Table 1: Concepts and Definitions for
sample summary
212
Alternatives include the closest prior or subsequent data paint as well as using a single randomly selected point
rather than the average of all clinical data points. While alternate data point selection options were explored, to best
keep the results comparable the reported values for the matched sample were derived in the same fashion as for the
sample at large.
Step 4: Data Measure Selection for Comparing the Two Data Sets
With representative patient sample, meaningful variables, and representative data points, the next important step for
designing an unbiased verification study was to select a meaningful data measure, which seems to be the most
subjective step without standard guidance. For this step, we considered two measures: (a) population-level average
summary statistics; and (b) patient-level average summary statistics.
Option (a): Population-Level Average summary statistics
Multiple data values available during the study period were averaged in order to minimize any temporal effects
while also allowing the use of the most number of patients. Continuous variables within each set were averaged,
with one exception, and compared via t-test. The median BMI value was used for comparison as the mean summary
value for the calculation of BMI is more susceptible to outliers. Choice of other "best matching" clinical data values,
such as the closest prior and subsequent values in time as well as simple random choice, were also explored.
Proportions of interest, which include % female, % smoking, and % Hispanic, for the categorical variables were
reported and compared with chi-square test. For some proportions there is a possibility that negative or healthy
status might not be recorded and would therefore be accurately represented by missing data. Therefore for smoking
and diabetes there is a second value reported: the proportion of labeled status, which excludes any patient with
missing data rather than assume missing data denotes known negative status.
For the purpose of primary analysis, only the strictest, ALL criteria for diabetes diagnosis are reported, as consistent
with the eMERGE criteria. However, each component of the diabetes diagnosis was examined for sensitivity,
specificity, and positive predictive against the patient's self-reported diabetes status. All summary and statistical
comparisons were performed in Python, using the SciPy scientific computing package for statistical comparisons.
Option (b): Patient-level
Average Summary Statistics
When there is sufficient
clinical data, it is possible to
create a distribution of
expected values for a given
patient and compare the
survey value to that
distribution. At its simplest,
the comparison is simply
whether the survey value is
within one standard
deviation of the mean of the
available clinical values.
This process was performed
for patients with at least five
data points for the same
variable recorded during the
study period.
Results
Summary values for the
WICER Survey population,
the raw clinical sample, the
resampled clinical data
targeted to match the survey
proportion of women and
Table 2: Summary Results from alternate sampling methods
213
hispanic participants, and census
distribution weighted samples are
presented in Table 2. Analysis
was performed across all samples
with no significant variation in
results. The original, total clinical
dataset was chosen for
representative purposes because
it is the only clinical sample to
contain all members of the
matched set.!
Following the population
summary approach, values and
statistics for each data point are
presented in Table 3. The Survey
dataset tends to be slightly older
and contain more women. Survey
participants were almost entirely
identifying as Hispanic. Sixteen
percent of the survey participants
self-identified as having diabetes.
Measuring the Matched dataset
via clinical data and primary
survey collection processes
broadly records the same values.
There are statistically significant
measurement discrepancies in
Hispanic ethnicity labeling,
height measurement, diastolic
blood pressure, and diabetes
status determination. Where the
Clinical and Survey datasets
differ, in age, proportion of
women, and prevalence of
smoking, are evidence of
statistically significant
differences in sample
composition. In exploring
patient-level summary statistics,
the number of patients with
sufficient data to construct a
distribution of expected blood
pressures was 866. Of these,
491(57%) and 479(55%) had a
survey systolic or diastolic blood
pressure, respectively, greater
than one standard deviation away from their clinical mean. Table 4 shows an example result of alternate data point
selections in Systolic BP. While values are statistically significantly different from one another in this and other
examples, they would not change the conclusions drawn from Table 3.
The sensitivity, specificity, and positive predictive value of various strategies to identify diabetes status using
clinical data are presented in Table 5. In this simple phenotype, ALL is the intersection of three criteria and ANY is
the union. The three criteria are having at least two ICD-9 codes for diabetes, one high HbA1c value, and at least
two high glucose values. The rationale for requiring two of some categories is to restrict potentially spurious results.
In the case of diagnostic codes, for example, a diabetes ICD-9 code might be recorded for a negative diabetes
Discrepancy in Measurement Discrepancy in Selection
Clinical
Matched
Clinical
Matched
Survey
Survey
p-value:
Matched vs.
Matched
N
Age
Proportion
Female
Proportion
Hispanic
Weight kg
Height cm
BMI
Prevalence
of Smoking
Prevalence
of Smoking
with labeled
status
Systolic
Diastolic
Prevalence
of Diabetes,
Strict
Criteria
Prevalence
of Diabetes
among
labeled
status, Strict
Criteria
78,418
1,279
1,279
5,269
47.55
52.33
51.12
50.12
0.072
0.62
0.79
0.78
0.71
0.963
0.50
0.56
0.94
0.96
p << .0001
75.69
77.16
76.99
75.42
0.851
160.34
158.23
161.31
161.25
p << .0001
28.10
29.70
28.90
28.20
0.207
0.09
0.08
0.08
0.06
0.944
0.12
0.09
0.08
0.06
0.283
127.23
128.48
127.50
127.68
0.204
73.07
74.34
79.24
80.95
p << .0001
0.04
0.09
0.22
0.16
p << .0001
0.28
0.35
0.23
0.16
0.001
Table 3: Clinical, Survey, and Matched Set data comparison. Bonferroni-
corrected p-value = 1e-4
Systolic BP
Survey
Closest
Prior
Closest
Subsequent
Random
Point
Mean
N
Mean
1290
1107
962
1185
1185
127.8
127.9
130.3
129.3
128.5
Table 4: Systolic blood pressure summary values and patient cohort size for
various data point selection methodologies
Table 5: Sensitivity, Specificity, F-measure, and Positive Predictive Value of
components of a diabetes diagnosis
214
evaluation. The removal of these restrictions was also considered. The ALL criteria have the highest positive
predictive value, but the lowest sensitivity. Both the ICD-9 and HbA1c-based criteria have high specificities and the
ICD-9 based criteria alone have the highest F-measure for sensitivity and specificity. Proportions of patients
retrieved under each qualifying criteria are consistent with published results[9].!!
Discussion
Our study shows discrepancies between clinical and research data, both in sampling and measurement. Clinical
measurement of some data, such as gender and BMI, accurately reproduces the research measurement and others,
such as diabetes, do not. While raw results may be interesting, because of the limits of overlapping data between sets
and the comparisons which could be made, the raw results may have little value outside of this case study. If these
discrepancies can be considered as representative of classes of clinical data, we can abstract some idea of
generalizable accuracy of clinical data as compared to primary research data. We introduce three categories of
accuracy.
The first category is "completely accurate" information, such as sex, birthdate, and therefore age. These data might
be considered Personally Identifiable Information (PII), or information that on its own could be used to identify an
individual. This classification suggests that address, social security number, and phone number would also be
accurate between datasets. While there will be instances of coding error, misreporting, or other errors, by and large
these data are consistent across datasets. It should be noted that birthdate was one of the criteria by which
individuals were identified for the Matched, and therefore errors in the recording of birthdate would be excluded
from this analysis. Also, while PII should be accurate across datasets, this does not suggest that all demographic
information, such as ethnicity, will be accurate.
The second category is 'simple measurement' information, which is the result of a clear concept or measurement
process. Height, weight, systolic and diastolic blood pressure, smoking status, and ethnicity are included in this
category. Here, the simplicity of the measurement or concept leads to agreement in the value between sources, and
differences in the value are the result of a difference in either the concept definition or the measurement process. For
example, measured heights in the Matched group differ by approximately 2.5cm or 1in, suggesting that the concept
and measurement of height in the Survey sample includes shoes. Likewise, diastolic blood pressure is consistently
measured 5 points higher in the Survey sample, suggesting a difference in measurement. Ethnicity, which is self-
reported in the survey, is labeled by hospital staff during admission to the hospital, resulting in approximately one
third of Hispanic individuals being labeled as 'Unknown' ethnicity in the Clinical sample.
The final category of accuracy is 'inferred' information, where a complex concept, such as diabetes, is inferred from
multiple variables. When compared with self-reported Survey values, no single prediction or combination of
variables can be considered accurate for an entire cohort. However, some results may be useful enough for a specific
purpose. For example, requiring ALL criteria has a high positive predictive value and may provide a high level of
accuracy within a given cohort. Conversely, using just HbA1c measurements has a high sensitivity and may be most
valuable when a larger quantity of data is required for statistical power.
At least in this case study, discrepancies in the 'simple measurement' category are stable across multiple sampling
methodologies. Discrepancies are also stable when samples are broken down into categories such as age by decade,
obesity classification, and hypertension risk category. This stability is what would be expected if the discrepancies
were the result of simple measurement error and would suggest these discrepancies represent systematic bias in the
clinical data. It is possible that reported discrepancies are the result of data retrieval and processing. However, the
presence of pairs of measurements such as weight/height and systolic/diastolic blood pressure, retrieved and
processed in an identical manner, where one is accurate and one not, suggests the discrepancies are truly present in
at data source. Due to the limitations of this case study, it is unclear how generalizable this finding may be.
The choice of exact data points may also influence study results, so care must be taken in accurately summarizing
patient data. In this study, the biggest apparent difference was between closest prior and subsequent data points. The
reason may be that closest prior data point represents the end of a series of blood pressures which began with a
hospitalization and is, therefore, the nearest to "normal". The closest subsequent data point, however, would
represent the initial data collection of a hospitalization and would likely reflect a health crisis. Furthermore, defining
allowable data points in time restricts the number of patients, who qualify for comparison. Using the average value
for each patient smoothens out these temporal effects and allows the use of the maximum number of patients for
comparison.
215
Recommendations
When a research cohort is defined as having clinical data, that clinical data may be a usable substitute for primarily
acquired research data, depending on the needs of the research. PII should have a high degree of accuracy and
aspects of the patient record, which are conceptually simple or have a clear measurement process, may be accurate
or include relatively small discrepancies. More complex concepts, such as a diabetes phenotype, are not accurate for
summary purposes but components may be useful depending on the exact nature of the requirement. However, to
avoid the discrepancies due to clinical sampling, the research cohort must be defined as already having clinical data.
Results of this study demonstrate a significant difference in sampling processes between clinical data and research
survey cohorts. Clinical data used as a convenience sample to substitute for primary research data will not accurately
describe the target population. Discrepancies in the simple measurement category may be due to differences
between either the concept definition or measurement processes. If a dictionary of concept definitions or
measurement procedures was provided as either a standalone document or as metadata tied to each value, such as
whether a measurement of height requires shoes to be taken off, then the comparability of specific variables might
be predictable. Additionally, while aspects of clinical data collection are not in the researcher's control, the exact
choice of data value for research may be. Different choices, such as average per patient or nearest in time, can result
in statistically significant differences in values.
Limitations
This study is limited in scope and setting. First, the overlap between the population survey and clinical data was
limited to a small set of variables. Second, the population survey targeted a largely Hispanic, urban population and
the institution is a large academic medical center. These findings may not be generalizable to other institutions and
populations. This work should be replicated in other settings.
Conclusions
We compared research population survey results to clinical data for the same target population to verify the accuracy
of clinical data elements. Clinical data elements may be classified into three categories of accuracy: completely
accurate, simple measurement, and inferred information, depending in part on the complexity of the concept being
measured and the process of that measurement. Additionally, we report recommendations and considerations for
using clinical data for cohort selection and research.
Acknowledgments
The authors are supported by grant R01HS019853 (PI: Bakken) from the Agency for Healthcare Research and
Quality, grants 5T15LM007079 (PI: Hripcsak) and R01LM009886 (PI: Weng) from the National Library of
Medicine, and grant UL1 TR000040 (PI: Ginsberg) from the National Center for Advancing Translational Sciences.
References
1. A First Look at the Volume and Cost of Comparative Effectiveness Research in the United States, 2009,
Academy Health.
2. Hersh, W.R., M.G. Weiner, P.J. Embi, J.R. Logan, P.R. Payne, E.V. Bernstam, H.P. Lehmann, G.
Hripcsak, T.H. Hartzog, J.J. Cimino, and J.H. Saltz, Caveats for the use of operational electronic health
record data in comparative effectiveness research. Med Care, 2013. 51(8 Suppl 3): p. S30-7.
3. Weiskopf, N.G. and C. Weng, Methods and dimensions of electronic health record data quality
assessment: enabling reuse for clinical research. J Am Med Inform Assoc, 2013. 20(1): p. 144-51.
4. OMOP Design and Validation. 2013; Available from: http://omop.fnih.org.
5. Tannen, R.L., M.G. Weiner, and D. Xie, Use of primary care electronic medical record database in drug
efficacy research on cardiovascular outcomes: comparison of database and randomised controlled trial
findings. BMJ, 2009. 338: p. b81.
6. Libby, A.M., W. Pace, C. Bryan, H.O. Anderson, S.L. Ellis, R.R. Allen, E. Brandt, A.G. Huebschmann, D.
West, and R.J. Valuck, Comparative effectiveness research in DARTNet primary care practices: point of
care data collection on hypoglycemia and over-the-counter and herbal use among patients diagnosed with
diabetes. Med Care, 2010. 48(6 Suppl): p. S39-44.
7. Grundy, S.M., R. Pasternak, P. Greenland, S. Smith, Jr., and V. Fuster, AHA/ACC scientific statement:
Assessment of cardiovascular risk by use of multiple-risk-factor assessment equations: a statement for
healthcare professionals from the American Heart Association and the American College of Cardiology. J
Am Coll Cardiol, 1999. 34(4): p. 1348-59.
216
8. Pacheco, J.T., W. Type 2 Diabetes Mellitus. 2012; Available from:
http://phenotype.mc.vanderbilt.edu/phenotype/type-2-diabetes-mellitus.
9. Richesson, R.L., S.A. Rusincovitch, D. Wixted, B.C. Batch, M.N. Feinglos, M.L. Miranda, W.E.
Hammond, R.M. Califf, and S.E. Spratt, A comparison of phenotype definitions for diabetes mellitus. J Am
Med Inform Assoc, 2013. 20(e2): p. e319-26.
!
217
... For instance, Meystre and Haug found problem list sensitivity for condition presence was initially 9.8% compared to a gold standard, only rising to 41% post-intervention using a Natural Language Processing system to extract medical problems from free-text documents in the electronic chart [10]. Other documentation locations in the EHR may have the diagnosis when the problem list is empty, such as lab results, studies, and medications, but these are even less complete overall [11] and may need to be adjusted for local practice patterns to maximize accuracy. When phenotyping patients to determine genetic influences on disease, this degree of precision is important, but risk stratification may depend less on precision and more on completeness or concepts like salience and temporality. ...
... mentation locations are available (likely, based on previous literature) and how different EHRs' modules and the single health systems' polices might shape their use, which can cause variance. Our rules to diagnose the conditions were imprecise partly because formal tests were often not done at the site, leading to lower than expected performance on rules alone; this issue is known for those developing phenotypes, as rigorous criteria will often exclude the vast majority of patients [11]. The low performance, however, is more due to the study design than the potential of these rules. ...
Article
Objective: To measure variation among four different Electronic Health Record (EHR) system documentation locations versus 'gold standard' manual chart review for risk stratification in patients with multiple chronic illnesses. Methods: Adults seen in primary care with EHR evidence of at least one of 13 conditions were included. EHRs were manually reviewed to determine presence of active diagnoses, and risk scores were calculated using three different methodologies and five EHR documentation locations. Claims data were used to assess cost and utilization for the following year. Descriptive and diagnostic statistics were calculated for each EHR location. Criterion validity testing compared the gold standard verified diagnoses versus other EHR locations and risk scores in predicting future cost and utilization. Results: Nine hundred patients had 2,179 probable diagnoses. About 70% of the diagnoses from the EHR were verified by gold standard. For a subset of patients having baseline and prediction year data (n=750), modeling showed that the gold standard was the best predictor of outcomes on average for a subset of patients that had these data. However, combining all data sources together had nearly equivalent performance for prediction as the gold standard. Conclusions: EHR data locations were inaccurate 30% of the time, leading to improvement in overall modeling from a gold standard from chart review for individual diagnoses. However, the impact on identification of the highest risk patients was minor, and combining data from different EHR locations was equivalent to gold standard performance. The reviewer's ability to identify a diagnosis as correct was influenced by a variety of factors, including completeness, temporality, and perceived accuracy of chart data.
... Previous groups have described the development of a natural language processing algorithm to potentially evaluate large databases. [11][12][13]23 These studies served as a proof of concept, showing that the method was viable, often by comparing the results of the natural language processing algorithm against billing codes. However, to the best of our knowledge, no study has yet used an algorithm to mine the EMRs of a large healthcare system to create a database for analysis, especially for vascular disease. ...
Article
Full-text available
Objective: Few studies have evaluated rapid progression of carotid stenosis on a large scale. We created a custom software algorithm to analyze an electronic medical record database, aiming to examine the natural progression of carotid stenosis, identify a subset of patients with rapid progression, and evaluate specific patient risk factors associated with this rapid progression. Methods: Patients in a large integrated healthcare system who received 2 or more carotid ultrasounds from August 2010 to August 2018 were identified. We did not distinguish between those with an established carotid stenosis diagnosis and those with a screening ultrasound. We used our novel algorithm to extract data from their carotid ultrasound reports. Degrees of carotid stenosis were categorized into levels: level 1, 0%-39%; level 2, 40%-59%; level 3, 60%-79%; level 4, 80%-99%; and level 5, complete occlusion. The primary endpoint was rapid versus slow progression of carotid stenosis, with rapid progression defined as an increase of 2 or more levels within any 18-month period of the study, regardless of the date of the initial ultrasound. The association of demographic and clinical characteristics with rapid progression was assessed by univariable and multivariable logistic regression. Results: From a cohort of 4.4 million patients, we identified 4,982 with 2 or more carotid ultrasounds who had a median follow-up 13.1 months (range 0.1-93.7 months). Of these, 879 (17.6%) patients showed progression of carotid stenosis. Only 116 (2.3%) patients progressed to level 4 (80%-99% stenosis) from any starting level during a median time of 11.5 months. A total of 180 (3.6%) patients were identified as rapid progressors during a median follow-up time of 9.9 months. Final multivariable analysis showed that younger age (P<0.01), Caucasian race (P=0.02), lower body mass index (BMI) (P=0.01), a diagnosis of peripheral arterial disease (P=0.03), and a diagnosis of transient ischemic attack (P<0.01) were associated with rapid progression. Conclusions: Using a novel algorithm to extract data from more than 4 million patient records, we found that rapid progression of carotid stenosis appears to be rare. While 17.6% of patients showed any degree of progression, only 3.6% were rapid progressors. Among those who had any disease progression, 20.5% were rapid progressors. While the overall incidence of rapid progression is low, patients who demonstrate any progression may warrant close follow-up, especially if they have the associated risk factors for rapid progression. The custom software algorithm may be a powerful tool for creating and evaluating large data sets.
... There are over 115,000 patients who had been diagnosed with type 2 diabetes (T2D) [5]. The quality of the EHR data including completeness, correctness, and plausibility was investigated to ensure suitability for studying treatment sequence of common diseases [25], [26]. Current T2D guidelines suggest metformin (MET) as the preferred first-line medication [27], but in the real world setting, there is no recommendation of optimal sequence of treatments on the long-term outcomes in the literature [28]. ...
Conference Paper
Full-text available
To address substantial heterogeneity in patient response to treatment of chronic disorders and achieve the promise of precision medicine, individualized treatment rules (ITRs) are estimated to tailor treatments according to patient-specific characteristics. Randomized controlled trials (RCTs) provide gold standard data for learning ITRs not subject to confounding bias. However, RCTs are often conducted under stringent inclusion/exclusion criteria, and participants in RCTs may not reflect the general patient population. Thus, ITRs learned from RCTs lack generalizability to the broader real world patient population. Real world databases such as electronic health records (EHRs) provide new resources as complements to RCTs to facilitate evidence-based research for personalized medicine. However, to ensure the validity of ITRs learned from EHRs, a number of challenges including confounding bias and selection bias must be addressed. In this work, we propose a matching-based machine learning method to estimate optimal individualized treatment rules from EHRs using interpretable features extracted from EHR documentation of medications and ICD diagnoses codes. We use a latent Dirichlet allocation (LDA) model to extract latent topics and weights as features for learning ITRs. Our method achieves confounding reduction in observational studies through matching treated and untreated individuals and improves treatment optimization by augmenting feature space with clinically meaningful LDA-based features. We apply the method to EHR data collected at New York Presbyterian Hospital clinical data warehouse in studying optimal second-line treatment for type 2 diabetes (T2D) patients. We use cross validation to show that ITRs outperforms uniform treatment strategies (i.e., assigning same treatment to all individuals), and including topic modeling features leads to more reduction of post-treatment complications.
... Within the health domain, a recent study showed that some discrepancies exist among clinical data element fields of research data and clinical practice data on the same patients. 15 The authors defined three levels of accuracy which can be useful in cohort discovery in the clinical research setting. This sampling of approaches to veracity in various settings highlights the importance of being aware of possible discrepancies in subject declarations. ...
Article
Full-text available
We examined the consistency of pain reporting by patients in a community pain management practice in Michigan. We compared pain levels (range 0-10) entered by patients in questionnaires versus those given during their face-to-face physician encounter on the same day. Both of these values were available for approximately 10,000 encounters during the study period (2010–2014). Two subpopulations of patients were identifiable. One was consistent in reporting worst or least pain levels on the questionnaire and during the provider encounter. The other was discordant. Factor analysis had previously identified severity scales for patient biopsychosocial characteristics derived from the full questionnaire. The two subpopulations differed in their factor profiles even though they had similar demographics. In general, pain reported directly to physicians was more correlated to severity indicators. Pain self-reporting using questionnaires has often been assumed to be ground truth, but those obtained during the physician encounter may be more reliable.
... Within the health domain, a recent study showed that some discrepancies exist among clinical data element fields of research data and clinical practice data on the same patients. 15 The authors defined three levels of accuracy which can be useful in cohort discovery in the clinical research setting. This sampling of approaches to veracity in various settings highlights the importance of being aware of possible discrepancies in subject declarations. ...
Conference Paper
We examined the consistency of pain reporting by patients in a community pain management practice in Michigan. We compared pain levels (range 0-10) entered by patients in questionnaires versus those provided during their face-to-face physician encounter on the same day. Both of these values were available for approximately 10,000 encounters during the study period (2010–2014). Two subpopulations of patients were identifiable. One was consistent in reporting worst or least pain levels on the questionnaire and during the provider encounter. The other was discordant. Factor analysis had previously identified severity scales for patient biopsychosocial characteristics derived from the full questionnaire. The two subpopulations differed in their factor profiles even though they had similar demographics. In general, pain reported directly to physicians was more correlated to biopsychosocial indicators. Pain self-reporting using questionnaires has often been assumed to be ground truth, but those obtained during the physician encounter may be more reliable.
... The information collected through WICER can be used to improve the prevention, diagnosis and treatment of illness, and the promotion of health in Washington Heights/Inwood. More specific information about the WICER study is available elsewhere [25,26]. ...
Article
Full-text available
In the United States, human immunodeficiency virus (HIV) has a disproportionately large impact on Latino Americans. This study assessed the acceptability of rapid HIV testing among a sample of Latinos from New York City. A cross-sectional study was conducted with 192 participants from The Washington Heights/Inwood Informatics Infrastructure for Community-Centered Comparative Effectiveness Research (WICER) study. Participants were interviewed and offered rapid HIV testing and post-test counseling. Seventy-five percent (n = 143) accepted rapid HIV testing when offered. More religious participants were less likely than less religious participants to undergo testing (RR = 0.73; 95% CI 0.54-0.99). Participants tested for HIV within the past year were less likely than those who had not been tested within the past year to agree to undergo testing (RR = 0.27; 95% CI 0.11-0.66). Community-based rapid HIV testing is feasible among Latinos in urban environments. Outreach efforts to engage religious individuals and encouraging routine testing should be reinforced.
Article
The electronic health record (EHR) represents a rich source of patient information, increasingly being leveraged for cardiovascular research. Although its primary use remains the seamless delivery of health care, the various longitudinally aggregated structured and unstructured data elements for each patient within the EHR can define the computational phenotypes of disease and care signatures and their association with outcomes. Although structured data elements, such as demographic characteristics, laboratory measurements, problem lists, and medications, are easily extracted, unstructured data are underused. The latter include free text in clinical narratives, documentation of procedures, and reports of imaging and pathology. Rapid scaling up of data storage and rapid innovation in natural language processing and computer vision can power insights from unstructured data streams. However, despite an array of opportunities for research using the EHR, specific expertise is necessary to adequately address confidentiality, accuracy, completeness, and heterogeneity challenges in EHR-based research. These often require methodological innovation and best practices to design and conduct successful research studies. Our review discusses these challenges and their proposed solutions. In addition, we highlight the ongoing innovations in federated learning in the EHR through a greater focus on common data models and discuss ongoing work that defines such an approach to large-scale, multicenter, federated studies. Such parallel improvements in technology and research methods enable innovative care and optimization of patient outcomes.
Article
Laboratory clinical decision support (CDS) typically relies on data from the electronic health record (EHR). The implementation of a sustainable, effective laboratory CDS program requires a commitment to standardization and harmonization of key EHR data elements that are the foundation of laboratory CDS. The direct use of artificial intelligence algorithms in CDS programs will be limited unless key elements of the EHR are structured. The identification, curation, maintenance, and preprocessing steps necessary to implement robust laboratory-based algorithms must account for the heterogeneity of data present in a typical EHR.
Article
Objective: To assess the influence of longitudinal weight gain from 0 to 4 years old on dehydroepiandrosterone sulfate (DHEAS) levels at 7 years old. Methods: DHEAS levels were measured at 7 years old in a subsample of 587 children from the Generation XXI birth cohort. Weight trajectories (0-4 years of age) were identified using model-based clustering and categorized as "normal weight gain", "weight gain during infancy", "weight gain during childhood" and "persistent weight gain". Differences in DHEAS levels at age 7 between the four weight trajectories were analyzed through ANCOVA, adjusted for birth weight (BW) and body mass index (BMI). Results: In the crude analysis, compared with the "normal weight gain" trajectory [5.53 (95% CI: 5.10-5.98) µmol/L], DHEAS levels were significantly higher in children in the "persistent weight gain" [8.75 (95% CI: 7.23-10.49) µmol/L, p<0.001] and in children in the "weight gain during infancy" trajectories [7.68 (95% CI: 6.22-9.49) µmol/L, p=0.021], and marginally significant higher in children in the "weight gain during childhood" trajectory [6.89 (95% CI: 5.98-8.00) µmol/L; p=0.052]. In BW- and BMI-adjusted model, a statistically significant difference in DHEAS levels was found between the "persistent weight gain" [7.93 (95% CI: 6.43-9.86) µmol/L] and the "normal weight gain" trajectories [5.75 (95% CI: 5.32-6.23) µmol/L; p=0.039]. Conclusion: Higher DHEAS levels are found in 7-year-old children following a trajectory of persistent weight gain from 0-4 years, independently of their BW or current BMI, highlighting the impact of exposure to overweight in the first years of life on prepubertal adrenal androgen production. This article is protected by copyright. All rights reserved.
Article
Background: Understanding the validity of data from electronic data research networks is critical to national research initiatives and learning healthcare systems for cardiovascular care. Our goal was to evaluate the degree of agreement of electronic data research networks in comparison with data collected by standardized research approaches in a cohort study. Methods: We linked individual-level data from MESA (Multi-Ethnic Study of Atherosclerosis), a community-based cohort, with HealthLNK, a 2006 to 2012 database of electronic health records from 6 Chicago health systems. To evaluate the correlation and agreement of blood pressure in HealthLNK in comparison with in-person MESA examinations, and body mass index in HealthLNK in comparison with MESA, we used Pearson correlation coefficients and Bland-Altman plots. Using diagnoses in MESA as the criterion standard, we calculated the performance of HealthLNK for hypertension, obesity, and diabetes mellitus diagnosis by using International Classification of Diseases, Ninth Revision codes and clinical data. We also identified potential myocardial infarctions, strokes, and heart failure events in HealthLNK and compared them with adjudicated events in MESA. Results: Of the 1164 MESA participants enrolled at the Chicago Field Center, 802 (68.9%) participants had data in HealthLNK. The correlation was low for systolic blood pressure (0.39; P<0.0001). In comparison with MESA, HealthLNK overestimated systolic blood pressure by 6.5 mm Hg (95% confidence interval, 4.2-7.8). There was a high correlation between body mass index in MESA and HealthLNK (0.94; P<0.0001). HealthLNK underestimated body mass index by 0.3 kg/m(2) (95% confidence interval, -0.4 to -0.1). With the use of International Classification of Diseases, Ninth Revision codes and clinical data, the sensitivity and specificity of HealthLNK queries for hypertension were 82.4% and 59.4%, for obesity were 73.0% and 89.8%, and for diabetes mellitus were 79.8% and 93.3%. In comparison with adjudicated cardiovascular events in MESA, the concordance rates for myocardial infarction, stroke, and heart failure were, respectively, 41.7% (5/12), 61.5% (8/13), and 62.5% (10/16). Conclusions: These findings illustrate the limitations and strengths of electronic data repositories in comparison with information collected by traditional standardized epidemiological approaches for the ascertainment of cardiovascular risk factors and events.
Article
Full-text available
This study compares the yield and characteristics of diabetes cohorts identified using heterogeneous phenotype definitions. Inclusion criteria from seven diabetes phenotype definitions were translated into query algorithms and applied to a population (n=173 503) of adult patients from Duke University Health System. The numbers of patients meeting criteria for each definition and component (diagnosis, diabetes-associated medications, and laboratory results) were compared. Three phenotype definitions based heavily on ICD-9-CM codes identified 9-11% of the patient population. A broad definition for the Durham Diabetes Coalition included additional criteria and identified 13%. The electronic medical records and genomics, NYC A1c Registry, and diabetes-associated medications definitions, which have restricted or no ICD-9-CM criteria, identified the smallest proportions of patients (7%). The demographic characteristics for all seven phenotype definitions were similar (56-57% women, mean age range 56-57 years).The NYC A1c Registry definition had higher average patient encounters (54) than the other definitions (range 44-48) and the reference population (20) over the 5-year observation period. The concordance between populations returned by different phenotype definitions ranged from 50 to 86%. Overall, more patients met ICD-9-CM and laboratory criteria than medication criteria, but the number of patients that met abnormal laboratory criteria exclusively was greater than the numbers meeting diagnostic or medication data exclusively. Differences across phenotype definitions can potentially affect their application in healthcare organizations and the subsequent interpretation of data. Further research focused on defining the clinical characteristics of standard diabetes cohorts is important to identify appropriate phenotype definitions for health, policy, and research.
Article
Full-text available
The growing amount of data in operational electronic health record systems provides unprecedented opportunity for its reuse for many tasks, including comparative effectiveness research. However, there are many caveats to the use of such data. Electronic health record data from clinical settings may be inaccurate, incomplete, transformed in ways that undermine their meaning, unrecoverable for research, of unknown provenance, of insufficient granularity, and incompatible with research protocols. However, the quantity and real-world nature of these data provide impetus for their use, and we develop a list of caveats to inform would-be users of such data as well as provide an informatics roadmap that aims to insure this opportunity to augment comparative effectiveness research can be best leveraged.
Article
Full-text available
Objective To review the methods and dimensions of data quality assessment in the context of electronic health record (EHR) data reuse for research. Materials and methods A review of the clinical research literature discussing data quality assessment methodology for EHR data was performed. Using an iterative process, the aspects of data quality being measured were abstracted and categorized, as well as the methods of assessment used. Results Five dimensions of data quality were identified, which are completeness, correctness, concordance, plausibility, and currency, and seven broad categories of data quality assessment methods: comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. Discussion Examination of the methods by which clinical researchers have investigated the quality and suitability of EHR data for research shows that there are fundamental features of data quality, which may be difficult to measure, as well as proxy dimensions. Researchers interested in the reuse of EHR data for clinical research are recommended to consider the adoption of a consistent taxonomy of EHR data quality, to remain aware of the task-dependence of data quality, to integrate work on data quality assessment from other fields, and to adopt systematic, empirically driven, statistically based methods of data quality assessment. Conclusion There is currently little consistency or potential generalizability in the methods used to assess EHR data quality. If the reuse of EHR data for clinical research is to become accepted, researchers should adopt validated, systematic methods of EHR data quality assessment.
Article
Full-text available
To determine whether observational studies that use an electronic medical record database can provide valid results of therapeutic effectiveness and to develop new methods to enhance validity. Data from the UK general practice research database (GPRD) were used to replicate previously performed randomised controlled trials, to the extent that was feasible aside from randomisation. Studies Six published randomised controlled trials. Cardiovascular outcomes analysed by hazard ratios calculated with standard biostatistical methods and a new analytical technique, prior event rate ratio (PERR) adjustment. In nine of 17 outcome comparisons, there were no significant differences between results of randomised controlled trials and database studies analysed using standard biostatistical methods or PERR analysis. In eight comparisons, Cox adjusted hazard ratios in the database differed significantly from the results of the randomised controlled trials, suggesting unmeasured confounding. In seven of these eight, PERR adjusted hazard ratios differed significantly from Cox adjusted hazard ratios, whereas in five they didn't differ significantly, and in three were more similar to the hazard ratio from the randomised controlled trial, yielding PERR results more similar to the randomised controlled trial than Cox (P<0.05). Although observational studies using databases are subject to unmeasured confounding, our new analytical technique (PERR), applied here to cardiovascular outcomes, worked well to identify and reduce the effects of such confounding. These results suggest that electronic medical record databases can be useful to investigate therapeutic effectiveness.
Chapter
Non-insulin-dependent diabetes mellitus (NIDDM), also called type 2 diabetes mellitus, is a common metabolic disorder that afflicts 2%–5% of the adult population of most Western countries, with, however, wide international variation (King et al. 1993). Furthermore, NIDDM is often undiagnosed even in the Western countries (Hortulanus-Beck et al. 1990), and numerous data indicate that undiagnosed NIDDM is not a benign condition (Harris 1993). Certainly, NIDDM is a leading cause of disability and death in developed and developing nations (Songer 1992).
Article
The Distributed Ambulatory Research in Therapeutics Network (DARTNet) is a federated network of electronic health record (EHR) data, designed as a platform for next-generation comparative effectiveness research in real-world settings. DARTNet links information from nonintegrated primary care clinics that use EHRs to deliver ambulatory care to overcome limitations with traditional observational research. Test the ability to conduct a remote, electronic point of care study in DARTNet practices by prompting clinic staff to obtain specific information during a patient encounter. Prospective survey of patients identified through queries of clinical data repositories in federated network organizations. On patient visit, survey is triggered and data are relinked to the EHR, de-identified, and copied for evaluation. Adult patients diagnosed with diabetes mellitus that scheduled a clinic visit for any reason in a 2-week period in DARTNet primary care practices. Survey on hypoglycemic events (past month) and over-the-counter and herbal supplement use. DARTNet facilitated point of care data collection triggered by an electronic prompt for additional information at a patient visit. More than one-third of respondents (33% response rate) reported either mild (45%) or severe hypoglycemic events (5%) in the month before the survey; only 3 of those were also coded using the ICD-9 (a significant difference in detection rates 37% vs. 1%). Nearly one-quarter of patients reported taking an OTC/herbal, 4% specifically for the treatment of symptoms of diabetes. Prospective data collection is feasible in DARTNet and can enable comparative effectiveness and safety research.
Comparative effectiveness research in DARTNet primary care practices: point of care data collection on hypoglycemia and over-the-counter and herbal use among patients diagnosed with diabetes
  • A M Libby
  • W Pace
  • C Bryan
  • H O Anderson
  • S L Ellis
  • R R Allen
  • E Brandt
  • A G Huebschmann
  • D West
  • R J Valuck
Libby, A.M., W. Pace, C. Bryan, H.O. Anderson, S.L. Ellis, R.R. Allen, E. Brandt, A.G. Huebschmann, D. West, and R.J. Valuck, Comparative effectiveness research in DARTNet primary care practices: point of care data collection on hypoglycemia and over-the-counter and herbal use among patients diagnosed with diabetes. Med Care, 2010. 48(6 Suppl): p. S39-44.