ArticlePDF Available

Clinical Predictive Modeling Development and Deployment through FHIR Web Services


Abstract and Figures

Clinical predictive modeling involves two challenging tasks: model development and model deployment. In this paper we demonstrate a software architecture for developing and deploying clinical predictive models using web services via the Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) standard. The services enable model development using electronic health records (EHRs) stored in OMOP CDM databases and model deployment for scoring individual patients through FHIR resources. The MIMIC2 ICU dataset and a synthetic outpatient dataset were transformed into OMOP CDM databases for predictive model development. The resulting predictive models are deployed as FHIR resources, which receive requests of patient information, perform prediction against the deployed predictive model and respond with prediction scores. To assess the practicality of this approach we evaluated the response and prediction time of the FHIR modeling web services. We found the system to be reasonably fast with one second total response time per patient prediction.
Content may be subject to copyright.
Clinical Predictive Modeling Development and Deployment through FHIR
Web Services
Mohammed Khalilia, Myung Choi, Amelia Henderson, Sneha Iyengar, Mark Braunstein,
Jimeng Sun
Georgia Institute of Technology, Atlanta, Georgia
Clinical predictive modeling involves two challenging tasks: model development and model deployment. In this
paper we demonstrate a software architecture for developing and deploying clinical predictive models using web
services via the Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) standard. The services
enable model development using electronic health records (EHRs) stored in OMOP CDM databases and model
deployment for scoring individual patients through FHIR resources. The MIMIC2 ICU dataset and a synthetic
outpatient dataset were transformed into OMOP CDM databases for predictive model development. The resulting
predictive models are deployed as FHIR resources, which receive requests of patient information, perform
prediction against the deployed predictive model and respond with prediction scores. To assess the practicality of
this approach we evaluated the response and prediction time of the FHIR modeling web services. We found the
system to be reasonably fast with one second total response time per patient prediction.
Clinical predictive modeling research has increased because of the increasing adoption of electronic health records1
3. Nevertheless, the dissemination and translation of predictive modeling research findings into healthcare delivery is
often challenging. Reasons for this include political, social, economic and organizational factors46. Other barriers
include the lack of computer programming skills by the target end users (i.e. physicians) and difficulty of integration
with the highly fragmented existing health informatics infrastructure7. Additionally, in many cases the evaluation of
the feasibility of predictive modeling marks the end of the project with no attempt to deploy those models into real
practice8. To achieve real impact, researchers should be concerned about the deployment and dissemination of their
algorithms and tools into day-to-day decision support and some researchers have developed approaches to doing
this. For example, Soto et al. developed EPOCH and ePRISM7, a unified web portal and associated services for
deploying clinical risk models and decision support tools. ePRISM is a general regression model framework for
prediction and encompasses various prognostic models9. However, ePRISM does not provide an interface allowing
for integration with existing EHR data. It requires users to input model parameters, which can be time consuming
and a particular challenge for researchers unfamiliar with the nuances of clinical terminology and the underlying
algorithms. A suite of decision support web services for chronic obstructive pulmonary disease detection and
diagnosis was developed by Velickovski et al.10, where the integration into providers’ workflow is supported
through the use of a service-oriented architecture. However, despite these few efforts and many calls for researchers
to be more involved in the practical dissemination of their systems, little has been done and much less has been
accomplished to utilize predictive modeling algorithms at the point-of-care.
An important missing aspect that retards bringing research into practice is the lack of simple, yet powerful standards
that could facilitate integration with the existing healthcare infrastructure. Currently, one major impediment to the
use of existing standards is their complexity11. The emerging Health Level 7 (HL7) Fast Healthcare Interoperability
Resources (FHIR) standard provides a simplified data model represented as some 100-150 JSON or XML objects
(the FHIR Resources). Each resource consists of a number of logically related data elements that will be 80%
defined through the HL7 specification and 20% through customized extensions12. Additionally, FHIR supports other
web standards such as XML, HTTP and OAuth. Furthermore, since FHIR supports a RESTful architecture for
information and message exchange it becomes suitable for use in a variety of settings such as mobile applications
and cloud computing. Recently, all the four major health enterprise software vendors (Cerner, Epic, McKesson and
MEDITECH) along with major providers including Intermountain Healthcare, Mayo Clinic and Partners Healthcare
have joined the Argonaut Project to further extend FHIR to encompass clinical documents constructed from FHIR
resources13. Epic, the largest enterprise healthcare software vendor, has a publicly available FHIR server for testing
that supports a subset of FHIR resource types including Patient, Adverse Reaction, Medication Prescription,
Condition and Observation14 and support of these resources is reportedly included in their June 30, 2015 release of
Version 15 of their software. SMART on FHIR has been developed by Harvard Boston Children’s Hospital as a
universal app platform to seamlessly integrate medical applications into diverse EHR systems at the point-of-care.
Cerner and four other EHR systems demonstrated their ability to run the same third party developed FHIR app at
HIMSS 2014. Of particular importance to our work is the demonstrated ability to provide SMART on FHIR app
services within the context and workflow of Cerner’s PowerChart EHR15.
As a result of these efforts, FHIR can both facilitate integration with existing EHRs and form a common
communication protocol using RESTful web services between healthcare organizations. This provides a clear path
for the widespread dissemination and deployment of research findings such as predictive modeling in clinical
practice. However, despite its popularity, FHIR currently is not suitable to directly support predictive model
development where a large volume of EHR data needs to be processed in order to train an accurate model. To
streamline predictive model development it is important to adopt a common data model (CDM) for storing EHR
data. The Observational Medical Outcomes Partnership (OMOP) was developed to transform data in disparate
databases into a common format and to map EHR data into a standardized vocabulary16. The OMOP CDM has been
used in various settings including drug safety and surveillance17, stroke prediction18, and prediction of adverse drug
events19. We utilize a database in the OMOP CDM to support predictive model development. The resulting
predictive model is then deployed as FHIR resources for scoring individual patients.
Based on these considerations, in this paper we propose to develop and deploy:
A predictive modeling development platform using the OMOP CDM for data storage and standardization
A suite of predictive modeling algorithms operating against data stored in the OMOP CDM
FHIR web services, that use the resulting trained predictive models to perform prediction on new patients
A pilot test using MIMIC2 ICU and ExactData chronic disease outpatient datasets for mortality prediction.
Overview and system architecture
Figure 1 shows how our architecture supports providing predictive modeling services to clinicians via their EHR. On
the server side, the model development platform trains and compares multiple predictive models using EHR data
stored in the OMOP CDM. After training, the best predictive models are deployed to a dedicated FHIR server as
executable predictive models. On the client side, users can use existing systems such as desktop or mobile
applications within their current workflows to query the predictive model specified in the FHIR resource. Such
integrations are done by using FHIR web services. Client applications use FHIR resource to package patient health
information and transport it using the FHIR RESTful Application Programming Interface (API). Once the FHIR
server receives the information, it passes it on to the deployed predictive model for the risk assessment. The returned
result from the predictive model will be sent to the client and also stored into a resource database that can be
accessed by the client to read or search the Risk Assessment resources for later use.
Figure 1. System Architecture
The common data model
Developing a reliable and reusable predictive model, requires a common data model into which diverse EHR data
sets are transformed and stored. For the proposed system we used the OMOP CDM designed to facilitate research
using some important design principles. First, data in the OMOP CDM is organized in a way that is optimal for data
analysis and predictive modeling, rather than for the operational needs of healthcare providers and other
administrative and financial personnel. Second, OMOP provides a data standard using existing vocabularies such as
the Systematized Nomenclature of Medicine (SNOMED), RxNORM, the National Drug Code (NDC) and the
Logical Observation Identifiers Names and Codes (LOINC). As a result predictive models built using data in an
OMOP CDM identify standardized features, assuming the mapping to OMOP is reliable. The OMOP CDM is also
technology neutral and can be implemented in any relational database such as Oracle, PostgreSQL, MySQL or MS
SQL Server. Third, our system can directly benefit the existing OMOP CDM community to foster collaborations.
The OMOP CDM consists of 37 data tables divided into standardized clinical data, vocabulary, health system data
and health economics. We only focused on a few of the CDM clinical tables including: person, condition
occurrence, observations and drug exposure. As we enhance our Extract, Transform, Load (ETL) process and the
predictive model, we can incorporate additional data sources as needed. Figure 2 shows a high level overview of the
ETL process in which multiple raw EHR data are mapped to their corresponding CDM instances. In the
transformation process, EHR source values such as lab names and results, diagnoses codes and medication names
are mapped to OMOP concept identifiers. The standardized data can then be accessed to train the predictive models.
Figure 2. EHR data to OMOP CDM
Predictive model development
The CDM provides the foundation for predictive modeling. As various datasets are transformed and loaded into the
OMOP CDM predictive modeling training can be simplified because the OMOP CDM instances all have the same
structure. For instance, one can equally easily train a model for predicting mortality, future diseases or readmission
using different datasets. Figure 3 provides an overview of predictive model training.
Figure 3. Training predictive models based on OMOP CDM
The predictive models are trained offline and can be re-trained as additional records are added to the database.
Training consists of three modules: 1) Cohort Construction: this is the first step in the training phase. At this stage
the user specifies the OMOP CDM instance, prediction target (i.e. mortality) and the cohort definition. Based on the
specified configuration the module will generate the patient cohort. 2) Feature Construction: at this stage the user
specifies which data sources (e.g. drugs, condition occurrence and observations) to include when constructing the
features. Additional configurations can also be provided for each data source. The user can include the observation
window (e.g. the prior year, to utilize only patient data recorded in the past 365 days). Other data source
configurations include the condition type concept identifier to specify which types of conditions to include (i.e.
primary, secondary), observation type concept identifier and drug type concept identifier. The final configuration is
the feature value aggregation function. For example, lab result values can be computed using one of five aggregation
functions: sum, max, mean, min, and most recent value. 3) Model Training: This module takes the feature vectors
constructed for the cohort and trains multiple models using algorithms such as Random Forest, K-Nearest Neighbor
(KNN) and Support Vector Machine (SVM). The parameters for each of these three algorithms are tuned using cross
validation in order to select the best performing model. The best model will be deployed as a FHIR resource. In the
next section, we will describe how to deploy such a predictive model for scoring future patient in real time.
FHIR web services for model deployment
Our approach is focused on API based predictive modeling services that can be easily implemented in thin client
applications, especially in the mobile environment. FHIR defines resources represented as JSON or XML objects
that can contain health concepts along with reference and searchable parameters. FHIR further defines RESTful API
URL patterns for create, read, update, and delete (CRUD) operations. In this paper, we propose to use the
RiskAssessment resource defined in the FHIR Draft Standard for Trial Use 2 (DSTU2) for our predictive analysis.
Readers should note that this particular FHIR resource is still in the draft stage. The current approved DSTU1
version of FHIR does not have a RiskAssessment resource. However, our development version of FHIR is currently
being balloted on by HL7 members and should be approved soon20. Detailed information about the FHIR
development version for RiskAssessment can be obtained from Ref. 20.
A prediction request to the FHIR server starts by forming the CREATE operation, which is used to request for
scoring specific patients in real-time using the deployed predictive model. This creates RiskAssessment resources at
the server. Client applications then receive a status response with a resource identifier that refers to the newly
created resource. Clients can use this resource identifier to read or search the resource database via a FHIR RESTful
API. For this paper, we used a SEARCH operation as we need to retrieve more than one result. Groups of FHIR
resources are called a bundle and there is a specified format for that. As all performed analyses are stored in the
resource database, additional query types can be implemented in the future. The operation process is depicted in
Figure 4.
Figure 4. Operation Process
During the process, clients and server need to put appropriate information into elements available in the
RiskAssessment resource. However, most of the data elements in the RiskAssessment resource are optional for
predictive modeling which gives us flexibility in choosing the model output. For the CREATE operation, the
RiskAssessment resource is constructed with subject, basis, and method elements as shown in Figure 5(a).
{"resourceType" : "RiskAssessment",
"basis" : [
{ "reference" : "Patient/patient 1 identifier" },
{ "reference" : "Patient/patient 2 identifier" },
"method" : [ { "text" : "predictive model identifier" } ],
"subject" : [ { "reference" : "Group/group identifier" } ]
{"resource" : "RiskAssessment",
"id" : "patient 1 identifier" ,
"prediction" : {"probabilityDecimal" : patient 1 score},
"method" : "predictive model identifier",
"subject" : { "reference" : "Group/group identifier" }
{"resource" : "RiskAssessment",
"id" : "patient 2 identifier" ,
"prediction" : {"probabilityDecimal" : patient 2 score},
"method" : "predictive model identifier",
"subject" : { "reference" : "Group/group identifier" }
(a) Request
(b) Response
Figure 5. JSON RiskAssessment FHIR request and the corresponding response
Subject is used to define a group identifier, which we also refer to as the resource identifier. Basis can contain
information used in assessment, which are the patient identifiers. All patients included in basis are bound to the
group identifier specified in the subject element. Therefore, the results of assessment for patients will be grouped in
the resource database by the group identifier at the FHIR server. Subject or group identifier provides a simple
mechanism for users to specify a group of patients, whose prediction score can be retrieved from the resource
Predictive model scoring creates feature vectors (one for each patient), which are derived from the patient’s health
information. The FHIR Patient resource can be referenced by either the patient identifier or by a collection of
resources such as MedicationPrescriptions, Conditions, Observations, etc. In our initial implementation we used
patient identifier. Using the patient identifier, the predictive model constructs feature vectors by pulling the
appropriate information from various resources that contain each patient’s clinical data occurring within the
specified observation window. In our implementation the clinical data is stored in the OMOP database. This all
happens during the predictive analysis period in Figure 4. In the future, we will construct features directly from EHR
databases through querying other FHIR resources in real-time.
Results from the predictive analysis are sent back to the FHIR server in JSON format and the FHIR server stores the
information in a resource database. If the CREATE operation is successful, the server will send a 201-status created
message. In case of any errors, an appropriate error status message will be sent back to the client with the
OperationOutcome resource, if required12.
Clients can query the prediction results using the resource identifier returned in the CREATE response (i.e. group
identifier under subject in this case). This resource identifier can be used to query the stored prediction results using
the SEARCH operation. Recall that the resource identifier is bounded to a list of patient identifiers. Once the
SEARCH operation request is received with the resource identifier, the FHIR server constructs a response FHIR
resource for each patient. In the RiskAssessment resource, the prediction element contains the risk score for a
patient, as shown in Figure 5(b). When the SEARCH operation is completed, the RiskAssessment resources are
packaged into a bundle and sent to the client. For example, the resource contains an identifier that represents the
patient for whom the predictive analysis is performed. In our experiments, mortality prediction was performed and
risk scores are returned. The mortality scores are populated in the probabilityDecimal sub element of the prediction
element, as shown in Figure 5(b).
We tested our implementation using two datasets: Multiparameter Intelligent Monitoring in Intensive Care
(MIMIC2) and a dataset licensed from ExactData. MIMIC2 contains comprehensive clinical data for thousands of
ICU patients collected between 2001 and 2008. ExactData is a custom, large realistic synthetic EHR dataset for
chronic disease patients. Using ExactData eliminates the cost, risk and legality of real EHR data. Table 1 presents
key statistics about the two datasets.
Table 1. Key ExactData and MIMIC2 Dataset Statistics
Number of patients
Number of condition occurrences
Number of drug exposures
Number of observations
Number of deceased patients
Two ETL processes were implemented to move the raw MIMIC2 and ExactData datasets into two OMOP CDM
instances. From each instance a cohort was generated for mortality prediction. The cohort for ExactData is limited to
the 53 patients with death records matched with 53 control patients to keep it balanced. The MIMIC2 cohort is
somewhat larger with 500 case patients and 500 control patients. To generate the control group we performed a one-
to-one matching with the case group based on age, gender and race. As illustrated in Figure 3, the feature vectors for
each cohort are generated using these MIMIC2 and ExactData cohorts. The features include condition occurrence,
observation and drug exposures. The observation window was set to 365 days for both cohorts and the feature values
for observations were aggregated by taking the mean of the values.
For each cohort three mortality predictive models were trained offline using Random Forest, SVM and K NN. After
performing cross validation and parameter tuning for each algorithm, the model with the highest Area Under the
Receiver Operating Characteristic curve (AUC) was deployed. This results in three final models, one for each
algorithm, allowing clients to specify which algorithm to use to predict new patients. The predictive models were
trained using Python scikit-learn machine learning package21. The final model training runtime, AUC, accuracy and
F1 score are reported in Table 2.
Table 2. Model evaluation using ExactData and MIMIC2 cohorts
Runtime (seconds)
F1 score
Random Forest
Random Forest
Our entire software platform was evaluated and tested on a small Linux server with 130GB hard drive, 16GB
memory, and two Intel Xeon 2.4GHz processors with two cores each. FHIR web services were built on the Tomcat
8 application server using the HL7 API (HAPI) RESTful server library22. The FHIR web service expects a request
from the client to perform patient scoring (in our case mortality prediction) with patient data or patient identifier
contained in the request body. We evaluated the performance of the web service in two ways. The first evaluation
method measures the response time from the client’s request to the web service until a response is received back
from the server. This is an important measure since a very slow response time makes the system unusable for
practical applications including at the point-of-care where busy clinicians demand a prompt response. A single API
request to score a patient is a composite of two different requests: 1) CREATE request made to the FHIR server that
sends the data in the JSON format. The data comprises of resource type, a set of patient identifiers, classification
algorithm to be employed (e.g. KNN, SVM or Random Forest) and the OMOP CDM instance to be used (e.g.
MIMIC2 or ExactData). The response time of this request increases with the number of patients (as can be observed
from Figures 6 and 7). SEARCH request is made by passing the patient group identifier as a parameter to obtain the
prediction for the patients. Similar to the CREATE request, a larger number of patients increases the response time
of the SEARCH request.
Figure 6 shows the FHIR web service response time using KNN and MIMIC2 datasets as we vary the number of
patients to score in the request. For a single patient, a typical point-of-care scenario, the response time is around one
second. The CREATE request consumes most of the time and the response time increases with the number of
patients. This could become an issue in an application such as population health to identify those chronic disease
patients most likely to be readmitted or screening an entire ICU to identify patients most in need of immediate
medical attention. The SEARCH request has a relatively constant response time and does not change significantly as
the number of patients increases.
Figure 6. Response time for CREATE and SEARCH requests using KNN and MIMIC2 data vs. the number of patients
The actual scoring or prediction of the patient is performed in the CREATE request. There are two major tasks that
take place in this CREATE request. First is the feature construction that is performed for each patient included in the
request. This requires querying the OMOP CDM database, extracting and aggregating feature values, which can be
an expensive operation when scoring a large number of patients. The second task is the actual prediction operation
which takes the least amount of time. Figure 7 shows the response time (after subtracting the feature construction
and prediction time), feature construction time, and prediction time as we increase the number of patients. As the
number of patients increases the prediction time does not increase much compared to the response time or the
feature construction time. For instance, a request for scoring 200 patients takes about 23 seconds, out of which 8
seconds were spent constructing features and only 16 milliseconds were spent on the actual predictions.
Figure 7. Response time, feature construction and prediction time vs. number of patients
The second web service performance evaluation measures the amount of load that the service can handle within
some defined time interval. As this service handles multiple clients it is important to guarantee that it is available
and able to respond in a timely manner to requests as the number of clients increases. However, keep in mind that
our evaluation is only a proof of concept done on a moderate server. Figure 8 shows the average response time when
1000 clients send CREATE requests (one patient in each request) to the FHIR server within a minute. Figure 8 is
generated by the Simple Cloud-based Load Testing tool23 and has three axes: response time on y-axis, number of
clients on secondary y-axis and time on x-axis, where 00:05 indicates the fifth second from the start of the test. The
green curve (upper curve) shows the number of simultaneous clients sending the requests at some point in time. For
instance, on the fifth second, about 50 clients were sending simultaneous requests. The blue line shows the average
response time for those clients (on the fifth second, the average response time for 50 clients was about 1800
milliseconds). Overall, the average response time for the 1000 clients is 1506 millisecond (about 1.5 seconds). A 1.5
second response time is acceptable when the task is performed at the point-of-care.
Figure 8. Average response time as 1,000 clients send CREATE request within one minute
Conclusion and Discussion
In this paper we presented a real-time predictive modeling development and deployment platform using the evolving
HL7 FHIR standard and demonstrated the system using MIMIC2 ICU and ExactData chronic disease datasets for
mortality prediction. The system consists of three core components: 1) The OMOP CDM which is more tailored for
predictive modeling than healthcare operational needs and stores EHR data using standardized vocabularies. 2)
Predictive model development, which consists of cohort construction, feature extraction, model training and model
evaluation. This training phase is streamlined, meaning that it will work for any type of data stored in an OMOP
CDM. 3) FHIR web services for predictive model deployment that we use to deliver the predictive modeling service
to potential clients. The web service takes a prediction request containing patient identifiers whose features can be
extracted from the OMOP database. The prediction or the scoring is not pre-computed but is performed in real-time.
Our future work on the FHIR web services will enhance the feature extraction by accessing EHR data over FHIR
Search operations. In this case, our predictive platform will become a client to the FHIR-enabled EHR. Recent FHIR
ballot for DSTU2 includes a Subscription resource24. This resource, if included in the standard, can be utilized in our
platform to subscribe the patient’s feature related data from their EHR. The Subscription resource uses Search string
for its criteria, thus our future work will comply with this new resource with only a few modifications. Even if this
resource couldn’t be included, our future platform will use a pull mechanism to retrieve EHR data for feature
A total of six predictive models were trained on MIMIC2 ICU and ExactData chronic disease datasets from which
we generated cohorts based on patients with a death event and a matched set of control patients. The FHIR web
service routes the incoming requests to the desired predictive model, which is usually a client specified parameter.
For a practical, real-time web service it is important to achieve a fast response time. The observed response time for
scoring one patient was around one second, of which the actual prediction took only few milliseconds. The total
response time increases with the number of patients in a single request, but the actual prediction time remains very
small, reaching 16 milliseconds when scoring 200 patients. However, scoring many patients in one request is not
always desired or needed. For instance, in many cases providers are only interested in querying one patient at a time
from the point-of-care. In such a direct care delivery scenario only a single request containing one patient’s
information would be sent to the server with an expected response time of one second. Additionally, this response
time can be easily reduced by expanding the server hardware and optimizing the implementation.
We present a prototype with much work and many needed improvements yet to be done. First, additional predictive
algorithms should be added. We demonstrated Random Forest, SVM and KNN, but others such as Logistic
Regression and Decision Trees could be added. Second, the system should allow for updating the predictive model
as new patients get scored. This requires using the patient EHR data passed into the web service for retraining the
predictive model. For this to take place, the web service should allow the provider to give feedback on the patient
scoring, which can then be used to improve future predictions. Third, the system needs to be scalable and should
handle larger datasets to train the predictive models. This can be done by utilizing big data technologies such as
Hadoop25 or Apache Spark26. Fourth, the response time needs to be improved especially for requests that contain
large numbers of patients, which can be done by allocating more resources and improving the FHIR web service
software. Additionally, the response time can be decreased if the client includes patient EHR data in the request,
thus avoiding expensive database querying. Fifth, the web service must have privacy and security protocols
implemented such as OAuth2, which has been already implemented in the SMART on FHIR app platform.
The proposed approach in this paper assumes that the client that is deploying our predictive model stores the patient
data in an OMOP CDM model and the CREATE request contains the identifier for the patient for which the analysis
is to be performed. Any ETL process for moving data from EHR to OMOP model is outside the scope of this paper.
However, FHIR resources such as MedicationPrescriptions, Conditions, Observations can be included in the body
of the CREATE request instead of the patient identifier. This allows for performing analysis for patients not
included in the database. Additionally, two processes can be performed on the FHIR resources, if passed in the
CREATE request: 1) perform prediction using the patient data (resources) in the request and 2) subscribe the
patient’s features related data from EHR to the predictive platform.
In conclusion, we have demonstrated the ease of developing and deploying a real-time predictive modeling web
service that uses open source technologies and standards such as OMOP and FHIR. With this web service we can
more easily and cost effectively bring research into clinical practice by allowing clients to tap into the web service
using their existing EHRs or mobile applications.
This work was supported by the National Science Foundation, award #1418511, Children's Healthcare of Atlanta,
CDC i-SMILE project, Google Faculty Award, AWS Research Award, and Microsoft Azure Research Award.
1. Brownson, R. C., Colditz, G. A. & Proctor, E. K. Dissemination and implementation research in health:
translating science to practice. (Oxford University Press, 2012).
2. Dobbins, M., Ciliska, D., Cockerill, R., Barnsley, J. & DiCenso, A. A Framework for the Dissemination and
Utilization of Research for Health-Care Policy and Practice. Worldviews Evid.-Based Nurs. Presents Arch.
Online J. Knowl. Synth. Nurs. E9, 149160 (2002).
3. Sun, J. et al. Predicting changes in hypertension control using electronic health records from a chronic disease
management program. J. Am. Med. Inform. Assoc. JAMIA (2013). doi:10.1136/amiajnl-2013-002033
4. Glasgow, R. E. & Emmons, K. M. How Can We Increase Translation of Research into Practice? Types of
Evidence Needed. Annu. Rev. Public Health 28, 413433 (2007).
5. Wehling, M. Translational medicine: science or wishful thinking? J. Transl. Med. 6, 31 (2008).
6. Hörig, H., Marincola, E. & Marincola, F. M. Obstacles and opportunities in translational research. Nat. Med.
11, 705708 (2005).
7. Soto, G. E. & Spertus, J. A. EPOCH and ePRISM: A web-based translational framework for bridging outcomes
research and clinical practice. in Computers in Cardiology, 2007 205208 (2007).
8. Bellazzi, R. & Zupan, B. Predictive data mining in clinical medicine: Current issues and guidelines. Int. J. Med.
Inf. 77, 8197 (2008).
9. Soto, G. E., Jones, P. & Spertus, J. A. PRISM trade;: a Web-based framework for deploying predictive clinical
models. in Computers in Cardiology, 2004 193196 (2004). doi:10.1109/CIC.2004.1442905
10. Velickovski, F. et al. Clinical Decision Support Systems (CDSS) for preventive management of COPD
patients. J. Transl. Med. 12, S9 (2014).
11. Barry, S. The Rise and Fall of HL7. (2011). at <
12. Bender, D. & Sartipi, K. HL7 FHIR: An Agile and RESTful approach to healthcare information exchange. in
2013 IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS) 326331 (2013).
13. Miliard, M. Epic, Cerner, others join HL7 project. Healthcare IT News (2014). at
14. Open Epic. at <>
15. Raths, D. SMART on FHIR a Smoking Hot Topic at AMIA Meeting. Healthcare Informatics (2014). at
16. Ryan, P. B., Griffin, D. & Reich, C. OMOP common data model (CDM) Specifications. (2009).
17. Zhou, X. et al. An Evaluation of the THIN Database in the OMOP Common Data Model for Active Drug
Safety Surveillance. Drug Saf. 36, 119134 (2013).
18. Letham, B., Rudin, C., McCormick, T. H. & Madigan, D. An Interpretable Stroke Prediction Model using
Rules and Bayesian Analysis. (2013). at <>
19. Duke, J., Zhang, Z. & Li, X. Characterizing an optimal predictive modeling framework for prediction of
adverse drug events. Stakehold. Symp. (2014). at
20. HL7 FHIR Development Version. at <>
21. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res 12, 28252830 (2011).
22. HL7 Application Programming Interface. at <>
23. Simple Cloud-based Load Testing. at <>
24. FHIR Resource Subscription. at <>
25. Shvachko, K., Kuang, H., Radia, S. & Chansler, R. The Hadoop Distributed File System. in 2010 IEEE 26th
Symposium on Mass Storage Systems and Technologies (MSST) 110 (2010).
26. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. Spark: cluster computing with working
sets. in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing 1010 (2010). at
... The OMOP-CDM and HL7 FHIR standards are good starting points for developing a platform that can deploy an interoperable CDSS to multiple organizations [18,19]. Moreover, the OMOP-CDM has acquired the status of a de facto standard in South Korea. ...
... That is, the operative guideline informally states that each resource should contain only those data elements agreed upon by 80% or more of the participants in the development effort [13,21]. Because HL7 FHIR employs a web protocol, the standard is widely used to exchange information in a variety of medical settings, including those of CDSS deployments [10,18,19]. ...
... Studies have been conducted to address interoperability issues and overcome the chasm between CDSS model development and widespread deployment. Khalilia et al [18] provided convincing answers to account for this gap in terms of web services based on a service-oriented architecture. They presented a streamlined architecture that facilitates predictive modeling using OMOP-CDM structured data sets and deployed the model into a clinical workflow using HL7 FHIR. ...
Full-text available
Background A clinical decision support system (CDSS) is recognized as a technology that enhances clinical efficacy and safety. However, its full potential has not been realized, mainly due to clinical data standards and noninteroperable platforms. Objective In this paper, we introduce the common data model–based intelligent algorithm network environment (CANE) platform that supports the implementation and deployment of a CDSS. Methods CDSS reasoning engines, usually represented as R or Python objects, are deployed into the CANE platform and converted into C# objects. When a clinician requests CANE-based decision support in the electronic health record (EHR) system, patients’ information is transformed into Health Level 7 Fast Healthcare Interoperability Resources (FHIR) format and transmitted to the CANE server inside the hospital firewall. Upon receiving the necessary data, the CANE system’s modules perform the following tasks: (1) the preprocessing module converts the FHIRs into the input data required by the specific reasoning engine, (2) the reasoning engine module operates the target algorithms, (3) the integration module communicates with the other institutions’ CANE systems to request and transmit a summary report to aid in decision support, and (4) creates a user interface by integrating the summary report and the results calculated by the reasoning engine. Results We developed a CANE system such that any algorithm implemented in the system can be directly called through the RESTful application programming interface when it is integrated with an EHR system. Eight algorithms were developed and deployed in the CANE system. Using a knowledge-based algorithm, physicians can screen patients who are prone to sepsis and obtain treatment guides for patients with sepsis with the CANE system. Further, using a nonknowledge-based algorithm, the CANE system supports emergency physicians’ clinical decisions about optimum resource allocation by predicting a patient’s acuity and prognosis during triage. Conclusions We successfully developed a common data model–based platform that adheres to medical informatics standards and could aid artificial intelligence model deployment using R or Python.
... Therefore, applying data analytics or data analysis is challenging and an extremely new concept in this domain. However, some researchers have made some initial efforts in FHIR-based analytical circumstances, such as the prediction of sepsis based on the FHIR standard in the [35] study and the deployment of clinical predictive models via FHIR in Web Services explained in the [36] study. ...
Full-text available
In this study, we discussed our contribution to building a data analytic framework that supports clinical statistics and analysis by leveraging a scalable standards-based data model named Fast Healthcare Interoperability Resource (FHIR). We developed an intelligent algorithm that is used to facilitate the clinical data analytics process on FHIR-based data. We designed several workflows for patient clinical data used in two hospital information systems, namely patient registration and laboratory information systems. These workflows exploit various FHIR Application programming interface (APIs) to facilitate patient-centered and cohort-based interactive analyses. We developed an FHIR database implementation that utilizes FHIR APIs and a range of operations to facilitate descriptive data analytics (DDA) and patient cohort selection. A prototype user interface for DDA was developed with support for visualizing healthcare data analysis results in various forms. Healthcare professionals and researchers would use the developed framework to perform analytics on clinical data used in healthcare settings. Our experimental results demonstrate the proposed framework’s ability to generate various analytics from clinical data represented in the FHIR resources.
... Finally, only since 2019 have mature interoperability standards (such as Fast Healthcare Interoperability Resources R4) and Health Insurance Portability and Accountability Act-compliant cloud computing resources been widely adopted to allow interoperable, secure, scalable, and reliable realtime access to electronic health records and bedside XXX 2023 • Volume XXX • Number 00 monitoring devices which can facilitate multicenter prospective implementation trials (30,31). The result has been an exponential growth in retrospective studies and a paucity of high-quality clinical evidence focused on patient-centered outcomes. ...
... During the search for an existing ETL-process for adaption, a project "omoponfhir" [19] came in sight. This project has designed an ETLprocess for transforming US CDS into OMOP CDM. ...
Full-text available
Background International studies are increasingly needed in order to gain more unbiased evidence from real-world data. To achieve this goal across the European Union, the EMA set up the DARWIN EU project based on OMOP CDM established by the OHDSI community. The harmonization of heterogeneous local health data in OMOP CDM is an essential step to participate in such networks. Using the widespread communication standard HL7 FHIR can reduce the complexity of the transformation process to OMOP CDM. Enabling German university hospitals to participate in such networks requires an Extract, Transform and Load (ETL)-process that satisfies the following criteria: 1) transforming German patient data from FHIR to OMOP CDM, 2) processing huge amount of data at once and 3) flexibility to cope with changes in FHIR profiles. Method A mapping of German patient data from FHIR to OMOP CDM was accomplished, validated by an interdisciplinary team and checked through the OHDSI Data Quality Dashboard (DQD). To satisfy criteria 2-3, we decided to use SpringBatch-Framework according to its chunk-oriented design and reusable functions for processing large amounts of data. Results We have successfully developed an ETL-process that fulfills the defined criteria of transforming German patient data from FHIR into OMOP CDM. To measure the validity of the mapping conformance and performance of the ETL-process, it was tested with 392,022 FHIR resources. The ETL execution lasted approximately one minute and the DQD result shows 99% conformance in OMOP CDM. Conclusions Our ETL-process has been successfully tested and integrated at 10 German university hospitals. The data harmonization utilizing international recognized standards like FHIR and OMOP fosters their ability to participate in international observational studies. Additionally, the ETL process can help to prepare more German hospitals with their data harmonization journey based on existing standards.
... They could constitute an additional element within a set of established diagnostic procedures and might increase the certainty of a diagnostic classification or risk assessment through the use of multimodal decision support systems. To improve functionality, these models should be connected with more complete electronic health records to extract additional patient data and visually present the results of the analysis, for example, through secure and flexible web services (Khalilia et al., 2015). Broader involvement would require additional training for dedicated medical personnel in model-based data analyses, in the same manner as MRI physicists. ...
Full-text available
Functional magnetic resonance imaging (fMRI) captures information on brain function beyond the anatomical alterations that are traditionally visually examined by neuroradiologists. However, the fMRI signals are complex in addition to being noisy, so fMRI still faces limitations for clinical applications. Here we review methods that have been proposed as potential solutions so far, namely statistical, biophysical and decoding models, with their strengths and weaknesses. We especially evaluate the ability of these models to directly predict clinical variables from their parameters (predictability) and to extract clinically relevant information regarding biological mechanisms and relevant features for classification and prediction (interpretability). We then provide guidelines for useful applications and pitfalls of such fMRI-based models in a clinical research context, looking beyond the current state of the art. In particular, we argue that the clinical relevance of fMRI calls for a new generation of models for fMRI data, which combine the strengths of both biophysical and decoding models. This leads to reliable and biologically meaningful model parameters, which thus fulfills the need for simultaneous interpretability and predictability. In our view, this synergy is fundamental for the discovery of new pharmacological and interventional targets, as well as the use of models as biomarkers in neurology and psychiatry.
... The data can be stored into OMOP using ETL (Extract, Transform and Load) process, which is our next step of this research project. Storing genomic data in OMOP CDM enables analysis studies using machine learning methods that can be used in early prediction and diagnosis and improvement of personalized cancer care [4,16]. ...
Full-text available
High throughput sequencing technologies have facilitated an outburst in biological knowledge over the past decades and thus enables improvements in personalized medicine. In order to support (international) medical research with the combination of genomic and clinical patient data, a standardization and harmonization of these data sources is highly desirable. To support this increasing importance of genomic data, we have created semantic mapping from raw genomic data to both FHIR (Fast Healthcare Interoperability Resources) and OMOP (Observational Medical Outcomes Partnership) CDM (Common Data Model) and analyzed the data coverage of both models. For this, we calculated the mapping score for different data categories and the relative data coverage in both FHIR and OMOP CDM. Our results show, that the patients genomic data can be mapped to OMOP CDM directly from VCF (Variant Call Format) file with a coverage of slightly over 50%. However, using FHIR as intermediate representation does not lead to further information loss as the already stored data in FHIR can be further transformed into OMOP CDM format with almost 100% success. Our findings are in favor of extending OMOP CDM with patient genomic data using ETL to enable the researchers to apply different analysis methods including machine learning algorithms on genomic data.
Full-text available
Background: Machine learning-enabled clinical information systems (ML-CISs) have the potential to drive health care delivery and research. The Fast Healthcare Interoperability Resources (FHIR) data standard has been increasingly applied in developing these systems. However, methods for applying FHIR to ML-CISs are variable. Objective: This study evaluates and compares the functionalities, strengths, and weaknesses of existing systems and proposes guidelines for optimizing future work with ML-CISs. Methods: Embase, PubMed, and Web of Science were searched for articles describing machine learning systems that were used for clinical data analytics or decision support in compliance with FHIR standards. Information regarding each system's functionality, data sources, formats, security, performance, resource requirements, scalability, strengths, and limitations was compared across systems. Results: A total of 39 articles describing FHIR-based ML-CISs were divided into the following three categories according to their primary focus: clinical decision support systems (n=18), data management and analytic platforms (n=10), or auxiliary modules and application programming interfaces (n=11). Model strengths included novel use of cloud systems, Bayesian networks, visualization strategies, and techniques for translating unstructured or free-text data to FHIR frameworks. Many intelligent systems lacked electronic health record interoperability and externally validated evidence of clinical efficacy. Conclusions: Shortcomings in current ML-CISs can be addressed by incorporating modular and interoperable data management, analytic platforms, secure interinstitutional data exchange, and application programming interfaces with adequate scalability to support both real-time and prospective clinical applications that use electronic health record platforms with diverse implementations.
Unlabelled: The need for an accurate country-specific real-world-based fracture prediction model is increasing. Thus, we developed scoring systems for osteoporotic fractures from hospital-based cohorts and validated them in an independent cohort in Korea. The model includes history of fracture, age, lumbar spine and total hip T-score, and cardiovascular disease. Purpose: Osteoporotic fractures are substantial health and economic burden. Therefore, the need for an accurate real-world-based fracture prediction model is increasing. We aimed to develop and validate an accurate and user-friendly model to predict major osteoporotic and hip fractures using a common data model database. Methods: The study included 20,107 and 13,353 participants aged ≥ 50 years with data on bone mineral density using dual-energy X-ray absorptiometry from the CDM database between 2008 and 2011 from the discovery and validation cohort, respectively. The main outcomes were major osteoporotic and hip fracture events. DeepHit and Cox proportional hazard models were used to identify predictors of fractures and to build scoring systems, respectively. Results: The mean age was 64.5 years, and 84.3% were women. During a mean of 7.6 years of follow-up, 1990 major osteoporotic and 309 hip fracture events were observed. In the final scoring model, history of fracture, age, lumbar spine T-score, total hip T-score, and cardiovascular disease were selected as predictors for major osteoporotic fractures. For hip fractures, history of fracture, age, total hip T-score, cerebrovascular disease, and diabetes mellitus were selected. Harrell's C-index for osteoporotic and hip fractures were 0.789 and 0.860 in the discovery cohort and 0.762 and 0.773 in the validation cohort, respectively. The estimated 10-year risks of major osteoporotic and hip fractures were 2.0%, 0.2% at score 0 and 68.8%, 18.8% at their maximum scores, respectively. Conclusion: We developed scoring systems for osteoporotic fractures from hospital-based cohorts and validated them in an independent cohort. These simple scoring models may help predict fracture risks in real-world practice.
Federated learning initiatives in healthcare are being developed to collaboratively train predictive models without the need to centralize sensitive personal data. GenoMed4All is one such project, with the goal of connecting European clinical and -omics data repositories on rare diseases through a federated learning platform. Currently, the consortium faces the challenge of a lack of well-established international datasets and interoperability standards for federated learning applications on rare diseases. This paper presents our practical approach to select and implement a Common Data Model (CDM) suitable for the federated training of predictive models applied to the medical domain, during the initial design phase of our federated learning platform. We describe our selection process, composed of identifying the consortium's needs, reviewing our functional and technical architecture specifications, and extracting a list of business requirements. We review the state of the art and evaluate three widely-used approaches (FHIR, OMOP and Phenopackets) based on a checklist of requirements and specifications. We discuss the pros and cons of each approach considering the use cases specific to our consortium as well as the generic issues of implementing a European federated learning healthcare platform. A list of lessons learned from the experience in our consortium is discussed, from the importance of establishing the proper communication channels for all stakeholders to technical aspects related to -omics data. For federated learning projects focused on secondary use of health data for predictive modeling, encompassing multiple data modalities, a phase of data model convergence is sorely needed to gather different data representations developed in the context of medical research, interoperability of clinical care software, imaging, and -omics analysis into a coherent, unified data model. Our work identifies this need and presents our experience and a list of actionable lessons learned for future work in this direction.
Full-text available
The use of information and communication technologies to manage chronic diseases allows the application of integrated care pathways, and the optimization and standardization of care processes. Decision support tools can assist in the adherence to best-practice medicine in critical decision points during the execution of a care pathway. The objectives are to design, develop, and assess a clinical decision support system (CDSS) offering a suite of services for the early detection and assessment of chronic obstructive pulmonary disease (COPD), which can be easily integrated into a healthcare providers' work-flow. The software architecture model for the CDSS, interoperable clinical-knowledge representation, and inference engine were designed and implemented to form a base CDSS framework. The CDSS functionalities were iteratively developed through requirement-adjustment/development/validation cycles using enterprise-grade software-engineering methodologies and technologies. Within each cycle, clinical-knowledge acquisition was performed by a health-informatics engineer and a clinical-expert team. A suite of decision-support web services for (i) COPD early detection and diagnosis, (ii) spirometry quality-control support, (iii) patient stratification, was deployed in a secured environment on-line. The CDSS diagnostic performance was assessed using a validation set of 323 cases with 90% specificity, and 96% sensitivity. Web services were integrated in existing health information system platforms. Specialized decision support can be offered as a complementary service to existing policies of integrated care for chronic-disease management. The CDSS was able to issue recommendations that have a high degree of accuracy to support COPD case-finding. Integration into healthcare providers' work-flow can be achieved seamlessly through the use of a modular design and service-oriented architecture that connect to existing health information systems.
Conference Paper
Full-text available
This research examines the potential for new Health Level 7 (HL7) standard Fast Healthcare Interoperability Resources (FHIR, pronounced “fire”) standard to help achieve healthcare systems interoperability. HL7 messaging standards are widely implemented by the healthcare industry and have been deployed internationally for decades. HL7 Version 2 (“v2”) health information exchange standards are a popular choice of local hospital communities for the exchange of healthcare information, including electronic medical record information. In development for 15 years, HL7 Version 3 (“v3”) was designed to be the successor to Version 2, addressing Version 2's shortcomings. HL7 v3 has been heavily criticized by the industry for being internally inconsistent even in it's own documentation, too complex and expensive to implement in real world systems and has been accused of contributing towards many failed and stalled systems implementations. HL7 is now experimenting with a new approach to the development of standards with FHIR. This research provides a chronicle of the evolution of the HL7 messaging standards, an introduction to HL7 FHIR and a comparative analysis between HL7 FHIR and previous HL7 messaging standards.
Full-text available
Common chronic diseases such as hypertension are costly and difficult to manage. Our ultimate goal is to use data from electronic health records to predict the risk and timing of deterioration in hypertension control. Towards this goal, this work predicts the transition points at which hypertension is brought into, as well as pushed out of, control. In a cohort of 1294 patients with hypertension enrolled in a chronic disease management program at the Vanderbilt University Medical Center, patients are modeled as an array of features derived from the clinical domain over time, which are distilled into a core set using an information gain criteria regarding their predictive performance. A model for transition point prediction was then computed using a random forest classifier. The most predictive features for transitions in hypertension control status included hypertension assessment patterns, comorbid diagnoses, procedures and medication history. The final random forest model achieved a c-statistic of 0.836 (95% CI 0.830 to 0.842) and an accuracy of 0.773 (95% CI 0.766 to 0.780). This study achieved accurate prediction of transition points of hypertension control status, an important first step in the long-term goal of developing personalized hypertension management plans.
Conference Paper
Full-text available
In an era where novel clinical, biochemical, genetic and imaging determinants impacting patient outcomes are being continuously discovered, the field of outcomes research offers clinicians and patients the potential to make better informed health care decisions through the use of sophisticated risk-adjustment models that incorporate patientspsila unique clinical characteristics. EPOCH<sup>reg</sup> and ePRISM<sup>reg</sup> comprise a novel application suite for delivering complex risk-adjustment models to the bedside and are currently in use at multiple medical centers. An intuitive visual interface for building models allows outcomes researchers to rapidly translate prediction models into web-based decision support and reporting tools. A flexible XML web services architecture facilitates integration with existing clinical information systems, allows for remote access to the modeling engine, and allows models to be exported across systems via the Predictive Model Markup Language standard.
Purpose (1) The purpose of this paper is to construct a comprehensive framework of research dissemination and utilization that is useful for both health policy and clinical decision-making. Organizing Construct (2) The framework illustrates that the process of the adoption of research evidence into health-care decision-making is influenced by a variety of characteristics related to the individual, organization, environment and innovation. The framework also demonstrates the complex interrelationships among these characteristics as progression through the five stages of innovation-namely, knowledge, persuasion, decision, implementation and confirmation-occurs. Finally, the framework integrates the concepts of research dissemination, evidence-based decision-making and research utilization within the diffusion of innovations theory. Methods (3) During the discussion of each stage of the innovation adoption process, relevant literature from the management field (i.e., diffusion of innovations, organizational management and decision-making) and health-care sector (i.e., research dissemination and utilization and evidence-based practice) is summarized. Studies providing empirical data contributing to the development of the framework were assessed for methodological quality. Conclusions (4) The process of research dissemination and utilization is complex and determined by numerous intervening variables related to the innovation (research evidence), organization, environment and individual.
We aim to produce predictive models that are not only accurate, but are also interpretable to human experts. We introduce a generative model called the Bayesian List Machine for fitting decision lists, a type of interpretable classifier, to data. We use the model to predict stroke in atrial fibrillation patients, and produce predictive models that are simple enough to be understood by patients yet significantly outperform the medical scoring systems currently in use.
Fifteen to twenty years is how long it takes for the billions of dollars of university-based research to translate into evidence-based policies and programs suitable for public use. Over the past decade, an exciting science has emerged that seeks to narrow the gap between the discovery of new knowledge and its application in public health, mental health, and health care settings. Dissemination and implementation (D&I) research seeks to understand how to best apply scientific advances in the real world, by focusing on pushing the evidence-based knowledge base out into routine use. To help propel this crucial field forward, this book aims to address a number of key issues, including: how to evaluate the evidence base on effective interventions; which strategies will produce the greatest impact; how to design an appropriate study; and how to track a set of essential outcomes. D&I studies must also take into account the barriers to uptake of evidence-based interventions in the communities where people live their lives and the social service agencies, hospitals, and clinics where they receive care. The challenges of moving research to practice and policy are universal, and future progress calls for collaborative partnerships and cross-country research. The fundamental tenet of D&I research-taking what we know about improving health and putting it into practice-must be the highest priority.
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Background: There has been increased interest in using multiple observational databases to understand the safety profile of medical products during the postmarketing period. However, it is challenging to perform analyses across these heterogeneous data sources. The Observational Medical Outcome Partnership (OMOP) provides a Common Data Model (CDM) for organizing and standardizing databases. OMOP's work with the CDM has primarily focused on US databases. As a participant in the OMOP Extended Consortium, we implemented the OMOP CDM on the UK Electronic Healthcare Record database-The Health Improvement Network (THIN). Objective: The aim of the study was to evaluate the implementation of the THIN database in the OMOP CDM and explore its use for active drug safety surveillance. Methods: Following the OMOP CDM specification, the raw THIN database was mapped into a CDM THIN database. Ten Drugs of Interest (DOI) and nine Health Outcomes of Interest (HOI), defined and focused by the OMOP, were created using the CDM THIN database. Quantitative comparison of raw THIN to CDM THIN was performed by execution and analysis of OMOP standardized reports and additional analyses. The practical value of CDM THIN for drug safety and pharmacoepidemiological research was assessed by implementing three analysis methods: Proportional Reporting Ratio (PRR), Univariate Self-Case Control Series (USCCS) and High-Dimensional Propensity Score (HDPS). A published study using raw THIN data was selected to examine the external validity of CDM THIN. Results: Overall demographic characteristics were the same in both databases. Mapping medical and drug codes into the OMOP terminology dictionary was incomplete: 25 % medical codes and 55 % drug codes in raw THIN were not listed in the OMOP terminology dictionary, representing 6 % condition occurrence counts, 4 % procedure occurrence counts and 7 % drug exposure counts in raw THIN. Seven DOIs had <0.3 % and three DOIs had 1 % of unmapped drug exposure counts; each HOI had at least one definition with no or minimal (≤0.2 %) issues with unmapped condition occurrence counts, except for the upper gastrointestinal (UGI) ulcer hospitalization cohort. The application of PRR, USCCS and HDPS found, respectively, a sensitivity of 67, 78 and 50 %, and a specificity of 68, 59 and 76 %, suggesting that safety issues defined as known by the OMOP could be identified in CDM THIN, with imperfect performance. Similar PRR scores were produced using both CDM THIN and raw THIN, while the execution time was twice as fast on CDM THIN. There was close replication of demographic distribution, death rate and prescription pattern and trend in the published study population and the cohort of CDM THIN. Conclusions: This research demonstrated that information loss due to incomplete mapping of medical and drug codes as well as data structure in the current CDM THIN limits its use for all possible epidemiological evaluation studies. Current HOIs and DOIs predefined by the OMOP were constructed with minimal loss of information and can be used for active surveillance methodological research. The OMOP CDM THIN can be a valuable tool for multiple aspects of pharmacoepidemiological research when the unique features of UK Electronic Health Records are incorporated in the OMOP library.
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.