ArticlePDF Available

Clinical Predictive Modeling Development and Deployment through FHIR Web Services


Abstract and Figures

Clinical predictive modeling involves two challenging tasks: model development and model deployment. In this paper we demonstrate a software architecture for developing and deploying clinical predictive models using web services via the Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) standard. The services enable model development using electronic health records (EHRs) stored in OMOP CDM databases and model deployment for scoring individual patients through FHIR resources. The MIMIC2 ICU dataset and a synthetic outpatient dataset were transformed into OMOP CDM databases for predictive model development. The resulting predictive models are deployed as FHIR resources, which receive requests of patient information, perform prediction against the deployed predictive model and respond with prediction scores. To assess the practicality of this approach we evaluated the response and prediction time of the FHIR modeling web services. We found the system to be reasonably fast with one second total response time per patient prediction.
Content may be subject to copyright.
Clinical Predictive Modeling Development and Deployment through FHIR
Web Services
Mohammed Khalilia, Myung Choi, Amelia Henderson, Sneha Iyengar, Mark Braunstein,
Jimeng Sun
Georgia Institute of Technology, Atlanta, Georgia
Clinical predictive modeling involves two challenging tasks: model development and model deployment. In this
paper we demonstrate a software architecture for developing and deploying clinical predictive models using web
services via the Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) standard. The services
enable model development using electronic health records (EHRs) stored in OMOP CDM databases and model
deployment for scoring individual patients through FHIR resources. The MIMIC2 ICU dataset and a synthetic
outpatient dataset were transformed into OMOP CDM databases for predictive model development. The resulting
predictive models are deployed as FHIR resources, which receive requests of patient information, perform
prediction against the deployed predictive model and respond with prediction scores. To assess the practicality of
this approach we evaluated the response and prediction time of the FHIR modeling web services. We found the
system to be reasonably fast with one second total response time per patient prediction.
Clinical predictive modeling research has increased because of the increasing adoption of electronic health records1
3. Nevertheless, the dissemination and translation of predictive modeling research findings into healthcare delivery is
often challenging. Reasons for this include political, social, economic and organizational factors46. Other barriers
include the lack of computer programming skills by the target end users (i.e. physicians) and difficulty of integration
with the highly fragmented existing health informatics infrastructure7. Additionally, in many cases the evaluation of
the feasibility of predictive modeling marks the end of the project with no attempt to deploy those models into real
practice8. To achieve real impact, researchers should be concerned about the deployment and dissemination of their
algorithms and tools into day-to-day decision support and some researchers have developed approaches to doing
this. For example, Soto et al. developed EPOCH and ePRISM7, a unified web portal and associated services for
deploying clinical risk models and decision support tools. ePRISM is a general regression model framework for
prediction and encompasses various prognostic models9. However, ePRISM does not provide an interface allowing
for integration with existing EHR data. It requires users to input model parameters, which can be time consuming
and a particular challenge for researchers unfamiliar with the nuances of clinical terminology and the underlying
algorithms. A suite of decision support web services for chronic obstructive pulmonary disease detection and
diagnosis was developed by Velickovski et al.10, where the integration into providers’ workflow is supported
through the use of a service-oriented architecture. However, despite these few efforts and many calls for researchers
to be more involved in the practical dissemination of their systems, little has been done and much less has been
accomplished to utilize predictive modeling algorithms at the point-of-care.
An important missing aspect that retards bringing research into practice is the lack of simple, yet powerful standards
that could facilitate integration with the existing healthcare infrastructure. Currently, one major impediment to the
use of existing standards is their complexity11. The emerging Health Level 7 (HL7) Fast Healthcare Interoperability
Resources (FHIR) standard provides a simplified data model represented as some 100-150 JSON or XML objects
(the FHIR Resources). Each resource consists of a number of logically related data elements that will be 80%
defined through the HL7 specification and 20% through customized extensions12. Additionally, FHIR supports other
web standards such as XML, HTTP and OAuth. Furthermore, since FHIR supports a RESTful architecture for
information and message exchange it becomes suitable for use in a variety of settings such as mobile applications
and cloud computing. Recently, all the four major health enterprise software vendors (Cerner, Epic, McKesson and
MEDITECH) along with major providers including Intermountain Healthcare, Mayo Clinic and Partners Healthcare
have joined the Argonaut Project to further extend FHIR to encompass clinical documents constructed from FHIR
resources13. Epic, the largest enterprise healthcare software vendor, has a publicly available FHIR server for testing
that supports a subset of FHIR resource types including Patient, Adverse Reaction, Medication Prescription,
Condition and Observation14 and support of these resources is reportedly included in their June 30, 2015 release of
Version 15 of their software. SMART on FHIR has been developed by Harvard Boston Children’s Hospital as a
universal app platform to seamlessly integrate medical applications into diverse EHR systems at the point-of-care.
Cerner and four other EHR systems demonstrated their ability to run the same third party developed FHIR app at
HIMSS 2014. Of particular importance to our work is the demonstrated ability to provide SMART on FHIR app
services within the context and workflow of Cerner’s PowerChart EHR15.
As a result of these efforts, FHIR can both facilitate integration with existing EHRs and form a common
communication protocol using RESTful web services between healthcare organizations. This provides a clear path
for the widespread dissemination and deployment of research findings such as predictive modeling in clinical
practice. However, despite its popularity, FHIR currently is not suitable to directly support predictive model
development where a large volume of EHR data needs to be processed in order to train an accurate model. To
streamline predictive model development it is important to adopt a common data model (CDM) for storing EHR
data. The Observational Medical Outcomes Partnership (OMOP) was developed to transform data in disparate
databases into a common format and to map EHR data into a standardized vocabulary16. The OMOP CDM has been
used in various settings including drug safety and surveillance17, stroke prediction18, and prediction of adverse drug
events19. We utilize a database in the OMOP CDM to support predictive model development. The resulting
predictive model is then deployed as FHIR resources for scoring individual patients.
Based on these considerations, in this paper we propose to develop and deploy:
A predictive modeling development platform using the OMOP CDM for data storage and standardization
A suite of predictive modeling algorithms operating against data stored in the OMOP CDM
FHIR web services, that use the resulting trained predictive models to perform prediction on new patients
A pilot test using MIMIC2 ICU and ExactData chronic disease outpatient datasets for mortality prediction.
Overview and system architecture
Figure 1 shows how our architecture supports providing predictive modeling services to clinicians via their EHR. On
the server side, the model development platform trains and compares multiple predictive models using EHR data
stored in the OMOP CDM. After training, the best predictive models are deployed to a dedicated FHIR server as
executable predictive models. On the client side, users can use existing systems such as desktop or mobile
applications within their current workflows to query the predictive model specified in the FHIR resource. Such
integrations are done by using FHIR web services. Client applications use FHIR resource to package patient health
information and transport it using the FHIR RESTful Application Programming Interface (API). Once the FHIR
server receives the information, it passes it on to the deployed predictive model for the risk assessment. The returned
result from the predictive model will be sent to the client and also stored into a resource database that can be
accessed by the client to read or search the Risk Assessment resources for later use.
Figure 1. System Architecture
The common data model
Developing a reliable and reusable predictive model, requires a common data model into which diverse EHR data
sets are transformed and stored. For the proposed system we used the OMOP CDM designed to facilitate research
using some important design principles. First, data in the OMOP CDM is organized in a way that is optimal for data
analysis and predictive modeling, rather than for the operational needs of healthcare providers and other
administrative and financial personnel. Second, OMOP provides a data standard using existing vocabularies such as
the Systematized Nomenclature of Medicine (SNOMED), RxNORM, the National Drug Code (NDC) and the
Logical Observation Identifiers Names and Codes (LOINC). As a result predictive models built using data in an
OMOP CDM identify standardized features, assuming the mapping to OMOP is reliable. The OMOP CDM is also
technology neutral and can be implemented in any relational database such as Oracle, PostgreSQL, MySQL or MS
SQL Server. Third, our system can directly benefit the existing OMOP CDM community to foster collaborations.
The OMOP CDM consists of 37 data tables divided into standardized clinical data, vocabulary, health system data
and health economics. We only focused on a few of the CDM clinical tables including: person, condition
occurrence, observations and drug exposure. As we enhance our Extract, Transform, Load (ETL) process and the
predictive model, we can incorporate additional data sources as needed. Figure 2 shows a high level overview of the
ETL process in which multiple raw EHR data are mapped to their corresponding CDM instances. In the
transformation process, EHR source values such as lab names and results, diagnoses codes and medication names
are mapped to OMOP concept identifiers. The standardized data can then be accessed to train the predictive models.
Figure 2. EHR data to OMOP CDM
Predictive model development
The CDM provides the foundation for predictive modeling. As various datasets are transformed and loaded into the
OMOP CDM predictive modeling training can be simplified because the OMOP CDM instances all have the same
structure. For instance, one can equally easily train a model for predicting mortality, future diseases or readmission
using different datasets. Figure 3 provides an overview of predictive model training.
Figure 3. Training predictive models based on OMOP CDM
The predictive models are trained offline and can be re-trained as additional records are added to the database.
Training consists of three modules: 1) Cohort Construction: this is the first step in the training phase. At this stage
the user specifies the OMOP CDM instance, prediction target (i.e. mortality) and the cohort definition. Based on the
specified configuration the module will generate the patient cohort. 2) Feature Construction: at this stage the user
specifies which data sources (e.g. drugs, condition occurrence and observations) to include when constructing the
features. Additional configurations can also be provided for each data source. The user can include the observation
window (e.g. the prior year, to utilize only patient data recorded in the past 365 days). Other data source
configurations include the condition type concept identifier to specify which types of conditions to include (i.e.
primary, secondary), observation type concept identifier and drug type concept identifier. The final configuration is
the feature value aggregation function. For example, lab result values can be computed using one of five aggregation
functions: sum, max, mean, min, and most recent value. 3) Model Training: This module takes the feature vectors
constructed for the cohort and trains multiple models using algorithms such as Random Forest, K-Nearest Neighbor
(KNN) and Support Vector Machine (SVM). The parameters for each of these three algorithms are tuned using cross
validation in order to select the best performing model. The best model will be deployed as a FHIR resource. In the
next section, we will describe how to deploy such a predictive model for scoring future patient in real time.
FHIR web services for model deployment
Our approach is focused on API based predictive modeling services that can be easily implemented in thin client
applications, especially in the mobile environment. FHIR defines resources represented as JSON or XML objects
that can contain health concepts along with reference and searchable parameters. FHIR further defines RESTful API
URL patterns for create, read, update, and delete (CRUD) operations. In this paper, we propose to use the
RiskAssessment resource defined in the FHIR Draft Standard for Trial Use 2 (DSTU2) for our predictive analysis.
Readers should note that this particular FHIR resource is still in the draft stage. The current approved DSTU1
version of FHIR does not have a RiskAssessment resource. However, our development version of FHIR is currently
being balloted on by HL7 members and should be approved soon20. Detailed information about the FHIR
development version for RiskAssessment can be obtained from Ref. 20.
A prediction request to the FHIR server starts by forming the CREATE operation, which is used to request for
scoring specific patients in real-time using the deployed predictive model. This creates RiskAssessment resources at
the server. Client applications then receive a status response with a resource identifier that refers to the newly
created resource. Clients can use this resource identifier to read or search the resource database via a FHIR RESTful
API. For this paper, we used a SEARCH operation as we need to retrieve more than one result. Groups of FHIR
resources are called a bundle and there is a specified format for that. As all performed analyses are stored in the
resource database, additional query types can be implemented in the future. The operation process is depicted in
Figure 4.
Figure 4. Operation Process
During the process, clients and server need to put appropriate information into elements available in the
RiskAssessment resource. However, most of the data elements in the RiskAssessment resource are optional for
predictive modeling which gives us flexibility in choosing the model output. For the CREATE operation, the
RiskAssessment resource is constructed with subject, basis, and method elements as shown in Figure 5(a).
{"resourceType" : "RiskAssessment",
"basis" : [
{ "reference" : "Patient/patient 1 identifier" },
{ "reference" : "Patient/patient 2 identifier" },
"method" : [ { "text" : "predictive model identifier" } ],
"subject" : [ { "reference" : "Group/group identifier" } ]
{"resource" : "RiskAssessment",
"id" : "patient 1 identifier" ,
"prediction" : {"probabilityDecimal" : patient 1 score},
"method" : "predictive model identifier",
"subject" : { "reference" : "Group/group identifier" }
{"resource" : "RiskAssessment",
"id" : "patient 2 identifier" ,
"prediction" : {"probabilityDecimal" : patient 2 score},
"method" : "predictive model identifier",
"subject" : { "reference" : "Group/group identifier" }
(a) Request
(b) Response
Figure 5. JSON RiskAssessment FHIR request and the corresponding response
Subject is used to define a group identifier, which we also refer to as the resource identifier. Basis can contain
information used in assessment, which are the patient identifiers. All patients included in basis are bound to the
group identifier specified in the subject element. Therefore, the results of assessment for patients will be grouped in
the resource database by the group identifier at the FHIR server. Subject or group identifier provides a simple
mechanism for users to specify a group of patients, whose prediction score can be retrieved from the resource
Predictive model scoring creates feature vectors (one for each patient), which are derived from the patient’s health
information. The FHIR Patient resource can be referenced by either the patient identifier or by a collection of
resources such as MedicationPrescriptions, Conditions, Observations, etc. In our initial implementation we used
patient identifier. Using the patient identifier, the predictive model constructs feature vectors by pulling the
appropriate information from various resources that contain each patient’s clinical data occurring within the
specified observation window. In our implementation the clinical data is stored in the OMOP database. This all
happens during the predictive analysis period in Figure 4. In the future, we will construct features directly from EHR
databases through querying other FHIR resources in real-time.
Results from the predictive analysis are sent back to the FHIR server in JSON format and the FHIR server stores the
information in a resource database. If the CREATE operation is successful, the server will send a 201-status created
message. In case of any errors, an appropriate error status message will be sent back to the client with the
OperationOutcome resource, if required12.
Clients can query the prediction results using the resource identifier returned in the CREATE response (i.e. group
identifier under subject in this case). This resource identifier can be used to query the stored prediction results using
the SEARCH operation. Recall that the resource identifier is bounded to a list of patient identifiers. Once the
SEARCH operation request is received with the resource identifier, the FHIR server constructs a response FHIR
resource for each patient. In the RiskAssessment resource, the prediction element contains the risk score for a
patient, as shown in Figure 5(b). When the SEARCH operation is completed, the RiskAssessment resources are
packaged into a bundle and sent to the client. For example, the resource contains an identifier that represents the
patient for whom the predictive analysis is performed. In our experiments, mortality prediction was performed and
risk scores are returned. The mortality scores are populated in the probabilityDecimal sub element of the prediction
element, as shown in Figure 5(b).
We tested our implementation using two datasets: Multiparameter Intelligent Monitoring in Intensive Care
(MIMIC2) and a dataset licensed from ExactData. MIMIC2 contains comprehensive clinical data for thousands of
ICU patients collected between 2001 and 2008. ExactData is a custom, large realistic synthetic EHR dataset for
chronic disease patients. Using ExactData eliminates the cost, risk and legality of real EHR data. Table 1 presents
key statistics about the two datasets.
Table 1. Key ExactData and MIMIC2 Dataset Statistics
Number of patients
Number of condition occurrences
Number of drug exposures
Number of observations
Number of deceased patients
Two ETL processes were implemented to move the raw MIMIC2 and ExactData datasets into two OMOP CDM
instances. From each instance a cohort was generated for mortality prediction. The cohort for ExactData is limited to
the 53 patients with death records matched with 53 control patients to keep it balanced. The MIMIC2 cohort is
somewhat larger with 500 case patients and 500 control patients. To generate the control group we performed a one-
to-one matching with the case group based on age, gender and race. As illustrated in Figure 3, the feature vectors for
each cohort are generated using these MIMIC2 and ExactData cohorts. The features include condition occurrence,
observation and drug exposures. The observation window was set to 365 days for both cohorts and the feature values
for observations were aggregated by taking the mean of the values.
For each cohort three mortality predictive models were trained offline using Random Forest, SVM and K NN. After
performing cross validation and parameter tuning for each algorithm, the model with the highest Area Under the
Receiver Operating Characteristic curve (AUC) was deployed. This results in three final models, one for each
algorithm, allowing clients to specify which algorithm to use to predict new patients. The predictive models were
trained using Python scikit-learn machine learning package21. The final model training runtime, AUC, accuracy and
F1 score are reported in Table 2.
Table 2. Model evaluation using ExactData and MIMIC2 cohorts
Runtime (seconds)
F1 score
Random Forest
Random Forest
Our entire software platform was evaluated and tested on a small Linux server with 130GB hard drive, 16GB
memory, and two Intel Xeon 2.4GHz processors with two cores each. FHIR web services were built on the Tomcat
8 application server using the HL7 API (HAPI) RESTful server library22. The FHIR web service expects a request
from the client to perform patient scoring (in our case mortality prediction) with patient data or patient identifier
contained in the request body. We evaluated the performance of the web service in two ways. The first evaluation
method measures the response time from the client’s request to the web service until a response is received back
from the server. This is an important measure since a very slow response time makes the system unusable for
practical applications including at the point-of-care where busy clinicians demand a prompt response. A single API
request to score a patient is a composite of two different requests: 1) CREATE request made to the FHIR server that
sends the data in the JSON format. The data comprises of resource type, a set of patient identifiers, classification
algorithm to be employed (e.g. KNN, SVM or Random Forest) and the OMOP CDM instance to be used (e.g.
MIMIC2 or ExactData). The response time of this request increases with the number of patients (as can be observed
from Figures 6 and 7). SEARCH request is made by passing the patient group identifier as a parameter to obtain the
prediction for the patients. Similar to the CREATE request, a larger number of patients increases the response time
of the SEARCH request.
Figure 6 shows the FHIR web service response time using KNN and MIMIC2 datasets as we vary the number of
patients to score in the request. For a single patient, a typical point-of-care scenario, the response time is around one
second. The CREATE request consumes most of the time and the response time increases with the number of
patients. This could become an issue in an application such as population health to identify those chronic disease
patients most likely to be readmitted or screening an entire ICU to identify patients most in need of immediate
medical attention. The SEARCH request has a relatively constant response time and does not change significantly as
the number of patients increases.
Figure 6. Response time for CREATE and SEARCH requests using KNN and MIMIC2 data vs. the number of patients
The actual scoring or prediction of the patient is performed in the CREATE request. There are two major tasks that
take place in this CREATE request. First is the feature construction that is performed for each patient included in the
request. This requires querying the OMOP CDM database, extracting and aggregating feature values, which can be
an expensive operation when scoring a large number of patients. The second task is the actual prediction operation
which takes the least amount of time. Figure 7 shows the response time (after subtracting the feature construction
and prediction time), feature construction time, and prediction time as we increase the number of patients. As the
number of patients increases the prediction time does not increase much compared to the response time or the
feature construction time. For instance, a request for scoring 200 patients takes about 23 seconds, out of which 8
seconds were spent constructing features and only 16 milliseconds were spent on the actual predictions.
Figure 7. Response time, feature construction and prediction time vs. number of patients
The second web service performance evaluation measures the amount of load that the service can handle within
some defined time interval. As this service handles multiple clients it is important to guarantee that it is available
and able to respond in a timely manner to requests as the number of clients increases. However, keep in mind that
our evaluation is only a proof of concept done on a moderate server. Figure 8 shows the average response time when
1000 clients send CREATE requests (one patient in each request) to the FHIR server within a minute. Figure 8 is
generated by the Simple Cloud-based Load Testing tool23 and has three axes: response time on y-axis, number of
clients on secondary y-axis and time on x-axis, where 00:05 indicates the fifth second from the start of the test. The
green curve (upper curve) shows the number of simultaneous clients sending the requests at some point in time. For
instance, on the fifth second, about 50 clients were sending simultaneous requests. The blue line shows the average
response time for those clients (on the fifth second, the average response time for 50 clients was about 1800
milliseconds). Overall, the average response time for the 1000 clients is 1506 millisecond (about 1.5 seconds). A 1.5
second response time is acceptable when the task is performed at the point-of-care.
Figure 8. Average response time as 1,000 clients send CREATE request within one minute
Conclusion and Discussion
In this paper we presented a real-time predictive modeling development and deployment platform using the evolving
HL7 FHIR standard and demonstrated the system using MIMIC2 ICU and ExactData chronic disease datasets for
mortality prediction. The system consists of three core components: 1) The OMOP CDM which is more tailored for
predictive modeling than healthcare operational needs and stores EHR data using standardized vocabularies. 2)
Predictive model development, which consists of cohort construction, feature extraction, model training and model
evaluation. This training phase is streamlined, meaning that it will work for any type of data stored in an OMOP
CDM. 3) FHIR web services for predictive model deployment that we use to deliver the predictive modeling service
to potential clients. The web service takes a prediction request containing patient identifiers whose features can be
extracted from the OMOP database. The prediction or the scoring is not pre-computed but is performed in real-time.
Our future work on the FHIR web services will enhance the feature extraction by accessing EHR data over FHIR
Search operations. In this case, our predictive platform will become a client to the FHIR-enabled EHR. Recent FHIR
ballot for DSTU2 includes a Subscription resource24. This resource, if included in the standard, can be utilized in our
platform to subscribe the patient’s feature related data from their EHR. The Subscription resource uses Search string
for its criteria, thus our future work will comply with this new resource with only a few modifications. Even if this
resource couldn’t be included, our future platform will use a pull mechanism to retrieve EHR data for feature
A total of six predictive models were trained on MIMIC2 ICU and ExactData chronic disease datasets from which
we generated cohorts based on patients with a death event and a matched set of control patients. The FHIR web
service routes the incoming requests to the desired predictive model, which is usually a client specified parameter.
For a practical, real-time web service it is important to achieve a fast response time. The observed response time for
scoring one patient was around one second, of which the actual prediction took only few milliseconds. The total
response time increases with the number of patients in a single request, but the actual prediction time remains very
small, reaching 16 milliseconds when scoring 200 patients. However, scoring many patients in one request is not
always desired or needed. For instance, in many cases providers are only interested in querying one patient at a time
from the point-of-care. In such a direct care delivery scenario only a single request containing one patient’s
information would be sent to the server with an expected response time of one second. Additionally, this response
time can be easily reduced by expanding the server hardware and optimizing the implementation.
We present a prototype with much work and many needed improvements yet to be done. First, additional predictive
algorithms should be added. We demonstrated Random Forest, SVM and KNN, but others such as Logistic
Regression and Decision Trees could be added. Second, the system should allow for updating the predictive model
as new patients get scored. This requires using the patient EHR data passed into the web service for retraining the
predictive model. For this to take place, the web service should allow the provider to give feedback on the patient
scoring, which can then be used to improve future predictions. Third, the system needs to be scalable and should
handle larger datasets to train the predictive models. This can be done by utilizing big data technologies such as
Hadoop25 or Apache Spark26. Fourth, the response time needs to be improved especially for requests that contain
large numbers of patients, which can be done by allocating more resources and improving the FHIR web service
software. Additionally, the response time can be decreased if the client includes patient EHR data in the request,
thus avoiding expensive database querying. Fifth, the web service must have privacy and security protocols
implemented such as OAuth2, which has been already implemented in the SMART on FHIR app platform.
The proposed approach in this paper assumes that the client that is deploying our predictive model stores the patient
data in an OMOP CDM model and the CREATE request contains the identifier for the patient for which the analysis
is to be performed. Any ETL process for moving data from EHR to OMOP model is outside the scope of this paper.
However, FHIR resources such as MedicationPrescriptions, Conditions, Observations can be included in the body
of the CREATE request instead of the patient identifier. This allows for performing analysis for patients not
included in the database. Additionally, two processes can be performed on the FHIR resources, if passed in the
CREATE request: 1) perform prediction using the patient data (resources) in the request and 2) subscribe the
patient’s features related data from EHR to the predictive platform.
In conclusion, we have demonstrated the ease of developing and deploying a real-time predictive modeling web
service that uses open source technologies and standards such as OMOP and FHIR. With this web service we can
more easily and cost effectively bring research into clinical practice by allowing clients to tap into the web service
using their existing EHRs or mobile applications.
This work was supported by the National Science Foundation, award #1418511, Children's Healthcare of Atlanta,
CDC i-SMILE project, Google Faculty Award, AWS Research Award, and Microsoft Azure Research Award.
1. Brownson, R. C., Colditz, G. A. & Proctor, E. K. Dissemination and implementation research in health:
translating science to practice. (Oxford University Press, 2012).
2. Dobbins, M., Ciliska, D., Cockerill, R., Barnsley, J. & DiCenso, A. A Framework for the Dissemination and
Utilization of Research for Health-Care Policy and Practice. Worldviews Evid.-Based Nurs. Presents Arch.
Online J. Knowl. Synth. Nurs. E9, 149160 (2002).
3. Sun, J. et al. Predicting changes in hypertension control using electronic health records from a chronic disease
management program. J. Am. Med. Inform. Assoc. JAMIA (2013). doi:10.1136/amiajnl-2013-002033
4. Glasgow, R. E. & Emmons, K. M. How Can We Increase Translation of Research into Practice? Types of
Evidence Needed. Annu. Rev. Public Health 28, 413433 (2007).
5. Wehling, M. Translational medicine: science or wishful thinking? J. Transl. Med. 6, 31 (2008).
6. Hörig, H., Marincola, E. & Marincola, F. M. Obstacles and opportunities in translational research. Nat. Med.
11, 705708 (2005).
7. Soto, G. E. & Spertus, J. A. EPOCH and ePRISM: A web-based translational framework for bridging outcomes
research and clinical practice. in Computers in Cardiology, 2007 205208 (2007).
8. Bellazzi, R. & Zupan, B. Predictive data mining in clinical medicine: Current issues and guidelines. Int. J. Med.
Inf. 77, 8197 (2008).
9. Soto, G. E., Jones, P. & Spertus, J. A. PRISM trade;: a Web-based framework for deploying predictive clinical
models. in Computers in Cardiology, 2004 193196 (2004). doi:10.1109/CIC.2004.1442905
10. Velickovski, F. et al. Clinical Decision Support Systems (CDSS) for preventive management of COPD
patients. J. Transl. Med. 12, S9 (2014).
11. Barry, S. The Rise and Fall of HL7. (2011). at <
12. Bender, D. & Sartipi, K. HL7 FHIR: An Agile and RESTful approach to healthcare information exchange. in
2013 IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS) 326331 (2013).
13. Miliard, M. Epic, Cerner, others join HL7 project. Healthcare IT News (2014). at
14. Open Epic. at <>
15. Raths, D. SMART on FHIR a Smoking Hot Topic at AMIA Meeting. Healthcare Informatics (2014). at
16. Ryan, P. B., Griffin, D. & Reich, C. OMOP common data model (CDM) Specifications. (2009).
17. Zhou, X. et al. An Evaluation of the THIN Database in the OMOP Common Data Model for Active Drug
Safety Surveillance. Drug Saf. 36, 119134 (2013).
18. Letham, B., Rudin, C., McCormick, T. H. & Madigan, D. An Interpretable Stroke Prediction Model using
Rules and Bayesian Analysis. (2013). at <>
19. Duke, J., Zhang, Z. & Li, X. Characterizing an optimal predictive modeling framework for prediction of
adverse drug events. Stakehold. Symp. (2014). at
20. HL7 FHIR Development Version. at <>
21. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res 12, 28252830 (2011).
22. HL7 Application Programming Interface. at <>
23. Simple Cloud-based Load Testing. at <>
24. FHIR Resource Subscription. at <>
25. Shvachko, K., Kuang, H., Radia, S. & Chansler, R. The Hadoop Distributed File System. in 2010 IEEE 26th
Symposium on Mass Storage Systems and Technologies (MSST) 110 (2010).
26. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. Spark: cluster computing with working
sets. in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing 1010 (2010). at
... The integration of many sources of electronic health records via the wide scale adoption of the Fast Healthcare Interoperability Resources (FHIR) standardized format (schema) was intended to enable the development of reusable machine learning-based models [6]- [9]. But while reusability was the original goal of the FHIR schema, the inevitable variance in the amount and quality of available data requires a framework that can adapt to missing tables and data items. ...
... FHIR has been adopted across many technologies in the health industry, including mobile applications, prediction software, and health management systems. FHIR has been used to communicate patient information in addition to reporting prediction results [6]; for scalable development of deep learning models [7]; and to provide streamlined access to data for the development of mortality decision support applications [8]. Finally, FHIR standards are being integrated with several health management systems [9]. ...
... Next, we set up an one-hour quiz. We created our study datasets based on the MIMIC-III Demo data 6 . The sampled data contains the same tables as the complete dataset, but the number of patients is reduced to 100. ...
An estimated 180 papers focusing on deep learning and EHR were published between 2010 and 2018. Despite the common workflow structure appearing in these publications, no trusted and verified software framework exists, forcing researchers to arduously repeat previous work. In this paper, we propose Cardea, an extensible open-source automated machine learning framework encapsulating common prediction problems in the health domain and allows users to build predictive models with their own data. This system relies on two components: Fast Healthcare Interoperability Resources (FHIR) -- a standardized data structure for electronic health systems -- and several AUTOML frameworks for automated feature engineering, model selection, and tuning. We augment these components with an adaptive data assembler and comprehensive data- and model- auditing capabilities. We demonstrate our framework via 5 prediction tasks on MIMIC-III and Kaggle datasets, which highlight Cardea's human competitiveness, flexibility in problem definition, extensive feature generation capability, adaptable automatic data assembler, and its usability.
... The data can be stored into OMOP using ETL (Extract, Transform and Load) process, which is our next step of this research project. Storing genomic data in OMOP CDM enables analysis studies using machine learning methods that can be used in early prediction and diagnosis and improvement of personalized cancer care [4,16]. ...
Full-text available
High throughput sequencing technologies have facilitated an outburst in biological knowledge over the past decades and thus enables improvements in personalized medicine. In order to support (international) medical research with the combination of genomic and clinical patient data, a standardization and harmonization of these data sources is highly desirable. To support this increasing importance of genomic data, we have created semantic mapping from raw genomic data to both FHIR (Fast Healthcare Interoperability Resources) and OMOP (Observational Medical Outcomes Partnership) CDM (Common Data Model) and analyzed the data coverage of both models. For this, we calculated the mapping score for different data categories and the relative data coverage in both FHIR and OMOP CDM. Our results show, that the patients genomic data can be mapped to OMOP CDM directly from VCF (Variant Call Format) file with a coverage of slightly over 50%. However, using FHIR as intermediate representation does not lead to further information loss as the already stored data in FHIR can be further transformed into OMOP CDM format with almost 100% success. Our findings are in favor of extending OMOP CDM with patient genomic data using ETL to enable the researchers to apply different analysis methods including machine learning algorithms on genomic data.
... Condition Management and Treatment (n=15) [12,27,28,30,31,35,36,42,44,45,50,53,54,56,58] CDS Infrastructure (n=7) [15,33,40,41,52,60,63] Lab Result Interpretation --Genomics (n=7) [29,32,46,48,49,58,59] Risk Factor Analysis (n=6) [14,[23][24][25][26]39] Health Metric Monitoring (n=5) [34,37,43,51,61] Lab Result Interpretation --Other (n=3) [38,57,62] HIV Screening (n=1) [55] 1 (14) 1 (14) 1 (17) 2 (40) 2 (67) ...
Full-text available
Objectives: To review the current state of research on designing and implementing clinical decision support (CDS) using four current interoperability standards: Fast Healthcare Interoperability Resources (FHIR); Substitutable Medical Applications and Reusable Technologies (SMART); Clinical Quality Language (CQL); and CDS Hooks. Methods: We conducted a review of original studies describing development of specific CDS tools or infrastructures using one of the four targeted standards, regardless of implementation stage. Citations published any time before the literature search was executed on October 21, 2020 were retrieved from PubMed. Two reviewers independently screened articles and abstracted data according to a protocol designed by team consensus. Results: Of 290 articles identified via PubMed search, 44 were included in this study. More than three quarters were published since 2018. Forty-three (98%) used FHIR; 22 (50%) used SMART; two (5%) used CQL; and eight (18%) used CDS Hooks. Twenty-four (55%) were in the design stage, 15 (34%) in the piloting stage, and five (11%) were deployed in a real-world setting. Only 12 (27%) of the articles reported an evaluation of the technology under development. Three of the four articles describing a deployed technology reported an evaluation. Only two evaluations with randomized study components were identified. Conclusion: The diversity of topics and approaches identified in the literature highlights the utility of these standards. The infrequency of reported evaluations, as well as the high number of studies in the design or piloting stage, indicate that these technologies are still early in their life cycles. Informaticists will require a stronger evidence base to understand the implications of using these standards in CDS design and implementation.
... Such an evaluation requires setting up an appropriate RCT. This was not touched upon during this thesis, but MORPHER Web provides a set of APIs that could be extended to support, e.g., the FHIR RiskAssessment profile, which, in turn, could be directly integrated into a Hospital Information System (HIS) to facilitate clinical trials [184]. ...
An ever-increasing number of prediction models is published every year in different medical specialties. Prognostic or diagnostic in nature, these models support medical decision making by utilizing one or more items of patient data to predict outcomes of interest, such as mortality or disease progression. While different computer tools exist that support clinical predictive modeling, I observed that the state of the art is lacking in the extent to which the needs of research clinicians are addressed. When it comes to model development, current support tools either 1) target specialist data engineers, requiring advanced coding skills, or 2) cater to a general-purpose audience, therefore not addressing the specific needs of clinical researchers. Furthermore, barriers to data access across institutional silos, cumbersome model reproducibility and extended experiment-to-result times significantly hampers validation of existing models. Similarly, without access to interpretable explanations, which allow a given model to be fully scrutinized, acceptance of machine learning approaches will remain limited. Adequate tool support, i.e., a software artifact more targeted at the needs of clinical modeling, can help mitigate the challenges identified with respect to model development, validation and interpretation. To this end, I conducted interviews with modeling practitioners in health care to better understand the modeling process itself and ascertain in what aspects adequate tool support could advance the state of the art. The functional and non-functional requirements identified served as the foundation for a software artifact that can be used for modeling outcome and risk prediction in health research. To establish the appropriateness of this approach, I implemented a use case study in the Nephrology domain for acute kidney injury, which was validated in two different hospitals. Furthermore, I conducted user evaluation to ascertain whether such an approach provides benefits compared to the state of the art and the extent to which clinical practitioners could benefit from it. Finally, when updating models for external validation, practitioners need to apply feature selection approaches to pinpoint the most relevant features, since electronic health records tend to contain several candidate predictors. Building upon interpretability methods, I developed an explanation-driven recursive feature elimination approach. This method was comprehensively evaluated against state-of-the art feature selection methods. Therefore, this thesis' main contributions are three-fold, namely, 1) designing and developing a software artifact tailored to the specific needs of the clinical modeling domain, 2) demonstrating its application in a concrete case in the Nephrology context and 3) development and evaluation of a new feature selection approach applicable in a validation context that builds upon interpretability methods. In conclusion, I argue that appropriate tooling, which relies on standardization and parametrization, can support rapid model prototyping and collaboration between clinicians and data scientists in clinical predictive modeling.
... Although we are not aware of any other study that has reported successful implementation of a comparable informatics infrastructure in psychiatric clinical routine, several preliminary reports should be taken into account. Complementary to the work presented in this study, Khalilia et al [94] described a Fast Healthcare Interoperability Resources (FHIR) web modeling service that was tested on a pilot intensive care unit dataset. A multi-source approach was used. ...
Full-text available
Background Empirically driven personalized diagnostic applications and treatment stratification is widely perceived as a major hallmark in psychiatry. However, databased personalized decision making requires standardized data acquisition and data access, which are currently absent in psychiatric clinical routine. Objective Here, we describe the informatics infrastructure implemented at the psychiatric Münster University Hospital, which allows standardized acquisition, transfer, storage, and export of clinical data for future real-time predictive modelling in psychiatric routine. Methods We designed and implemented a technical architecture that includes an extension of the electronic health record (EHR) via scalable standardized data collection and data transfer between EHRs and research databases, thus allowing the pooling of EHRs and research data in a unified database and technical solutions for the visual presentation of collected data and analyses results in the EHR. The Single-source Metadata ARchitecture Transformation (SMA:T) was used as the software architecture. SMA:T is an extension of the EHR system and uses module-driven engineering to generate standardized applications and interfaces. The operational data model was used as the standard. Standardized data were entered on iPads via the Mobile Patient Survey (MoPat) and the web application Mopat@home, and the standardized transmission, processing, display, and export of data were realized via SMA:T. Results The technical feasibility of the informatics infrastructure was demonstrated in the course of this study. We created 19 standardized documentation forms with 241 items. For 317 patients, 6451 instances were automatically transferred to the EHR system without errors. Moreover, 96,323 instances were automatically transferred from the EHR system to the research database for further analyses. Conclusions In this study, we present the successful implementation of the informatics infrastructure enabling standardized data acquisition and data access for future real-time predictive modelling in clinical routine in psychiatry. The technical solution presented here might guide similar initiatives at other sites and thus help to pave the way toward future application of predictive models in psychiatric clinical routine.
... Complementary to the work presented in the present study, Khalilia et. al. described a Fast Healthcare Interoperability Resources (FHIR) web modeling service that was tested on a pilot ICU dataset [62]. A multi-source approach was used. ...
Full-text available
Background Empirically driven personalized diagnostic and treatment is widely perceived as a major hallmark in psychiatry. However, databased personalized decision making requires standardized data acquisition and data access, which is currently absent in psychiatric clinical routine. Objective Here we describe the informatics infrastructure implemented at the psychiatric university hospital Münster allowing for standardized acquisition, transfer, storage and export of clinical data for future real-time predictive modelling in psychiatric routine. Methods We designed and implemented a technical architecture that includes an extension of the EHR via scalable standardized data collection, data transfer between EHR and research databases thus allowing to pool EHR and research data in a unified database and technical solutions for the visual presentation of collected data and analyses results in the EHR. The Single-source Metadata ARchitecture Transformation (SMA:T) was used as the software architecture. SMA:T is an extension of the EHR system and uses Module Driven Software Development to generate standardized applications and interfaces. The Operational Data Model (ODM) was used as the standard. Standardized data was entered on iPads via the Mobile Patient Survey (MoPat) and the web application Mopat@home, the standardized transmission, processing, display and export of data was realized via SMA:T. Results The technical feasibility was demonstrated in the course of the study. 19 standardized documentation forms with 241 items were created. In 317 patients, 6,451 instances were automatically transferred to the EHR system without errors. 96,323 instances were automatically transferred from the EHR system to the research database for further analyses. Conclusions With the present study, we present the successful implementation of the informatics infrastructure enabling standardized data acquisition, and data access for future real-time predictive modelling in clinical routine in psychiatry. The technical solution presented here might guide similar initiatives at other sites and thus help to pave the way towards future application of predictive models in psychiatric clinical routine.
... The dataset was transformed into the OMOP CDM schema using the extract transform load (ETL) tool provided by the OHDSI community, 44 and transformed into the FHIR format using the OMOP on FHIR tool. 45 We ran the HF phenotype definition using CQL on OMOP against an OHDSI instance containing the SynPUF 1 k dataset and generated the resulting cohort of patients. We then ran the same HF CQL using the CQL reference implementation against a FHIR server containing the same SynPUF 1 k data and compared the resulting patient cohorts. ...
Full-text available
Introduction Electronic health record (EHR)‐driven phenotyping is a critical first step in generating biomedical knowledge from EHR data. Despite recent progress, current phenotyping approaches are manual, time‐consuming, error‐prone, and platform‐specific. This results in duplication of effort and highly variable results across systems and institutions, and is not scalable or portable. In this work, we investigate how the nascent Clinical Quality Language (CQL) can address these issues and enable high‐throughput, cross‐platform phenotyping. Methods We selected a clinically validated heart failure (HF) phenotype definition and translated it into CQL, then developed a CQL execution engine to integrate with the Observational Health Data Sciences and Informatics (OHDSI) platform. We executed the phenotype definition at two large academic medical centers, Northwestern Medicine and Weill Cornell Medicine, and conducted results verification (n = 100) to determine precision and recall. We additionally executed the same phenotype definition against two different data platforms, OHDSI and Fast Healthcare Interoperability Resources (FHIR), using the same underlying dataset and compared the results. Results CQL is expressive enough to represent the HF phenotype definition, including Boolean and aggregate operators, and temporal relationships between data elements. The language design also enabled the implementation of a custom execution engine with relative ease, and results verification at both sites revealed that precision and recall were both 100%. Cross‐platform execution resulted in identical patient cohorts generated by both data platforms. Conclusions CQL supports the representation of arbitrarily complex phenotype definitions, and our execution engine implementation demonstrated cross‐platform execution against two widely used clinical data platforms. The language thus has the potential to help address current limitations with portability in EHR‐driven phenotyping and scale in learning health systems.
... To achieve real clinical impact, researchers should thus also be concerned about the deployment and dissemination of their algorithms and tools into day-to-day clinical decision support. 4 This typically challenges the developers to integrate their model into proprietary commercial electronic health record (EHR) products. ...
Background The increasing availability of molecular and clinical data of cancer patients combined with novel machine learning techniques has the potential to enhance clinical decision support, example, for assessing a patient's relapse risk. While these prediction models often produce promising results, a deployment in clinical settings is rarely pursued. Objectives In this study, we demonstrate how prediction tools can be integrated generically into a clinical setting and provide an exemplary use case for predicting relapse risk in melanoma patients. Methods To make the decision support architecture independent of the electronic health record (EHR) and transferable to different hospital environments, it was based on the widely used Observational Medical Outcomes Partnership (OMOP) common data model (CDM) rather than on a proprietary EHR data structure. The usability of our exemplary implementation was evaluated by means of conducting user interviews including the thinking-aloud protocol and the system usability scale (SUS) questionnaire. Results An extract-transform-load process was developed to extract relevant clinical and molecular data from their original sources and map them to OMOP. Further, the OMOP WebAPI was adapted to retrieve all data for a single patient and transfer them into the decision support Web application for enabling physicians to easily consult the prediction service including monitoring of transferred data. The evaluation of the application resulted in a SUS score of 86.7. Conclusion This work proposes an EHR-independent means of integrating prediction models for deployment in clinical settings, utilizing the OMOP CDM. The usability evaluation revealed that the application is generally suitable for routine use while also illustrating small aspects for improvement.
This study aims to propose a framework for developing a sharable predictive model of diabetic nephropathy (DN) to improve the clinical efficiency of automatic DN detection in data intensive clinical scenario. Different classifiers have been developed for early detection, while the heterogeneity of data makes meaningful use of such developed models difficult. Decision tree (DT) and random forest (RF) were adopted as training classifiers in de-identified electronic medical record dataset from 6,745 patients with diabetes. After model construction, the obtained classification rules from classifier were coded in a standard PMML file. A total of 39 clinical features from 2159 labeled patients were included as risk factors in DN prediction after data preprocessing. The mean testing accuracy of the DT classifier was 0.8, which was consistent to that of the RF classifier (0.823). The DT classifier was choose to recode as a set of operable rules in PMML file that could be transferred and shared, which indicates the proposed framework of constructing a sharable prediction model via PMML is feasible and will promote the interoperability of trained classifiers among different institutions, thus achieving meaningful use of clinical decision making. This study will be applied to multiple sites to further verify feasibility.
Full-text available
The use of information and communication technologies to manage chronic diseases allows the application of integrated care pathways, and the optimization and standardization of care processes. Decision support tools can assist in the adherence to best-practice medicine in critical decision points during the execution of a care pathway. The objectives are to design, develop, and assess a clinical decision support system (CDSS) offering a suite of services for the early detection and assessment of chronic obstructive pulmonary disease (COPD), which can be easily integrated into a healthcare providers' work-flow. The software architecture model for the CDSS, interoperable clinical-knowledge representation, and inference engine were designed and implemented to form a base CDSS framework. The CDSS functionalities were iteratively developed through requirement-adjustment/development/validation cycles using enterprise-grade software-engineering methodologies and technologies. Within each cycle, clinical-knowledge acquisition was performed by a health-informatics engineer and a clinical-expert team. A suite of decision-support web services for (i) COPD early detection and diagnosis, (ii) spirometry quality-control support, (iii) patient stratification, was deployed in a secured environment on-line. The CDSS diagnostic performance was assessed using a validation set of 323 cases with 90% specificity, and 96% sensitivity. Web services were integrated in existing health information system platforms. Specialized decision support can be offered as a complementary service to existing policies of integrated care for chronic-disease management. The CDSS was able to issue recommendations that have a high degree of accuracy to support COPD case-finding. Integration into healthcare providers' work-flow can be achieved seamlessly through the use of a modular design and service-oriented architecture that connect to existing health information systems.
Conference Paper
Full-text available
This research examines the potential for new Health Level 7 (HL7) standard Fast Healthcare Interoperability Resources (FHIR, pronounced “fire”) standard to help achieve healthcare systems interoperability. HL7 messaging standards are widely implemented by the healthcare industry and have been deployed internationally for decades. HL7 Version 2 (“v2”) health information exchange standards are a popular choice of local hospital communities for the exchange of healthcare information, including electronic medical record information. In development for 15 years, HL7 Version 3 (“v3”) was designed to be the successor to Version 2, addressing Version 2's shortcomings. HL7 v3 has been heavily criticized by the industry for being internally inconsistent even in it's own documentation, too complex and expensive to implement in real world systems and has been accused of contributing towards many failed and stalled systems implementations. HL7 is now experimenting with a new approach to the development of standards with FHIR. This research provides a chronicle of the evolution of the HL7 messaging standards, an introduction to HL7 FHIR and a comparative analysis between HL7 FHIR and previous HL7 messaging standards.
Full-text available
Common chronic diseases such as hypertension are costly and difficult to manage. Our ultimate goal is to use data from electronic health records to predict the risk and timing of deterioration in hypertension control. Towards this goal, this work predicts the transition points at which hypertension is brought into, as well as pushed out of, control. In a cohort of 1294 patients with hypertension enrolled in a chronic disease management program at the Vanderbilt University Medical Center, patients are modeled as an array of features derived from the clinical domain over time, which are distilled into a core set using an information gain criteria regarding their predictive performance. A model for transition point prediction was then computed using a random forest classifier. The most predictive features for transitions in hypertension control status included hypertension assessment patterns, comorbid diagnoses, procedures and medication history. The final random forest model achieved a c-statistic of 0.836 (95% CI 0.830 to 0.842) and an accuracy of 0.773 (95% CI 0.766 to 0.780). This study achieved accurate prediction of transition points of hypertension control status, an important first step in the long-term goal of developing personalized hypertension management plans.
Conference Paper
Full-text available
In an era where novel clinical, biochemical, genetic and imaging determinants impacting patient outcomes are being continuously discovered, the field of outcomes research offers clinicians and patients the potential to make better informed health care decisions through the use of sophisticated risk-adjustment models that incorporate patientspsila unique clinical characteristics. EPOCH<sup>reg</sup> and ePRISM<sup>reg</sup> comprise a novel application suite for delivering complex risk-adjustment models to the bedside and are currently in use at multiple medical centers. An intuitive visual interface for building models allows outcomes researchers to rapidly translate prediction models into web-based decision support and reporting tools. A flexible XML web services architecture facilitates integration with existing clinical information systems, allows for remote access to the modeling engine, and allows models to be exported across systems via the Predictive Model Markup Language standard.
Purpose (1) The purpose of this paper is to construct a comprehensive framework of research dissemination and utilization that is useful for both health policy and clinical decision-making. Organizing Construct (2) The framework illustrates that the process of the adoption of research evidence into health-care decision-making is influenced by a variety of characteristics related to the individual, organization, environment and innovation. The framework also demonstrates the complex interrelationships among these characteristics as progression through the five stages of innovation-namely, knowledge, persuasion, decision, implementation and confirmation-occurs. Finally, the framework integrates the concepts of research dissemination, evidence-based decision-making and research utilization within the diffusion of innovations theory. Methods (3) During the discussion of each stage of the innovation adoption process, relevant literature from the management field (i.e., diffusion of innovations, organizational management and decision-making) and health-care sector (i.e., research dissemination and utilization and evidence-based practice) is summarized. Studies providing empirical data contributing to the development of the framework were assessed for methodological quality. Conclusions (4) The process of research dissemination and utilization is complex and determined by numerous intervening variables related to the innovation (research evidence), organization, environment and individual.
We aim to produce predictive models that are not only accurate, but are also interpretable to human experts. We introduce a generative model called the Bayesian List Machine for fitting decision lists, a type of interpretable classifier, to data. We use the model to predict stroke in atrial fibrillation patients, and produce predictive models that are simple enough to be understood by patients yet significantly outperform the medical scoring systems currently in use.
Fifteen to twenty years is how long it takes for the billions of dollars of university-based research to translate into evidence-based policies and programs suitable for public use. Over the past decade, an exciting science has emerged that seeks to narrow the gap between the discovery of new knowledge and its application in public health, mental health, and health care settings. Dissemination and implementation (D&I) research seeks to understand how to best apply scientific advances in the real world, by focusing on pushing the evidence-based knowledge base out into routine use. To help propel this crucial field forward, this book aims to address a number of key issues, including: how to evaluate the evidence base on effective interventions; which strategies will produce the greatest impact; how to design an appropriate study; and how to track a set of essential outcomes. D&I studies must also take into account the barriers to uptake of evidence-based interventions in the communities where people live their lives and the social service agencies, hospitals, and clinics where they receive care. The challenges of moving research to practice and policy are universal, and future progress calls for collaborative partnerships and cross-country research. The fundamental tenet of D&I research-taking what we know about improving health and putting it into practice-must be the highest priority.
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Background: There has been increased interest in using multiple observational databases to understand the safety profile of medical products during the postmarketing period. However, it is challenging to perform analyses across these heterogeneous data sources. The Observational Medical Outcome Partnership (OMOP) provides a Common Data Model (CDM) for organizing and standardizing databases. OMOP's work with the CDM has primarily focused on US databases. As a participant in the OMOP Extended Consortium, we implemented the OMOP CDM on the UK Electronic Healthcare Record database-The Health Improvement Network (THIN). Objective: The aim of the study was to evaluate the implementation of the THIN database in the OMOP CDM and explore its use for active drug safety surveillance. Methods: Following the OMOP CDM specification, the raw THIN database was mapped into a CDM THIN database. Ten Drugs of Interest (DOI) and nine Health Outcomes of Interest (HOI), defined and focused by the OMOP, were created using the CDM THIN database. Quantitative comparison of raw THIN to CDM THIN was performed by execution and analysis of OMOP standardized reports and additional analyses. The practical value of CDM THIN for drug safety and pharmacoepidemiological research was assessed by implementing three analysis methods: Proportional Reporting Ratio (PRR), Univariate Self-Case Control Series (USCCS) and High-Dimensional Propensity Score (HDPS). A published study using raw THIN data was selected to examine the external validity of CDM THIN. Results: Overall demographic characteristics were the same in both databases. Mapping medical and drug codes into the OMOP terminology dictionary was incomplete: 25 % medical codes and 55 % drug codes in raw THIN were not listed in the OMOP terminology dictionary, representing 6 % condition occurrence counts, 4 % procedure occurrence counts and 7 % drug exposure counts in raw THIN. Seven DOIs had <0.3 % and three DOIs had 1 % of unmapped drug exposure counts; each HOI had at least one definition with no or minimal (≤0.2 %) issues with unmapped condition occurrence counts, except for the upper gastrointestinal (UGI) ulcer hospitalization cohort. The application of PRR, USCCS and HDPS found, respectively, a sensitivity of 67, 78 and 50 %, and a specificity of 68, 59 and 76 %, suggesting that safety issues defined as known by the OMOP could be identified in CDM THIN, with imperfect performance. Similar PRR scores were produced using both CDM THIN and raw THIN, while the execution time was twice as fast on CDM THIN. There was close replication of demographic distribution, death rate and prescription pattern and trend in the published study population and the cohort of CDM THIN. Conclusions: This research demonstrated that information loss due to incomplete mapping of medical and drug codes as well as data structure in the current CDM THIN limits its use for all possible epidemiological evaluation studies. Current HOIs and DOIs predefined by the OMOP were constructed with minimal loss of information and can be used for active surveillance methodological research. The OMOP CDM THIN can be a valuable tool for multiple aspects of pharmacoepidemiological research when the unique features of UK Electronic Health Records are incorporated in the OMOP library.
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.