ArticlePDF Available

Dynamic-ETL: A hybrid approach for health data extraction, transformation and loading

Authors:

Abstract and Figures

Background Electronic health records (EHRs) contain detailed clinical data stored in proprietary formats with non-standard codes and structures. Participating in multi-site clinical research networks requires EHR data to be restructured and transformed into a common format and standard terminologies, and optimally linked to other data sources. The expertise and scalable solutions needed to transform data to conform to network requirements are beyond the scope of many health care organizations and there is a need for practical tools that lower the barriers of data contribution to clinical research networks. Methods We designed and implemented a health data transformation and loading approach, which we refer to as Dynamic ETL (Extraction, Transformation and Loading) (D-ETL), that automates part of the process through use of scalable, reusable and customizable code, while retaining manual aspects of the process that requires knowledge of complex coding syntax. This approach provides the flexibility required for the ETL of heterogeneous data, variations in semantic expertise, and transparency of transformation logic that are essential to implement ETL conventions across clinical research sharing networks. Processing workflows are directed by the ETL specifications guideline, developed by ETL designers with extensive knowledge of the structure and semantics of health data (i.e., “health data domain experts”) and target common data model. ResultsD-ETL was implemented to perform ETL operations that load data from various sources with different database schema structures into the Observational Medical Outcome Partnership (OMOP) common data model. The results showed that ETL rule composition methods and the D-ETL engine offer a scalable solution for health data transformation via automatic query generation to harmonize source datasets. ConclusionsD-ETL supports a flexible and transparent process to transform and load health data into a target data model. This approach offers a solution that lowers technical barriers that prevent data partners from participating in research data networks, and therefore, promotes the advancement of comparative effectiveness research using secondary electronic health data.
Content may be subject to copyright.
T E C H N I C A L A D V A N C E Open Access
Dynamic-ETL: a hybrid approach for health
data extraction, transformation and loading
Toan C. Ong
1*
, Michael G. Kahn
1,4
, Bethany M. Kwan
2
, Traci Yamashita
3
, Elias Brandt
5
, Patrick Hosokawa
2
,
Chris Uhrich
6
and Lisa M. Schilling
3
Abstract
Background: Electronic health records (EHRs) contain detailed clinical data stored in proprietary formats with
non-standard codes and structures. Participating in multi-site clinical research networks requires EHR data to be
restructured and transformed into a common format and standard terminologies, and optimally linked to other
data sources. The expertise and scalable solutions needed to transform data to conform to network requirements
are beyond the scope of many health care organizations and there is a need for practical tools that lower the
barriers of data contribution to clinical research networks.
Methods: We designed and implemented a health data transformation and loading approach, which we refer to as
Dynamic ETL (Extraction, Transformation and Loading) (D-ETL), that automates part of the process through use of
scalable, reusable and customizable code, while retaining manual aspects of the process that requires knowledge of
complex coding syntax. This approach provides the flexibility required for the ETL of heterogeneous data, variations
in semantic expertise, and transparency of transformation logic that are essential to implement ETL conventions
across clinical research sharing networks. Processing workflows are directed by the ETL specifications guideline,
developed by ETL designers with extensive knowledge of the structure and semantics of health data (i.e., health
data domain experts) and target common data model.
Results: D-ETL was implemented to perform ETL operations that load data from various sources with different
database schema structures into the Observational Medical Outcome Partnership (OMOP) common data model. The
results showed that ETL rule composition methods and the D-ETL engine offer a scalable solution for health data
transformation via automatic query generation to harmonize source datasets.
Conclusions: D-ETL supports a flexible and transparent process to transform and load health data into a target
data model. This approach offers a solution that lowers technical barriers that prevent data partners from
participating in research data networks, and therefore, promotes the advancement of comparative effectiveness
research using secondary electronic health data.
Keywords: Electronic health records, Extraction, Transformation and loading, Distributed research networks, Data
harmonization, Rule-based ETL
* Correspondence: Toan.Ong@ucdenver.edu
1
Departments of Pediatrics, University of Colorado Anschutz Medical
Campus, School of Medicine, Building AO1 Room L15-1414, 12631 East 17th
Avenue, Mail Stop F563, Aurora, CO 80045, USA
Full list of author information is available at the end of the article
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134
DOI 10.1186/s12911-017-0532-3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Background
Clinical data such as from electronic health records
(EHR) - have become key data sources (i.e., as secondary
data) for comparative effectiveness research (CER) [1, 2].
The clinical research community has long envisioned
using data generated during routine clinical care to
explore meaningful health care questions and health
policy issues that cannot be addressed by traditional
randomized clinical trials [37]. Recent developments in
observational CER, patient-centered outcomes research
(PCOR) study methods, and analytic techniques have
improved the ability to infer valid associations from
non-randomized observational studies [814]. A current
objective of multiple major U.S. health data initiatives is
to create large CER-supportive data networks by integrat-
ing EHR data from multiple sources (i.e., multiple EHRs
from multiple health care organizations) and enriching
these data with claims data [6, 1319]. To harmonize data
from multiple sources, health data networks transform
data from source EHR systems to a common data
model (CDM), such as those of the Observational
Medical Outcomes Partnership (OMOP), Informatics
for Integrating Biology and the Bedside (i2b2), Mini-
Sentinel (MS) and the Patient Centered Outcome
Research Network (PCORNet) [16, 2024].
Data harmonization processes are known to consume
significant resources, and much prior work has been
done to simplify data mappings, shorten data querying
time, and improve data quality [2528]. Common tech-
nical challenges of an ETL process are compatibility of
the source and target data, scalability of the ETL
process, and quality of source data [2933]. Compatibil-
ity challenges occur because local EHR systems often
have different data models, vocabularies, terms for data
elements, and levels of data granularity. Incompatibility
issues may lead to information loss due to the inability
of the target data model to translate and store the syntax
and semantics of the source data accurately [34]. Scal-
ability is a challenge due to the volume of health data,
the need for frequent data refreshes, operational changes
in source data systems, and ongoing revisions to target
schema definitions and scope. Finally, ensuring data
quality as an outcome of the ETL processes is challenging
due to the varying quality of source EHR data which is
dependent on the source organizationsEHR implementa-
tion and end-user interaction with the system [35]. An-
other data transformation challenge involves providing
solutions for conflicting and duplicate records. Conflicting
records are defined as two or more records about the
same object (e.g. patient, visit) that share the same identi-
fication (e.g. same encounter number) but assert different
values for a given fact or observation. On the other hand,
duplicate records refer to two records that have identical
values in all columns except the primary key record
identifier. Conflicting and duplicate records are common
data problems that can significantly affect the efficiency of
an ETL process and output data quality. Current ap-
proaches to data transformation are often not flexible or
scalable for large initiatives with multiple heteroge-
neous data sources and highly specified relationships
between data elements in the source and target data
models [30, 36, 37].
The ETL (Extraction-Transformation-Load) process is a
series of operations that allows source data to be syntactic-
ally and semantically harmonized to the structure and ter-
minology of the target CDM [38]. The ETL process to
support data harmonization typically comprises two se-
quential phases, each of which is performed by skilled
personnel with different expertise. In phase 1, subject mat-
ter experts in the source data (e.g. EHR, claims data) iden-
tify the appropriate data elements required to populate
the target database for extraction and specify the map-
pings between the source data and target data elements.
This step requires knowledge about the structure and se-
mantics of both the source and target data, such as expert-
ise in the local EHR implementation and use, and local
terminologies. In phase 2, database programmers imple-
ment methods of data transformation and the schema
mappings for loading data into the harmonized schema.
Transformation is a complex process of data cleaning
(e.g., data de-duplication, conflict resolution) and
standardization (e.g. local terminology mapping) to con-
form to the target schema format and codes so they can
be loaded into the target CDM-specific database. This
phase requires manually programming using database-
programming languages such as structured query lan-
guage (SQL). In many cases, these steps are iterated until
the transformed data are accepted as complete and cor-
rect. These two phases (schema mapping; database pro-
gramming) must be done separately for each data source,
and rarely does a single person have both the source data
expertise and database programming skills to perform the
tasks in both phases for even a single data source, and es-
pecially not for multiple data sources.
The ETL process can be supported by a data integra-
tion tool with a graphical user interface (GUI), such as
Talend
1
and Pentaho,
2
which helps reduce the manual
burden of the ETL design process. However, GUI-based
tools are often not flexible enough to address compli-
cated requirements of transformation operations such as
specific conventions to perform data de-duplication or
to perform incremental data load. Also, GUI-based tool
often lack transparency of the underlying SQL com-
mands performing the transformation making it difficult
to investigate transformation errors.
In this paper, we describe a data transformation
approach, referred to as dynamic ETL (D-ETL), that au-
tomates part of the process by using scalable, reusable
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 2 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
and customizable code, while retaining manual aspects
of the process that require complex coding syntax. The
contributions of this work include 1) providing a scal-
able, practical solution for data harmonization in a clin-
ical data research network with heterogeneous source
data and 2) lowering the technical barriers for health
data domain experts to play the main role in ETL opera-
tions by simplifying data transformation process.
Methods
Setting and context
SAFTINet (Scalable Architecture for Federated Transla-
tional Inquiries Network) is one of three national distrib-
uted research networks funded by the Agency for
Healthcare Research and Quality (AHRQ) to support
broad-scale comparative effectiveness research [21]. In 2010
SAFTINet selected the OMOP version 4 Common Data
Model (OMOP v4 CDM) and Terminology as its approach
for harmonizing and sharing data across all data partners
[32, 39]. Each data-sharing institution that participates in
SAFTINet must create and maintain a database that con-
tains their EHR data restructured into a HIPAA-compliant
(HIPAA = The Health Insurance Portability and Account-
ability Act), limited data set that conforms to the OMOP
CDM. Clinical data was also integrated with claims data
provided by payers, for two safety net organizations, and
patient-reported outcomes data collected at the point
of care to create the SAFTINet common database.
To ensure compliance with the HIPAA Privacy and
Security Rules, SAFTINet restricts protected health in-
formation (PHI) data elements to those allowable under
the regulatory definition of a limited data set (LDS),
which removes direct identifiers, such as name, address
and social security number; but includes dates, city/
town, state and 3-digit zip codes [4044]. The D-ETL
rules must therefore enforce these HIPAA restrictions as
part of the data transformation process.
D-ETL approach
Figure 1 illustrates the workflows in a D-ETL approach
to integrate two source datasets. The D-ETL approach is
based on four key components:
1. Comprehensive ETL specifications, which are the
master plan for the entire ETL process, outlining in
narrative text and diagrams the scope of data to be
extracted, the target data model, and the format of
the input and output data files.
2. D-ETL rules composed in plain text format, which
ensures that rules are human readable and therefore
easily scrutinized, maintained, shared and reused.
3. An efficient ETL rules engine that generates full
SQL statements from ETL rules to transform,
conform, and load the data into target tables.
4. These auto-generated SQL statements are accessible
by the ETL designers to execute, test and debug the
rules thereby supporting an iterative process of
validation and debugging.
ETL specifications and design
An ETL specifications guidelines (created in a standard
word processing application) contains information about
the source and target schemas, terminology mappings be-
tween data elements and values in the source and target
schemas, and definitions and conventions for data in the
target schema. The ETL specifications document is cre-
ated by one or more health data domain experts who have
extensive knowledge of the source and target schemas.
Data extraction and validation
In the data extraction step, required data elements from
the source system are extracted to a temporary data stor-
age from which they are transformed and loaded into the
target database. The D-ETL approach uses comma-
separated values (CSV) text files for data exchange due to
its wide use and acceptability [45]. Extracted data then go
through a data validation processes including input data
checks for missing data in required fields and orphan for-
eign key values (e.g. values which are present in a foreign
Fig. 1 Workflows of D-ETL approach to integration two
source datasets
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 3 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
key column but not in a primary key column) checks. In
addition, data transformation processes usually have spe-
cific assumptions about the value and structure of input
data that require validation. Figure 2 shows a list of
example validation rules.
D-ETL rules
With D-ETL, validated input data are transformed via a
set of D-ETL rules. The structure of a D-ETL rule re-
quires basic information about the source and target
database (i.e. database, schema, table, field) as well as
the data transformation formulae. The structure of the
D-ETL rules allows target data to be generated by com-
bining data from multiple related source tables. A spe-
cific example of a data transformation problem that can
be addressed by the ETL rules is the transfer of source
data from the source Demographic table and Race field
to the target Person table and Race field in the OMOP
CDM. Assume that Race values in the source EHR data
is coded using the standard Health Level 7 (HL7)
3
cod-
ing system. Since the standard coding system for Race
values in OMOP is the Systematized Nomenclature of
Medicine (SNOMED),
4
there must be a terminology
mapping operation as part of the ETL process. To ad-
dress this problem, the D-ETL rule that transforms data
in the source Demographic table must reference at least
two source tables: The Demographic table and the Sour-
ce_to_Concept_Map table. The Source_to_Concept_-
Map table provides the mapping from HL7 value codes
for race to SNOMED value codes for race.
A D-ETL rule is a data structure that has 12 attributes
and as many rows as needed for complete rule specifica-
tion. Each rule results in the SQL code that is responsible
for the transforming and loading of one or more fields in
asingle target table. Table 1 contains a list of the rule
attributes and their descriptions. D-ETL rules are usually
composed by the health data domain experts based on the
ETL specifications document. D-ETL rules implement the
schema mappings prescribed in the ETL specifications
document.
Given its structure, a D-ETL rule can be stored in a
CSV formatted file with one column for each attribute. Al-
though D-ETL rules in CSV format can be edited using
most text editors available with all major operating sys-
tems, experience shows that D-ETL rules can be best
composed and maintained in a spreadsheet application.
D-ETL rules can be shared easily among ETL teams who
have the same source data and target data structure. If
multiple data sources are being loaded into a single target
dataset, each data source has its own set of D-ETL rules.
Table 2 is an example of a D-ETL rule. For simplification,
some of the attributes were omitted in the example.
Using SQL, the transformation specified in this ex-
ample D-ETL rule can be done using a pair of SELECT
and INSERT statements,
5
following the syntax below:
INSERT INTO tableName <fieldList>
SELECT <Transformed fieldList>
FROM <tableList>
The INSERT and SELECT statements above are generated
automatically by the D-ETL engine from D-ETL rules. Each
component of the rules corresponds to a specific oper-
ation of the query. The D-ETL rule engine directly sup-
ports the following SQL operations: INSERT, SELECT,
SELECT DISTINCT, JOINS (inner join, left outer join,
right outer join, full outer join) and WHERE. The D-ETL
rule structure takes advantage of both the simplicity of the
CSV format and the flexibility of full SQL statements. The
D-ETL designer can compose a rule without having
extensive knowledge of the formal syntax of SQL, and
Fig. 2 Examples of input data validation rules with loose and strict validation criteria
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 4 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
only requires input from a technical expert for special
circumstances (e.g., complex rule debugging). All D-ETL
rules used to load a single dataset are contained in one
CSV file.
D-ETL rule attributes can be categorized into three
functional components: rule specification, output
destination and source data. Rule specification attributes
include: rule order, rule descriptions and data source ID, a
field used to identify specific datasets in case multiple
datasets with different rule sets are processed at the same
time. A composite key uniquely identifying a rule is formed
based on the combination of these three fields. Rule order
is the unique identifier of a rule. However, because each
rule comprises multiple rows representing that rules
components, all these rows have the same rule order.
Therefore, along with the rule order, each row within a rule
is further identified by the map order column. Its
important to note that rule order is unique in one rule set,
however, it might not be unique across different rule sets.
The output destination attributes contain information
about the target destination (e.g. target database, target
schema, target table, target column). A rule can only
populate one target table. Not all attributes of the target
table must be included in the rule. However, a NULL
value is usually use as the source value for non-populated
columns.
The source data attributes include the source data
information (e.g. source database, source schema, source
table, source value). Source data columns not only
contain information about the location of the source
data but also the data transformation formula.
The example rule in Table 2 is used to populate the
target table: Care_site. The rule will be used to generate
one SQL statement. The row with PRIMARY map-type
identifies the main source table from which the data will
be queried, in this example the Medical_claims table.
The primary table is a table that has at least one field
that is used to populate the primary key of the target
table. The map_type column of the first row is always
set to PRIMARY, indicative of the primary table from
which the source data resides. Additional source tables
can be joined with the primary table by the join opera-
tors with map_type = {JOIN, LEFT JOIN, RIGHT JOIN,
FULL JOIN} and the join condition specified in the
map_type column. In the example, the source_value
column of the PRIMARY row consists of a composite
primary key used to populate the primary key of the
target table. The primary key includes 3 fields: billing_-
provider_id and place_of_ service_code from the medical
claims table and provider_organization_type from the
provider table. The provider table is joined with the
medical_claims table in a JOIN row with the JOIN con-
dition specified in the source_value of that same row.
An optional row with the WHERE clause can be used
with the WHERE condition in the source_value column.
For example, the WHERE row defines the WHERE
condition that only provider_organization_type of 1
or 2will be populated to the target table. Note that
the inoperator was used because it is a standard
operator that PostgreSQL supports. The next rows,
which have VALUE as map_type, contain direct map-
pings from the source values to the target columns.
NULL values must be clearly indicated where target
columns cannot be populated. Although all rows in
one rule have the same rule_order and rule descrip-
tion, they will have different map_orders. Finally, each
value listed in the source_value column must include
the source table and source field.
A D-ETL rule can include mappings that have differ-
ent complexity levels varying from one-to-one mappings
Table 1 Attribute Type of a D-ETL rule
Attribute Description Group
Rule order Rule identification number. All rows of a rule should have the same rule order Identification
Rule description Short description with maximum length of 255 characters to describe the purpose of the rule. Identification
Target database Name of target database Target
Target schema Name of target schema Target
Target table Name of target table Target
Target column Name of target column Target
Map type Type of row. Possible values: PRIMARY, JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, WHERE, VALUE, CUSTOM. Source
Map order Identification of row within a rule Identification
Source database Name of source database Source
Source schema Name of source schema Source
Source table Name of source table Source
Source value VALUE row: The value used to populate target column
JOIN row: join condition
WHERE row: where condition
Source
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 5 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 2 Example of a D-ETL rule that loads data into the Care_site table in OMOP from a claims-based source CSV file
Rule Order Rule Description Target Table Target Column Map Type Map Order Source Table Source Value
1 Medical_claims to Care_site Care_site PRIMARY 1 Medical_claims medical_claims.billing_provider_id,
medical_claims.place_of_service_code,
provider.provider_organization_type
1 Medical_claims to Care_site Care_site JOIN 2 provider medical_claims.billing_provider_id =
provider.provider_id
1 Medical_claims to Care_site Care_site WHERE 3 provider.provider_organization_type in (1,2)
1 Medical_claims to Care_site Care_site care_site_source_value VALUE 4 medical_claims.billing_provider_id || -
||medical_claims.place_of_service_code||-
||provider.provider_organization_type
1 Medical_claims to Care_site Care_site organization_source_value VALUE 5 NULL
1 Medical_claims to Care_site Care_site place_of_service_source_value VALUE 6 medical_claims.place_of_service_code
1 Medical_claims to Care_site Care_site care_site_address_1 VALUE 7 provider.provider_address_first_line
1 Medical_claims to Care_site Care_site care_site_address_2 VALUE 8 provider.provider_street
1 Medical_claims to Care_site Care_site care_site_city VALUE 9 provider.provider_city
1 Medical_claims to Care_site Care_site care_site_state VALUE 10 provider.provider_state
1 Medical_claims to Care_site Care_site care_site_zip VALUE 11 provider.provider_zip
1 Medical_claims to Care_site Care_site care_site_county VALUE 12 NULL
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 6 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
to many-to-one mappings. The output of the rule can be
filtered by the WHERE clause specified in the map_type
column. A WHERE row in a D-ETL rule is not a re-
quired component, but when present, each rule can only
have one WHERE clause. The source_value column may
also contain expressions which can be formulated using
native dialect of the DBMS to transform data from the
source to target. For example, if PostgreSQL is the
DBMS, all PostgreSQL operators and functions are
supported. This approach extends the flexibility of the
hybrid approach by allowing D-ETL designers to take
advantage of all functions supported by the target
DBMS. A drawback of this approach is that code transla-
tion is needed when the rules are being used in a differ-
ent DBMS. Therefore, it is good practice to use widely
used SQL functions when possible.
D-ETL engine
The D-ETL rules are composed in a human-readable
format, allowing personnel with limited database pro-
gramming expertise to compose, read, verify, and main-
tain them. The D-ETL engine automatically translates
these D-ETL rules into complex executable SQL state-
ments to transform and load the data into target tables.
During this process, multiple query performance en-
hancements and data cleaning operations such as com-
mon table expressions (CTEs) and data de-duplication
are automatically incorporated into the SQL statements.
The D-ETL engine comprises five sub-processes that
deal with data integration, transformation, de-duplication
and loading. In Fig. 3, the numbers in red ovals represent
the process used to carry out each D-ETL engine sub-
process. In the diagram, variables are enclosed in pointing
angle brackets (<variable>) and a query is enclosed in
square brackets ([query]). D-ETL rule attributes are
identified by the format: <map_type:column_name>. For
example, the WHERE condition of a rule can be identified
by <where:source_value>. Even though the order of execu-
tion is different, for readability, the processes will be
enumerated from top to bottom.
In process 1, the D-ETL engine starts by creating the
INSERT statement using the values in the target column.
In process 2, the data are de-duplicated using sub-
queries and CTEs. In process 3, the de-duplicated source
data sets are then joined and filtered before being
inserted into the target table in process 1. An example
solution for conflicting records is to pick the record with
latest ETL timestamp out of a group records that share
common primary key values. Custom queries are handled
in process 5. See Additional file 1 for description of CUS-
TOM rules and Additional file 2 for detailed description
the individual D-ETL engine processes.
Testing and debugging
The D-ETL approach facilitates code-level testing and de-
bugging. SQL statements generated by the D-ETL engine
are stored separately in the database from individual rules.
D-ETL designers can test and debug these statements and
review query results directly themselves in an internal test-
ing process. Any syntactic and semantic errors in the
Fig. 3 Architecture of the ETL engine
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 7 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
statements can be traced back to the components of the
rule. This mechanism allows D-ETL designers to under-
stand the error at the SQL statement level and take advan-
tage of the error messaging system of the DBMS, instead
of having to go through the indirect error messages pro-
vided by a GUI or error log file after a complete ETL
process.
In addition, executing the SQL statements directly
allows D-ETL designers to iteratively perform trial and
error adjustments on the query until the desired trans-
formation is created instead of continuously changing
the D-ETL rule. Consequently, only the final change
need be recorded in the D-ETL specifications. Being able
to review the SQL statements directly allows D-ETL de-
signers to understand the relationship between rule de-
sign and the query results, hence, improving the rule
design process. A drawback of this testing and debug-
ging process is that it requires direct access to the back-
end database where the D-ETL statements are stored
and might require an advanced knowledge of SQL.
Table 3 summarizes the challenges of the ETL process
and the solutions for these challenges enabled by D-ETL.
Results
Internal testing and validation
SAFTINet partners successfully implemented D-ETL via a
system called ROSITA (Reusable OMOP and SAFTINet
Interface Adaptor). The ROSITA software system is oper-
ated via a web-based user interface powered by a backend
D-ETL engine. The efficiency of D-ETL using ROSITA to
process health data is very promising, even in situations
where duplicate and overlapping data are present in the
source dataset. Table 4 shows the runtime of some D-ETL
rules within ROSITA from an internal testing process,
loading heath datasets on a CentOS 6 virtual machine with
2 processing cores, 32GB of RAM and 300GB hard drive.
Practical, scalable and feasible implementation of D-ETL
in ROSITA
Using ROSITA, SAFTINet partners in all three states
were able to load clinical data and transform the data
into the OMOP v4 CDM and in two of these states,
where claims data was available, partners were able to
load and link clinical and claims data, prior to
completing the data transformation to OMOP. Two
partners with well-staffed, sophisticated informatics
departments implemented ROSITA within their own
environment, mapping directly from their own EHR or
electronic data warehouse (EDW) to OMOP. The other
ten partners used intermediaries who were health data do-
main experts but not advanced SQL database program-
mers to transform their EHR data to an intermediary data
model; they then applied D-ETL within ROSITA to trans-
form data from the intermediary model to OMOP. The
four resulting SAFTINet-affiliated ROSITA instances cur-
rently contain records for 1,616,868 patient lives. These
ROSITA systems have also been used to capture results of
over 8000 patient-reported outcomes measures, which are
being used in SAFTINet CER studies [46, 47].
A site-specific ETL specifications guideline is used by
each data partner to both guide and document their
extracted source data choices and intended target loca-
tions. Source data is extracted and transferred using
CSV files. In practice, the CSV format is a ubiquitous
and flexible temporary storage for D-ETL. Extracting
data to CSV files allowed the data extraction process to
be separated from the more difficult data transforming
and loading processes. Separating data extraction from
data transformation eliminated the need for an active
network connection to the source data every time a new
transformation task was performed. In addition, extract-
ing the source data into a temporary storage system,
without directly connecting to the source database,
enabled controlled access to the exported data created
by the data owners.
Table 3 Challenges and solutions
Challenges Solution by D-ETL Approach
Heterogeneity in source
data sets
ETL specifications
Rule-based D-ETL engine
Native SQL code acceptance
Custom rule mechanism
Data extraction interferes with
source EHR
CSV file format
Efficiency Integrated D-ETL engine
Query optimization
Duplicate and overlapping data Automated data de-duplication
and incremental data loading
Data quality Input data: Extracted data validation
Output data: Data profiling and
visualization
Human expertise Explicit rule structure
Effective rule testing and debugging
mechanism
Resumption (ability to continue
from a point where an error
previously occurred)
Modular ETL process
Table 4 D-ETL engine performance in ROSITA
Rule
number
Number of
source tables
Number of records
(in all source tables)
Run-time
(in seconds)
1 1 21,565 1.1
2 2 851,706 30.3
3 2 1,910,513 12.0
4 2 1,324,860 13.1
5 3 1,987,582 15.3
6 3 2,007,661 30.1
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 8 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Data quality validations on the extracted dataset were
important to ensure the success of the subsequent steps.
Data validation usually occurred immediately after the
data extraction step for quick identification of any
problems with the extracted data and expedited re-
extraction. Experience showed that it is very difficult to
produce perfectly extracted data sets that would be
accepted by the ETL process on the first try. For that
reason, it was important to learn from the errors and
incorporate them back into the data extraction conven-
tions document for future extractions. Data in CSV files
can be validated directly or imported as-is into a DBMS,
ideally the same DBMS where the transformation will
occur. Table 5 includes a list of data validation rules per-
formed to ensure the quality of the data is appropriate
for ETL processes.
In Table 5, if the type of error is Error, the data fail
the validation rule and have to be extracted again after
the problem is fixed. The data element that causes the
error and its exact location in the dataset must be
provided. To protect sensitive data from being exposed,
only the line number of the data record in the data file
is displayed. If the type of error is Warning, the data
will not fail the validation rule. Instead, there will be a
warning message that provides information about the
data. The decision to deal with the issue is optional. The
list of validation rules is based on the anticipation of the
requirements of the data transformation process into the
target data model. The type of error can be classified
based on the impact of the error and the expected
amount of information loss.
In SAFTINet, a health data domain expert, who had
limited expertise about SQL or the DBMS, created a
separate D-ETL rule set for each partners OMOP CDM
instantiation. Occasionally, the domain expert required
assistance from technical personnel for complex DBMS
functions. Technical assistance was also needed in case
of non-obvious data output discrepancies. Over time,
the number of technical assistances might diminish as
the experience of the domain expert In many cases, the
health data domain expert was able to compose, load
and debug the rules. Via the operations of ROSITA, we
found that D-ETL is very effective in rule testing and
debugging. The health data domain expert was able to
effectively track down the source of the error by execut-
ing individual rules.
Extending beyond SAFTINet, ROSITA has been used
by the DARTNet Institute (DARTNet) to successfully
transform data for more than 7 million patients into the
OMOP CDM for various CER, quality improvement and
interventional research activities. DARTNet uses ROSITA in
a different way than SAFTINet partners; data contributors
send fully identified data to DARTNet personnel (under
HIPAA BAAs) who then perform transformations
centrally (i.e., in a non-distributed fashion).
Discussion
In this project, we designed and implemented a novel
hybrid rule-based ETL approach called Dynamic-ETL
(D-ETL). Implementation in practice shows that D-ETL
and its implementation in ROSITA is viable and successful
approach to the structural and sematic harmonization of
health data in large health data sharing networks contain-
ing heterogeneous data partners, such as SAFTINet and
DARTNet. D-ETL allows health data domain experts with
limited SQL expertise to be involved in all phases, namely
ETL specifications, rule design, test and debugging rules,
and only requires expert technical assistance in special
cases. D-ETL promotes a rule structure that accommo-
dates both straightforward and complex ETL operations
and support ETL transparency and encourages D-ETL
rule-sharing. The ETL rule engine also incorporates the
mechanisms that deal with conflicting and duplicate data.
Using readily available hardware, the implemented D-ETL
system shows acceptable performance results loading real
health data.
The accuracy and reliability of the D-ETL rules and
the D-ETL engine rely on the accuracy and reliability
of content of the D-ETL rules. Additional technical
performance metrics to be examined in a larger scale
implementation would include aspects of improved
data quality, including accuracy (e.g., fewer errors in
mapping) and completeness (e.g., fewer missing data
points) [48]. Other key factors in adoption and use of
D-ETL include perceived usability and usefulness and
resources and time required to implement the system,
Table 5 Examples of extracted data validations
Validation Rule Type of error Description
Data in a date or timestamp column cannot be parsed as a date Error Invalid date data will fail date operators and functions
Data is missing in a field defined as required by the schema Error Missing data in required fields will violate database constraints
of target schema
A column in the schema has a missing length, precision or scale Warning Default length, precision or scale can be used
Data in a numeric or decimal column is not a number Error Invalid numeric data will fail numeric operators and functions
Data is too long for text or varchar field Error Data loss will occur if long text values are truncated to meet
length requirement
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 9 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
which can be assessed using surveys and in-depth in-
terviews with users [49]. In this initial development
and small scale implementation work, clinical partner
users reported the domain experts following the D-
ETL approach experienced a learning curve and in-
creased efficiency following the first use. A better un-
derstanding of the efforts related to D-ETL rule
composition and debugging, and level of involvement
of database technicians in comparison with other
tools or approaches will further validate the effective-
ness and relative efficiency of D-ETL in different con-
texts. It is important to note the performance of an
ETL operation with healthcare data depends on many
interrelated factors such as familiarity with source
data model and the complexity of the terminology
mapping process which are external to the operations
of the D-ETL engine.
Despite its advantages, D-ETL has several limitations.
First, although expertise in query writing is not required,
certain SQL coding skill is needed for the health data
domain expert who is involved in the ETL process. Know-
ledge about the operators and functions of the DBMS is
needed for the rule creation. Second, since the ETL rules
are composed in third party software such as Excel, no
real time syntactical error checking is available. The rule
composer will not know about syntactical errors (i.e. in-
correct column name) until the SQL statement is gener-
ated. Third, the testing and debugging process requires
direct access to the rule database and extract data dataset,
which might not be available to the rule composer due to
database security access limitations.
Future directions of D-ETL focus on addressing some
of the limitations and improving the efficiency of the
rule designing process. First, the emergence of semi-
automatic schema mapping methods supports the
automation of D-ETL rules composition [50]. The in-
volvement of the health data domain expert can then
focus more on correcting the mapping results and
ensuring data quality. Second, an automatic rule valid-
ation mechanism that checks for basic syntactical errors
would improve the efficiency of the ETL rule creation
process. To be feasible, a user-friendly rule editor with
intuitive user interface has to be developed. Third, the
expressions in the ETL rules must be in the language of
the local DBMS. For rules to be used and reused across
different DBMSs, a rule conversion tool that automatic-
ally translates operators and functions from one SQL
dialect into another is needed. Open source tools, such
as SQL Renderer from the OHDSI community,
6
could
be a potential solution to this problem. Finally, even
though rules are composed in plain text format, a graph-
ical presentation of the structure of the rule will improve
ETL rule maintenance and help ETL team members
understand complex rules created by others.
Conclusion
Data harmonization is an important step towards data
interoperability which supports the progression of
comparative effectiveness research. Data harmonization
can be accomplished by incorporating data standards, the
knowledge of domain experts and effective and efficient
ETL processes and tools. In this work, we propose a
dynamic data ETL approach to lower the technical
barriers encountered during execution of ETL processes.
Our approach has been implemented and deployed to
load clinical and claims data from source electronic health
record systems into the OMOP common data model. This
is an important step forwards to making high quality data
available for clinical quality improvement and biomedical
research.
Endnotes
1
https://www.talend.com/
2
http://www.pentaho.com/
3
http://www.hl7.org/
4
https://www.nlm.nih.gov/research/umls/Snomed/
snomed_main.html
5
http://www.w3schools.com/sql/sql_insert.asp
6
https://github.com/OHDSI/SqlRender
Additional files
Additional file 1: Description of the custom D-ETL rules. (DOCX 13 kb)
Additional file 2: Description of the D-ETL engine. (DOCX 21 kb)
Abbreviations
AHRQ: Agency for healthcare research and quality; BAA: Business associate
agreement; CDM: Common data model; CER: Comparative effectiveness
research; CSV: comma-separated values; DARTNet: Distributed ambulatory
research in therapeutics network; DBMS: Database management system; D-
ETL: Dynamic extraction, transformation and load; EDW: Electronic data
warehouse; EHRs: Electronic health records; ETL: Extraction, transformation
and load; GUI: Graphical user interface; HIPAA: Health insurance portability
and accountability act; HL7: Health level 7; I2b2: Informatics for integrating
biology and the bedside; ICD-9CM: International classification of diseases,
ninth revision, clinical modification; LDS: Limited data set; MS: Mini-Sentinel;
OHDSI: Observational health data sciences and informatics;
OMOP: Observational medical outcome partnership; PCOR: Patient-centered
outcomes research; PCORnet: Patient centered outcome research network;
PHI: Protected health information; ROSITA: Reusable OMOP and SAFTINet
Interface Adaptor; SAFTINet: Scalable architecture for federated translational
inquiries network; SNOMED: Systematized nomenclature of medicine;
SQL: Structured query language
Acknowledgements
Mr. David Newton provided programming support for the graphical user
interface of the system. We thank the team members at Recombinant by
Deloitte for supporting the development of the ROSITA system.
Funding
Funding was provided by AHRQ 1R01HS019908 (Scalable Architecture for
Federated Translational Inquiries Network) and AHRQ R01 HS022956
(SAFTINet: Optimizing Value and Achieving Sustainability) Contents are
the authorssole responsibility and do not necessarily represent official
AHRQ or NIH.
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 10 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Availability of data and materials
There was no specific dataset discussed in this paper. The datasets used to
measure the loading performance of the ROSITA system are real data and
cannot be made publically available due to the HIPAA law and other federal,
state and institutional regulations.
Link to ROSITA source code: https://github.com/SAFTINet/rosita2-1
Data extraction tool implementation
The data extraction tool was implemented by SAFTInet data owners sites.
The SAFTINet project received approval from the Internal Review Board for
its implementation infrastructure and had HIPAA business associate
agreement for testing with real data.
Authorscontributions
All authors were involved in the conception and design of this study. LS and
MK oversaw the requirements and design of the D-ETL approach. TO and CU
implemented and optimized the D-ETL engine. PH and TY composed the D-
ETL rules and tested D-ETL engine. EB performed terminology mappings and
test the quality of the terminology mappings. BK coordinated the development
process of ROSITA and provided important inputs on layout of the manuscript.
TO, LS, BK and MK drafted the manuscript. All authors participated in
manuscript revision and approved the final version.
Ethics approval and consent to participate
The work described in this paper describes ETL infrastructure development,
testing, and implementation, but does not describe any research using this
infrastructure. The ETL tool is implemented behind data ownersfirewalls, or
behind firewalls of those with whom data owners have HIPAA Business
Associate Agreements (BAAs). Ethical and legal approvals for this work are as
follows: 1. The IRB approved an infrastructure protocol, indicating that the
infrastructure was an approved mechanism for producing and transmitting
HIPAA limited data sets for future research. 2. Any data released for research
would require separate IRB approval and HIPAA data use agreements with
data owners. 3. Data owners signed a legal document (Master Consortium
Agreement) developed in collaboration with university legal counsel and
each organizationown legal counsel, specifying conditions under which the
ETL tool is implemented, maintained, and used to produce research data
sets within the research network. 4. For development and testing, several
data owners provided fully identified patient data to the software developers,
under HIPAA BAAs between the data owners and the university (also covering
contractors). These data were not used for research. These data governance
processes were discussed in great detail and established as described with legal
and regulatory officials from the university and data owners.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
PublishersNote
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Author details
1
Departments of Pediatrics, University of Colorado Anschutz Medical
Campus, School of Medicine, Building AO1 Room L15-1414, 12631 East 17th
Avenue, Mail Stop F563, Aurora, CO 80045, USA.
2
Departments of Family
Medicine, University of Colorado Anschutz Medical Campus, School of
Medicine, Aurora, CO, USA.
3
Departments of Medicine, University of Colorado
Anschutz Medical Campus, School of Medicine, Aurora, CO, USA.
4
Colorado
Clinical and Translational Sciences Institute, University of Colorado Anschutz
Medical Campus, School of Medicine, Aurora, CO, USA.
5
DARTNet Institute,
Aurora, CO, USA.
6
OSR Data Corporation, Lincoln, MA, USA.
Received: 26 May 2016 Accepted: 31 August 2017
References
1. Sox HC, Greenfield S. Comparative effectiveness research: a report from the
institute of medicine. Ann Intern Med. 2009;151:2035. doi:10.7326/0003-
4819-151-3-200908040-00125.
2. Danaei G, Rodríguez LAG, Cantero OF, et al. Observational data for
comparative effectiveness research: an emulation of randomised trials of
statins and primary prevention of coronary heart disease. Stat Methods Med
Res. 2013;22:7096. doi:10.1177/0962280211403603.
3. Institute of Medicine (U.S.) Committee on Improving the Patient Record. In:
Dick RS, Steen EB, editors. The computer-based patient record: an essential
technology for health care, revised edition. Washington, D.C.: National
Academy Press; 1997.
4. Grossmann C, Institute of Medicine (U.S.). Roundtable on Value & Science-
Driven Health Care. Clinical data as the basic staple of health learning :
creating and protecting a public good : workshop summary. Washington, D.
C.: National Academies Press; 2010. http://www.ncbi.nlm.nih.gov/bookshelf/
br.fcgi?book=nap12212 http://www.nap.edu/catalog.php?record_id=12212
5. Selby JV, Lipstein SH. PCORI at 3 yearsprogress, lessons, and plans. N Engl J
Med. 2014;370:5925. doi:10.1056/NEJMp1313061.
6. Fleurence RL, Curtis LH, Califf RM, et al. Launching PCORnet, a national
patient-centered clinical research network. J Am Med Inform Assoc JAMIA.
2014;21:57882. doi:10.1136/amiajnl-2014-002747.
7. Holve E, Segal C, Lopez MH, et al. The Electronic Data Methods (EDM)
Forum for Comparative Effectiveness Research (CER). Med Care. 2012;
50(Suppl):S710. doi:10.1097/MLR.0b013e318257a66b.
8. Rassen JA, Schneeweiss S. Using high-dimensional propensity scores to
automate confounding control in a distributed medical product safety
surveillance system. Pharmacoepidemiol Drug Saf. 2012;21(Suppl 1):419.
doi:10.1002/pds.2328.
9. Garbe E, Kloss S, Suling M, et al. High-dimensional versus conventional
propensity scores in a comparative effectiveness study of coxibs and
reduced upper gastrointestinal complications. Eur J Clin Pharmacol
Published Online First: 5 July 2012. doi:10.1007/s00228-012-1334-2.
10. Fleischer NL, Fernald LC, Hubbard AE. Estimating the potential impacts
of intervention from observational data: methods for estimating causal
attributable risk in a cross-sectional analysis of depressive symptoms in
Latin America. J Epidemiol Community Health. 2010;64:1621.
doi:10.1136/jech.2008.085985.
11. Polsky D, Eremina D, Hess G, et al. The importance of clinical variables in
comparative analyses using propensity-score matching: the case of ESA costs
for the treatment of chemotherapy-induced anaemia. PharmacoEconomics.
2009;27:75565. doi:10.2165/11313860-000000000-00000.
12. McCandless LC, Gustafson P, Austin PC. Bayesian propensity score analysis
for observational data. Stat Med. 2009;28:94112. 10.1002/sim.3460.
13. Jansen RG, Wiertz LF, Meyer ES, et al. Reliability analysis of observational
data: problems, solutions, and software implementation. Behav Res Methods
Instrum Comput. 2003;35:3919.
14. PCORI Methodology Committee. PCORI Methodology Standards. 2012.
http://www.pcori.org/assets/PCORI-Methodology-Standards.pdf
15. Collins FS, Hudson KL, Briggs JP, et al. PCORnet: turning a dream into reality. J
Am Med Inform Assoc JAMIA. 2014;21:5767. doi:10.1136/amiajnl-2014-002864.
16. Forrest CB, Margolis PA, Bailey LC, et al. PEDSnet: a National Pediatric
Learning Health System. J Am Med Inform Assoc JAMIA Published Online
First: 12 May 2014. doi: 10.1136/amiajnl-2014-002743.
17. Behrman RE, Benner JS, Brown JS, et al. Developing the sentinel system
A national resource for evidence development. N Engl J Med. 2011;364:4989.
doi:10.1056/NEJMp1014427.
18. Joe V Selby, Richard Platt, Rachael Fleurence, et al. PCORnet: achievements
and plans on the road to transforming health research | PCORI. http://www.
pcori.org/blog/pcornet-achievements-and-plans-road-transforming-health-
research. Accessed 16 May 2016.
19. Gini R, Schuemie M, Brown J, et al. Data extraction and management in
networks of observational health care databases for scientific research: a
comparison among EU-ADR, OMOP, mini-sentinel and MATRICE strategies.
EGEMS. 2016;4:1189. doi:10.13063/2327-9214.1189.
20. Kahn MG, Batson D, Schilling LM. Data model considerations for
clinical effectiveness researchers. Med Care. 2012;50(Suppl):S607.
doi:10.1097/MLR.0b013e318259bff4.
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 11 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
21. Schilling LM, Kwan B, Drolshagen C, et al. Scalable Architecture for
Federated Translational Inquires Network (SAFTINet): technology
infrastructure for a Distributed Data Network. EGEMs Gener. Evid. Methods
Improve Patient Outcomes. 2013. in press.
22. Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered
Scalable National Network for Effectiveness Research. J Am Med Inform
Assoc JAMIA. 2014;21:6216. doi:10.1136/amiajnl-2014-002751.
23. Murphy SN, Weber G, Mendis M, et al. Serving the enterprise and beyond
with informatics for integrating biology and the bedside (i2b2). J Am Med
Inform Assoc JAMIA. 2010;17:12430. doi:10.1136/jamia.2009.000893.
24. Brown JS, Lane K, Moore K, et al. Defining and evaluating possible database
models to implement the FDA Sentinel initiative. 2009.https://www.
brookings.edu/wp-content/uploads/2012/04/03_Brown.pdf. Accessed 15
Feb 2014.
25. Patil PS. Improved extraction mechanism in ETL process for building of a
data warehouse. INFLIBNET Published Online First: 22 February 2013. http://
ir.inflibnet.ac.in:8080/jspui/handle/10603/7023. Accessed 13 Nov 2014.
26. Devine E, Capurro D, van Eaton E, et al. Preparing electronic clinical data for
quality improvement and comparative effectiveness research: the SCOAP
CERTAIN automation and validation project. EGEMS. 2013;1:1025.
doi:10.13063/2327-9214.1025.
27. Suresh S, Gautam JP, Pancha G, et al. Method and architecture for automated
optimization of ETL throughput in data warehousing applications. 2001. http://
google.com/patents/US6208990. Accessed 13 Nov 2014.
28. Kushanoor A, Murali Krishna S, Vidya Sagar Reddy T. ETL process modeling
in DWH using enhanced quality techniques. Int J Database Theory Appl.
2013;6:17998.
29. Peek N, Holmes JH, Sun J. Technical challenges for big data in biomedicine
and health: data sources, infrastructure, and analytics. Yearb Med Inform.
2014;9:427. doi:10.15265/IY-2014-0018
30. Sandhu E, Weinstein S, McKethan A, et al. Secondary uses of electronic health
record data: Benefits and barriers. Jt Comm J Qual Patient Saf. 2012;38:3440. 1
31. Schilling LM, Kwan BM, Drolshagen CT, et al. Scalable architecture for
federated translational inquiries network (SAFTINet) technology
infrastructure for a distributed data network. EGEMS. 2013;1:1027.
doi:10.13063/2327-9214.1027.
32. Observational Medical Outcomes Partnership. CDM and Standard Vocabularies
(Version 4). 2012. http://omop.org/VocabV4.5. Accessed 1 Apr 2013.
33. Schema Matching and Mapping. http://www.springer.com/computer/
database+management+%26+information+retrieval/book/978-3-642-16517-
7. Accessed 25 May 2014.
34. Fagin R. Inverting Schema Mappings. In: Proceedings of the twenty-fifth
ACM SIGMOD-SIGACT-SIGART symposium on principles of database
systems. New York: ACM; 2006. p. 509. doi:10.1145/1142351.1142359.
35. Häyrinen K, Saranto K, Nykänen P. Definition, structure, content, use and
impacts of electronic health records: a review of the research literature. Int J
Med Inf. 2008;77:291304. doi:10.1016/j.ijmedinf.2007.09.001.
36. Madhavan J, Jeffery SR, Cohen S, et al. Web-scale data integration: you can
only afford to pay as you go. In: In Proc. of CIDR-07; 2007.
37. Davidson SB, Overton C, Buneman P. Challenges in integrating biological
data sources. J Comput Biol. 1995;2:55772. doi:10.1089/cmb.1995.2.557.
38. Taniar D, Chen L, editors. Integrations of Data Warehousing, Data Mining
and Database Technologies: Innovative Approaches. IGI Global 2011. http://
www.igi-global.com/chapter/survey-extract-transform-load-technology/
53076. Accessed 24 Sep 2014.
39. Stang PE, Ryan PB, Racoosin JA, et al. Advancing the science for active
surveillance: rationale and design for the Observational Medical
Outcomes Partnership. Ann Intern Med. 2010;153:6006. doi:10.1059/
0003-4819-153-9-201011020-00010.
40. Rothstein MA. Currents in contemporary ethics. Research privacy under
HIPAA and the common rule. J Law Med Ethics. 2005;33:1549.
41. Nass SJ, Levit LA, Gostin LO, et al. Beyond the HIPAA privacy rule:
enhancing privacy, improving health through research. Washington, D.C.:
National Academies Press; 2009. https://www.nap.edu/read/12458/chapter/1
42. Ness RB. Influence of the HIPAA privacy rule on health research. JAMA.
2007;298:216470. doi:10.1001/jama.298.18.2164.
43. Herdman R, Moses HL, United States, et al. Effect of the HIPAA privacy rule
on health research: Proceedings of a workshop presented to the National
Cancer Policy Forum. Washington D.C.: National Academies Press; 2006.
44. Selker HP, Pienta KJ. The importance of proposed changes in the Common
Rulefor clinical and translational researchers. Clin Transl Sci. 2011;4:3123.
doi:10.1111/j.1752-8062.2011.00352.x.
45. Wyatt L, Caufield B, Pol D. Principles for an ETL Benchmark. In: Nambiar R,
Poess M, eds. Performance Evaluation and Benchmarking. Springer Berlin
Heidelberg 2009. 18398.https://link.springer.com/book/10.1007%2F978-3-
642-10424-4. Accessed 1 Dec 2014.
46. Sills MR, Kwan BM, Yawn BP, et al. Medical home characteristics and asthma
control: a prospective, observational cohort study protocol. EGEMS Wash
DC. 2013;1:1032. doi:10.13063/2327-9214.1032.
47. Kwan BM, Sills MR, Graham D, et al. Stakeholder Engagement in a Patient-
Reported Outcomes (PRO) Measure Implementation: A Report from the
SAFTINet Practice-based Research Network (PBRN). J Am Board Fam Med
JABFM. 2016;29:10215. doi:10.3122/jabfm.2016.01.150141.
48. Kahn MG, Raebel MA, Glanz JM, et al. A pragmatic framework for single-site
and multisite data quality assessment in electronic health record-based
clinical research. Med Care. 2012;50(Suppl):S219. doi:10.1097/MLR.
0b013e318257dd67.
49. Legris P, Ingham J, Collerette P. Why do people use information
technology? A critical review of the technology acceptance model. Inf
Manag. 2003;40:191204. doi:10.1016/S0378-7206(01)00143-4.
50. Fagin R, Haas LM, Hernández M, et al. Clio: Schema Mapping Creation and
Data Exchange. In: Borgida AT, Chaudhri VK, Giorgini P, et al., eds.
Conceptual Modeling: Foundations and Applications. Springer Berlin
Heidelberg 2009. 198236.http://link.springer.com/chapter/10.1007/978-3-
642-02463-4_12. Accessed 11 Jun 2014.
We accept pre-submission inquiries
Our selector tool helps you to find the most relevant journal
We provide round the clock customer support
Convenient online submission
Thorough peer review
Inclusion in PubMed and all major indexing services
Maximum visibility for your research
Submit your manuscript at
www.biomedcentral.com/submit
Submit your next manuscript to BioMed Central
and we will help you at every step:
Ong et al. BMC Medical Informatics and Decision Making (2017) 17:134 Page 12 of 12
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Así mismo, el proyecto Dynamic-ETL (D-ETL) construye una plataforma de extracción, transformación y carga de los datos empleando formatos comunes y terminologías estándares (Ong et al., 2017). Para ello, han diseñado una metodología que automatiza parte del proceso mediante un desarrollo de código escalable, reutilizable y flexible, al tiempo que conserva los aspectos manuales del proceso que requieren el conocimiento de una sintaxis de codificación compleja. ...
... Por otro lado, los estudios desarrollados por Hong Sung et al. (Sun et al., 2015), y por Anil Pacacil et al. (Pacaci et al., 2018), proponen metodologías basadas en la web semántica, a través de la representación de datos con Resource Description Framework (RDF), y conversiones expresadas a través de reglas de Notación 3 (N3). Por último, Dynamic-ETL construye un proceso compuesto por (Ong et al., 2017): (1) (Brat et al., 2020;Abbas et al., 2021;Instituto i+12, 2022;OHDSI, 2022a;TriNetX, 2022). Estos proyectos responden a modelos de datos de uso secundario de diferentes diseños y propósitos explicados anteriormente, tales como repositorios clínicos, formularios de reporte de casos y conjuntos de datos agregados. ...
... Esto explica por qué las metodologías anteriores de reutilización de la HCE se han centrado en la transformación de la HCE a modelos de repositorio estandarizados como OMOP CDM (Sun et al., 2015;Ong et al., 2017;Pacaci et al., 2018), en lugar de modelos de restricciones complejas como los CRF y los conjuntos de datos agregados. Así mismo, el conjunto de operaciones de datos identificado es sólo una especificación inicial que puede ampliarse tras la aplicación de la metodología a nuevos casos de uso. ...
Thesis
Full-text available
The collection of health data for research and other secondary purposes must evolve from a paradigm based on the manual recording in specific information systems for a purpose of exploitation, towards an effective reuse of the data recorded during the care process in the Electronic Health Record, known as real-world data (RWD). However, to achieve this ideal scenario it is necessary that the data are recorded, extracted, and transformed from the information systems with full meaning, and through formal and transparent processes that make them understandable, auditable, and reproducible. This thesis aims to propose a methodology for the recording, management and reuse of health data based on dual architecture models, also known as Detailed Clinical Models (DCM). Thus, the contributions of the thesis are: (1) The study of the paradigm of Detailed Clinical Models in the EHR, specifically the UNE-EN ISO 13606 standard, for the management and governance of concept and data models in the different typologies of information systems that compose the EHR; (2) The analysis of standard terminologies, such as SNOMED CT and LOINC, to represent the meaning of the concepts of the health domain formalized through clinical archetypes; (3) The proposal of a formal, transparent and automated process of extraction, selection and transformation of health data for reuse in any purpose and application scenario; (4) The evaluation of the validity, utility and acceptability of the methodology in its application to different use cases. These contributions have been applied to different health data projects developed at the Hospital Universitario 12 de Octubre in the COVID-19 pandemic. These projects specified data models of different typologies: standardized repositories, case report forms and aggregated data sets. As these projects were developed in a critical situation such as the COVID-19 pandemic, the data were required in an agile and flexible manner and without additional effort for health professionals, thus providing an ideal scenario to apply and evaluate the proposed health data reuse methodology. The conclusion of this PhD Thesis is that the proposed methodology has made it possible to obtain valid and useful data for research projects, using a process that is accepted by the consumers of the data. This is a first step towards changing the paradigm of data collection for research, going from ad-hoc processes of manual collection for a single purpose, to a process that is efficient, as it takes advantage of what is already recorded in the Electronic Health Record; flexible, as it is applicable for multiple purposes and for any organization that demands the data; and transparent, as it can be analyzed in technical or functional auditing processes.
... The use of this common framework makes the operations, and their constraints, understandable by any organization wishing to incorporate them into its EHR reuse process, at whatever point in the process it deems necessary. [47][48][49][50][51] However, a future step of this work will be the formalization, not only of the input of the operations, but also of the outputs after their application as well. At this point in the development of the methodology, the loading of data into research databases is contemplated as a manual process once the data have been obtained in accordance with the requirements of the output model. ...
... An improvement to this process will be the implementation of a graphical interface tool that abstracts the user to fill in the XML file directly in a text editor, making it more accessible to nontechnical personnel as previous works have done. 47,48,51 Conclusions This study has provided a novel solution to the difficulty of making the ETL processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Thus, a transparent and flexible methodology was designed based on open standards and technologies, applicable to any clinical condition and health care organization, and even to EHR reuse processes already in place. ...
Article
Full-text available
Background: During the COVID-19 pandemic, several methodologies were designed for obtaining electronic health record (EHR)-derived datasets for research. These processes are often based on black boxes, on which clinical researchers are unaware of how the data were recorded, extracted, and transformed. In order to solve this, it is essential that extract, transform, and load (ETL) processes are based on transparent, homogeneous, and formal methodologies, making them understandable, reproducible, and auditable. Objectives: This study aims to design and implement a methodology, according with FAIR Principles, for building ETL processes (focused on data extraction, selection, and transformation) for EHR reuse in a transparent and flexible manner, applicable to any clinical condition and health care organization. Methods: The proposed methodology comprises four stages: (1) analysis of secondary use models and identification of data operations, based on internationally used clinical repositories, case report forms, and aggregated datasets; (2) modeling and formalization of data operations, through the paradigm of the Detailed Clinical Models; (3) agnostic development of data operations, selecting SQL and R as programming languages; and (4) automation of the ETL instantiation, building a formal configuration file with XML. Results: First, four international projects were analyzed to identify 17 operations, necessary to obtain datasets according to the specifications of these projects from the EHR. With this, each of the data operations was formalized, using the ISO 13606 reference model, specifying the valid data types as arguments, inputs and outputs, and their cardinality. Then, an agnostic catalog of data was developed through data-oriented programming languages previously selected. Finally, an automated ETL instantiation process was built from an ETL configuration file formally defined. Conclusions: This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Moreover, the abstraction carried out in this study means that any previous EHR reuse methodology can incorporate these results into them.
... Quiroz et al. (20) developed an SQL-based ETL framework for the conversion of health databases to the OMOP CDM (20,21). Ong et al. (22) developed a GUI-based ETL system for the conversion of data to the OMOP CDM (22). This paper proposes an integrated solution to the problem of clinical data reuse that has been implemented in the context of the MODELHealth project. ...
... Quiroz et al. (20) developed an SQL-based ETL framework for the conversion of health databases to the OMOP CDM (20,21). Ong et al. (22) developed a GUI-based ETL system for the conversion of data to the OMOP CDM (22). This paper proposes an integrated solution to the problem of clinical data reuse that has been implemented in the context of the MODELHealth project. ...
Article
Full-text available
IntroductionElectronic Health Records (EHRs) are essential data structures, enabling the sharing of valuable medical care information for a diverse patient population and being reused as input to predictive models for clinical research. However, issues such as the heterogeneity of EHR data and the potential compromisation of patient privacy inhibit the secondary use of EHR data in clinical research.Objectives This study aims to present the main elements of the MODELHealth project implementation and the evaluation method that was followed to assess the efficiency of its mechanism.Methods The MODELHealth project was implemented as an Extract-Transform-Load system that collects data from the hospital databases, performs harmonization to the HL7 FHIR standard and anonymization using the k-anonymity method, before loading the transformed data to a central repository. The integrity of the anonymization process was validated by developing a database query tool. The information loss occurring due to the anonymization was estimated with the metrics of generalized information loss, discernibility and average equivalence class size for various values of k.ResultsThe average values of generalized information loss, discernibility and average equivalence class size obtained across all tested datasets and k values were 0.008473 ± 0.006216252886, 115,145,464.3 ± 79,724,196.11 and 12.1346 ± 6.76096647, correspondingly. The values of those metrics appear correlated with factors such as the k value and the dataset characteristics, as expected.Conclusion The experimental results of the study demonstrate that it is feasible to perform effective harmonization and anonymization on EHR data while preserving essential patient information.
... Miután az új információ az adattárházban meghatározott struktúrába és formátumokba kerül, a bővülés meghatározott időközönként történő tárházba való adatbetöltéssel valósul meg (ETL = extract, transform, load [kinyerési, átalakítási és betöltési] folyamat). Végeredményben az új, rendezetlen rekordokat halmozó tárolók és a merev adattárházak hiányosságai hibrid megoldást hívtak életre, az előnyöket ötvöző "adatkincstárház" (data lakehouse) modellt, amely egy köztes tárolási réteget alkalmaz, így fenntartva az eredeti adatok integritását [8][9][10]. ...
Article
Full-text available
Fragmentation of health data and biomedical research data is a major obstacle for precision medicine based on data-driven decisions. The development of personalized medicine requires the efficient exploitation of health data resources that are extraordinary in size and complexity, but highly fragmented, as well as technologies that enable data sharing across institutions and even borders. Biobanks are both sample archives and data integration centers. The analysis of large biobank data warehouses in federated datasets promises to yield conclusions with higher statistical power. A prerequisite for data sharing is harmonization, i.e., the mapping of the unique clinical and molecular characteristics of samples into a unified data model and standard codes. These databases, which are aligned to a common schema, then make healthcare information available for privacy-preserving federated data sharing and learning. The re-evaluation of sensitive health data is inconceivable without the protection of privacy, the legal and conceptual framework for which is set out in the GDPR (General Data Protection Regulation) and the FAIR (findable, accessible, interoperable, reusable) principles. For biobanks in Europe, the BBMRI-ERIC (Biobanking and Biomolecular Research Infrastructure - European Research Infrastructure Consortium) research infrastructure develops common guidelines, which the Hungarian BBMRI Node joined in 2021. As the first step, a federation of biobanks can connect fragmented datasets, providing high-quality data sets motivated by multiple research goals. Extending the approach to real-word data could also allow for higher level evaluation of data generated in the real world of patient care, and thus take the evidence generated in clinical trials within a rigorous framework to a new level. In this publication, we present the potential of federated data sharing in the context of the Semmelweis University Biobanks joint project. Orv Hetil. 2023; 164(21): 811-819.
... The ethics of neuroscience using big data have been developed mainly through researchers associated with the EU's Human Brain Project (Kaye et al., 2015;Ong et al., 2017). The philosophy and ethics of AI, or the exploration of the impact of AI on humans and society using methods from the humanities, have been gradually established (Benke and Benke, 2018;Johnson, 2020). ...
Article
Neuroethics is the study of how neuroscience impacts humans and society. About 15 years have passed since neuroethics was introduced to Japan, yet the field of neuroethics still seeks developed methodologies and an established academic identity. In light of progress in neuroscience and neurotechnology, the challenges for Japanese neuroethics in the 2020 s can be categorized into five topics. (1) The need for further research into the importance of informed consent in psychiatric research and the promotion of public-patient engagement. (2) The need for a framework that constructs a global environment for neuroscience research that utilizes reliable samples and data. (3) The need for ethical support within a Japanese context regarding the construction of brain banks and the research surrounding their use. It is also important to reconsider the moral value of the human neural system and make comparisons with non-human primates. (4) An urgent need to study neuromodulation technologies that intervene in emotions. (5) The need to reconsider neuroscience and neurotechnology from social points of view. Rules for neuroenhancements and do-it-yourself neurotechnologies are urgently needed, while from a broader perspective, it is essential to study the points of contact between neuroscience and public health.
... These results highlight the value of the EHRs as a useful and valid source for research, in a scenario where multiple projects propose automated upload processes from healthcare information systems to databases and repositories for research [18,19]. ...
Chapter
Full-text available
One approach to verifying the quality of research data obtained from EHRs is auditing how complete and correct the data are in comparison with those collected by manual and controlled methods. This study analyzed data quality of an EHR-derived dataset for COVID-19 research, obtained during the pandemic at Hospital Universitario 12 de Octubre. Data were extracted from EHRs and a manually collected research database, and then transformed into the ISARIC-WHO COVID-19 CRF model. Subsequently, a data analysis was performed, comparing both sources through this convergence model. More concepts and records were obtained from EHRs, and PPV (95% CI) was above 85% in most sections. In future studies, a more detailed analysis of data quality will be carried out.
Chapter
Conducting epidemiologic research usually requires a large amount of data to establish the natural history of a disease and achieve meaningful study design, and interpretations of findings. This is, however, a huge task because the healthcare domain is composed of a complex corpus and concepts that result in difficult ways to use and store data. Additionally, data accessibility should be considered because sensitive data from patients should be carefully protected and shared with responsibility. With the COVID-19 pandemic, the need for sharing data and having an integrated view of the data was reaffirmed to identify the best approaches and signals to improve not only treatments and diagnoses but also social answers to the epidemiological scenario. This paper addresses a data integration scenario for dealing with COVID-19 and cardiovascular diseases, covering the main challenges related to integrating data in a common data repository storing data from several hospitals. Conceptual architecture is presented to deal with such approaches and integrate data from a Portuguese hospital into the common repository used to explore data in a standardized way.KeywordsHealthcare dataData integrationETLCOVID-19
Article
Purpose of review: Despite several efforts to enhance guideline adherence in cancer management, the rate of adherence remains often dissatisfactory in clinical routine. Clinical decision-support systems (CDSS) have been developed to support the management of cancer patients by providing evidence-based recommendations. In this review, we focus on both current evidence supporting the beneficial effects of CDSS on guideline adherence as well as technical and structural requirements for CDSS implementation in clinical routine. Recent findings: Some studies have demonstrated a significant improvement of guideline adherence by CDSSs in oncologic diseases such as breast cancer, colon cancer, cervical cancer, prostate cancer, and hepatocellular carcinoma as well as in the management of cancer pain. However, most of these studies were rather small and designs rather simple. One reason for this limited evidence might be that CDSSs are only occasionally implemented in clinical routine. The main limitations for a broader implementation might lie in the currently existing clinical data infrastructures that do not sufficiently allow CDSS interoperability as well as in some CDSS tools themselves, if handling is hampered by poor usability. Summary: In principle, CDSSs improve guideline adherence in clinical cancer management. However, there are some technical und structural obstacles to overcome to fully implement CDSSs in clinical routine.
Chapter
The power of data lies in the insights derived from it. We trace this journey as Data-Information-Knowledge-Wisdom and then come to the Insight (Rowley in J Inf Sci 33:163–180, 2007). A data warehouse is meant for storing and processing enormous amounts of data, gathered and transformed from various data sources (Yessad and Labiod in 2016 International conference on system reliability and science, ICSRS 2016—proceedings, pp 95–99, 2017). Data security is a major concern in the data warehouse domain, along with the privacy and confidentiality factors. This paper discusses the various measures and actions to be taken care of, to protect the data in a data warehouse. We are considering the healthcare industry use cases in this study, and the proposed measures are discussed in the context of healthcare data. The healthcare industry is governed by various strict guidelines and regulatory requirements in the aspects of data storage, processing and transferring. Our proposed methods are concentrated on the areas of privacy and confidentiality of healthcare data warehouse and consist of de-identification and user privilege-based access controls.KeywordsData martData warehouseHealthcare data warehouseBusiness intelligenceData warehouse securityInformation security
Article
Full-text available
Introduction: We see increased use of existing observational data in order to achieve fast and transparent production of empirical evidence in health care research. Multiple databases are often used to increase power, to assess rare exposures or outcomes, or to study diverse populations. For privacy and sociological reasons, original data on individual subjects can't be shared, requiring a distributed network approach where data processing is performed prior to data sharing. Case descriptions and variation among sites: We created a conceptual framework distinguishing three steps in local data processing: (1) data reorganization into a data structure common across the network; (2) derivation of study variables not present in original data; and (3) application of study design to transform longitudinal data into aggregated data sets for statistical analysis. We applied this framework to four case studies to identify similarities and differences in the United States and Europe: Exploring and Understanding Adverse Drug Reactions by Integrative Mining of Clinical Records and Biomedical Knowledge (EU-ADR), Observational Medical Outcomes Partnership (OMOP), the Food and Drug Administration's (FDA's) Mini-Sentinel, and the Italian network-the Integration of Content Management Information on the Territory of Patients with Complex Diseases or with Chronic Conditions (MATRICE). Findings: National networks (OMOP, Mini-Sentinel, MATRICE) all adopted shared procedures for local data reorganization. The multinational EU-ADR network needed locally defined procedures to reorganize its heterogeneous data into a common structure. Derivation of new data elements was centrally defined in all networks but the procedure was not shared in EU-ADR. Application of study design was a common and shared procedure in all the case studies. Computer procedures were embodied in different programming languages, including SAS, R, SQL, Java, and C++. Conclusion: Using our conceptual framework we found several areas that would benefit from research to identify optimal standards for production of empirical knowledge from existing databases.an opportunity to advance evidence-based care management. In addition, formalized CM outcomes assessment methodologies will enable us to compare CM effectiveness across health delivery settings.
Article
Full-text available
Purpose: Patient-reported outcome (PRO) measures offer value for clinicians and researchers, although priorities and value propositions can conflict. PRO implementation in clinical practice may benefit from stakeholder engagement methods to align research and clinical practice stakeholder perspectives. The objective is to demonstrate the use of stakeholder engagement in PRO implementation. Method: Engaged stakeholders represented researchers and clinical practice representatives from the SAFTINet practice-based research network (PBRN). A stakeholder engagement process involving iterative analysis, deliberation, and decision making guided implementation of a medication adherence PRO measure (the Medication Adherence Survey [MAS]) for patients with hypertension and/or hyperlipidemia. Results: Over 9 months, 40 of 45 practices (89%) implemented the MAS, collecting 3,247 surveys (mean � 72, median � 30, range: 0 - 416). Facilitators included: an electronic health record (EHR) with readily modifiable templates; existing staff, tools and workflows in which the MAS could be integrated (e.g., health risk appraisals, hypertension-specific visits, care coordinators); and engaged leadership and quality improvement teams. Conclusion: Stakeholder engagement appeared useful for promoting PRO measure implementation in clinical practice, in a way that met the needs of both researchers and clinical practice stakeholders. Limitations of this approach and opportunities for improving the PRO data collection infrastructure in PBRNs are discussed. (J Am Board Fam Med 2016;29:102–115.)
Article
Full-text available
The field of clinical research informatics includes creation of clinical data repositories (CDRs) used to conduct quality improvement (QI) activities and comparative effectiveness research (CER). Ideally, CDR data are accurately and directly abstracted from disparate electronic health records (EHRs), across diverse health-systems. Investigators from Washington State's Surgical Care Outcomes and Assessment Program (SCOAP) Comparative Effectiveness Research Translation Network (CERTAIN) are creating such a CDR. This manuscript describes the automation and validation methods used to create this digital infrastructure. SCOAP is a QI benchmarking initiative. Data are manually abstracted from EHRs and entered into a data management system. CERTAIN investigators are now deploying Caradigm's Amalga™ tool to facilitate automated abstraction of data from multiple, disparate EHRs. Concordance is calculated to compare data automatically to manually abstracted. Performance measures are calculated between Amalga and each parent EHR. Validation takes place in repeated loops, with improvements made over time. When automated abstraction reaches the current benchmark for abstraction accuracy - 95% - itwill 'go-live' at each site. A technical analysis was completed at 14 sites. Five sites are contributing; the remaining sites prioritized meeting Meaningful Use criteria. Participating sites are contributing 15-18 unique data feeds, totaling 13 surgical registry use cases. Common feeds are registration, laboratory, transcription/dictation, radiology, and medications. Approximately 50% of 1,320 designated data elements are being automatically abstracted-25% from structured data; 25% from text mining. In semi-automating data abstraction and conducting a rigorous validation, CERTAIN investigators will semi-automate data collection to conduct QI and CER, while advancing the Learning Healthcare System.
Article
Full-text available
This paper describes the methods for an observational comparative effectiveness research study designed to test the association between practice-level medical home characteristics and asthma control in children and adults receiving care in safety-net primary care practices. This is a prospective, longitudinal cohort study, utilizing survey methodologies and secondary analysis of existing structured clinical, administrative, and claims data. The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) is a safety net-oriented, primary care practice-based research network, with federated databases containing electronic health record (EHR) and Medicaid claims data. Data from approximately 20,000 patients from 50 practices in four healthcare organizations will be included. Practice-level medical home characteristics will be correlated with patient-level asthma outcomes, controlling for potential confounding variables, using a clustered design. Linear and non-linear mixed models will be used for analysis. Study inception was July 1, 2012. A causal graph theory approach was used to guide covariate selection to control for bias and confounding. Strengths of this design include a priori specification of hypotheses and methods, a large sample of patients with asthma cared for in safety-net practices, the study of real-world variations in the implementation of the medical home concept, and the innovative use of a combination of claims data, patient-reported data, clinical data from EHRs, and practice-level surveys. We address limitations in causal inference using theory, design and analysis.
Article
Full-text available
Objectives: To review technical and methodological challenges for big data research in biomedicine and health. Methods: We discuss sources of big datasets, survey infrastructures for big data storage and big data processing, and describe the main challenges that arise when analyzing big data. Results: The life and biomedical sciences are massively contributing to the big data revolution through secondary use of data that were collected during routine care and through new data sources such as social media. Efficient processing of big datasets is typically achieved by distributing computation over a cluster of computers. Data analysts should be aware of pitfalls related to big data such as bias in routine care data and the risk of false-positive findings in high-dimensional datasets. Conclusions: The major challenge for the near future is to transform analytical methods that are used in the biomedical and health domain, to fit the distributed storage and processing model that is required to handle big data, while ensuring confidentiality of the data being analyzed.
Article
Full-text available
Many modern physician specialists like to think of their work as grounded in strong science. Yet 5 years ago, a group of cardiologists published their findings on the science underlying over 2700 practice recommendations issued by their specialty societies.1 Only 314 (or 11%) were based on ‘level A’ evidence, that is, evidence based on multiple well-done randomized trials. Nearly half of the recommendations were based solely on ‘expert opinion.’ Even more disconcerting is the fact that despite the activities of many researchers, the vast majority of ongoing clinical trials are too small to provide evidence relevant to patients and clinicians.2 Robust trials that can support recommendations grounded in solid science are few and far between, in part because they have become too expensive and complicated to run. Our biomedical enterprise is conducting many clinical trials, yet we may not be getting all that much for what we spend.3 No wonder that too often, patients and caregivers seeking information on how best to improve their health or the health of their loved ones find that biomedicine does not have answers for questions they ask. Too often, clinicians cannot tell patients which therapies are likely to work best for individuals like them. Too often risks, benefits, and impact on quality of life are uncertain. Providing accurate answers based on the highest levels of scientific evidence for the majority of unresolved clinical questions is a revolutionary dream shared by patients, providers, payers, health plans, researchers, and policy makers alike. PCORnet, the National Patient-Centered Clinical Research … Correspondence to Dr Kathy L Hudson, National Institutes of Health, 9000 Rockville Pike, Bethesda, MD 20892, USA; kathy.hudson{at}nih.gov
Article
Full-text available
The Patient-Centered Outcomes Research Institute (PCORI) has launched PCORnet, a major initiative to support an effective, sustainable national research infrastructure that will advance the use of electronic health data in comparative effectiveness research (CER) and other types of research. In December 2013, PCORI's board of governors funded 11 clinical data research networks (CDRNs) and 18 patient-powered research networks (PPRNs) for a period of 18 months. CDRNs are based on the electronic health records and other electronic sources of very large populations receiving healthcare within integrated or networked delivery systems. PPRNs are built primarily by communities of motivated patients, forming partnerships with researchers. These patients intend to participate in clinical research, by generating questions, sharing data, volunteering for interventional trials, and interpreting and disseminating results. Rapidly building a new national resource to facilitate a large-scale, patient-centered CER is associated with a number of technical, regulatory, and organizational challenges, which are described here.
Article
Full-text available
A learning health system (LHS) integrates research done in routine care settings, structured data capture during every encounter, and quality improvement processes to rapidly implement advances in new knowledge, all with active and meaningful patient participation. While disease-specific pediatric LHSs have shown tremendous impact on improved clinical outcomes, a national digital architecture to rapidly implement LHSs across multiple pediatric conditions does not exist. PEDSnet is a clinical data research network that provides the infrastructure to support a national pediatric LHS. A consortium consisting of PEDSnet, which includes eight academic medical centers, two existing disease-specific pediatric networks, and two national data partners form the initial partners in the National Pediatric Learning Health System (NPLHS). PEDSnet is implementing a flexible dual data architecture that incorporates two widely used data models and national terminology standards to support multi-institutional data integration, cohort discovery, and advanced analytics that enable rapid learning.