Graph databases for openEHR clinical repositories

Int. J. Computational Science and Engineering, Vol. 20, No. 3, 2019 281
Copyright © 2019 Inderscience Enterprises Ltd.
Graph databases for openEHR clinical repositories
Samar El Helou*
Department of Social Informatics,
Graduate School of Informatics,
Kyoto University, Japan
*Corresponding author
Shinji Kobayashi
Department of Electronic Health Record,
Graduate School of Medicine,
Kyoto University, Japan
Goshiro Yamamoto
Division of Medical IT and Administration Planning,
Kyoto University Hospital, Japan
Naoto Kume
Department of Electronic Health Record,
Graduate School of Medicine,
Kyoto University, Japan
Eiji Kondoh
Department of Gynecology and Obstetrics,
Graduate School of Medicine,
Kyoto University, Japan
Shusuke Hiragi and Kazuya Okamoto
Division of Medical IT and Administration Planning,
Kyoto University Hospital, Japan
Hiroshi Tamura
Center for Innovative Research and Education in Data Science,
Kyoto University, Japan
Tomohiro Kuroda
Division of Medical IT and Administration Planning,
Kyoto University Hospital, Japan
282 S. El Helou et al.
Abstract: The archetype-based approach has now been adopted by major EHR interoperability
standards. Soon, due to an increase in EHR adoption, more health data will be created and
frequently accessed. Previous research shows that conventional persistence mechanisms such as
relational and XML databases have scalability issues when storing and querying archetype-based
datasets. Accordingly, we need to explore and evaluate new persistence strategies for
archetype-based EHR repositories. To address the performance issues expected to occur with the
increase of data, we proposed an approach using labelled property graph databases for
implementing openEHR clinical repositories. We implemented the proposed approach using
Neo4j and compared it to an object relational mapping (ORM) approach using Microsoft SQL
server. We evaluated both approaches over a simulation of a pregnancy home-monitoring
application in terms of required storage space and query response time. The results show that the
proposed approach provides a better overall performance for clinical querying.
Keywords: openEHR; graph database; electronic health records; EHR; database; performance;
archetypes; reference model; EHR repository; archetype-based storage; query response time;
clinical repository.
Reference to this paper should be made as follows: El Helou, S., Kobayashi, S., Yamamoto, G.,
Kume, N., Kondoh, E., Hiragi, S., Okamoto, K., Tamura, H. and Kuroda, T. (2019)
‘Graph databases for openEHR clinical repositories’, Int. J. Computational Science and
Engineering, Vol. 20, No. 3, pp.281–298.
Biographical notes: Samar El Helou is a PhD student in the Department of Social Informatics at
the Kyoto University. Her research interests include EHRs, health data models, patient-centred
care and ubiquitous healthcare.
Shinji Kobayashi is currently a Senior Lecturer in the Department of Electronic Health Record in
the Graduate School of Medicine at the Kyoto University. He received his MD and PhD degree
from the Kyushu University. He has been leading the Medical Open Source Software Council in
Japan since 2003. His research area is open source software in medicine and ruby
implementation of the openEHR standards.
Goshiro Yamamoto is currently a Senior Lecturer in the Division of Medical IT and
Administration Planning of the Kyoto University Hospital. He received his BE, ME and PhD in
Engineering from the Osaka University. His major interests are human-computer interaction and
medical informatics.
Naoto Kume is currently an Associate Professor in the Department of Electronic Health Record
in the Graduate School of Medicine at the Kyoto University. He received his PhD in informatics
from the Kyoto University. His research interests include EHRs, clinical studies and mobile
Eiji Kondoh is an Associate Professor in the Department of Gynecology and Obstetrics in the
Graduate School of Medicine at the Kyoto University. He specialises in maternal-fetal medicine.
His research interests include high-risk obstetrics, especially postpartum hemorrhage, placenta
accrete and preeclampsia.
Shusuke Hiragi is an Assistant Professor in the Division of Medical IT and Administration
Planning of the Kyoto University Hospital. His research interests include medical research with
hospital information systems and insurance claim databases.
Kazuya Okamoto is a Senior Lecturer in the Division of Medical IT and Administration Planning
of the Kyoto University Hospital. He received his BS, MS and PhD in informatics from the
Kyoto University. His current research interests include medical informatics, artificial
intelligence in medicine and rehabilitation engineering.
Hiroshi Tamura is a Program-Specific Professor in the Center for Innovative Research and
Education in Data Science of the Kyoto University Institute for Liberal Arts and Sciences. His
research interests include hospital management, healthcare policy and management,
ophthalmology and visual sciences.
Tomohiro Kuroda is a Professor in the Division of Medical IT and Administration Planning of
the Kyoto University Hospital. He received his PhD in Information Science from the Nara
Institute of Science and Technology. His research interests include human interfaces,
virtual/augmented reality, wearable computing and medical & assistive informatics. He is a
member of IEEE, ISVR, HISJ, JAMI and others.
This paper is a revised and expanded version of a paper entitled ‘Exploring graph databases with
openEHR in antenatal care settings’ presented at BASE 2015 Symposium, Aizu, Japan, 7–9
December 2015.
Graph databases for openEHR clinical repositories 283
1 Introduction
The widespread use of electronic health records (EHR)
could lead to higher service efficiency and effectiveness and
thus lower healthcare costs (Shen et al., 2015; King et al.,
2014; Agrawal, 2002). Governments aiming to improve the
performance of healthcare provision are implementing
policies to increase EHR adoption. Initiatives for EHR
development and dissemination can be seen in Australia,
Canada, the European Union, the USA and Japan
(Cornwall, 2002; IT Strategic Headquarters, 2009). These
governmental efforts increased EHR adoption rates and
exposed some of their adoption barriers. Some of the
commonly cited EHR adoption barriers are a lack of
interoperability, low levels of usability and high
maintainability costs (Hamid and Cline, 2013; Vishwanath
and Scamurra, 2007; Ash and Bates, 2005).
The lack of interoperability of EHR systems is
considered a major technical barrier since it can hinder the
access to and the sharing of data between EHRs as well as
their consequent processing by computers (Kalra and
Blobel, 2007; Librelotto et al., 2015). Traditionally, EHR
vendors developed or implemented EHRs based on internal
and proprietary standards, which complicated the
information sharing with other vendors' EHRs. Currently,
the use of non-proprietary standards for building EHR
systems is recognised as a requirement to address the
interoperability issue and allow the exchange of information
between EHR services. Multiple EHR interoperability
standards have been developed over the last decade. The
most commonly used ones such as HL7, CEN ISO 13606
and openEHR adopted a two-level modelling approach, i.e.,
an archetype-based modelling approach, separating the
physical representation of data from the clinical domain
concepts (Begoyan, 2007; Schloeffel et al., 2006). When
building EHR systems following the archetype-based
modelling approach, the data repository stores health
concepts as instances of an information reference model
(RM) (Garde et al., 2007).
In this study, we consider the case of openEHR
( openEHR is a technology-
independent specification for structuring EHR data. It
defines an information RM but does not commit its
implementers to any particular implementation approach
(openEHR, 2008). Consequently, the developers have to
make a decision regarding the persistence technology and
approach when building EHR data repositories following
the openEHR specification. This decision could affect the
system’s overall level of usability and maintainability costs.
In the following sections, we describe how a persistence
approach could contribute to the EHR system’s usability
and maintainability costs.
The usability and efficiency of EHR systems are
common concerns expressed by the clinical staff (Belden
et al., 2009). An important effectiveness metric and
usability factor is a system’s response time (Nielsen, 1993;
Li et al., 2002; Li and Bao, 2017; Liu and Xiao, 2016) or the
time a transaction needs to be executed when using a
system. The system’s response time depends on many
variables such as the CPU, the network and the database
used (Li et al., 2002). For an EHR system, most of the tasks
require browsing the databases. Thus, it can be assumed that
the database query response times significantly affect the
system’s overall performance and usability (Balsamo et al.,
2004). When using EHR systems in clinical settings,
healthcare providers usually execute create, read, update
and destroy (CRUD) operations to generate, retrieve and
update data from individual patients’ health records.
Minimising the execution time of such CRUD operations
could improve the usability of EHRs in clinical settings.
In addition to usability, the cost of maintaining EHR
systems is a common concern expressed by organisations
considering their implementation. If EHRs are adopted, a
rapid growth in data quantity is expected to occur. This
growth will create the need for greater storage capacities,
addressed by buying larger storage, i.e., scaling up, or
distributing the data over multiple servers, i.e., scaling out
(Frost, 2015;Cottle et al., 2013). Therefore, reducing the
required storage space for clinical data could reduce the
maintainability costs of EHR systems.
Considering that the clinical query execution time is
crucial to the overall performance of the EHR system and
that the storage efficiency affects the future maintainability
costs, the aim of this work is to provide an openEHR
repository implementation strategy that improves the
execution time of clinical queries and reduces the data
storage space requirements.
When building EHR systems following the openEHR
archetype-based modelling approach, the data repository has
to store instances of the openEHR information RM. The
RM contains multiple classes in a deep tree hierarchy. Since
the openEHR RM has a tree structure, an openEHR EHR
would have the structure of a directed rooted tree, a graph-
like data structure. When clinical queries regarding a
specific patient are executed, the tree structure is queried
starting from the top node containing the unique EHR ID.
Multiple openEHR implementation approaches have
been previously explored and implemented most often using
Relational and XML databases (Frade et al., 2013).
However, previous research and discussions suggest that
these approaches are less than optimal for storing and
querying archetype-based datasets (Freire et al., 2012,
2016). Using a relational database implies that multiple
JOIN operations need to be executed when querying the tree
structure, leading the system’s performance to deteriorate
with the increase of data. Moreover, due to the complex
structure of the RM and the impedance mismatch between
the RM and the relational model, the schema can be hard to
model. As for XML databases, they do not perform as well
as relational databases (Green, 2008) and they were found
to require larger memory and storage space to process and
store the information (Megginson, 2004). Proposing and
evaluating new implementation approaches could be of
value for openEHR implementers.
Recently, graph databases have been developed as a
possible replacement for relational databases when dealing
with graph-like data structures (Angles, 2012). Graph
284 S. El Helou et al.
databases are optimised for storing and querying graph-like
structures. Since the RM has a graph-like structure,
mapping it to a graph model and consequently storing it in a
graph database would be straightforward. Moreover, instead
of joining multiple tables to query the tree, a graph database
starts by locating the initial node and subsequently executes
traversals. Since the cost of traversals is not affected by the
number of records in the database (Robinson et al., 2015),
graph databases must theoretically scale better than their
relational counterparts in the case of openEHR repositories.
This work proposes an openEHR repository
implementation approach that allows fast clinical querying
and efficient storage. The proposed approach uses a labelled
property graph database and directly maps the openEHR
RM structure to a graph structure. We evaluate the proposed
approach by comparing it to an object relational mapping
(ORM) approach because most of the existing openEHR
database implementations follow the relational approach
(Frade et al., 2013). The evaluation explores likely querying
scenarios over artificial simulations of different size
pregnancy home-monitoring data repositories. The
performance comparison considers two main criteria: the
query response time and the required storage space.
2 openEHR structure and database models
2.1 openEHR
openEHR is a set of open-source specifications for a
complete EHR architecture. It is based on 15 years of
research and real-world implementations (Schloeffel et al.,
2006). openEHR specifies how to create, store, maintain
and query EHRs following a two-level modelling approach.
This approach separates the domain knowledge, i.e., health
concepts from the software and its database. In the two-
level modelling approach, the first level consists of the
information RM and the second level of the domain concept
models (DCM) or ‘archetypes’ (Beale, 2002). The RM
specifies a set of classes covering the possible types of
information meant to be stored in an EHR. Archetypes are
coded in terms of constraints over the RM classes. For
example, the blood pressure archetype is modelled by
constraining the OBSERVATION class. When developing
an EHR system, only the RM is implemented and clinical
data is stored in the database as instances of the RM classes
(openEHR, 2007; 2008). Therefore, following the two-level
modelling approach, the database design depends solely on
the RM and is not affected by continuous changes in
medical knowledge, resulting in highly flexible and
adaptable EHRs. openEHR is a technology independent
specification and does not recommend or specify the use of
any particular database technology.
Figure 1 High level structure of the openEHR EHR
Figure 2 UML diagram of the openEHR reference model
The top-level structure of the openEHR EHR is shown in
Figure 1. The structure starts with an EHR object identified
by a globally unique identifier called ‘EHR id.’ The access
control settings of the EHR are contained in the ‘EHR
access’ object and the status information is contained in the
‘EHR status’ object. All the clinical and administrative data
in the EHR are contained in ‘composition’ objects.
The model of the data in ‘composition’ objects follows
the logical structure of the openEHR RM a tree structure
with deep inheritance hierarchy, as shown in Figure 2.
Accordingly, an openEHR repository is implemented to
store the tree structure with deep hierarchy described above.
Graph databases for openEHR clinical repositories 285
2.2 openEHR data repositories
Being a technology-independent specification, openEHR
does not recommend any specific database technology or
approach for implementing openEHR data repositories. The
choice of technology and implementation approach is left to
the developers. However, openEHR attracts a wide range of
individuals and organisations including developers, medical
specialists, researchers, small organisations, large
organisations and governments. Each of these parties
employs openEHR in a different context in which specific
database technologies and modelling approaches might
prove fitter. For these reasons, multiple approaches for
implementing openEHR repositories exist and some of them
were described and evaluated in the literature.
A survey of existing openEHR repository
implementations published in 2013 (Freire et al., 2016)
showed that the relational storage solutions were most often
used. In conventional cases, using a relational database
management system (RDBMS) for an openEHR repository
would require ORM and various strategies to handle the
model impedance mismatch, making the schema hard to
model. The effort required to model the database schema is
a common database adoption barrier (Jagadish et al., 2015).
Moreover, an ORM approach maps each class in the RM to
a table in the relational database, resulting in a large number
of tables. The numerous tables and deep hierarchy of the
openEHR data structures would require the execution of
multiple JOIN operations when querying the data, which
would theoretically result in poor query response times for
large datasets. Another relational mapping method called
archetype relational mapping (ARM) was proposed (Wang
et al., 2015). Instead of mapping the RM classes to tables,
the ARM maps archetypes to tables. The method was
evaluated and appeared to be promising for population-wide
On the openEHR website, a relatively direct key-value
strategy called ‘Node + Path’ is proposed and explained
(openEHR, 2008c). Following the ‘Node + Path’ approach,
openEHR data is stored in a key-value store which is one
big relational table with two columns: a key column and a
value column. The key column contains node paths and the
value column contains blobs corresponding to the serialised
nodes at those paths. Even though this approach is fairly
simple to implement, it performed poorly in terms of query
response time (Wang et al., 2015).
In addition to relational databases, openEHR
implementers often use native XML databases. However, a
study found that XML databases, without major
optimisations, performed poorly for population-wide and
ad-hoc querying for large openEHR datasets (Freire et al.,
2012). A later study found that even after indexing and
query optimisations were applied, XML databases did not
perform as well as relational databases for population-wide
querying (Freire et al., 2016).
2.3 Graph databases
Graph databases were invented to counteract some
limitations of the relational databases regarding highly
interconnected data and continuously evolving data models.
In a graph data model, information is represented using
nodes and edges (Hunger et al., 2016). Nodes represent the
entities and the relationships between those entities are
manifested by the edges that connect them (Robinson et al.,
Graph databases can be split into two categories:
native graph storage and processing
non-native graph storage and processing.
In a native graph storage technology, the underlying
structure of the database is optimised to store graph-like
data, ensuring that nodes and relationships are written close
to each other. Non-native graph databases store the graph
data, i.e., node data and relationship data, in other database
technologies, e.g., relational tables, which can lead to slow
querying as these models are not optimised for graph-like
data (Robinson et al., 2015).
In a native graph processing technology, the database
does not rely on global indexes to gather the data. Rather,
index-free adjacency is used. Index-free adjacency means
that each node references its adjacent nodes, so instead of
using global indexes, the nodes act as indexes for their
nearby nodes. Theoretically, the complexity of executing
graph traversals is O(1) in a graph database using index-free
adjacency (Robinson et al., 2015), in comparison to an
average of O(log(n)) for a binary search to locate an index
entry in a relational database.
Figure 3 The labelled property graph model
One commonly used and well-documented graph database
is Neo4j (
database/#property-graph/). Neo4j uses native graph storage
and processing and employs the labelled property graph
model. In the labelled property graph model, nodes and
edges can have properties associated with them and nodes
can be tagged with labels representing their different roles
(Robinson et al., 2015). An example is shown in Figure 3,
where A, B and C are nodes. A, B and C are labelled
‘EHR’, ‘composition,’ and ‘person’, respectively. A has ‘id’
as a node attribute, while B and C have a ‘name’ attribute.
A is connected to B via a relationship of type ‘CONTAINS’
and to C via a relationship of typeBELONGS TO. B is
connected to C via a relationship of type ‘ADDED BY’
with ‘in’ as a relationship property.
286 S. El Helou et al.
3 Methods and materials
To implement the openEHR archetype-based data
repository, a graph model representing the openEHR RM
was created. Neo4j, a labelled property graph database
technology, was employed to store openEHR archetyped
data as instances of the openEHR RM. To evaluate the
proposed implementation approach, a performance
evaluation was conducted. The proposed approach was
implemented using Neo4j and was compared in terms of
query response time and required storage space to an ORM
approach implemented using Microsoft SQL server (2016).
To conduct the performance evaluation, datasets simulating
a pregnancy home-monitoring data repository were
artificially generated and imported into Neo4j and Microsoft
SQL Server. To compare the query response times, a set of
application-specific queries differing in complexity were
identified. Finally, the queries were written in Cypher
(Neo4j Inc.,
language/), a declarative query language for Neo4j and in
SQL to be executed over both repository implementations.
3.1 Graph model of the openEHR RM
Following the openEHR specification, clinical information
is represented using openEHR archetypes, which are
modelled as constraints over the openEHR RM classes. To
query the archetypes’ structure, openEHR includes a path
mechanism specifying the path to reach archetype nodes
starting from the root node of the archetype structure in
XPath-compatible syntax. Each path identifies an archetype
node using openEHR RM class attributes as attribute names
and ‘archetype_id’ or ‘archetype_node_id’ as predicates.
To create the graph model of the openEHR RM,
openEHR RM classes were mapped into graph nodes. The
relationships were modelled following the openEHR RM
class hierarchy and named in accordance with the class
attributes employed in the openEHR path mechanism, as
shown in Figure 4. Accordingly, a set of mapping rules was
designed for storing archetype structures in a labelled
property graph database:
1 Each archetype is mapped to a subgraph.
2 Each archetype node path is mapped to a branch in the
3 Each archetype node is mapped to a node in the
4 Each archetype node corresponds to a class in the RM.
The RM class names are mapped to node labels in the
5 ‘archetype_id’ and ‘archetype_node_id’ attributes are
mapped to node properties in the subgraph.
6 Class attributes are mapped to relationship types.
Figure 4 Graph model of the openEHR reference model
(see online version for colours)
3.2 Storing and retrieving archetyped data with
Our method employs Cypher queries (Neo4j Inc., to
store and retrieve openEHR data. Cypher is a declarative
graph query language for Neo4j graphs. To store archetyped
data, leaf node paths were extracted from the archetypes’
definitions, as shown in Table 1.
Table 1 Example of an extracted leaf node path
Leaf node
weight.v1]/data[at0002]/ events[at0003]/
Table 2 Cypher statement to store a graph branch
CREATE (OBSERVATION {archetype_id: ’openEHR-EHR-
OBSERVATION.body_weight.v1’}) -[:data]-
>(HISTORY{archetype_node_id: ‘at0002’})-[:events]->
(POINT_EVENT{archetype_node_id: ’at0003’})-[:state]->
(ITEM_TREE{archetype_node_id: ‘at0008’})-[:items] -> (
ELEMENT{archetype_node_id: ‘at0009’})
After the mapping rules were applied, Cypher CREATE
statements were formulated to store the corresponding graph
branches. Table 2 shows the Cypher CREATE statement
Graph databases for openEHR clinical repositories 287
needed to store the extracted node path previously shown in
Table 1.
To retrieve archetyped data, Cypher queries traversing
the graph were built. The queries indicate the class
attributes as relationship types and the node predicates, i.e.,
‘archetype_id’ and ‘archetype_node_id’ as node attributes.
3.3 Test datasets generation
To evaluate the repository implementation approach, we
needed datasets containing a large number of records that
complied with the openEHR data models. However,
structured EHR data are difficult to obtain and usually
governed by strict privacy laws when available. To ensure
our ability to share the dataset used in this evaluation in the
future and thus guarantee the reproducibility of this
experiment, we decided to artificially generate the datasets.
By doing so, we sacrificed some realism in favour of
We artificially generated datasets simulating a
pregnancy home-monitoring data repository. The simulated
repository corresponds to an application that would allow
pregnant women to view information relating to their
pregnancy and to report pregnancy related symptoms. The
contents of the datasets corresponded to clinical concepts
and realistic data entries identified through discussions and
interviews with antenatal care experts. The structure of the
data was dictated by the openEHR RM and the definitions
of the archetypes. The dataset generation process is shown
in Figure 5.
Figure 5 openEHR dataset generation process
We started by reviewing the Japanese obstetrical guidelines
to gain an initial understanding of the antenatal care
concepts and processes (Minakami et al., 2011). Following
the review, we conducted two semi-structured interviews
with an obstetrician and a midwife. During the interviews,
we took notes describing the information flow during the
care process. We also identified the clinical information that
they considered relevant to report when using a pregnancy
home-monitoring application. During the interview with the
obstetrician, we used a checklist to determine the possible
symptoms that pregnant women may experience and the
information the pregnant women need to provide when
reporting such symptoms. The interviews allowed us to
identify the clinical concepts that could be involved in a
pregnancy home-monitoring application and a list of
realistic data entries that we could use to populate the
Next, we mapped the clinical concepts to openEHR
archetypes available in the openEHR clinical knowledge
manager (CKM) (openEHR, 2016) and created data value
sets corresponding to the possible data entries. The high-
level structure of the simulated records along with the
employed archetypes is shown in Figure 6.
Figure 6 The high-level structure of a simulated record
The simulated EHR records were modelled to contain the
date of birth, the obstetric history, the current pregnancy
summary and reports of pregnancy-related symptoms. In
total, 11 archetypes found on the CKM were used without
modification and four templates representing the different
types of compositions were created using the ocean
informatics template designer (Ocean Informatics, Each
of the generated EHR records includes five compositions
with a total of 42 nodes, out of which 19 nodes are leaf
nodes containing the data entries.
We then applied an ORM approach to design a
relational schema allowing the persistence of the required
archetypes over classes from the openEHR RM. The
relational schema over which the data generation plans were
executed is shown in Figure 7. The plans were executed
using Microsoft Visual Studio 2010 to populate a Microsoft
SQL server database.
288 S. El Helou et al.
Figure 7 Object relational mapping schema
Figure 8 Generation of the health summary composition (see online version for colours)
Graph databases for openEHR clinical repositories 289
Figure 9 Generation of the obstetric history composition (see online version for colours)
Figure 10 Generation of the pregnancy history composition (see online version for colours)
290 S. El Helou et al.
Figure 11 Generation of the self-monitoring composition (see online version for colours)
Each generated record contained one date of birth entry, one
obstetric history entry, one pregnancy summary entry and
two reports of pregnancy symptoms. To generate the
records, we created 24 data generation plans using
Microsoft Visual Studio 2010. First, we generated the EHR
compositions with unique IDs. Then each generation plan
was applied to create different branches of the EHR record
as shown in Figures 8, 9, 10 and 11. The generation plans
contained possible value lists that were randomly assigned
in a uniform way across all the generated instances. In
certain cases, we had to create rules to make sure the data
made sense. For example, when generating the estimated
dates of birth and the dates of conception, we made sure that
the estimated date of birth would be nine months post the
date of conception.
In total, five different size datasets were generated to
evaluate the effect of dataset size on the performance of
queries and required storage space. Sets named S1K, S5K,
S10K, S50K and S100K contained 1,000, 5,000, 1,0000,
50,000 and 100,000 records respectively.
In the prefecture where this research was conducted,
approximately 20,000 births take place every year. In the
institution where the EHR application is being developed, it
is estimated that 15 antenatal visits occur daily and up to
300 women receive antenatal care in a year. The institution
is a major university hospital; therefore, we expect that
other institutions would provide care for a smaller number
of women per year. Taking into consideration the previous
estimations, the size of the datasets aims to simulate the
following situations:
S1k, containing 1,000 records, simulates a situation in
which the application is used in one institution (200
pregnancies/year) over 5 years.
S5k, containing 5,000 records, simulates a situation in
which the application is used in three major institutions
(1,000 pregnancies/year) over 5 years.
S10k, containing 10,000 records, simulates a situation
in which the application is used by 10% of the pregnant
women (2,000 births/year) over 5 years.
S50k, containing 50,000 records, simulates a situation
in which the application is used by 50% of the pregnant
women (10,000 births/year) over 5 years.
S100k, containing 100,000 records, simulates a
situation in which the application is used on a
prefectural level (20,000 births/year) over 5 years.
After the datasets were created in Visual Studio 2010, they
were exported as comma-separated values (CSV) files. The
CSV data was imported into Neo4j and merged into a graph
structure aligning with the proposed graph model of the
openEHR RM. The S10k dataset can be accessed and
downloaded as a Neo4j database via http://openehr-test-
3.4 Evaluation setup
To create the query set used in the query response time
evaluation, we identified usage scenarios expected to occur
in clinical and home monitoring settings when using the
home-monitoring application. Each usage scenario was
Graph databases for openEHR clinical repositories 291
mapped to a query. The usage scenarios are shown in
Table 3.
Table 3 Home monitoring application use-case scenarios
Use of the home-monitoring app
Scenario 1
During the
care visit
Create a new EHR
Add the DOB, the pregnancy summary and the
pregnancy history to the EHR
View the EHR record
View the symptoms list
View the added symptoms since the previous
Scenario 2
using the
View EHR record
Insert a new symptoms list
View the symptoms list
Update a symptom
Accordingly, a set of seven queries were identified:
Q1 find the health information present in one health
Q2 add a new symptom entry to one health record
Q3 update a symptom entry in one health record
Q4 find the symptoms list in one health record
Q5 create a health record
Q6 add the date of birth, pregnancy summary and
pregnancy history to one health record
Q7 find the symptoms reported since period X in one
health record.
Equivalent SQL and Cypher queries were written to
represent the seven identified queries.
As mentioned earlier, in the institution where the EHR
application is being developed, it is estimated that 15
antenatal visits occur daily and up to 300 women receive
antenatal care in a year. According to these estimations the
query requests would have different frequencies and are
estimated as follows:
frequency of Q1, Q2, Q3, Q4: 250 times/day
frequency of Q5, Q6, Q7: 30 times/day.
Similar to Freire et al. (2016), the evaluation criteria were
the database required storage space and query response
time. However, in this study we considered clinical queries
since they are the types of queries required in the
application’s usage scenarios. Clinical queries return
requested data values existing in a specific EHR.
The performance evaluation was conducted using:
Intel(R) Xeon(R) CPU E5-3620 v3 @ 2.40 GHz 2.40
GHz with 32GB of memory, over Windows 10
enterprise version 1607 64-bit operating system
Neo4j community version 3.0.1
Microsoft SQL server 2016.
The queries were executed using Neo4j browser and
Microsoft SQL server management Studio 2016. If not
configured otherwise, Neo4j assumes that the entire RAM
on the machine is available to run the Neo4j server.
Similarly, Microsoft SQL server dynamically changes its
memory requirements based on the available system
resources. To ensure a fair comparison, the maximum server
memory for Microsoft SQL server was set at 32 GB. Each
query was executed 15 times over the five different datasets
in both database technologies.
3.5 Labelling and indexing
Both Microsoft SQL server and Neo4j have an indexing
mechanism to accelerate query executions. In Microsoft
SQL Server, the queries perform JOIN operations over the
tables using the id’ property. The archetype_id’ and
‘archetype_node_id’ attributes are used as conditions in the
WHERE clauses of the SQL queries. To optimise the
performance of Microsoft SQL server, indexes are applied
over the ‘id,’ ‘archetype_id,’ and ‘archetype_node_id’
columns in all tables. Indexes which could possibly improve
the query response time and were indicated as missing by
Microsoft SQL server management studio were also
In Neo4j, the query response time could be improved
through the creation of node labels. In the labelled property
graph model, nodes can have any number of labels assigned
to them, indicating the role of the node in the domain.
Labels can be used in queries to identify the starting nodes
for a traversal, thus allowing for more efficient node
lookups. If the nodes are labelled, schema indexes can be
created for each label and property combination. In Neo4j,
schema indexes are helpful to locate the start node of each
query. Once the start node is located, Neo4j executes
traversals over the queried path. Two indexing strategies
were applied in Neo4j. The first strategy is similar to the
indexing strategy applied with SQL server, where indexes
were created for the following (label, property)
combinations: (EHR, id), (COMPOSITION, archetype
_id), (EVALUATION, archetype_id), (OBSERVATION,
archetype_id), (ITEM_TREE, archetype_node_id),
(HISTORY, archetype_node_id), (POINT_EVENT,
archetype_node_id), (EVENT_CONTEXT, archetype_
node_id), (CLUSTER, archetype_id), (ELEMENT,
archetype_node_id). In the second strategy, we only
indexed the (EHR, id) combination since all of the queries
deal with individual EHR records, implying that the starting
node is located by searching for a specific EHR id. When
comparing the performance of both implementation
approaches, the first Neo4j indexing strategy was used since
it resulted in faster query response times.
292 S. El Helou et al.
4 Results
We compared our approach using Neo4j with an ORM
approach using Microsoft SQL server in terms of required
storage space and query response time. We first show the
required storage space in both approaches after the indexing
was applied. Then, we show how query response times
compared using both approaches.
4.1 Storage space requirement
Figure 12 shows the required storage space for each of the
databases after the indexing was performed. The Microsoft
SQL server database required less storage space for the
S1K, S5K and S10K datasets. Neo4j required less storage
space for the S50K and S100K datasets.
4.2 Query response times
The dataset size and the type of query are the two main
factors affecting the query response times in both
implementation approaches. At first, we show the effect of
the dataset size and then we show the effect of the query
Figure 12 Required storage space by dataset
1k 5k 10k 50k 100k
indexes 24 95 185 904 1760
Indexes 33 109 212 701 1520
Figure 13 Effect of dataset size on the response time (see online version for colours)
X1K X5K X10K X50K X100K
Dataset Size
Query Response Time (ms)
SQL Server
Graph databases for openEHR clinical repositories 293
Figure 14 Effect of the query type on the response time (see online version for colours)
Figure 15 Effect of indexing on storage space (see online version for colours)
Figure 13 shows how both implementation approaches
performed for the different size datasets. The queries are
grouped together to simulate a complete usage scenario of
the home-monitoring application. Neo4j performed better
than Microsoft SQL server for all of the dataset sizes.
However, Neo4j had a large number of outliers, while
Microsoft SQL server maintained a more stable
performance. The outliers were mainly the result of
submitting a query to the server for the first time, meaning
that these outliers would not occur in a system in which the
server has been warmed up.
Figure 14 shows how both implementation approaches
performed for the different types of queries. The response
times for each query over the different dataset sizes are
grouped together. Neo4j performed better than Microsoft
SQL server for all the query types. The results also show
that the type of query has almost the same effect over the
performance of both implementation approaches where Q2,
Q3 and Q7 have a longer response time with both
implementation approaches. More detailed results about the
query response times in each dataset are provided in the
Appendix section.
294 S. El Helou et al.
5 Discussion
We proposed an implementation approach of openEHR
repositories using a labelled property graph database. We
compared a Neo4j implementation of the proposed approach
with a Microsoft SQL server implementation of the
commonly used ORM approach. The results confirm that
the ORM approach is not optimal for storing and querying
openEHR data and that the graph model could provide a
better overall performance. On the other hand, we can see
that Neo4j had a larger number of extreme outliers. These
outliers were mainly the response times that corresponded
to the first time a certain query was submitted to the server.
We can conclude that Neo4j has a limited performance with
ad-hoc queries. However, ad-hoc queries could be avoided
in clinical settings. In the institution in which this research
was conducted, about 90% of the queries are cached
beforehand. Therefore, the limited performance of Neo4j for
ad-hoc queries would not be a practical concern for the
performance of clinical queries.
In terms of required storage space, the Microsoft SQL
server implementation required less space for the smaller
datasets while the Neo4j implementation required less space
for the larger datasets. One way to explain this is by looking
at the effect of indexing on the required storage space in
both database technologies, shown in Figure 15. Without
indexes, the Microsoft SQL server implementation of the
ORM approach requires the least storage space. However,
we see a threefold increase in the required storage space
after the indexes were added. For Neo4j, adding the indexes
increased the required storage by a maximum of 10%. With
relational databases, indexing cannot be practically avoided
because it greatly reduces the query response times when
JOIN operations over large tables are executed. Thus, these
results suggest that for larger datasets, Neo4j would be more
space efficient.
In addition to a promising overall performance, the
proposed approach using Neo4j was more straightforward
and easier to implement. For example, during this study,
Cypher queries required less than half the number of logical
lines of code (LLOC) than those required for the SQL
queries. The ease of implementation was due to the
semantic alignment between the openEHR RM and the
labelled property graph model, the schema-less nature of
graph databases, the declarative nature of Cypher and the
degree of similarity between the Cypher and AQL queries
(openEHR, 2008a). Furthermore, using Neo4j’s browser,
we were able to directly visualise the EHR as a semantic
graph. A survey of openEHR learning approaches (Sundvall
et al., 2016) proposed the use of interactive graphical
representations to browse and manipulate EHR instance
data to learn openEHR, a process usually described as
difficult and time consuming. Our experience suggests that
the ease of implementation and visualisation allowed by
Neo4j could be of value for beginners approaching
Limitations of this study include the nature of the used
datasets and the limited number of explored clinical
use-cases. The first limitation resulted from a lack of access
to real datasets, a common issue faced by different
groups testing the performance of clinical repository
implementations. To conduct the performance evaluation,
we used artificially generated datasets instead of real EHR
data. The EHRs in the generated datasets contain five
compositions each, a number likely to be surpassed in a real
production scenario. However, since we could generate
different size datasets, we consider our method sufficient to
highlight the difference in performance between the two
compared implementations when the dataset size grows.
On the other hand, the simulated datasets used for the
evaluation do not include image or video files. In reality,
EHR data is heterogeneous and includes variable data types.
To handle a variety of data types, a polyglot-persistent
systems approach for storing and querying EHR data was
previously proposed (Kaur and Rani, 2015a, 2015b).
Furthermore, EHR implementation tutorials recommend a
hybrid approach if it improves the querying performance
(Gutiérrez et al., 2015). Further research is required to
determine which database technology and implementation
approach fits for each type of openEHR data and how these
technologies can be integrated smoothly in a polyglot-
persistent systems schema.
Finally, we note that the number of use-cases explored
in this study is limited and does not represent a full usage
scenario, nor do they consider concurrent transactions,
which is essential when evaluating the performance of
clinical querying over an EHR repository.
6 Conclusions
Aligning with the need for scalable archetype-based
repository implementations, we proposed, tested and
evaluated an approach for storing and querying openEHR
archetype-based data using a labelled property graph
database. We compared the proposed approach
implemented over Neo4j with the ORM approach
implemented over Microsoft SQL server in terms of query
response time and required storage space. The evaluation
was performed using different size datasets and a query set
simulating a pregnancy home-monitoring application. The
experimental results showed that the proposed
implementation is more space efficient for larger datasets
and results in lower query response times for clinical
queries. This work encourages further research on graph
databases as a possible alternative to conventional database
technologies for clinical repositories.
This research was partly funded by JP16686872
Grant-in-Aid for Exploratory Research and JP15611340
Grant-in-Aid for Scientific Research (C).
Graph databases for openEHR clinical repositories 295
Agrawal, A. (2002) ‘Return on investment analysis for a
computer-based patient record in the outpatient clinic setting’,
Journal of the Association for Academic Minority Physicians:
the Official Publication of the Association for Academic
Minority Physicians, Vol. 13, No. 3, pp.61–65.
Angles, R. (2012) ‘A comparison of current graph database
models’, presented at the IEEE 28th International Conference
on Data Engineering Workshops (ICDEW), pp.171–177.
Ash, J.S. and Bates, D.W. (2005) ‘Factors and forces affecting
EHR system adoption: report of a 2004 ACMI discussion’,
Journal of the American Medical Informatics Association,
Vol. 12, No. 1, pp.8–12.
Balsamo, S., Marzolla, M., Di Marco, A. and Inverardi, P. (2004)
‘Experimenting different software architectures performance
techniques: a case study’, presented at the ACM SIGSOFT
Software Engineering Notes, Vol. 29, pp.115–119.
Beale, T. (2002) ‘Archetypes: constraint-based domain models for
future-proof information systems’, OOPSLA 2002 Workshop
on Behavioural Semantics.
Begoyan, A. (2007) ‘An overview of interoperability standards for
electronic health records’, Integrated Design and Process
Technology, Society for Design and Process Science, USA.
Belden, J.L., Grayson, R. and Barnes, J. (2009) ‘Defining and
testing EMR usability: principles and proposed methods of
EMR usability evaluation and rating’, Healthcare Information
and Management Systems Society (HIMSS).
Cornwall, A. (2002) ‘Electronic health records: an international
perspective’, Health Issues, Vol. 73, pp.19–23.
Cottle, M., Hoover, W., Kanwal, S., Kohn, M., Strome, T.
and Treister, N. (2013) ‘Transforming health care
through big data Strategies for leveraging big data in the
health care industry’, Institute for Health Technology
Transformation [online] http://c4fd63cb482ce6861463-
HT2_BigData_2013.pdf (access 3 November 2017).
Frade, S., Freire, M.S. and Sundvall, E. (2013) ‘Survey of
openEHR storage implementations’, IEEE 26th International
Symposium on Computer-Based Medical Systems (CBMS),
IEEE, pp.303–307.
Freire, S.M., Sundvall, E. and Karlsson, D. (2012) ‘Performance of
XML databases for epidemiological queries in archetype-
based EHRs’, presented at the Scandinavian Conference on
Health Informatics.
Freire, S.M., Teodoro, D., Wei-Kleiner, F., Sundvall, E., Karlsson,
D. and Lambrix, P. (2016) ‘Comparing the performance of
NoSQL approaches for managing archetype-based electronic
health record data’, PloS One, Vol. 11, No. 3, p.e0150069.
Frost, S. (2015) Drowning in Big Data? Reducing Information
Technology Complexities and Costs for Healthcare
Garde, S., Knaup, P., Hovenga, E. and Heard, S. (2007) ‘Towards
semantic interoperability for electronic health records’,
Methods of Information in Medicine, Vol. 46, No. 3,
Green, J. (2008) Comparison of the Relative Performance of XML
and SQL Databases in the Context of the Grid-SAFE Project,
University of Edinburgh.
Gutiérrez, P.P., Atalag, K., Marco-Ruiz, L., Sundvall, E. and
Freire, S.M. (2015) ‘Design and implementation of clinical
databases using openEHR’, MEDINFO 2015 Conference, Sao
Paulo, Brazil [online]
openehr (access 3 November 2017).
Hamid, F. and Cline, T. (2013) ‘Providers’ acceptance factors and
their perceived barriers to electronic health record (EHR)
adoption’, Online Journal of Nursing Informatics, Vol. 17,
No. 3 [online]
Hunger, M., Boyd, R. and Lyon, W. (2016) ‘The definitive guide
to graph databases for the RDBMS developer, Neo
Technology [online]
Developer.pdf (accessed 01 August 2017).
IT Strategic Headquarters (2009) I-Japan Strategy 2015, Striving
to Create a Citizen-Driven, Teassuring & Vibrant Digital
Society (accessed 3 November 2017).
Jagadish, H.V., Qian, L. and Nandi, A. (2015) ‘Organic databases’,
International Journal of Computational Science and
Engineering, Vol. 11, No. 3, pp.270–283.
Kalra, D. and Blobel, B. (2007) ‘Semantic interoperability of EHR
systems’, Studies in Health Technology and Informatics,
Vol. 127, p.231.
Kaur, K. and Rani, R. (2015a) ‘A smart polyglot solution for big
data in healthcare’, IT Professional, Vol. 17, No. 6, pp.48–55.
Kaur, K. and Rani, R. (2015b) Managing data in healthcare
information systems: many models, one solution’, Computer,
Vol. 48, No. 3, pp.52–59.
King, J., Patel, V., Jamoom, E.W. and Furukawa, W.F. (2014)
‘Clinical benefits of electronic health record use: national
findings’, Health Services Research, Vol. 49, No. 1pt2,
Li, W-S., Hsiung, W-P., Po, O., Candan, K.S. and
Agrawal, D. (2002) ‘Evaluations of architectural designs and
implementation for database-driven web sites’, Data &
Knowledge Engineering, Vol. 43, No. 2, pp.151–177.
Li, X. and Bao, Z. (2017) ‘A comprehensive performance study of
HTML5-enabled WebApps’, International Journal of
Embedded Systems, Vol. 9, No. 2, pp.119–129.
Librelotto, G.R. et al. (2015) ‘OntoHealth: a system to process
ontologies applied to health pervasive environment’,
International Journal of Computational Science and
Engineering, Vol. 10, No. 4, pp.359–367.
Liu, D. and Xiao, P. (2016) ‘An energy-efficient adaptive resource
provision framework for cloud platforms’, International
Journal of Computational Science and Engineering, Vol. 13,
No. 4, pp.346–354.
Megginson, D. (2004) Imperfect XML: Rants, Raves, Tips, and
Tricks from an Insider, Addison-Wesley Professional,
Boston, USA.
Microsoft (2016) Microsoft SQL Server 2016 [online]
(accessed 01 August 2017).
Minakami, H. et al. (2011) ‘Guidelines for obstetrical practice in
Japan: Japan society of obstetrics and gynecology (JSOG) and
Japan association of obstetricians and gynecologists (JAOG)
2011 edition’, Journal of Obstetrics and Gynaecology
Research, Vol. 37, No. 9, pp.1174–1197.
Neo4j, The Property Graph [online]
developer/graph-database/#property-graph/ (accessed 19 July
296 S. El Helou et al.
Neo4j Inc., Intro to Cypher [online]
cypher-query-language/ (accessed 3 November 2017).
Nielsen, J. (1993) Usability Engineering, Morgan Kaufmann
Publishers Inc. San Francisco, CA, USA.
Ocean Informatics, Ocean Template Designer [online] (accessed
01 August 2017).
openEHR (2007) Archetype Meta-Architecture [online]
view/Output/design_principles.html (accessed 19 July 2016).
openEHR (2008a) Archetype Query Language (AQL) [online]
QL.html (accessed 3 November 2017).
openEHR (2008b) Architecture Overview [online]
pdf (accessed 3 November 2017).
openEHR (2008c) Node+Path Persistence [online]
pageId=6553626 (accessed 01 August 2017).
openEHR (2016) Clinical Knowledge Manager [online] (accessed 19 July 2016).
openEHR, openEHR Foundation [online]
(accessed 13 August 2016).
Robinson, I., Webber, J. and Eifrem, E. (2015) Graph Databases:
New Opportunities for Connected Data, O’Reilly Media, Inc,
California, USA.
Schloeffel, P., Beale, T., Hayworth, G., Heard, S. and Leslie, H.
(2006) The Relationship Between CEN 13606, HL7, and
Shen, J.J., Cochran, C.R., Neish, S., Moseley, C.B. and Mukalian,
R. (2015) ‘Level of EHR adoption and quality and cost of
care-evidence from vascular conditions and procedures’,
International Journal of Healthcare Technology and
Management, Vol. 15, No. 1, pp.4–21.
Sundvall, E., Siivonen, D. and Örman, H. (2016) ‘Approaches to
learning openEHR: a qualitative survey, observations, and
suggestions’, presented at the Proceedings from The 14th
Scandinavian Conference on Health Informatics, Gothenburg,
Sweden, 6–7 April, pp.29–36.
Vishwanath, A. and Scamurra, S.D. (2007) ‘Barriers to the
adoption of electronic health records: using concept mapping
to develop a comprehensive empirical model’, Health
Informatics Journal, Vol. 13, No. 2, pp.119–134.
Wang, L., Min, L., Wang, R., Lu, X. and Duan, H. (2015)
‘Archetype relational mapping-a practical openEHR
persistence solution’, BMC Medical Informatics and Decision
Making, Vol. 15, No. 1, p.88.
Figure 16 Query response time over the 1K dataset (see online version for colours)
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Query Response Time (ms)
SQL Server
Graph databases for openEHR clinical repositories 297
Figure 17 Query response time over the 5K dataset (see online version for colours)
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Query Response Time (ms)
SQL Server
Figure 18 Query response time over the 10K dataset (see online version for colours)
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Query Response Time (ms)
SQL Server
298 S. El Helou et al.
Figure 19 Query response time over the 50K dataset (see online version for colours)
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Query Response Time (ms)
SQL Server
Figure 20 Query response time over the 100K dataset (see online version for colours)
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Query Response Time (ms)
SQL Server
... For such clinical practice, healthcare providers typically perform create, read, update, and destroy (CRUD) operations to retrieve and modify a relatively small number of several EHR extracts easily. Minimizing the response time of these CRUD operations may enhance EHRs' usability and functionality [31]. A fundamental principle in medical systems is that clinical data cannot be overwritten. ...
... Graph databases have been recently introduced as a potential alternative to relational databases for handling graph-like data structures [73,74]. A graph-based implementation method [31] was suggested and evaluated for an archetype-oriented repository utilizing a labeled property graph database. This method was used as an alternative to traditional relational database architecture for clinical data storage. ...
... As the RM includes several classes in a deep tree hierarchy, it has a graph-like architecture. As a result, mapping it to a graph model and storing it in a graph database would be easy [31]. ...
With the extensive adoption of electronic health records (EHRs) by several healthcare organizations, more efforts are needed to manage and utilize such massive, various, and complex healthcare data. Databases' performance and suitability to health care tasks are dramatically affected by how their data storage model and query capabilities are well-adapted to the use case scenario. On the other hand, standardized healthcare data modeling is one of the most favorable paths for achieving semantic interoperability, facilitating patient data integration from different healthcare systems. This paper compares the state-of-the-art of the most crucial database management systems used for storing standardized EHRs data. It discusses different database models' appropriateness for meeting different EHRs functions with different database specifications and workload scenarios. Insights into relevant literature show how flexible NoSQL databases (document, column, and graph) effectively deal with standardized EHRs data's distinctive features, especially in the distributed healthcare system, leading to better EHR.
... There are several criteria, such as data model, performance, data persistence, and CAP support [19], which must be considered when choosing which NoSQL store to be used. Various data modeling approaches [3,[20][21][22][23][24][25][26][27] have been introduced for medical data persistence according to use case scenarios. These works investigate not only the type of NoSQL store that has to be chosen but which NoSQL products in that type will be used [19]. ...
... OpenEHR-related research is gradually becoming one of the most discussed semantic interoperability-related research topics. Such research involves archetype modeling [17][18][19][20][21][22][23], data persistence [24][25][26], language design [27], model mapping [28], model retrieval [29,30], and reuse [19]. ...
Background The semantic interoperability of health care information has been a critical challenge in medical informatics and has influenced the integration, sharing, analysis, and use of medical big data. International standard organizations have developed standards, approaches, and models to improve and implement semantic interoperability. The openEHR approach—one of the standout semantic interoperability approaches—has been implemented worldwide to improve semantic interoperability based on reused archetypes. Objective This study aimed to verify the feasibility of implementing semantic interoperability in different countries by comparing the openEHR-based information models of 2 acute coronary syndrome (ACS) registries from China and New Zealand. Methods A semantic archetype comparison method was proposed to determine the semantics reuse degree of reused archetypes in 2 ACS-related clinical registries from 2 countries. This method involved (1) determining the scope of reused archetypes; (2) identifying corresponding data items within corresponding archetypes; (3) comparing the semantics of corresponding data items; and (4) calculating the number of mappings in corresponding data items and analyzing results. Results Among the related archetypes in the two ACS-related, openEHR-based clinical registries from China and New Zealand, there were 8 pairs of reusable archetypes, which included 89 pairs of corresponding data items and 120 noncorresponding data items. Of the 89 corresponding data item pairs, 87 pairs (98%) were mappable and therefore supported semantic interoperability, and 71 pairs (80%) were labeled as “direct mapping” data items. Of the 120 noncorresponding data items, 114 (95%) data items were generated via archetype evolution, and 6 (5%) data items were generated via archetype localization. Conclusions The results of the semantic comparison between the two ACS-related clinical registries prove the feasibility of establishing the semantic interoperability of health care data from different countries based on the openEHR approach. Archetype reuse provides data on the degree to which semantic interoperability exists when using the openEHR approach. Although the openEHR community has effectively promoted archetype reuse and semantic interoperability by providing archetype modeling methods, tools, model repositories, and archetype design patterns, the uncontrolled evolution of archetypes and inconsistent localization have resulted in major challenges for achieving higher levels of semantic interoperability.
... To ensure the success of our approach, the systems needs to be initially built in a way that makes their adaptation simple. To do so, the designers can follow existing frameworks that allow them to develop adaptable software architectures [23]- [25]. After these systems are implemented and used, the designers can follow our approach to understand which features to redesign. ...
Full-text available
Electronic Medical Record (EMR) systems are the computers used inside healthcare clinics. EMR systems have multiple stakeholders whose needs continuously evolve. The traditional EMR design approach focuses on designing systems that perfectly fit the requirements of some stakeholders as they are understood in the initial design stages. This results in EMR systems that do not answer all the stakeholder's needs and quickly become outdated. To address the limitations of the traditional EMR design approach, we propose a utilitarian redesign approach for EMR systems. By "utilitarian redesign", we mean that the designers continuously redesign the EMR system with the aim of maximizing the satisfaction of all the stakeholders. Our approach allows the designers to (i) identify the features to redesign and (ii) to know which features would bring the largest good to the largest number of stakeholders. We showcase the approach using a case study of redesigning an EMR system in Japanese antenatal care settings. We also evaluate our approach with 21 participants split over 7 workshops. Our results showed that the approach provides useful information to help the designers make utilitarian redesign choices. Even though our approach was applied to EMR systems, it may also be applied to redesign other complex socio-technical systems and potentially maximize the good for the largest number of stakeholders.
Full-text available
Graph representation learning is a method for introducing how to effectively construct and learn patient embeddings using electronic medical records. Adapting the integration will support and advance the previous methods to predict the prognosis of patients in network models. This study aims to address the challenge of implementing complex and highly heterogeneous dataset, including the following: (1) demonstrating how to build a multi-attributed and multi-relational graph model (2) and applying a downstream disease prediction task of patient’s prognosis using HinSAGE algorithm. We present a bipartite graph schema and a graph database construction in detail. The first constructed graph database illustrates a query of a predictive network which provides analytical insights using graph representation of a patient’s journey. Moreover, we demonstrate an alternative bipartite model where we apply the model to the HinSAGE to perform the link prediction task for predicting the event occurrence. Consequently, the performance evaluation indicated that our heterogeneous graph model successfully predicted as baseline models. Overall, our graph database successfully demonstrated efficient real-time query performance and showed HinSAGE implementation to predict cardiovascular diseases event outcomes on supervised link prediction learning.
Background The widespread adoption of electronic health records (EHRs) has facilitated the secondary use of EHR data for clinical research. However, screening eligible patients from EHRs is a challenging task. The concepts in eligibility criteria are not completely matched with EHRs, especially derived concepts. The lack of high-level expression of Structured Query Language (SQL) makes it difficult and time consuming to express them. The openEHR Expression Language (EL) as a domain-specific language based on clinical information models shows promise to represent complex eligibility criteria. Objective The study aims to develop a patient-screening tool based on EHRs for clinical research using openEHR to solve concept mismatch and improve query performance. Methods A patient-screening tool based on EHRs using openEHR was proposed. It uses the advantages of information models and EL in openEHR to provide high-level expressions and improve query performance. First, openEHR archetypes and templates were chosen to define concepts called simple concepts directly from EHRs. Second, openEHR EL was used to generate derived concepts by combining simple concepts and constraints. Third, a hierarchical index corresponding to archetypes in Elasticsearch (ES) was generated to improve query performance for subqueries and join queries related to the derived concepts. Finally, we realized a patient-screening tool for clinical research. Results In total, 500 sentences randomly selected from 4691 eligibility criteria in 389 clinical trials on stroke from the Chinese Clinical Trial Registry (ChiCTR) were evaluated. An openEHR-based clinical data repository (CDR) in a grade A tertiary hospital in China was considered as an experimental environment. Based on these, 589 medical concepts were found in the 500 sentences. Of them, 513 (87.1%) concepts could be represented, while the others could not be, because of a lack of information models and coarse-grained requirements. In addition, our case study on 6 queries demonstrated that our tool shows better query performance among 4 cases (66.67%). Conclusions We developed a patient-screening tool using openEHR. It not only helps solve concept mismatch but also improves query performance to reduce the burden on researchers. In addition, we demonstrated a promising solution for secondary use of EHR data using openEHR, which can be referenced by other researchers.
Conference Paper
Full-text available
There are very few published studies regarding the performance of persistence mechanisms for systems that use the openEHR multi level modelling approach. This paper addresses the performance and size of XML databases that store openE\HR compliant documents. Database size and response times to epidemiological queries are described. An anonymized relational epidemiology database and associated epidemiological queries were used to generate openEHR XML documents that were stored and queried in four open-source XML databases. The XML databases were considerably slower and required much more space than the relational database. For population-wide epidemiological queries the response times scaled in order of magnitude at the same rate as the number of records (total database size) but were orders of magnitude slower than the original relational database. For individual focused clinical queries where patient ID was specified the response times were acceptable. This study suggests that the tested XML database configurations without further optimizations are not suitable as persistence mechanisms for openEHR-based systems in production if population-wide ad hoc querying is needed.
Full-text available --------------------------------------------------------------- This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.
Full-text available
One of the primary obstacles to the widespread adoption of openEHR methodology is the lack of practical persistence solutions for future-proof electronic health record (EHR) systems as described by the openEHR specifications. This paper presents an archetype relational mapping (ARM) persistence solution for the archetype-based EHR systems to support healthcare delivery in the clinical environment. First, the data requirements of the EHR systems are analysed and organized into archetype-friendly concepts. The Clinical Knowledge Manager (CKM) is queried for matching archetypes; when necessary, new archetypes are developed to reflect concepts that are not encompassed by existing archetypes. Next, a template is designed for each archetype to apply constraints related to the local EHR context. Finally, a set of rules is designed to map the archetypes to data tables and provide data persistence based on the relational database. A comparison study was conducted to investigate the differences among the conventional database of an EHR system from a tertiary Class A hospital in China, the generated ARM database, and the Node + Path database. Five data-retrieving tests were designed based on clinical workflow to retrieve exams and laboratory tests. Additionally, two patient-searching tests were designed to identify patients who satisfy certain criteria. The ARM database achieved better performance than the conventional database in three of the five data-retrieving tests, but was less efficient in the remaining two tests. The time difference of query executions conducted by the ARM database and the conventional database is less than 130 %. The ARM database was approximately 6–50 times more efficient than the conventional database in the patient-searching tests, while the Node + Path database requires far more time than the other two databases to execute both the data-retrieving and the patient-searching tests. The ARM approach is capable of generating relational databases using archetypes and templates for archetype-based EHR systems, thus successfully adapting to changes in data requirements. ARM performance is similar to that of conventionally-designed EHR systems, and can be applied in a practical clinical environment. System components such as ARM can greatly facilitate the adoption of openEHR architecture within EHR systems.
Full-text available
This study examined relationships of electronic health record (EHR) adoption to both the cost of care and quality outcomes in the acute care hospital setting. Data were mainly obtained from the 2009 National Inpatient Sample and the 2009 American Hospital Association EHR implementation survey. Two sets of dependent variables were identified. The first set included quality indicators of five cardiovascular and three cerebrovascular conditions and procedures. The second set included cost of care for the eight quality indicators. The independent variables were levels of EHR adoption. The results did not identify many differences in quality indicators across levels of EHR adoption, but consistently showed that patients in hospitals with EHR systems incurred lower costs than patients in hospitals without a comprehensive or basic EHR system. It was concluded that EHR adoption is more likely to be associated with the cost of patient care than improving quality indicators and clinical outcomes.
Web applications (WebApps) built on top of HTML standards have the advantage of cross-platform portability. However, the user experience in terms of both functionality and performance provided by current generation WebApps is not comparable to native apps running on iOS or Android. Despite significant functional extensions in the new HTML5 standard, the poor performance of WebApps has not been addressed. To promote the adoption of WebApps, it is important to study their performance comprehensively. In this paper, we take Google Chrome as the target browsing engine, and evaluate its performance with a set of popular web pages and WebApps. We make in-depth analysis on the major performance-contributing aspects in a browser engine. Our study exposes a number of interesting observations, from which we make reasonings and provide suggestions for the optimisation of browser engines, as well as guidelines for developing efficient web-based applications.
In cloud computing, resource provision service plays an important role for operating large-scale datacentres. Conventional resource provision policies or services mainly concentrate on optimising costs and application execution performance. In this paper, we present an integrated and adaptive resource provision framework, which is based on our previous work on performance monitor in cloud environments. In the proposed framework, several novel mechanisms are implemented, aiming at improving the energy-efficiency as well as the execution performance for cloud systems. Extensive experiments are conducted to evaluate the performance of the proposed framework in terms of different metrics. The experimental results show that the proposed framework can significantly improve the energy-efficiency metric, especially when a cloud system is in presence of intensive hybrid workloads.
With the 2014 governmental deadline f or nationwide implementation of the electronic health records (EHR) approaching, healthcare systems need to ensure successf ul EHR adoption among their providers. Recent reports indicate that only 55 percent of physicians nationwide have adopted the EHR (Jamoom, Beatty, Bercovitz, Woodwell, Palso, & Rechtsteiner, 2012). This study explored acceptance f actors and barriers associated with providers' intention to adopt EHR by provider types (physicians and advanced practice providers). Physicians (n=24) and advanced practice providers (n=20) employed in acute care settings at a community healthcare system participated in the study. The participants in this study indicated that perceived management support, provider involvement, and adequate training were f acilitators. Perceived lack of usef ulness and provider autonomy were barriers (p<0.01). Advanced practice providers f ound EHR marginally easier to use, but were less inclined to accept EHR in clinical practice compared to physicians. Increasing Age was negatively correlated with EHR adoption f or physicians only (r=-.476, p<.05).
Healthcare information systems (HISs) are multifarious in nature. They are thus best implemented using multiple data stores because one database won't fulfill all the storage requirements of such complex applications. The amalgamation of different databases within an application is known as polyglot persistence. To achieve a polyglot-persistent solution, different database types must available. As late as 2005, relational databases ruled as the de facto databases, but their reign has been challenged by nonrelational databases, known as NoSQL data stores, making polyglot persistence possible. Applications' persistence needs are progressing from mostly relational to a mixture of data stores. For example, various HIS modules should use different data stores to model data closer to its semantic usage. The authors' PolyglotHIS solution lets managers easily venture into polyglot-persistent software because all data stores are free and open source. Although polyglot persistence generates overhead from dealing with multiple data stores, the benefits are worth every penny. With PolyglotHIS, overhead in terms of latency caused by the presence of multiple layers is so small that it becomes negligible as the dataset increases.
Because healthcare data comes from multiple, vastly different sources, databases must adopt a range of models to process and store it. A polyglot-persistent framework combines relational, graph, and document data models to accommodate information variety.