Conference PaperPDF Available

Abstract and Figures

Information dissemination has always been in the focus of the computer science research community. New ways of information and data representation, storage, querying and visualization are being constantly developed and upgraded. Linked Open Data represents a concept which offers a comprehensive solution for information and data dissemination. It accomplishes this by aiming towards two things: to represent data in an open, machine-readable format, and to interlink data from heterogeneous repositories in a way which allows a large variety of usage scenarios for both humans and machines. On the other hand, health also represents a domain of high interest in our research community. In order to provide use-case scenarios for publishing and using healthcare data in Macedonia, we generated a dataset of five-star Linked Open Data, based on the data provided and published by the Health Insurance Fund (HIF) of the Republic of Macedonia. In this paper, we describe the process of transforming the data available at the HIF website, into data published in an open format, and interlinked with data from the DrugBank domain.
Content may be subject to copyright.
The 10th Conference for Informatics and Information Technology (CIIT 2013)
©2013 Faculty of Computer Science and Engineering
LINKED OPEN DRUG DATA FROM THE HEALTH INSURANCE FUND OF
MACEDONIA
Milos Jovanovik, Bojan Najdenov, Dimitar Trajanov
Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University
Skopje, Republic of Macedonia
ABSTRACT
Information dissemination has always been in the focus of
the computer science research community. New ways of
information and data representation, storage, querying and
visualization are being constantly developed and upgraded.
Linked Open Data represents a concept which offers a
comprehensive solution for information and data
dissemination. It accomplishes this by aiming towards two
things: to represent data in an open, machine-readable
format, and to interlink data from heterogeneous
repositories in a way which allows a large variety of usage
scenarios for both humans and machines. On the other hand,
health also represents a domain of high interest in our
research community. In order to provide use-case scenarios
for publishing and using healthcare data in Macedonia, we
generated a dataset of five-star Linked Open Data, based on
the data provided and published by the Health Insurance
Fund (HIF) of the Republic of Macedonia. In this paper, we
describe the process of transforming the data available at
the HIF website, into data published in an open format, and
interlinked with data from the DrugBank domain.
I. INTRODUCTION
The basic idea behind the Open Data concept is that data
which can be considered public should be available in a raw
and machine-readable format, for the purposes of use,
reuse, republishing and redistributing, with little or no
restrictions. When public datasets are published in an open
format, they can be used for building useful applications
which leverage their value and offer different use-cases for
the interested parties [1]. Furthermore, these datasets can
contribute to the overall development of the society, by both
boosting the ICT business sector with new business value and
providing the stakeholders with new functionalities [2].
On the other hand, the concept of Linked Data provides
mechanisms for interlinking data from different repositories
distributed on the Web, in order to provide better data
usage and querying scenarios [3]. Linked Data, in a way,
represents a synonym to the Semantic Web, since its main
goal is interlinking data from the Web by their meaning. The
Linked Data techniques rely on identifying resources with
URIs, providing data about these resources and connecting
them to other resources on the Web, by using standards
such as the Resource Description Framework (RDF) [4].
The Linked Open Data Cloud diagram [5] (Fig. 1) shows the
datasets which have been published in Linked Data format,
along with their interconnections. The figure depicts the
state from September 2011. The datasets are grouped and
shown in different colours based on their domains: media,
geographic data, publications, user-generated content,
government, cross-domain, and health and life sciences.
Figure 1. The LOD Cloud.
Health is a research area which, when it comes to data and
information, is one of the main topics of interest for
computer scientists. Over the years, many different
approaches for representation, storage, querying and
visualizing of health data have been developed. The new
The 10th Conference for Informatics and Information Technology (CIIT 2013)
techniques employed by the Linked Open Data community
offer new ways for covering these areas of interest for health
data. This is one of the main reasons we decided to work
with data from the Health Insurance Fund of the Republic of
Macedonia.
A. Linked Open Data Rating System
In order to encourage the governments, institutions and
people in general to provide and use Linked Open Data, a
star rating system for data has been developed [4].
Figure 2. The Star Rating System.
According to the rating system, every information
published online, regardless of the format, which is made
public with an open licence, can be considered Open Data,
and gets one star. This can be an image, a scan, a PDF file,
etc.
If the data is publicly available on the Web as machine-
readable structured data, then it gets two stars. This can be
an Excel spreadsheet instead of a scanned image. Three stars
are given to data published as structured and machine-
readable, but in a non-proprietary format, such as CSV
instead of Excel.
If the data complies with all of the above rules, and
additionally uses Semantic Web standards (RDF, OWL,
SPARQL) to identify things, so that people can point to it on
the Web, it gets four stars. Five stars are appointed to data
which comply with all of the above rules, and additionally
link to other people’s data, for providing context.
Therefore, when publishing Open Data on the Web, the
most desirable format is the five-star Linked Open Data
format, and it is a goal to which all publishers should aim to.
II. RELATED WORK
Numerous efforts have been made worldwide so far for
transforming healthcare data into Linked Data. The most
notable are the Linking Open Drug Data (LODD) project,
LinkedCT, Open Biological and Biomedical Ontologies (OBO),
and the Semantic Web Health Care and Life Sciences Interest
Group at W3C.
The LODD project1 is focused on interlinking data about
drugs already existing on the Web [6]. The data ranges from
impact of the drugs on gene expression, through to results of
clinical trials. The aim of the project is to enable answering of
interesting scientific and business questions by interlinking
previously separated data about drugs and healthcare. As
part of their work, they have collected datasets with over 8
million RDF triples, interlinked with more than 370.000 RDF
links (Fig. 3).
Figure 3. Part of the LODD Cloud.
One of the datasets, which is a part of the LODD cloud, is
DrugBank2. It provides RDF data about drugs, such as
chemical, pharmacological and pharmaceutical information,
taken from an existing base3 of drug data on the web. The
DrugBank RDF dataset contains over 766.000 RDF triples for
4.800 drugs. Because of its size, we decided to use this
dataset as a reference point for the drug data from the
Health Insurance Fund which we describe, publish and
interlink.
1 http://www.w3.org/wiki/HCLSIG/LODD
2 http://wifo5-03.informatik.uni-mannheim.de/drugbank/
3 http://drugbank.ca
The 10th Conference for Informatics and Information Technology (CIIT 2013)
LinkedCT4 is a project aimed at publishing clinical trials
data in a Linked Data format [7]. They transform existing
clinical trials data into RDF, discover semantic links between
the records themselves, and link to other data sources such
as PubMed5, as well. The datasets from LinkedCT are also
part of the LODD project (Fig. 3).
OBO Foundry6 the Open Biological and Biomedical
Ontologies project, is a collaborative effort involving biology
researchers and ontology developers who work together to
develop a set of design principles for ontology development
in the biomedical domain. As a result of the project, eight
ontologies have been developed, and a large amount of
others hold the status of candidate ontologies. The domains
of the ontologies are mainly bio processes, anatomy,
biochemistry, proteins, etc.
The Semantic Web Health Care and Life Sciences7 is an
interest group at the World Wide Web Consortium (W3C),
comprised of experts from around 30 W3C member
organizations: research centers, universities, companies,
health institutions, etc. Its mission is to develop and support
the use of the technologies of the Semantic Web in the fields
of healthcare, life sciences, clinical research and translational
medicine [8]. It is comprised of various subgroups, which are
focused on making the biomedical data available in RDF,
developing and maintaining biomedical ontologies, etc.
In Macedonia, the Open Data and the Open Government
Data initiatives started around 2011. The Macedonian
Government officially joined the global Open Government
Partnership (OGP) initiative8 in April 2012, and took part in
the annual OGP meeting in Brazil, the same month. After a
few months of gathering data from the ministries and the
government institutions, the Macedonian Government
published the official Macedonian Open Government Data
portal9. The portal currently holds open data from twelve
ministries, six government institutions and three
independent institutions. The data on this portal are mostly
published as one-, two-, or three-star data, or represent links
to already existing and specialized applications which the
4 http://linkedct.org/
5 http://www.ncbi.nlm.nih.gov/pubmed
6 http://obofoundry.org/
7 http://www.w3.org/blog/hcls/
8 http://www.opengovpartnership.org/
9 http://opendata.gov.mk/
different institutions have developed for their needs in the
previous years. From the healthcare sector, the portal holds
data from the Ministry of Health, which is mainly data about
public health institutions and their licenses, published as
three-star data, in CSV format.
Besides the official Government activities, there have been
other Open Data and Linked Data activities in Macedonia,
mainly from the academia, which include the development
of a Crime Map for the Republic of Macedonia [9], based on
the public bulletins from the Ministry of Internal Affairs, as
well as the opening and linking of data from the Universities
in Macedonia [10]. Our Faculty had an industry project from
which two Open Data mobile and web applications were
developed based on the data from the Health Insurance
Fund of Macedonia. The data we used for the project were
three-star data, in CSV and XML format, acquired by
transforming the publicly available data from the Fund which
they publish on their website.
Apart from these, there have not been other Open Data
and Linked Data activities involving healthcare data from
Macedonia.
III. LINKED OPEN DATA FROM THE HEALTH INSURANCE
FUND
The Health Insurance Fund of the Republic of Macedonia is
an institution which is responsible for regulating and
managing the public services for primary healthcare,
specialist healthcare, and hospital healthcare. Additionally,
the Fund along with other government institutions regulates
the list of drugs which are covered by the health insurance,
and defines the referent (nominal) prices for certain drugs.
With this position in the society, we believe that the data
which the Fund works with is of high importance, and there
would be a great benefit of opening their public data in RDF,
and interlinking it with other datasets from the LOD and
LODD clouds.
A. Public Data from the Fund
The Health Insurance Fund of the Republic of Macedonia
has been publishing their public data on a regular basis on
their website10. These data contain information about
healthcare services and their prices, statistics about the rate
10 http://www.fzo.org.mk/
The 10th Conference for Informatics and Information Technology (CIIT 2013)
of usage of hospital beds, reports from the inspections in the
public and private healthcare institutions, financial data
about the Fund, insurance information, referent drug prices,
private and public healthcare institutions which the Fund
works with, etc. The Fund has not yet published its data on
the official Macedonian Open Government Data portal.
Although the data from the Funds website can be
technically considered as Open Data, they are mainly
published in PDF and Excel formats, making them only one-
star and two-star data. In order to leverage the usability of
the public data from the Fund, we decided to transform
them into five-star Linked Open Data: to first transform them
into RDF, and then interlink them with data from other
publicly available datasets from the LOD and LODD clouds.
As a starting point, we chose the drug datasets from the
Fund, which contain pharmacological and pharmaceutical
information, along with the referent price for different drugs.
B. Ontology
The Fund has published their public drug data in various
datasets, which contain different sets of information. These
datasets hold information about the brand name, the generic
name, the manufacturer, the referent price, the packaging,
the strength, and the dosage form for drugs. Additionally,
each drug is identified by an ID generated by the Fund, as
well as a globally identifiable ATC code, used for
classification of drugs and controlled by the World Health
Organization.
In order to transform and represent the drug data in RDF,
we needed an ontology. Following the best practices for
ontology development, we decided to re-use already existing
drug ontologies. In the process of choosing an ontology for
re-use, we had to bear in mind the interlinking part of the
process, which meant that we need an ontology used by a
drug dataset which we would connect our data to, later in
the process. With this in consideration, we decided to use
the DrugBank RDF repository and its ontology.
Table 1. The properties from the DrugBank ontology which
we use.
DrugBank property
Description
atcCode
The global ATC code of the drug.
genericName
The generic name of the drug.
brandName
The brand name of the drug.
The DrugBank ontology contains the class drugs, which
represents the drug entities. It also contains relations for the
ATC code, the generic name and the brand name. We used
the ‘drugs’ class along with the ‘atcCode’, ‘genericName’ and
‘brandName’ properties (Table 1, Fig. 4).
Figure 4. The HIFM Ontology.
Table 2. The properties in the HIFM ontology.
HIFM property
Description
id
The ID of the drug, as defined by the
Health Insurance Fund.
manufacturer
The name of the manufacturer of the
drug.
refPriceNoVAT
The referent (nominal) price in
Macedonian denars (MKD), without the
VAT tax.
refPriceWithVAT
The referent (nominal) price in
Macedonian denars (MKD), with the VAT
tax.
packaging
The type of packaging of the drug.
dosageForm
The dosage form of the drug.
strength
The strength of the active substance in
the drug.
similarTo
This property points to other drugs which
have the same active substance and
indications, but may come in different
strengths and from different
manufacturers.
The 10th Conference for Informatics and Information Technology (CIIT 2013)
However, we still needed properties for describing the
other drug information, not covered by the DrugBank
ontology. Therefore, we developed our own ontology: the
HIFM ontology (Fig. 4). The HIFM ontology contains its own
class for drug type entities, ‘Drug’, seven datatype properties
and ‘similarTo’ as an object property (Table 2, Fig. 4).
Along with the properties taken from DrugBank, and the
properties defined in our HIFM ontology, we use the
‘rdfs:label‘ and ‘owl:seeAlso’ properties. The ‘rdfs:label’
property is used to point to the generic name of the drug,
whereas ‘owl:seeAlso’ is used to link the drugs from our
HIFM graph with drugs from DrugBank. This will be
elaborated in more details further in the paper.
C. Mapping the Data from CSV to RDF
The next step was to map and transform the public data
from Excel to RDF. For this, we decided to use the Virtuoso
Universal Server11, which provides mechanisms for data
transformation and management, for various types of data,
including the Semantic Web standard representation format,
RDF. It serves as a Linked Data server, as well, and allows
local and remote data querying with the Semantic Web
query language, SPARQL.
The mapping process consisted of two steps. First, we
imported the CSV files (generated from the Excel files
available on the website of the Health Insurance Fund) into
relational databases in Virtuoso. Then, with the use of
R2RML12, the mapping language for transforming RDB data
into RDF data, which is also a part of Virtuoso, we created
RDF Views over the relational databases. These RDF Views
allow data management with the use of the technologies of
the Semantic Web, such as querying with SPARQL, over data
which resides in standard relational databases.
The R2RML mapping was done with the use of mapping
files, which contain information about the transformation of
the RDB tables, columns and cell values into RDF triples, with
a subject, a predicate, and an object. In this step we used our
HIFM ontology, as well as parts of the DrugBank ontology
which were previously discussed. As an identifier of the
drugs we chose the ID value, assigned to the drugs by the
Fund. Each of the drugs was set to be both of
‘drugbank:drugs’ and of ‘hifm:drug’ RDF type, and the values
11 http://virtuoso.openlinksw.com/
12 http://www.w3.org/TR/r2rml/
for the ATC code, the generic and brand name, the dosage
form, the strength, etc., were described using the DrugBank
and HIFM properties (Fig. 4, Table 1, Table 2).
Since the different Excel files from HIF’s website contained
different subsets of drug data, the process resulted in several
different graphs, with different sets of information about the
drugs. In order to create one single graph with all of the
information, we used the SPARQL endpoint from Virtuoso
and with the use of the SPARQL query language, we
matched, combined and inserted the data from the other
graphs into one single graph. The matching of the drugs was
done by their ID, assigned by the Fund, which was present in
all of the Excel files.
D. Transforming the RDF Data into Linked Open Data
Once we had all of the drug data into an RDF graph in
Virtuoso, we proceeded with interlinking the drugs among
themselves and with other drugs available in the LOD and
LODD clouds.
For the purpose of interlinking the drugs from the Fund
between themselves, we created a property in the HIFM
ontology, called ‘similarTo’ (Fig. 4, Table 2). This property has
the purpose to link Drug A to Drug B (and vice-versa), if their
first seven characters from the ATC code match. Even though
the ATC codes should have seven digits, the ATC codes which
the Fund assigns to the drugs in Macedonia contain ten
digits. These additional three digits are used for marking a
difference between drugs which have the same active
substance, but come in different strengths, packages and can
be from different manufacturers. So, in order to support a
use-case scenario in which a user would be interested in
drugs similar to the one he or she is looking for, we decided
to create a ‘similarTo’ relation between each two drugs from
our dataset which have the same first seven digits in the ATC
code. The relation is defined as both transitive and
symmetric in the ontology, which allows more flexibility in
the process of querying the data.
In order to transform our drug data into five-star Linked
Open Data, we needed relations in the RDF graph towards
outside entities. For this purpose, we decided to use the
DrugBank dataset, which is the largest and the most detailed
drug dataset on the Web. Similarly as in the process of
interlinking the drugs internally, we used the ATC codes to
detect the similarity between the drugs from our dataset and
the drugs from DrugBank. For this purpose, we matched the
The 10th Conference for Informatics and Information Technology (CIIT 2013)
first seven digits from the ten-digit ATC code in our dataset,
with the seven-digit ATC code in the DrugBank dataset. Once
the drugs were matched, we added new triples within our
graph, denoting that the drug defined in our dataset had an
‘owl:seeAlso’ relation to the drug defined in the DrugBank
dataset. This relation provides new possibilities for data
querying, since we can now move from our local drug
dataset and get information which is not present locally, but
somewhere on the Web, in the LOD and LODD clouds.
An example RDF representation of a particular drug in our
graph, denoted with all of its properties and relations to
other drugs, both from the same graph and from DrugBank,
is shown in Fig. 5.
Figure 5. An example Drug from the HIFM Graph.
We choose the ‘owl:seeAlso’ relation from the commonly
used OWL namespace, over the ‘owl:sameAs’ relation,
because we cannot guarantee that the two drug descriptions
refer to the same real-world entity. For instance, a drug in
our dataset contains information about a manufacturer,
dosage form, strength and price, i.e. a drug as a product. On
the other hand, a drug in the DrugBank dataset contains
information about the chemical formula, molecular weight,
affected organisms, interactions, etc, i.e. information about
drugs as active substances, which have the same effect and
indications, but can be marketed and sold in various
packages, forms, strength, by different manufacturers.
E. Publishing the Linked Open Data
Once we had a graph of Linked Open Data from the Health
Insurance Fund of Macedonia, the next step was to publish
the data on the Web. For this purpose, we created a public
instance13 of Virtuoso at the Faculty of Computer Science
and Engineering, in Skopje. This Virtuoso instance holds the
Linked Drug Data from the HIFM graph, and provides a public
interface via its SPARQL endpoint14. The endpoint can be
used for querying the drug data from the graph, either by
using the SPARQL editor available at the endpoint, or by
using the endpoint as a web service from a mobile, web or
desktop application. The endpoint can be used as a web
service by adding the SPARQL query into a query string,
appended to the URL of the endpoint.
Additionally, we made dumps of the HIFM graph data into
RDF files, represented in RDF/XML, Turtle, N3, RDF/JSON and
JSON-LD semantic data formats. These RDF dumps are
published on a public CKAN instance15 at the Faculty of
Computer Science and Engineering, in Skopje. This instance
represents a CKAN catalogue of Open Data maintained and
published by the Faculty. The users can freely access and
download the data from the catalogue, and use it in their
own applications.
IV. USE-CASES
The purpose of using Linked Open Data is the ability of
leveraging the value and usability of the data, in various use-
cases. Once we have the local HIFM drug data interlinked
with data from the LODD cloud, we can start querying the
local data and continue moving through the links to
information published elsewhere on the Web. This ability
broadens the usage possibilities of the data, and allows
development of new types of applications over the data.
A. Using Information from HIFM
Once such use-case would be to use the ‘hifm:similarTo
relation to retrieve information about drugs which have the
same active substance as the drug we are interested in, but
may have a different brand name, different price, may be
manufactured by a different company, and may have a
different package form and strength.
13 http://linkeddata.finki.ukim.mk/
14 http://linkeddata.finki.ukim.mk/sparql
15 http://data.finki.ukim.mk/
The 10th Conference for Informatics and Information Technology (CIIT 2013)
For instance, if we are looking at information about the
drug “NIFADIL, film coated tablets, 50 x 10mg from the
HIFM graph, and we want to find out the drugs which are
similar to it, we can use the following SPARQL query:
PREFIX drugbank: <http://wifo5-04.informatik.uni-
mannheim.de/drugbank/resource/drugbank/>
PREFIX hifm: <http://www.fzo.org.mk/ontology/hifm#>
SELECT ?bn ?p ?m
WHERE
{
hifm:79588 hifm:similarTo ?dbd .
?dbd drugbank:brandName ?bn ;
hifm:refPriceWithVAT ?p ;
hifm:manufacturer ?m .
}
ORDER BY ASC (?bn)
The query first makes a lookup for RDF triples in the HIFM
graph where the subject is the drug we are currently
interested in, and it is in a hifm:similarTo relation with
another drug from the HIFM graph. The drugs similar to the
drug with ID = 79588 are placed in the ?dbd variable. Then,
we look up the details for these drugs, and select their brand
name, the price and the manufacturer (Table 3).
Table 3. Results from the SPARQL query.
Brand Name
Price
Manufacturer
CORDIPIN R, 30 x 20mg
14,00
KRKA
CORDIPIN XL, 20 x 40mg
19,00
KRKA
KORINCARE NEO, 20 x 40mg
19,00
TCHAIKAPHARMA
KORINCARE, 20 x 20mg
9,00
TCHAIKAPHARMA
NIFADIL RETARD, 30 x 20mg
14,00
ALKALOID
NIFEDIPIN RETARD, 30 x 20mg
14,00
REPLEKFARM
NIFEDIPIN, 50 x 10mg
35,00
JAKA 80
NIFELAT RETARD, 30 x 20mg
14,00
ZDRAVLJE
This query can be written and executed directly in the
SPARQL editor at our Virtuoso SPARQL endpoint, or can be
sent as a query string from an application, and used as a web
service. The web service calls have the following format:
http://linkeddata.finki.ukim.mk/sparql?default-graph-
uri=DEFAULTGRAPH&query=SPARQLQUERY&format=FORMAT
Here, DEFAULTGRAPH represents the graph URI of the
default graph for the query, i.e. the graph the query should
be executed over, SPARQLQUERY represents the SPARQL
query, as the one shown above, and FORMAT represents the
format of the response. The different format supported by
the Virtuoso SPARQL endpoint include HTML, XML, JSON,
Javascript, CSV, Spreadsheet, RDF/XML, N3, Turtle, etc.
With this, a developer of an mobile application over the
Linked Open Data from the Fund could easily develop a
functionality which, based on the current drug the user is
browsing, could offer him alternative drugs which may be
more accessible, easier to find, and even cheaper. This would
provide the end-user of the application with a better insight
into his options as a patient when buying drugs.
B. Using Information from DrugBank, LOD and LODD
Now that the HIFM graph contains links to another dataset
on the Web, we can use them to traverse the remote graph.
This way, by using the ‘owl:seeAlso’ relation, we can retrieve
information from the DrugBank dataset which are not
present in the local HIFM drug data.
For instance, if we want to get information about the food
interactions of the drug “DILACOR, tablets, 20 x 0,25mg, we
can use the following SPARQL query:
PREFIX hifm: <http://www.fzo.org.mk/ontology/hifm#>
PREFIX drugbank: <http://wifo5-04.informatik.uni-
mannheim.de/drugbank/resource/drugbank/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT ?fi
WHERE
{
hifm:32964 owl:seeAlso ?dbd
SERVICE <http://wifo5-04.informatik.uni-
mannheim.de/drugbank/sparql>
{
?dbd drugbank:foodInteraction ?fi .
}
}
ORDER BY ASC (?fi)
This SPARQL query starts from the HIFM graph, looking for
all of the triples which state that the drug with ID = 32964 is
in a ‘owl:seeAlso’ relation with another drug. The drugs from
the matched triples are selected as an ?dbd SPARQL variable,
which is then used in the next line and is sent to the SPARQL
endpoint at DrugBank. This line asks for triples which will tell
us the food interaction for the drug(s) represented by the
?dbd variable. The resulting food interactions will be
returned in the ?fi variable, which is then displayed as a
result (Table 4).
The 10th Conference for Informatics and Information Technology (CIIT 2013)
Table 4. Results from the SPARQL query.
Food Interactions
Avoid avocado.
Avoid bran and high fiber foods within 2 hours of taking this
medication.
Avoid excess salt/sodium unless otherwise instructed by your
physician.
Avoid milk, calcium containing dairy products, iron, antacids, or
aluminium salts 2 hours before or 6 hours after using antacids
while on this medication.
Avoid salt substitutes containing potassium.
Limit garlic, ginger, gingko, and horse chestnut.
This information was not stored in our local HIFM dataset,
but because of the Linked Open Data principles and the links
we provided to drugs published at DrugBank, we were able
to retrieve additional information for the given drug. We can
use these types of queries for retrieving any other
information which DrugBank provides, for our drugs, defined
in our HIFM graph.
Note that if the drug with ID = 32964 from the local HIFM
graph has more than one ‘owl:seeAlso’ relations with drugs
defined in the DrugBank dataset, the result of the query
would be a union of all of the food interactions from the
different drugs it is similar to. This can happen when a drug
from the HIFM graph has more than one active substance,
and they are defined as different drugs in the DrugBank
dataset.
This use-case can provide a developer of a medical mobile,
web-based or desktop application, with a functionality which
would give its end-users additional and vital information
about the drugs they are browsing.
The DrugBank dataset contains links to other datasets as
well (Fig. 1, Fig. 3), and we can use them for accessing other
LOD and LODD cloud datasets. For instance, the DrugBank
data contain ‘owl:sameAs’ relations to drugs which are
described as part of clinical trials in the LinkedCT dataset. In
the same manner as we leap from our HIFM graph to the
DrugBank graph, we can continue over to the LinkedCT
graph, and gather the needed information from there.
V. CONCLUSION AND FUTURE WORK
The Open Data concept has gained momentum in the last
years. The governments and government institutions from
the world leading countries are proactively providing their
public data in an open, machine-readable format free for use
and re-use by the citizens, developers and business entities.
In order for the utility of data in open format to be
maximized, they need to be transformed into five-star Linked
Open Data. This means that they first need to be described,
published on the Web, and used, with the W3C Semantic
Web standards, such as RDF, OWL and SPARQL. Next, they
need to have links towards data published on other locations
on the Web, in order to provide context and broaden the
data description. This way, once we start going through data
from one RDF graph, we can easily traverse into another
graph and use and retrieve its information. This is not limited
to only one step; regardless of the starting point, we can
traverse any number of interlinked RDF graphs, regardless of
their location. This is the main idea behind the LOD and
LODD clouds of datasets.
In this paper, we gave an overview of the process of
transforming the two-star data which the Health Insurance
Fund of Macedonia publishes on their website, into five-star
Linked Open Data, connected to the DrugBank dataset, and
from there, indirectly with the entire LOD and LODD clouds.
The result of the transformation was a HIFM graph, which
contains over 21.000 RDF triples for 1.020 drugs from the
Fund. These drugs are interconnected with 9.946
‘hifm:similarTo’ relations with each other, and with 1.015
‘owl:seeAlso’ relations to drugs from the DrugBank dataset.
The HIFM graph is available for use at the Virtuoso instance
at our Faculty, and as an RDF dump on the CKAN instance at
our Faculty.
We also provided use-cases which give examples of how
the data from the Health Insurance Fund and DrugBank can
be used, in order to provide application developers with
mechanisms and ideas for retrieving distributed data in
various formats.
Future work on the project would include other datasets
from the Health Insurance Fund, as well as interlinking the
already existing HIFM graph with data from other LOD and
LODD member datasets. The Fund also has data about the
private and public healthcare institutions, and the services
they offer. Potentially interesting use-cases arise from
providing five-star Linked Open Data about these
institutions, such as the services they provide, the price of
the services, the location of the healthcare institutions,
contact information, etc. This data could then be used by a
The 10th Conference for Informatics and Information Technology (CIIT 2013)
mobile application which would suggest the nearest
pharmacy, or dentist, or laboratory to a user, based on his
needs, preferences and location. The Linked Open Data
characteristic of the data would allow one such application
access to other valuable information available elsewhere in
the LOD and LODD clouds, thus leveraging the usability of
the data and the application at the same time.
REFERENCES
[1] T. Berners-Lee, N. Shadbolt, “There’s gold to be mined from all our
data”, The Times, 2012.
[2] V. Kundra, “Digital Fuel of the 21st Century: Innovation through Open
Data and the Network Effect”, Joan Shorenstein Center on the Press,
Politics and Public Policy, Harvard College, 2012.
[3] C. Bizer, T. Heath and T. Berners-Lee, "Linked Data - The Story So Far"
International Journal on Semantic Web and Information Systems
(IJSWIS), 2009, pp: 1-22.
[4] T. Berners-Lee, Linked Data - Design Issues.
http://www.w3.org/designissues/linkeddata.html.
[5] R. Cyganiak and A. Jentzsch. Linking Open Data cloud diagram.
http://lod-cloud.net/.
[6] A. Jentzsch, J. Zhao, O. Hassanzadeh, K. H. Cheung, M. Samwald and B.
Andersson, “Linking Open Drug Data”, Triplification Challenge of the
International Conference on Semantic Systems. 2009.
[7] O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller and M. Wang,
“LinkedCT: A Linked Data Space for Clinical Trials.”, arXiv:0908.0567,
2009.
[8] K. H. Cheung, E. Prud’hommeaux, Y. Wang and S. Stephens, "Semantic
Web for Health Care and Life Sciences: a review of the state of the art."
Briefings in Bioinformatics 10, 2009, no. 2, pp. 111-113.
[9] M. Mitrevski, M. Jovanovik, R. Stojanov, D. Trajanov, “Open University
Data”, in Proceeding from the 9th Conference for Informatics and
Information Technology, 2012.
[10] D. Temelkovski, M. Jovanovik, I. Mishkovski, D. Trajanov, “Towards
Open Data in Macedonia: Crime Map based on Ministry of Internal
Affairs’ Bulletins”, in Proceeding from the 9th Conference for Informatics
and Information Technology, 2012.
... Macedonian drug data is drug data from the Health Insurance Fund of North Macedonia that has been transformed into a knowledge graph and linked to other LOD Cloud datasets (Jovanovik et al., 2013). This knowledge graph was further extended with linked data about Macedonian medical institutions and drug availability lists from pharmacies (Jovanovik et al., 2015b). ...
... Table 7 gives an overview. (Jovanovik et al., 2013(Jovanovik et al., , 2015b 3,000 21,233 LinkedDrugs (Jovanovik and Trajanov, 2017) 248 (Bird et al., 2009) (https://www.nltk.org/) is one of the most powerful and popular NLP libraries. NLTK is a suite of open-source Python modules, data sets, and tutorials on language processing. ...
Article
Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers. Significance Statement The main objective of this work is to survey the recent use of NLP in the field of pharmacology, in order to provide a comprehensive overview of the current state in the area after the rapid developments which occurred in the last few years. We believe the resulting survey to be useful to practitioners and interested observers in the domain.
... In the first research project, we generated Linked Drug Data from the data published by the Macedonian Health Insurance Fund, regarding registered drug products in Macedonia [60]. We gathered the 2-star data and created a consolidated and interlinked 5-star Linked Data dataset, using our own HIFM Ontology 8 . ...
... Based on our experience with applying the Linked Data principles in the domains of public transport and air pollution [73,77,75,72], the financial domain [76], the entertainment domain [61] and the healthcare domain [60,59,58,57], we developed a methodology for Linked Data, focused on reusable components as support for the methodology steps. These guidelines build on the existing Linked Data methodologies and contain actions which cover the general Linked Data lifecycle. ...
Thesis
Full-text available
The vast amount of data available over the distributed infrastructure of the Web has initiated the development of techniques for their representation, storage and usage. One of these techniques is the Linked Data paradigm, which aims to provide unified practices for publishing and contextually interlinking data on the Web, by using the World Wide Web Consortium (W3C) standards and the Semantic Web technologies. This approach enables the transformation of the Web from a web of documents, to a web of data. With it, the Web transforms into a distributed network of data which can be used by software agents and machines. The interlinked nature of the distributed datasets enables the creation of advanced use-case scenarios for the end users and their applications , scenarios previously unavailable over isolated data silos. This creates opportunities for generating new business values in the industry. The adoption of the Linked Data principles by data publishers from the research community and the industry has led to the creation of the Linked Open Data (LOD) Cloud, a vast collection of interlinked data published on and accessible via the existing infrastructure of the Web. The experience in creating these Linked Data datasets has led to the development of a few methodo-logies for transforming and publishing Linked Data. However, even though these methodologies cover the process of modeling, transforming / generating and publishing Linked Data, they do not consider reuse of the steps from the life-cycle. This results in separate and independent efforts to generate Linked Data within a given domain, which always go through the entire set of life-cycle steps. In this PhD thesis, based on our experience with generating Linked Data in various domains and based on the existing Linked Data methodologies, we define a new Linked Data methodology with a focus on reuse. It consists of five steps which encompass the tasks of studying the domain, modeling the data, transforming the data, publishing it and exploiting it. In each of the steps, the methodology provides guidance to data publishers on defining reusable components in the form of tools, schemas and services, for the given domain. With this, future Linked Data publishers in the domain would be able to reuse these components to go through the life-cycle steps in a more efficient and productive manner. With the reuse of schemas from the domain, the resulting Linked Data dataset will be compatible and aligned with other datasets generated by reusing the same components, which additionally leverages the value of the datasets. This approach aims to encourage data publishers to generate high-quality, aligned Linked Data datasets from various domains, leading to further growth of the number of datasets on the LOD Cloud, their quality and the exploitation scenarios. With the emergence of data-driven scientific fields, such as Data Science, creating and publishing high-quality Linked Data datasets on the Web is becoming even more important, as it provides an open dataspace built on existing Web standards. Such a dataspace enables data scientists to make data analytics over the cleaned, structured and aligned data in it, in order to produce new knowledge and introduce new value in a given domain. As the Linked Data principles are also applicable within closed environments over proprietary data, the same methods and approaches are applicable in the enterprise domain as well.
... These datasets contain various healthcare data: clinical Macedonian drug data. Drug data from the Health Insurance Fund of North Macedonia has been transformed into a knowledge graph and linked to other LOD Cloud datasets [100]. This knowledge graph was further extended with linked data about Macedonian medical institutions, and drug availability lists from pharmacies [102]. ...
Preprint
Full-text available
Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers.
... However, the RDF dataset provides drugs data only and does not provide information about diseases, number of deaths, health expenditure, and status of the health system in a country for a disease, which is provided by the GHO. Milos et al. used LOD principles and SW technologies to transform and publish drug data in the Health Insurance Fund (HIF) of the Republic of Macedonia into a 5-star LOD connected to the LODD and other LOD Cloud datasets through the DrugBank dataset [27]. However, the RDF dataset is restricted to drugs only and does not provide information about diseases, which is provided by the GHO. ...
Article
Full-text available
The COVID-19 data is critical to support countries and healthcare organizations for effective planning and evidence-based practices to counter the pressures of cost reduction, improved coordination, and outcome and produce more with less. Several COVID-19 datasets are published on the web to support stakeholders in gaining valuable insights for better planning and decision-making in healthcare. However, the datasets are produced in heterogeneous proprietary formats, which create data silos and make data discovery and reuse difficult. Further, the data integration for analysis is difficult and is usually performed by the domain experts manually, which is time-consuming and error-prone. Therefore, an explicit, flexible, and widely acceptable methodology to represent, store, query, and visualize COVID-19 data is needed. In this paper, we have presented the design and development of the Linked Open COVID-19 Data system for structuring and transforming COVID-19 data into semantic format using explicitly developed ontology and publishing on the web using Linked Open Data (LOD) principles. The key motivation of this research is the evaluation of LOD technology as a potential option and application of the available Semantic Web tools (i.e., Protégé, Excel2RDF, Fuseki, Silk, and Sgvizler) for building LOD-based COVID-19 information systems. We have also underpinned several use-case scenarios exploiting the LOD format of the COVID-19 data, which could be used by applications and services for providing relevant information to the end-users. The effectiveness of the proposed methodology and system is evaluated using the system usability scale and descriptive statistical methods and results are found promising.
... The data publishing category discusses publishing data online, either as private data with limited access or as open data. Some projects aim to publish governmental health data as linked open data (LOD) such as (Bukhari and Baker, 2013;Jovanovik, Najdenov and Trajanov, 2013;Rinciog and Posea, 2015). For instance Bukhari and Baker (2013) aim to re-publish the open Canadian health census data as LOD, as part of the open government movement. ...
Thesis
The semantic web (SW) offers tools for supporting data integration and sharing across disparate resources in the web. Meanwhile, health research needs an efficient approach for handling heterogenous data integration for the massive amounts of available health-related data to help discovering new scientific breakthroughs. In this thesis, the current and potential relationships between the semantic web and health research are aimed to be understood and identified through systematically reviewing the literature and examining the SW features in a proof-ofconcept health-related demonstrator. Firstly, a systematic literature review of 447 articles addressing health questions and using the SW standards was conducted to map the literature and identify any gaps or opportunities. The results of the review were analysed in a mixed approach of quantitative and qualitative methods producing two taxonomies: 1) the health aims and 2) the SW features taxonomies. The review revealed the most and least addressed health questions as well as the used SW features in the literature. Secondly, a semantic web-based demonstrator was developed to represent the NHS dispensed prescriptions topic and examine some of the identified SW features. The prescriptions demonstrator consists of three interlinked OWL ontologies: the BNF, NHS and prescriptions ontologies along with their converted RDF instances. Moreover, two health questions, inspired from the traditional health literature and suggested by health experts in a focus group, were translated into SPARQL queries and ran across the ontologies to test more of the SW features. It has been learned that the SW has a potential in supporting health research and accelarting research findings in the areas of: data representaion, data integration and knowledge discovery. However, there are some challenges need resolving for a better result such as: data accessibility, security, quality, heterogeneity and lack of user-friendly tools.
... Even though the spreadsheet format seems advanced and application friendly, it does not fully capture the relationships among the data, and it is hard to relate the content to its real life concept. For this reasons, the goal of our study is to make the food waste information available as Open Data, which means that the data should be available in a raw and machine-readable format, for the purposes of use, reuse, republishing and redistributing, with little or no restrictions [18]. Therefore, the open format datasets can enable building of useful applications, which leverage their value and offer different use-cases for the interested parties. ...
... The efforts made by [19,20] use the Linked Data approach to consolidate drug product data in Macedonia, and then on a global scale [21]. These approaches provide a global overview of the drug products which are registered and sold in different countries, and the provide the ability to identify and analyze related drug products across and between countries. ...
Conference Paper
Full-text available
Medical datasets that contain data relating to drugs and chemical substances, in general tend to contain multiple variations of a generic name which denotes the same drug or a drug product. This ambiguity lies in the fact that a single drug, referenced by a unique code, has an active substance which can be known under different chemical names in different countries, thus forming an obstacle during the process for extracting relevant and useful information. To overcome the issues presented by this ambiguity, we developed a scalable, term frequency based data cleaning algorithm, that solely uses the data available in the dataset to infer the correct generic name for each drug based on text similarities, thus forming the roots for building a model that would be able to predict generic names for related and previously unseen drug records with high accuracy. This paper describes the application of the algorithm towards the cleaning and standardization process of an already populated drug products availability dataset, by representing all of the variations of a substance under a single generic name, thus eliminating ambiguity. Our proposed algorithm is also evaluated against a Linked Data approach for detecting related drug products in the dataset.
... Szekely et al., propose an approach that maps data of the Smithsonian American Art Museum to RDF Linked Open Data [38]. Jovanovik et al. provide use-case scenarios for publishing and using healthcare data in the republic of Macedonia as RDF Linked Open Data [18]. Willighagen et al. describe recent work in an ongoing project converting data from the ChEMBL database into RDF triples [40]. ...
Article
The Web has become a tremendously huge data source hidden under linked documents. A significant number of Web documents include HTML tables generated dynamically from relational databases. Often, there is no direct public access to the databases themselves. On the other hand, RDF (Resource Description Framework) gives an efficient mechanism to represent directly data on the Web based on a Web-scalable architecture for identification and interpretation of terms. This leads to the concept of Linked Data on the Web. To allow direct access to data on the Web as Linked Data, we propose in this paper an approach to transform HTML tables into RDF triples. It consists of three main phases: refining, pretreatment and mapping. The whole process is assisted by a domain ontology and the WordNet lexical database. A tool called Htab2RDF has been implemented. Experiments have been carried out to evaluate and show efficiency of the proposed approach.
Article
Full-text available
Studies conducted in Macedonia suggest that many children unnecessarily suffer from child maltreatment. As in other countries, most maltreatment occurs in home and community and may not come to the attention of child protection agencies. It is nevertheless a grave public health and societal problem with far-reaching consequences for the mental, physical and reproductive health of children, as well for the societal development. Health systems have a key role to play, not only in providing high-quality services for children who experience violence, but also in detecting and supporting families at risk. The country should implement prevention programmes such as home visitation and parenting support, and hospital-based interventions to support parents, along with programmes against abusive head trauma. Collaborative actions are required between all partners in order to tackle this public health and societal problem and implement sustainable development goals specifically targeting child maltreatment.
Article
Full-text available
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Conference Paper
Full-text available
Today, there is a growing trend for publishing public data in an open format, on the web, making it available for everyone to use and reuse. This idea has been widely supported by governments and companies throughout the world, which have made their own public data available in such way. Some of them, like the World Bank, even challenge developers to write applications based on their open data, by organizing competitions [1]. Data has become the new raw material of the 21 st century [2]. The Linked Open Data project has begun turning the document-oriented web into a database of global proportions [3]. The Faculty of Computer Science and Engineering joins this trend by making parts of its public data available as open data. This paper introduces a system for mapping relational data from databases into data represented in semantic web format (N3 and RDF), as well as editing and querying the data by using a SPARQL endpoint. Here we describe the process of publishing the open data from our Faculty, as well as some basic information from the other faculties which are part of the "Ss. Cyril and Methodius" University in Skopje. We also propose some possible applications which can use this open data, and in that way add more value to it.
Conference Paper
Full-text available
Today, many organizations and institutions have vast collections of datasets and databases filled with information that, in general, can turn out to be very useful for individuals and for the society [1]. The Police is one such institution which is obliged by law to keep an archive of all the information it deals with on a daily basis. All that information about the criminal events that have occurred in the past can be used in many ways. In this work, we present a research project focused on crime analysis using the concept of crime map, which can be very useful for the law enforcement agencies as well as for the citizens. This research project resulted in an effective crime map of the Republic of Macedonia, which consists of about 1800 events in total. With this project we hope to encourage and motivate the Ministry of Internal Affairs, as well as other institutions, to continue liberating valuable data and bring the benefits of the Open Data idea to its citizens.
Article
Full-text available
The development of new therapies for diseases requires the integration of large amounts of biomedical data from many different sources. The goal of the Linking Open Drug Data (LODD) 1 project is to facilitate this integration by bringing these data sources onto the Web of Linked Data. We describe the different datasets published by this project, which are strongly interlinked with other Linked Data sources and contain 8.4 million RDF triples. A use case is provided that demonstrates the benefit of this work to patients and medical researchers.
Article
Full-text available
The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Article
Full-text available
The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. The database exposed by LinkedCT is generated by (1) transforming existing data sources of clinical trials into RDF, and (2) discovering semantic links between the records in the trials data and several other data sources. In this paper, we discuss several challenges involved in these two steps and present the methodology used in LinkedCT to overcome these challenges. Our approach for semantic link discovery involves using state-of-the-art approximate string matching techniques combined with ontology-based semantic matching of the records, all performed in a declarative and easy-to-use framework. We present an evaluation of the performance of our proposed techniques in several link discovery scenarios in LinkedCT. Comment: 5 pages, 1 figure, 4 tables
Article
There is no need to fear a 'database state'. The information age will boost the economy and make life easier Data is the new raw material of the 21st century — a resource that gets more plentiful every day. In today's web-connected world it drives transactions and decisions of every kind. We need accurate data to help us to catch trains and buses on time, anticipate the weather and pick the right place to live, course to study or product to buy. Two years ago in this newspaper we anticipated a world in which, if you typed your postcode into a government website you would get all sorts of data. You would see the crime rate for your neighbourhood, when the buses ran and the rubbish was collected, how the schools were doing and what your local authority spends. This is now a reality at data.gov.uk. When the data has been released, applications have quickly followed, from mobile apps to find an NHS dentist to companies that use the open data on spending to advise local authorities on how to get the best value for money. These open data apps are creating new businesses for their developers and great resources for us all. Take, for example, bus finders (see London Bus Stop Live or BusMate London) — these were developed within weeks of the data's release and did not cost the taxpayer a penny.
Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect”, Joan Shorenstein Center on the Press
  • V Kundra
V. Kundra, “Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect”, Joan Shorenstein Center on the Press, Politics and Public Policy, Harvard College, 2012.