ChapterPDF Available

Semantic Data Integration of Big Biomedical Data for Supporting of Personalised Medicine

Authors:
  • TIB-Information Centre for Science and Technology. Leibniz University Hannover. Simon Bolívar University

Abstract and Figures

Big biomedical data has grown exponentially during the last decades and a similar growth rate is expected in the next years. Likewise, semantic web technologies have also advanced during the last years, and a great variety of tools, e.g., ontologies and query languages, have been developed by different scientific communities and practitioners. Although a rich variety of tools and big data collections are available, many challenges need to be addressed in order to discover insights from which decisions can be taken. For instance, different interoperabil-ity conflicts can exist among data collections, data may be incomplete, and entities may be dispersed across different datasets. These issues hinder knowledge exploration and discovery, being thus required data integration in order to unveil meaningful outcomes. In this chapter, we address these challenges and devise a knowledge-driven framework that relies on semantic web technologies to enable knowledge exploration and discovery. The framework receives big data sources and integrates them into a knowledge graph. Semantic data integration methods are utilized for identifying equivalent entities, i.e., entities that correspond to the same real-world elements. Fusion policies enable the merging of equivalent entities inside the knowledge graph, as well as with entities in other knowledge graphs, e.g., DBpedia and Bio2RFD. Knowledge discovery allows for the exploration of knowledge graphs in order to uncover novel patterns and relations. As proof of concept, we report on the results of applying the knowledge-driven framework in the EU funded project iASiS 3 in order to transform big data into actionable knowledge, paving thus the way for personalised medicine.
Content may be subject to copyright.
Semantic Data Integration of Big Biomedical Data for
Supporting of Personalised Medicine
Maria-Esther Vidal1,2, Kemele M. Endris2,1,
Samaneh Jozashoori2,1, Farah Karim2,1, and Guillermo Palma2,1
1TIB Leibniz Information Centre for Science and Technology, Germany
2L3S Institute, Leibniz University of Hannover, Germany
maria.vidal@tib.eu
{endris|jozashoori|karim|palma}@l3s.de
Abstract. Big biomedical data has grown exponentially during the last decades
and a similar growth rate is expected in the next years. Likewise, semantic web
technologies have also advanced during the last years, and a great variety of tools,
e.g., ontologies and query languages, have been developed by different scientific
communities and practitioners. Although a rich variety of tools and big data col-
lections are available, many challenges need to be addressed in order to discover
insights from which decisions can be taken. For instance, different interoperabil-
ity conflicts can exist among data collections, data may be incomplete, and en-
tities may be dispersed across different datasets. These issues hinder knowledge
exploration and discovery, being thus required data integration in order to unveil
meaningful outcomes. In this chapter, we address these challenges and devise a
knowledge-driven framework that relies on semantic web technologies to enable
knowledge exploration and discovery. The framework receives big data sources
and integrates them into a knowledge graph. Semantic data integration methods
are utilized for identifying equivalent entities, i.e., entities that correspond to the
same real-world elements. Fusion policies enable the merging of equivalent en-
tities inside the knowledge graph, as well as with entities in other knowledge
graphs, e.g., DBpedia and Bio2RFD. Knowledge discovery allows for the ex-
ploration of knowledge graphs in order to uncover novel patterns and relations.
As proof of concept, we report on the results of applying the knowledge-driven
framework in the EU funded project iASiS3in order to transform big data into
actionable knowledge, paving thus the way for personalised medicine.
1 Introduction
Big data plays a relevant role in promoting both manufacturing and scientific devel-
opment through industrial digitization and emerging interdisciplinary research. Specif-
ically, big data-driven studies have provided the basis for noteworthy contributions in
biomedicine with the aim of supporting personalised medicine [46]. Some exemplar
contributions include the discovery of associations between the use of proton-pump
inhibitors and the likelihood of incurring a heart attack [48], and intra-brain vascular
3http://project-iasis.eu/
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●●● ●
● ●
● ● ● ●
● ●
● ●
0
20
40
60
2015 2016 2017 2018
Year
Interest
Big Biomedical Data
Personalised Medicine
Semantic Data Integration
Fig. 1: Relevance of Big Biomedical Data. Trend analysis provided by Google for "big
biomedical data", "personalised medicine", and "semantic data integration". The three
terms are trending and have similar patterns of relative popularity. These results suggest
that these terms are widely searched and relevant for different communities.
dysregulation, i.e., a change in the brain blood flow, and early pathological events of
Alzheimer’s progression [52]. Semantic web technologies have also experienced great
progress, and scientific communities and practitioners have contributed with ontologi-
cal models, controlled vocabularies, linked datasets, and query languages. Additionally,
ontology-based tools are available, e.g., query engines [1,47], semantic data integration
tools [8,9,31], as well as linked data applications [7,17,19,29]. Moreover, according to
Google 4, "big biomedical data", "personalised medicine", and "semantic data integra-
tion" are trending terms. Figure 1 presents the results of the trend analysis provided by
4https://trends.google.com/trends/?geo=US
Google; as observed, they have similar patterns of search and popularity. These results
evidence the attention that these topics have received in different communities.
Despite the significant impact of big data and semantic web technologies, we are en-
tering into a new era where domains like genomics, are projected to grow very rapidly
in the next decade, reaching more than one Zetta bytes of heterogeneous data per year
by 2025 [50]. In this next era, transforming big data into actionable big knowledge de-
mands novel and scalable tools for enabling not only big data ingestion and curation,
but also for efficient large-scale semantic data integration, exploration, and discovery.
Particularly, big biomedical data suffers from different interoperability conflicts, e.g.,
structuredness, schematic, or granularity. Further, it may be incomplete or values can be
incorrect; more importantly, knowledge required to discover relevant outcomes may be
dispersed across different datasets. All these issues interfere with the process of knowl-
edge exploration and discovery required to support decision making tasks and person-
alised medicine. We tackle these challenges and present a knowledge-driven framework
devised with the aim of semantically integrating big data into a knowledge graph. The
framework relies on the assumption that mining techniques are utilized to extract and
structure knowledge encoded in unstructured big data and describe extracted knowledge
with ontologies. Structured data annotations provide for the resolution of interoperabil-
ity conflicts and data integration into the knowledge graph; a unified schema describes
annotated data into the knowledge graph. Finally, knowledge discovery methods ex-
plore and analyze the knowledge graph. Thus, by exploiting knowledge at all the steps
of big data processing, the proposed knowledge-driven framework facilitates the trans-
formation of big data into actionable knowledge at scale.
In this chapter, the semantic data integration techniques implemented in the knowledge-
driven framework are defined. Additionally, the results of applying these techniques
in the European Union Horizon 2020 funded project iASiS are presented. In iASiS,
the proposed knowledge-driven framework is being utilized to integrate big biomedical
data, e.g., drugs, genes, mutations, side effects, with clinical records, medical images,
and geneomic data. As a result, a knowledge graph represented using the Resource De-
scription Framework (RDF) is created. This current version has more than 230 million
RDF triples and is accessible through a federation of SPARQL endpoints 5. Albeit ini-
tial, this knowledge graph enables the exploration of associations hidden in the raw
data. Associations include mutations that impact on the effectiveness of a drug, side-
effects of a drug, and drug-target interactions. Thus, this knowledge graph corresponds
to a building block for determining relatedness between entities, link prediction, and
pattern discovery. Finally, because access of data collections integrated in the knowl-
edge graph may be regulated by different policies and licenses, the knowledge-driven
framework is empowered to enforce data privacy and access control. The performance
of these knowledge-driven techniques has been empirically studied in the state-of-the-
art benchmarks. Observed results suggest that exploiting knowledge during all the steps
of big data processing enables scalability to the vary nature of biomedical data.
The contributions of the work are summarized as follows:
5Web services that enable the execution of SPARQL queries following the SPARQL protocol.
A knowledge-driven framework able to integrate big biomedical data into a knowl-
edge graph. Integrated data is structured and semantically described, enabling the
exploration and discovery of novel patterns and associations.
Characterization of interoperability conflicts among concepts in big biomedical
data, and semantic data integration methods tailored for their resolution.
A case study illustrating the benefits of executing the proposed knowledge-driven
framework to big biomedical data collected in the context of the iASiS project.
The remainder of the chapter is structured as follows: Section 2 presents the back-
ground knowledge required to understand the terminology used in this chapter. Related
approaches are summarized in Section 3; they include big data frameworks, seman-
tic data integration techniques, federated query engines, and approaches for enforcing
data privacy and access control regulations. Section 4 presents the main components
and features of the knowledge-driven framework. Section 5 describes the application
of the proposed knowledge-driven framework in iASiS. Finally, we sum up the lessons
learned and outline future research directions in Section 6.
2 Preliminaries
2.1 The 5Vs Model for Biomedical Data
In a general sense, big data is defined as data whose volume, acquisition speed, data
representation, veracity, and potential value overcome the capacity of traditional data
management systems [6]. Big data is characterized by a 5Vs model: Volume denotes
that generation and collection of data are produced at increasingly big scales. Velocity
represents that data is rapidly and timely generated and collected. Variety indicates
heterogeneity in data types, formats, structuredness, and data generation scale. Veracity
refers to noise and quality issues in the data. Finally, value denotes the benefit and
usefulness that can be obtained from processing and mining big data; Figure 2 depicts
a summary of the vary nature of the biomedical data sources. According to this 5Vs
model, biomedical data is characterized as follow.
Volume: biomedical data sources and particularly, genomics, make available large vol-
umes of data. Public websites from scientific organizations like UK Biobank, European
Genome-Phenome Archive (EGA), EMBL-EBI, and the Centre for Genomic Regula-
tion (CRG) are making available controlled clinical data from more than 500,000 par-
ticipants, different liquid samples and their corresponding genetic analysis, and health
records. Furthermore, there are over three billion base pairs (sites) on a human genome,
and sequencing a whole genome generates more than 100 gigabytes of data. Despite
the size of current biomedical data sources, the genomic data is growing at an unprece-
dented rate; in fact, biomedical data is projected to grow very rapidly in the next decade,
reaching more than one Zettabytes per year by 2025. Thus, scaling up to volume re-
quires of efficient management of very large datasets.
Value
Veracity
Velocity
Volume
Variety
Healthcare Research Policy Makers
Data Size
Time
# of incomplete records
Cohort
Fig. 2: Big Biomedical Data. The 5Vs model is utilized to characterize the very nature
of big biomedical data. As observed, the dominant big data dimensions, i.e., volume,
velocity, variety, veracity, and value, are present in existing biomedical datasets.
Variety: biomedical data is collected in a wide variety of ways, and using different
devices and protocols, e.g., medical images, and genetic or molecular tests. Further-
more, electronic health records describing patients with different characteristics, are
composed of unstructured notes. Clinical note, encode relevant knowledge about the
conditions and treatments; however, irregularity in the visits generates heterogeneity
in the granularity of the entries of collections of clinical records. More importantly,
treatments, interventions, and outcomes are heterogeneous, and there are no standard
schema or protocols for reporting them in an electronic health record. Thus, novel data
processing techniques are demanded for scaling up to the variety of biomedical data.
Velocity: clinical data is composed of data generated from different devices and as the
results of medical tests regularly done to the patients. Furthermore, patient vital signs
can be registered in real-time, as well as the evolution of a tumor as the reaction to a
particular treatment. Consequently, processing and analyzing data in motion is required
for addressing the velocity dimension of biomedical data.
Veracity: because the unique conditions of a patient in a given instance of time, col-
lected clinical data cannot be reproducible. Moreover, clinical data is affected in many
cases for uncertainty generated by missing observations, errors in the interpretations of
the conditions of a patient, and incorrect values due to inaccuracy of existing interven-
tions and procedures. In consequence, data quality methods are demanded for dealing
and ensuring with the veracity of biomedical data.
Value: the potential value of biomedical data to improve healthcare has been shown
in diverse scenarios. Big data frameworks are supporting the delivery of personalised
medicine by providing semi-automatic interpretation and mining of medical images,
and analyses of large populations. Nevertheless, the biomedical data alone does not have
any value, if it cannot be analysed in a way that actionable insights can be discovered.
Hence, methods for adding value to biomedical data are needed.
We present a knowledge-driven framework that enables the integration of big biomed-
ical data into a knowledge graph. The knowledge-driven framework resorts to knowl-
edge extraction, ontologies, and knowledge discovery, to tackle the challenges imposed
by the very nature of biomedical data.
2.2 Knowledge Modeling and Ontologies
Knowledge modeling is a design process where entities in a universe of discourse are
represented using a knowledge representation model. They range from expressive for-
malisms like ontologies to less expressive models like the relational model. Knowledge
representation provides the basis for the definition of the main properties of a real-
world entity, as well as relations between entities. Accordingly, ontologies enable a
formal specification of a domain of knowledge, develop a common understanding of
the domain, and enable knowledge management and discovery.
The Resource Description Framework (RDF) is a knowledge representation model
developed by the W3C consortium for describing resources in terms of RDF triples.
Three different types of arguments are distinguished in RDF: a) Uniform Resource
Identifier (URI) is a string of characters for denoting entities; it acts as an identifier of
equivalent entities; b) Literals are strings which denote entities; c) Blank nodes repre-
sent a resource, but without a specific identifier; they represent existential variables. An
RDF triple relates three elements: i) Subject - a described resource, represented by a
URI reference or a blank node. ii) Predicate - the property of a resource, represented
by a URI reference. iii) Object - property value, represented by a URI reference, blank
node or a literal. An RDF schema is represented as a directed graph, where nodes cor-
respond to resources, literals or blank nodes, and a directed edge between two nodes
represents an RDF triple; edges are annotated with predicates. Resources can have in-
coming and outgoing edges, i.e., they can be either a subject or an object of an RDF
triple; contrary, nodes representing literals, only can have incoming edges. RDF graphs
allow for the understanding of the relations among resources and their properties.
SPARQL is a W3C standard query language used to define and manipulate RDF
graphs. SPARQL queries comprise triple patterns including conjunctions and disjunc-
tions. The main query clauses supported by SPARQL are SELECT, CONSTRUCT,
ASK, and DESCRIBE. An evaluation of a SPARQL query Q over an RDF graph G,
corresponds to the set instantiations of the variables in the SELECT clause of Q against
RDF triples in G. A SPARQL query can include different operators, e.g., JOIN, UNION,
and OPTIONAL. Moreover, the FILTER modifier can be used in order to filter from the
output the instantiations of the variables of the SELECT clause of Q that meet a certain
condition. The basic building block in the WHERE clause of a SPARQL query is the
triple pattern, or a triple with variables. A Basic Graph Pattern (BGP) is the conjunc-
tion of several triple patterns, where a conjunction corresponds to the JOIN operator.
Finally, BGPs can be connected with the JOIN, UNION, or OPTIONAL operators.
The SPARQL query in Listing 1.1 expresses the “Mutations of the type Confirmed
somatic variant located in transcripts which are translated as proteins that interact with
the drug Docetaxel“. This query is composed of 12 triple patterns; each triple, e.g.,
“?mutation rdf:type iasis:Mutation” corresponds to a triple pattern, where “?mutation”
corresponds to a variable, “rdf:type” to a predicate, and “iasis:Mutation” a mutation.
Triple patterns are connected using the “.” operator which corresponds to a JOIN. The
12 triple patterns in the WHERE clause of the query composes one BGP.
Listing 1.1: SPARQL Query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iasis: <http://project-iasis.eu/vocab/>
SELECT DISTINCT ?mutation
WHERE {
?mutation rdf:type iasis:Mutation .
?mutation iasis:mutation_chromosome ?chromosome .
?mutation iasis:mutation_start ?start.
?mutation iasis:mutation_NucleotideSeq ?nucleotideSeq.
?mutation iasis:mutation_isClassifiedAs_mutationType ?type.
?mutation iasis:mutation_somatic_status ’Confirmed somatic variant’.
?mutation iasis:mutation_cds ?cds.
?mutation iasis:mutation_aa ?aa.
?mutation iasis:mutation_isLocatedIn_transcript ?transcript .
?transcript iasis:translates_as ?protein .
?drug iasis:drug_interactsWith_protein ?protein .
?drug iasis:label ’docetaxel’ }
2.3 Ontologies in the Biomedical Domain
In the last few years, biomedical ontologies has become extremely popular in the com-
putational biology community due to their central role in providing formal description
of biomedical knowledge, classification of entities, and common concepts of the do-
main. Large number of ontologies have been defined in the biomedical domain. For
example, there are more than 719 biomedical ontologies accessible just in Bioportal 6.
The most commonly accessed biomedical ontologies include:
SNOMED CT7- Comprehensive concept system for healthcare.
6https://bioportal.bioontology.org/
7https://www.snomed.org/snomed-ct
UMLS (Unified Medical Language System) 8- terminology integration system in
which all the mentioned ontologies are integrated.
HPO (Human Phenotype Ontology) 9- standardised vocabulary for representing
phenotypic abnormalities existing in human disease.
FMA (Foundational Model of Anatomy) 10 - Ontology of structural human anatomy.
MeSH (Medical Subject Headings) 11 - controlled vocabulary for the indexing and
retrieval of the biomedical literature.
RxNorm 12 - controlled vocabulary of normalized names and codes of drugs.
NCIt (The National Cancer Institute Thesaurus) 13 - public domain terminology
that provides broad coverage of the cancer domain.
Biomedical ontologies are commonly utilized to provide a unique representation of
concepts extracted from unstructured or structured datasets. Specifically in the case
study reported in this chapter, knowledge extraction methods rely on SNOMED-CT
and UMLS for annotating the concepts extracted from clinical notes and scientific pub-
lications. Furthermore, side effects from drugs are annotated with terms from HPO.
2.4 The RDF Mapping Language (RML)
Big data is usually presented in different formats, e.g., images, unstructured text, or tab-
ular, being required the definition of mapping rules in order to transform data in these
diverse formats into a unified schema. The RDF Mapping Language (RML) is one of
the existing mapping languages [13]; it expresses mappings to transform sources repre-
sented in tabular or nested formats, e.g., CSV, relational, JSON, XML into RDF. Each
mapping rule in RML is represented as a triple map which consists of the following
parts: i) A Logical Source refers to a data source from where data is collected. ii) A
Subject Map defines the subject of the generated RDF triples. iii) Predicate-Object
Map combines a predicate map expressing the predicate of an RDF triple with an object
map expressing the object of the RDF triple. A referencing object map indicates the ref-
erence to another triples map. In the proposed knowledge-driven framework mapping
rules are utilized to transform and integrate biomedical data into the knowledge graph.
2.5 Federated Query Processing
In order to scale up to volume and variety, datasets can be partitioned and distributed
in a federation of data sources. A federation of SPARQL endpoints enables the access
of distributed RDF datasets via SPARQL endpoints. A SPARQL endpoint is a Web ser-
vice that provides a Web interface to query RDF data following the SPARQL protocol.
8https://www.nlm.nih.gov/research/umls/
9https://hpo.jax.org/app/
10http://si.washington.edu/projects/fma
11https://meshb.nlm.nih.gov/search
12https://www.nlm.nih.gov/research/umls/rxnorm/
13https://ncit.nci.nih.gov/ncitbrowser/
RDF datasets comprise sets of RDF triples; predicates of these triples can be from more
than one Linked Open Vocabulary, e.g., FOAF or the DBpedia ontology. Additionally,
proprietary vocabularies can be used to describe the RDF resources of these triples, and
controlled vocabularies as VoID, can be used to describe the properties of the RDF data
accessible through a given SPARQL endpoint. Queries against federations of SPARQL
endpoints are posed through federated SPARQL query engines. A generic architecture
of a federated SPARQL query engine is based on the mediator and wrapper architecture
[51,53]. Wrappers translate SPARQL subqueries into calls to the SPARQL endpoints as
well as convert endpoint answers into the query engine internal structures. The mediator
rewrites original queries into subqueries that can be executed by the data sources of the
federation. Moreover, the mediator collects the answers of evaluating the subqueries
over the selected queries, merges the results and produces the answer of a federated
query; mainly, it is composed of three components: i) Source Selection and Query
Decomposition breaks down queries into subqueries, and selects the endpoints capable
of executing each subquery. Simple subqueries comprise a list of triple patterns that can
be evaluated against at least one endpoint. ii) Query Optimizer identifies execution
plans comprising subqueries and physical operators implemented by the query engine.
iii) Query Engine implements physical operators to combine tuples from endpoints.
Physical operators implement logical SPARQL like JOIN, UNION, or OPTIONAL. In
the proposed knowledge-driven framework, a federated query engine enables to inter-
operate across different knowledge graphs.
3 Related Work
3.1 Big Data
Data complexity challenges reflected in the Vs of big data, i.e., volume, variety, ve-
racity, velocity, and value, have a negative impact on the effectiveness and scalability
of techniques across all the steps of big data processing [5]. To address the challenges
of data complexity, novel paradigms and technologies have been proposed in the last
years. In order to address variety, flexible data representations like semi-structured data
models or graph databases, have emerged as alternatives for scaling up to divergent data
sources characterized by schematic conflicts. Linked Data technologies have focused on
managing data that is semantically heterogeneous. Albeit all these advancements, re-
search and technical challenges still abound in the big data era. The extensive literature
analysis on big data methods provided by [49] indicates that the majority of the state-of-
the-art solutions constitute silos that focus on specific dimensions of data complexity.
However, isolated solutions are not sufficient to meet the concurrent demands imposed
by the different Vs of big data to successfully generate actionable knowledge [26].
The knowledge-driven framework implements a data-driven pipeline able to address all
challenges of data complexity. Volume is managed by the federated query engine im-
plemented in knowledge-driven framework, which decomposes and executes an input
query over the remote endpoints containing the knowledge graph. Non-blocking oper-
ators implemented in the federated query engine tackle data velocity. RML mapping
rules according to a unified schema to generate knowledge graph address the variety di-
mension of the data complexity. Semantic data integration and data fusion policies im-
plemented in the knowledge-driven framework deal with veracity. To extract value from
big data, the knowledge-driven framework implements knowledge discovery methods
for uncovering patters and hidden relations; it enables profiling patients, interactions be-
tween drugs, or a treatment side-effects. Further, an ontology-based component enables
the definition of data access policies and rules to perform reasoning over the guidelines
that regulate the operations allowed over the data integrated in the knowledge graph.
3.2 Semantic Data Integration
Semantic integration of big data entails big data variety by enabling the resolution of
several interoperability conflicts, e.g., structuredness, schematic, representation, com-
pleteness, domain, granularity and entity matching conflicts. These conflicts arise be-
cause data sources may have different data models or none, follow various schemes
for data representation, contain complementary information. Furthermore, a real-world
entity may be represented using diverse properties or at various levels of detail. Thus,
data integration techniques able to solve all the interoperability issues while addressing
data complexity challenges imposed by Big data characteristics are demanded.
In order to efficiently integrate big data sources and to address interoperability con-
flicts, several integration approaches have been devised to collect domain independent
data, whereas others integrate data particularly from biomedical domain. KARMA [31],
MINTE [8], SILK [25], SJoin [18], LDIF [2], Sieve [34], LIMES [36], and RapidMiner
LOD Extension [44] are generic approaches for semantic data integration. KARMA is
a semi-automatic approach capable to resolve interoperability conflicts among struc-
tured sources. KARMA builds source models and mapping rules for structured data
sources by mapping them to ontologies. The RDF-based semantic integration approach,
MINTE, resolves integration conflicts among RDF data sources; it exploits knowledge
expressed in RDF vocabularies and semantic similarity measures to integrate the se-
mantically equivalent RDF graphs. SILK, a linked discovery framework, integrates dif-
ferent linked data sources by identifying links between corresponding entities. SILK
allows for the specification of rules to define the link types to be discovered among the
data sources, as well as conditions to be fulfilled by the data entities to be integrated.
SJoin, a semantic join operator, performs semantic integration of syntactically different
heterogeneous RDF graphs. SJoin identifies semantically related heterogeneous RDF
graphs in blocking mode for batch processing, as well as in non-blocking mode to pro-
duce results incrementally. LDIF integrates disparate linked data sources represented
using different ontologies into a local targeted ontology. Sieve resorts to mapping rules
for performing data fusion and conflict resolution; it solves data quality issues, e.g., in-
consistencies and missing values, during data fusion. LIMES is a tool using supervised
and unsupervised techniques for integrating different linked data sources by identifying
links among instances. LIMES exploits metric spaces to filter out all instance pairs that
do not meet the mapping criteria. RapidMiner LOD Extension discovers relevant linked
data sources by following links and integrates overlapping data found in different data
sources. In the biomedical domain, [24] and [23] implement ontology matching to in-
tegrate data sources by mapping different entities and relationships, and [4] reports a
second release of Bio2RDF for improved syntactic and semantic interoperability among
datasets. Further, Hu et al. [23] perform various link analysis methods against, e.g., data
link analysis, entity link analysis, and term link analysis; the results of link analysis are
exploited for solving interoperability conflicts and for facilitating data integration.
Table 1: Semantic Data Integration. Comparison of existing approaches. Mapping-
based: data integration is guided by mapping rules; Similarity-based: entity matching
is resorted to similarity measures; Linked Discovery: data integration is guided by links
between matched entities; Ontology Matching: ontology alignments are used to entity
matching; Fusion Criteria: fusion policies guide the integration of matched entities;
and Variety: data integration scales up to various formats.
Data Integration
Approach
Mapping-
based
Similarity-
based
Linked
Discovery
Ontology
Matching
Fusion
Criteria
Variety
KARMA [31] X X X
MINTE [8] X X X
SILK [25] X X
SJoin [18] X X
LDIF [2] X X
Sieve [34] X X X
LIMES [36] X X
RapidMiner [44] X
Knowledge-driven
framework
X X X X X
The knowledge-driven framework receives structured data annotated with terms
from controlled vocabularies or ontologies. Knowledge extraction techniques such as
natural language processing (NLP) or visual analytics are performed to resolve struc-
turedness conflicts. Moreover, mapping rules defined conforming to unified schema
facilitate the translate of annotated data into a knowledge graph to solve interoperabil-
ity conflicts. Furthermore, mapping rules enable the transformation of the annotated
data into RDF, and semantic similarity measures are utilized to determine when two
resources match, i.e., they correspond to the same real-world entity. Finally, diverse
data fusion policies can be adopted to integrate related entities in the generated knowl-
edge graph. Thus, variety can be managed, and knowledge extraction, mapping rules,
similarity measures, and fusion policies provide the basis for solving interoperability
conflicts and data integration. Table 1 depicts the main properties of these approaches;
as observed, existing approaches are able to solve data integration by taking advantage
of diverse techniques, e.g., links, mappings, and ontologies. Nevertheless, the proposed
knowledge-driven framework is also able to scale up to various types of data, e.g., un-
structured notes and images, and structured and semi-structured data. These features
are crucial for enabling scalability of biomedical data management and analytics.
3.3 Knowledge Management and Query Processing
According to a survey recently conducted by Sahu et al. [45], modelling and process-
ing big data using graph based management tools is becoming increasingly common
in both research and industry. Nonetheless, this study also reveals that there are still
open issues that impede a prevalence usage of graph based frameworks over more tra-
ditional technologies like relational databases. Scalable graph management infrastruc-
tures, and query languages and formal models for representing and querying graphs are
actually some of the challenges to be addressed. Moreover, Hartig et al. [21] just fo-
cus on federations of data sources represented using RDF, and highlight that ensuring
efficient and effective query processing while enforcing data access and privacy poli-
cies are the main challenges to be faced. In order to address these issues, the semantic
web community has actively proposed federated SPARQL query engines able to exe-
cute queries over a federation of SPARQL endpoints. FedX [47], ANAPSID [1], and
MULDER [15] are exemplar contributions. FedX implements source selection tech-
niques able to contact the SPARQL endpoints on the fly to decide the subqueries of the
original query that can be executed over the endpoints of the federation. Thus, FedX
relies on zero knowledge about the content of the SPARQL endpoints to perform the
tasks of source selection and decomposition. ANAPSID exploits information about the
predicates of the RDF datasets accessible via the SPARQL endpoints of the federation
to select relevant sources, decompose the original queries, and find efficient execution
plans. Moreover, ANAPSID implements physical operators able to adjust schedulers
of query executions to the current conditions of SPARQL endpoints, i.e., if one of the
SPARQL endpoint is delayed or blocked, ANAPSID is able to adapt the query plans
in order to keep producing results in an incremental fashion. Finally, MULDER is a
federation SPARQL engine that relies on the description of the properties and links of
the classes in the RDF graphs accessible from SPARQL endpoints, to decompose the
original queries into the minimal number of subqueries required to evaluate the original
query into the relevant SPARQL endpoints. MULDER utilizes the RDF Molecule Tem-
plates (RDF-MTs) to describe classes and links in an RDF graph. It also exploits the
physical operators implemented in ANAPSID to provide efficient query executions of
SPARQL queries. Thus, MULDER provides source selection and query decomposition,
and query optimizer components which effectively exploit the ANAPSID query engine.
BioSearch [24], is a semantic search engine for linked biomedical data. It resorts to on-
tology matching for efficient browsing; it integrates data from different data sources by
matching classes and properties in the Semantic science Integrated Ontology (SIO)14.
The knowledge-driven framework resorts to the federated query engine called On-
tario, to execute queries against a federation of knowledge graphs. Similarly to MUL-
DER, Ontario relies on RDF Molecule Templates (RDF-MTs) for describing the RDF
classes included in a federation of knowledge graphs; RDF-MTs correspond to an ab-
stract representation of the RDF classes in an RDF dataset and all the properties that can
have the instances of the class. Additionally, Ontario maintains in the RDF-MTs meta-
data describing the data privacy and access control regulations imposed by the provider
of the data used to populate the RDF classes of the knowledge graph. Moreover, On-
14https://code.google.com/archive/p/semanticscience/wikis/SIO.wiki
Table 2: Knowledge Management and Query Processing. Related approaches are de-
scribed in terms of various characteristics. Source Semantic Description: query pro-
cessing resort to data source description; Adaptive Engine: query processing schedules
are adjusted to the source conditions; Ontology-based: ontologies are exploited during
query processing; and Variety: data management scales up to various formats.
Approach Source Semantic
Description
Adaptive
Engine
Ontology-
based
Variety
FedX [47] X
ANAPSID [1] X X
MULDER [15] X X
BioSearch [24] X
Knowledge-based
framework
X X X X
tario relies on adaptive physical operators to be able to adjust query execution plans to
the condition of the SPARQL endpoints that make accessible a federation knowledge
graphs. More importantly, contrary to existing federated SPARQL query engines, On-
tario is able to execute SPARQL queries over data sources that are not integrated in the
knowledge graph and are stored in raw formats, e.g., CSV or JSON. This feature of
Ontario allows for executing queries over both RDF graphs and against data collections
that are not physically integrated into the knowledge graph, providing thus a virtual and
scalable integration of data sources. Table 2 summarizes the main properties of existing
knowledge management and query processing approaches. Albeit efficient, existing fed-
erated query engines are not able to scale up to variety of biomedical data during query
processing, i.e., queries cannot be executed over heterogeneous sources described in
different formats, e.g., CSV or JSON, or accessible in using various database engines,
e.g., relational or graph database engines.
3.4 Data Privacy
Preserving privacy and enforcing data access policies is a challenging task, particularly,
whenever privacy aware access control features from heterogeneous big data sources
are integrated or reasoning processes are required to enforce potentially contradicting
access regulations [10]. Kirrane et al. [30] survey various access control models, policy
representations, and standards for access policy representations using RDF. As shown
by Kirrane et al., several ontology-based approaches have been proposed. Exemplar ap-
proaches include Kamateri et al. [27] and Grando et al. [20]. Kamateri et al. present the
Linked Medical Data Access Control (LiMDAC) framework with the aim of enabling
access control over medical data aggregated by the multi-dimensional data cubes. LiM-
DAC exploits data cubes metadata to restrict access to cubes and access policies can
be defined over specific datasets and access spaces to which a number of users belong.
Grando et al. propose a hybrid approach where an ontology and a set of access control
rules allow for reasoning about access permissions. As a proof of concept, Grando et
al. [20] apply the proposed formalism to biomedical data, where rules take the form
of a consent statement signed by a patient and lead by a researcher, and a consent has
a number of consent rules performed over an operation against different information
object. Finally, Zeng et al. [54] devise a query evaluation scheme that supports access
control in a federated database system where different collaborative parties are sharing
and exchanging data described using the relational model. Albeit expressive, these ap-
proaches do not exploit the semantic encoded in privacy aware formalisms to execute
efficient plans against knowledge graphs.
The knowledge-drive framework also implements an ontology-based approach to
describe data access policies and a set of rules to reasoning about the privacy and access
control policies to apply when these source are accessed [14]. However, in contrast
to the above ontology-based approaches, this formalism is included into the federated
query engine in order to ensure that every operation executed over the data sources, e.g.,
Read (R) or Merge (M), respect the access policies of the data sources.
4 A Knowledge-Driven Framework for Big Data
The knowledge-driven framework receives big data sources in different formats, e.g.,
clinical notes, images, scientific publications, and structured data, and generates a knowl-
edge graph from which unknown patterns and relationships can be discovered; Figure
3 depicts the architecture. The framework comprises four main components: i) Knowl-
edge Extraction; ii) Knowledge Graph Creation; iii) Knowledge Management & Dis-
covery; and iv) Data Access Control & Privacy. As observed, diverse data sources can
be integrated and described into a knowledge graph, and management and discovery are
performed on top of the knowledge graph. These components are described as follows.
Fig. 3: A Knowledge-Driven Framework. Heterogeneous data sources are received as
input, and a knowledge graph and unknown patterns are output. The knowledge graph is
linked to existing knowledge graphs; federated query processing and knowledge discov-
ery techniques enable knowledge exploration and discovery at large scale. Data privacy
and access regulations are enforced in all the steps of big data processing.
Knowledge Extraction: This component exploits mining and data analytics tech-
niques in order to transform unstructured data sources like clinical notes, images, and
scientific publications, into structured datasets; ontologies are utilized to express the
meaning of the concepts extracted by the mining processes and for standardized terms
across heterogeneous data sources.
Knowledge Graph Creation: This component receives annotated datasets produced
during knowledge extraction and generates a knowledge graph; the evaluation of map-
ping rules expressed in RML enables the transformation of annotated data into RDF
triples in the knowledge graph. A knowledge graph is created by semantically describ-
ing entities using a unified schema. Annotations are exploited by semantic similarity
measures [43] with the aim of determining relatedness between the entities included
in the knowledge graph, as well as for duplicate and inconsistency detection. Related
entities are integrated into the knowledge graph following different fusion policies [8];
ontological axioms of the dataset annotations are fired for resolving conflicts during
the evaluation of the fusion policies. Moreover, entity linking techniques are used to
connect these entities to equivalent entities in other knowledge graphs.
Knowledge Management & Discovery: This component enables the exploration of
the knowledge graph, as well as the discovery of new relations or patterns between
entities, e.g., drugs, side-effects, or targets. Once the knowledge graph is created, it can
be explored and consulted by using Ontario. Results of executing a federated query can
be used as input of the tasks of data analytics or knowledge discovery. Thus, patterns
among entities on a knowledge graph, as well as relationships between these entities
can be uncovered. Discoveries include profiles of lung cancer patients, and networks of
drug-target interactions, drug and side-effects, and drug-drug interactions.
Data Access Control and Data Privacy Enforcement: This component allows for the
description of the access policies that indicate the operations, e.g., Read (R) or Merge
(M), that can be executed over the data integrated in the knowledge graph.
In the next sections, we will illustrate the features of the knowledge-driven frame-
work in the context of the European Union Horizon 2020 project iASiS.
5 Applications of the Knowledge-Driven Framework in iASiS
iASiS is a 36-month H2020-RIA project with the vision of turning clinical and phar-
macogenomics big data into actionable knowledge for personalised medicine and de-
cision making. iASiS aims at integrating heterogeneous big data sources into the iA-
SiS knowledge graph. Data sources include clinical notes, medical images, genomics,
medications, and scientific publications. In order to create the knowledge graph, iASiS
offers a unified schema able to represent knowledge encoded into the heterogeneous
big data sources. Furthermore, to overcome heterogeneity conflicts across the hetero-
geneous sources, the iASiS infrastructure makes use of diverse data analytics meth-
ods. For example, natural language processing and text-mining techniques are used to
convert clinical notes into usable data [33], state-of-the-art machine learning methods
are utilized for image analysis [39], and genomic analysis tools [32] for link predic-
tion. Moreover, the iASiS infrastructure relies on ontologies to semantically describe
real-world entities, e.g., drugs, treatments, publications, genes, and mutations; these
annotations provide the basis for the semantic integration of these entities. The iASiS
knowledge graph is linked to existing knowledge graphs, e.g., DBpedia and Bio2RDF,
and query processing and knowledge discovery techniques are implemented in order to
explore and uncover patterns in the knowledge graphs. Data from two different diseases
are integrated: lung cancer and Alzheimer’s disease.
5.1 Big Biomedical Data Sources
The very nature of the biomedical data sources, and in particular, variety, generates
interoperability conflicts across the data sources that need to be addressed before inte-
grating them in the knowledge graph. These conflicts are summarized as follows:
Structuredness (C1): occurs whenever data sources are described at different level
of structuredness, e.g., structured, semi-structured, and unstructured. Structured data
sources are represented using schema of a particular data/knowledge model, e.g., the
relational data model; all the represented entities are described in terms of fixed schema
and attributes. Semi-structured data sources are also described using a data/knowledge
model, e.g., RDF or XML; however, in contrast to structured data, each modeled entity
can be represented using different attributes and, a predefined and fixed schema is not
required to describe an entity. Finally, unstructured data sources represent data without
following any structure or using a data model; typically, data is presented in various
formats, e.g., textual, numerical, images, or FASTA files.
Schematic (C2): exists among data sources that are modeled with various schema.
Conflicts include: i) different attributes representing the same concept across sources;
ii) the same concept modeled using different structures, e.g., attributes versus classes;
iii) different types are used to represent the same concept, e.g., string versus integer;
iv) the same concept is described at different levels of specialization/generalization;
v) different names are used to model the same concept;and vi) different ontologies are
used to annotate the same entity, e.g., UMLS, SNOMED-CT.
Domain (C3): occurs when various interpretations of the same domain are represented.
Different interpretations include: i) homonym: the same name is used to represent con-
cepts with different meaning; ii) synonym: distinct names are used to model the same
concept; iii) acronym: different abbreviations for the same concept; iv) semantic con-
straint: different integrity constraints are used to model the characteristics of a concept.
Representation (C4): refers to different representations are used to model the same
concept. Representation conflicts include: i) different scales or units; ii) various values
of precision; iii) incorrect spellings; iv) different criteria for identifiers; and v) various
methods for encode values or representing the encoding.
Language (C5): occurs whenever different languages are used to represent the data or
metadata (i.e., schema).
Granularity (C6): refers to the level of granularity used to collect and represent the
data. Examples of granularity conflicts include: i) samples of the same measurement
observed at different time frequency; ii) various criteria of aggregation; and iii) data
modeled at various levels of detail.
5.2 Techniques for Extracting Knowledge from Big Biomedical Data Sources
Knowledge extraction methods capture knowledge encoded in unstructured data sources,
and represent the extracted knowledge using biomedical ontologies or controlled vocab-
ularies. Thus, the interoperability conflicts C1,C2, and C4 existing across the biomed-
ical data sources are solved during knowledge extraction.
Electronic Health Record (EHR) Text Analysis: Semi-automatic data curation tech-
niques are utilized for data quality assurance, e.g., removing duplicates, solving am-
biguities, and completing missing attributes. Natural language processing techniques
developed by Menasalvas et al. [33] are applied to extract relevant entities from un-
structured fields, i.e., clinical notes or lab test results. NLP techniques rely on medical
vocabularies, e.g., UMLS or HPO, and NLP corpuses and tools, e.g., lemmatization or
named entity recognition, to annotate concepts with terms from medical vocabularies.
Genomic Analysis: Data mining tools, e.g., catRapid [32], are used to identify protein-
RNA associations with high accuracy. Publicly available datasets, e.g., data from GTEx,
GEO, and ArrayExpress, are used for the integration with transcriptomic data. Finally,
this component relies on the Gene Ontology to determine key genes for lung cancer
and interactions between these genes. Furthermore, genes are annotated with identifiers
from different databases, e.g., HUGO or Uniprot/SwissProt, as well as with HPO.
Image Analysis: Machine learning algorithms developed by Ortiz et al [39] are applied
to learn predictive models able to classify medical images and detect areas of interests,
e.g., lung cancer tumors or imaging biomarkers. Further, image annotation methods
semantically describe these areas of interest using ontologies [11,41].
Open Data Analysis: NLP and network analysis methods enable the semantic annota-
tion of entities from biomedical data sources using biomedical ontologies and medical
vocabularies, e.g., UMLS or HPO. Data sources include PubMed15, COSMIC16, Drug-
Bank17, and STITCH18 . Annotated datasets comprise entities like mutations, genes,
15https://www.ncbi.nlm.nih.gov/pubmed/
16https://cancer.sanger.ac.uk/cosmic
17https://www.drugbank.ca/
18http://stitch.embl.de/
scientific publications, biomarkers, side effects, transcripts, proteins, and drugs, as well
as relations between these entities. Further, entity linking tools like DBpedia Spotlight
[12] and TagMe [16], solve the tasks of entity extraction, disambiguation, and linking.
They are used for annotating unstructured attributes of the data sources, e.g., names of
drugs, genes, or mutations with permanent web links, e.g., in DBpedia or Wikipedia.
5.3 The iASiS Unified Schema
The iASiS unified schema models main biomedical concepts, as well as their properties
and relations; it is used in the knowledge graph to describe the meaning of the anno-
tated datasets created during the knowledge extraction process. Table 3 describes the
represented concepts; detailed description and visualization can be found as an instance
of VoCol19. Furthermore, VoCol provides ontology management features that enable
the visualization and exploration of the ontology; also, VoCol provides an interface for
specifying queries against the iASiS unified schema. The current version of the unified
schema includes 129 nodes and 174 edges which correspond to 49 classes, 56 object
properties, and 74 data type properties.
5.4 The Knowledge Graph Creation
The knowledge graph creation process relies on RML mapping rules to transform the
annotated data generated during the knowledge extraction process into RDF triples in
the knowledge graph; it is composed of four main steps:
Alignment of Concept Identifiers Data sources are pre-processed in order to identify
mappings between identifiers in different ontologies or vocabularies. For example, the
name of drug is posed to the Rest APIs of KEGG20 and STITCH21 to download the
identifiers of a drug and the targets that interact with that drug; furthermore, an instance
of the DrugBank database is utilized to find the DrugBank identifier. UMLS terms of
side effects are downloaded from SIDER22, while the HPO terms are downloaded from
the HPO database. Conflicts C2 and C5 are solved in this step.
Semantic Enrichment transforms annotated data into RDF; it relies on rules in a map-
ping language, e.g., RML, to generate the RDF triples that correspond to the semantic
description of the input data. The iASiS unified schema and properties from existing
RDF vocabularies like RDFS and OWL, are utilized as predicates and classes. Anno-
tations in the input data are also represented as RDF triples. The RDF representation
of these annotations are linked to the corresponding entities in the knowledge graph,
e.g., the resource of the UMLS annotation C00031149 is associated with the resource
19https://vocol.iais.fraunhofer.de/iasis/
20https://www.kegg.jp/kegg/rest/keggapi.html
21http://http://stitch1.embl.de/
22http://http://sideeffects.embl.de/
Table 3: The Unified Schema. Represented biomedical concepts.
Concept Description
Patient Person suffering from a disease and receiving medical treatment in a medical center or hospital.
Test Medical procedure performed to detect, diagnose, monitor disease processes or susceptibility,
and determine a course of treatment.
Clinical Record Variety of documents containing information of a patient’s medical history that has been created
or gathered by health care professionals.
Diagnosis The identification of the nature of an illness or other problem by examination of the symptoms.
Medical Images Visual representation of the interior of an organ.
Liquid Biopsy A blood test able to report cancer cells from a tumor that are circulating in the blood or for
pieces of DNA from tumor cells that are in the blood.
Ecog performance status Patient’s level of functioning in terms of her/his ability for caring her/himself, daily activities
and physical abilities.
Observations Statements based on features of interest and treatments of patients
Gene DNA or RNA sequence.
Tumor Abnormal mass of tissue that results when cells divide more than they should or do not die
when they should.
Protein Large and complex molecules composed of one or more long chains of amino acids.
Mutation A permanent alteration in the DNA sequence that makes up a gene, such that the sequence
differs from what is found in most people.
Variation The diversity of differences in genomes and their complex relationship with health and disease.
Single nucleotide polymorphisms (SNP) and copy number variants (CNVs) are two forms of
genetic variants that can be studied.
Gene expression The phenotypic manifestation of a gene or genes by the processes of genetic transcription and
genetic translation.
CpG island CGIs are regions of the genome that contain a large number of CpG dinucleotide repeats.
Although most CGIs linked to promoters are non-methylated, majority of CGIs may be com-
pletely methylated in normal cells which makes the study of methylation of CGIs important in
cancer studies.
Transcript Single-stranded RNA product synthesized by transcription of DNA.
eQTL A locus that explains s fraction of the genetic variance of a gene expression phenotype.
Drug Substances used for the treatment, diagnosis, cure, or prevention of a disease.
Side effect An often harmful effect of a drug or chemical that occurs along with the desired effect.
Enzyme Macromolecular biological catalyst that accelerate chemical reactions.
Publication Scientific publications.
Annotation Vocabulary controlled terms used to describe, tumours, texts, images, genes, treatments, pro-
teins, and biomarkers, among others.
Genotype-Tissue Expressions Correlations between genotype and gene expression within tissues and individuals.
Measurement units Standards used to make measurements.
BioMarker Any substance, structure, or process that can be measured in the body as indicator of a disease.
Treatment Application of medical care to a patient in an attempt to cure or mitigate a disease or injury.
of the PubMed publication 28381756. Moreover, equivalences and semantic relations
between annotations are represented in the knowledge graph. These relationships allow
for detecting entities annotated with equivalent annotations and that may correspond to
the same real-world entities, i.e., they are duplicates; thus, equivalent annotations rep-
resent the input to the tasks of knowledge integration. While mapping rules are tools to
convert the format of data, they are also utilized for data curation. In order to prevent
duplication in creation of the same instance of a class from different resources, such as
the same drug, a unique URI structure is defined for each concept. Therefore, the URI
identification is source-independent. Furthermore, the Semantic Enrichment component
is able to detect data quality issues in the input data collections; it has been empowered
with data curation capabilities that allow for detecting missing values, and malformed
names and identifiers. Consequently, during semantic enrichment the interoperability
conflicts C2,C3,C4, and C5 are solved. Moreover, given the number of rules and the
size of the data sources, optimization techniques have been implemented with the aim
of scaling up. Empirically, scalability has been evaluated, and the Semantic Enrichment
component is able to generate knowledge graphs in the order of the Terabytes.
Knowledge Curation and Integration receives an initial version of the iASiS knowl-
edge graph that may include duplicates and it outputs a new version of the knowledge
graph from where duplicates are removed. In order to detect if two entities correspond
to the same real-world entity, i.e., they are duplicates, similarity measures are utilized,
e.g., GADES [43] or Jaccard; all the entities in an RDF class of the knowledge graph
are compared pairwise. Then, a 1-1 perfect weighted matching algorithm is performed
in order to identify duplicates in the class. Thus, if two entities are matched, they are
considered as equivalent entities and merged in the knowledge graph. Fusion policies
are followed to decide how equivalent entities are merged in a knowledge graph; the
fusion policies include: i) Union creates a new entity with the union of the properties
of the matched entities. ii) Semantic based Union creates a new entity with the union
of the properties of the matched entities. Only most general properties are kept in case
of properties related by the subproperty relationship; furthermore, if two properties are
equivalent, only one of them is kept in the resulting entity. iii) Authoritative Merge
outputs an entity with the data provided from an authoritative source.
To illustrate the knowledge graph creation process, suppose input data describing a
drug, is received in a tabular format, e.g., as a CSV file. Then, an RDF graph describing
the drugs in the file is created, as can be seen in Figure 4. This RDF graph is called a sim-
ple RDF molecule or group of RDF triples that share the same subject. RML mapping
rules are defined and executed to transform raw data into the RDF triples that comprise
the resulting RDF molecules. Further, these mapping rules indicate the format of the
URIs of the resources that appear as subjects or objects of the RDF molecules created
during their execution. In this case, three URIs are created, i.e., for drug, publication,
and variation. The same process is repeated for all the RML mappings that define the
RDF classes in the iASiS knowledge graph in terms of the available data sources.
Interlinking receives the iASiS knowledge graph and a list of existing knowledge
graphs, e.g., DBpedia or Bio2RDF, and outputs a new version of the iASiS knowledge
Fig. 4: Example of Knowledge Graph Creation. An RDF molecule is created from a
CSV file. The meaning of each entry in the file is described using a unified schema.
graph where entities are linked to equivalent entities in the input knowledge graphs.
Entity Linking tools like DBpedia Spotlight [12] are used for linking resources in the
iASiS knowledge graph to equivalent resources in DBpedia. Additionally, link traversal
techniques are performed to further identify links with other knowledge graphs. In case
several simple RDF molecules are defined for the same real-world entity, e.g., the drug
docetaxel, the process of knowledge integration is executed. This process determines
RDF molecules that represent equivalent entities of a class, according to the available
fusion policies. Thus, simple RDF molecules are then merged or integrated into a com-
plex RDF molecule that represent all the properties of the real-world entity, that are
represented in the different simple RDF molecules. Finally, entity linking techniques al-
low for discovering links between entities in the iASiS knowledge graph and equivalent
entities in existing knowledge graphs, e.g., DBpedia. Figure 5 illustrates the resource
representing the drug docetaxel in the iASiS knowledge graph is linked to the resource
that represents the same drug in DBpedia; the owl:sameAs property is utilized to rep-
resent these type of links. Linking the iASiS knowledge graph with other knowledge
graphs not only allows for exploring properties that are not represented in the original
knowledge graph (e.g., dbo:atcPrefix), but also enables the identification of data quality
issues like missing values or duplicates. The current version of the iASiS knowledge
graph has 236,512,819 RDF triples, 26 RDF classes, and in average, 6.98 properties
per entity, and in average there are 86,934.00 entities per class. RDF-MTs of the iA-
SiS knowledge graph and connected knowledge graphs are used to describe the main
characteristics of the integrated data and their connections. To conduct this analysis,
the RDF-MTs that describe the iASiS knowledge graph and the connected RDF-MTs
in DBpedia and Bio2RDF are computed. The algorithm proposed by Endris et al. [15]
computes the RDF-MTs from the RDF classes in the iASiS knowledge graph, DBpedia,
and Bio2RDF. Furthermore, an undirected graph with the computed RDF-MTs is built.
Figure 6 shows this graph; RDF-MTs correspond to 35 nodes in the graph, while 58
edges represent links among RDF-MTs. It can be observed that all the RDF-MTs are
Fig. 5: Example of Knowledge Integration. Several RDF molecules are integrated
into one RDF molecule. Resources representing the drug docetaxel in two knowledge
graphs, are linked using the predicate owl:sameAs.
connected to at least one RDF-MT, i.e., there are no isolated classes in the iASiS knowl-
edge graph. Moreover, using network analysis, several graph measures are computed;
Figure 6a reports on the results of these measures. Clustering coefficient measures the
tendency of nodes who share the same connections in a graph to become connected. If
the neighborhood is fully connected, the clustering coefficient is 1.0 while a value close
to 0.0 means that there is no connection in the neighborhood. Transitivity measures if
RDF-MTs are transitively connected; values close to 1.0 indicate that almost all the
RDF-MTs are related, while low values indicate that many RDF-MTs are not related.
Each RDF-MT is connected to almost three RDF-MTs in average, indicating thus that
biomedical concepts are integrated and related in the knowledge graph. Nevertheless,
clustering coefficient and transitivity are both relatively low, i.e., 0.224 and 0.23, re-
spectively. Given the relationships existing between the biomedical concepts modeled
in the unified schema, these two values suggest that there are still more connections to
be discovered and included in future versions of the iASiS knowledge graph.
RDF-MT Graph Property Value
Number of RDF-MTs (nodes) 35
Number of connections (edges) 58
Clustering coefficient 0.224
Transitivity 0.230
Avg. number of neighbors 2.629
(a) Graph Analysis (b) Graph of RDF-MTs.
Fig. 6: Connectivity of IASIS-KG. (a) Graph representing the connectivity of the RDF
classes in IASIS-KG, and DBpedia and Bio2RDF. All the RDF classes are connected.
(b) Graph analysis of the RDF-MTs of the iASiS knowledge graph.
5.5 Exploring and Querying a Knowledge Graph
Ontario is a federated query engine that enables the exploration of the iASiS knowledge
graph and the connected knowledge graphs, e.g., DBpedia and Bio2RDF. Queries can
be written in SPARQL, and Ontario decides the subqueries that need to be executed
over each knowledge graph to collect the data required for query answer. Additionally,
Ontario executes physical operators, e.g., symmetric join [18] and gjoin[1], and is able
to relate during query execution, RDF triples stored in different knowledge graphs. To
illustrate this feature, consider the following query: "Mutations of the type confirmed
somatic variant located in transcripts which are translated as proteins that are trans-
porters of the drug docetaxel" which is represented by SPARQL query in Listing 1.2.
Listing 1.2: SPARQL Query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iasis: <http://project-iasis.eu/vocab/>
SELECT DISTINCT ?mutation
WHERE {
?mutation rdf:type iasis:Mutation .
?mutation iasis:mutation_somatic_status ’Confirmed somatic variant’.
?mutation iasis:mutation_isLocatedIn_transcript ?transcript .
?transcript iasis:translates_as ?protein .
?drug iasis:drug_interactsWith_protein ?protein .
?protein iasis:label ?proteinName .
?drug iasis:label ’docetaxel’ .
?drug owl:sameAs ?drug1 .
?drug1 drugbank:transporter ?transporter .
?transporter drugbank:gene-name ?proteinName .}
Data from the iASiS knowledge graph and Bio2RDF is collected and linked. SPARQL
queries in Listing 1.3 and Listing 1.4 are generated by Ontario. Query in Listing 1.3 is
executed against the iASiS knowledge graph; it retrieves the names of the proteins that
translate the transcripts where the mutations of type confirmed somatic variant are lo-
cated; also, the URI of docetaxel in Bio2RDF is projected out.
Listing 1.3: SPARQL Query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iasis: <http://project-iasis.eu/vocab/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?mutation ?proteinName ?drug1
WHERE {
?mutation rdf:type iasis:Mutation .
?mutation iasis:mutation_somatic_status ’Confirmed somatic variant’.
?mutation iasis:mutation_isLocatedIn_transcript ?transcript .
?transcript iasis:translates_as ?protein .
?drug iasis:drug_interactsWith_protein ?protein .
?protein iasis:label ?proteinName .
?drug iasis:label ’docetaxel’ .
?drug owl:sameAs ?drug1 .}
Query in Listing 1.4 is evaluated over Bio2RDF; the results correspond to the URI
of docetaxel and the names of the proteins that are transporters of docetaxel. This query
is executed against Bio2RDF; the values of the names of the proteins that are trans-
porters are produced, as well as the URI of docetaxel in Bio2RDF.
Listing 1.4: SPARQL Query
PREFIX drugbank: <http://bio2rdf.org/drugbank_vocabulary:>
SELECT DISTINCT ?proteinName ?drug1
WHERE {
?drug1 drugbank:transporter ?transporter .
?transporter drugbank:gene-name ?proteinName .
}
Ontario collects the results and merges them in order to project out the names of
the mutations. As a result, 24 mutations of the protein ABCB1 and 11 mutations of
the protein ABCG2 are identified. These mutations are associated with proteins whose
names are equal to the names collected from Bio2RDF. A join operator is executed to
perform this merging. It is important to highlight that without the integration of COS-
MIC data into the iASiS knowledge graph and the linking of the corresponding entities
with Bio2RDF, this query could not be executed. Thus, these results evidence not only
the features of Ontario as a federated query engine, but also the benefits of semantically
describing and integrating heterogeneous data into a knowledge graph.
In order to illustrate the performance of Ontario; the results of executing ten queries
of the LSLOD [22] benchmark are reported; state-of-the-art engines are included in
the study. LSLOD [22] is a benchmark composed of ten knowledge graphs from life
sciences domain 23. They include: ChEBI (the Chemical Entities of Biological Interest),
KEGG (Kyoto Encyclopedia of Genes and Genomes), DrugBank, TCGA-A (subset of
The Cancer Genome Atlas), LinkedCT (Linked Clinical Trials), Sider (Side Effects
Resource), Affymetrix, Diseasome, DailyMed, and Medicare. Queries to be executed
against this federation of knowledge graphs are also part of the benchmark. Figure 7
reports on a heat map with the normalized values of total execution time, cardinality,
and time for the first answer. Cardinality corresponds to the ratio between the number
of answers returned by a federated engine during the evaluation of a query and total
number of answers of that query; it is a higher-is-better metric. First result time reports
on the elapsed time between the submission of a query and the output of the first answer,
whilst total execution time represents the elapsed time between the submission of a
query to an engine and the delivery of all the answers. Both values are normalized
by the highest values observed in the studied engines; they are lower-is-better metrics.
Additionally, the average of these normalized values are depicted in the heat map. As
observed, Ontario outputs answers faster than FedX and ANAPSID; further, the answers
produced by Ontario are complete. These results suggest that the knowledge-driven
framework is able to scale up to large datasets and overcome existing federated engines.
5.6 Knowledge Discovery over a Knowledge Graph
Knowledge discovery allows for uncovering patterns and relations between entities in
the knowledge graph. Discoveries include groups or samples of patients with unique
characteristics, and novel interactions between drugs and side effects. In order to iden-
tify the groups of entities from where patterns or new relations can be revealed, the
23The ten knowledge graphs have 133,873,127 RDF triples.
ANAPSID
FedX
ONTARIO
Cardinality
Execution Time
First Result Time
Metrics
Federated Query Engine
0.25
0.50
0.75
1.00
Value
Fig. 7: Query Processing Performance. A heat map describing the average of the nor-
malized values of cardinality (higher is better), first result time (lower is better), and
total execution time (lower is better); state-of-the-art federated query engines are com-
pared. Ontario better scales up to large knowledge graphs than ANAPSID and FedEx.
knowledge-driven framework resorts to community detection algorithms, e.g., semEP
[40] and METIS [28]; they are empowered with semantics encoded in the knowledge
graph in order to produce accurate discoveries. Figure 8 reports on the results of per-
forming the knowledge discovery techniques over the entities of the knowledge graph
that correspond to lung cancer patients. Main properties of these entities involve muta-
tions of non-small-cell lung cancer related genes, e.g., EGFR and ALK; demographic
attributes, smoking habits, treatments, and tumor stages. The studied population is com-
posed of 534 entities of patients. The goal of the study is to identify samples of these
patients with characteristics different to the whole population. Figure 8 reports on sam-
ples of patients and the percentage of them that have the same age range, sex, EGFR
mutation, smoking habits, and tumor stage. The three samples that differ the most with
whole population are included in the heat map. As observed, the reported values are
uncommon among samples, and provide the basis for profiling the patients in a sample.
6 Conclusions and Future Work
Big biomedical data is analyzed in terms of dominant dimensions: volume, velocity, va-
riety, veracity, and value. In order to scale up to challenges imposed by the very nature of
biomedical data, data management techniques able to semantically integrate, explore,
and mine this data are demanded. In this chapter, we presented a knowledge-driven
framework for transforming big data into a knowledge graph; it comprises components
that enable knowledge extraction, a knowledge graph creation, and knowledge manage-
ment and discovery. The proposed knowledge-driven framework is able to receive data
All Patients
Sample 1
Sample 2
Sample 3
Age 46−55
Age 56−65
Age 66−75
Age 75>
Male
Female
Non EGFR
EGFR
Non Smoker
Smoker
Stage I
Stage II
Stage III
Stage IV
Characteristics
Patients
0
25
50
75
100
Percentage
Fig. 8: Profiling Entities in a Knowledge Graph. Patterns of property values of lung
cancer patients. Patients in samples differ from the patients in the whole population in
terms of the reported values. Patterns enable the profiling of patients.
sources in various formats, and by exploiting diverse mining techniques and semantic
enrichment processes, integrate them into a knowledge graph; diverse fusion policies
enable the integration of equivalent entities. Thus, the knowledge graph materializes
the result of the semantic description, integration, and curation of big biomedical data.
More importantly, the knowledge graph is the building block for detecting relatedness
between knowledge graph entities, as well as for the tasks of knowledge exploration
and discovery. Specifically, the iASiS knowledge graph is the outcome of the transfor-
mation of big biomedical data into knowledge and facilitates the uncovering of hidden
patterns and relations among patients.
The main features of the proposed knowledge-driven framework are illustrated in
the context of the iASiS project with the aim of supporting personalized medicine. As a
result, a knowledge graph of more than 230 million RDF triple patterns have been cre-
ated. A federated query engine is integrated as part of the framework. It enables the ex-
ploration and integration of data across several knowledge graphs; results of evaluating
federated queries reveal relations between concepts in the knowledge graph. Moreover,
knowledge discovery techniques for uncovering patterns and relations are included in
the framework. The performance of the framework is illustrated with the results of two
empirical studies. Initial results suggest that the framework is able to scale up to large
knowledge graphs and to the vary nature of biomedical data. More importantly, these
outcomes provide evidence that knowledge encoded in the knowledge graph can be
exploited to uncover patterns that pay the way for profiling lung cancer patients.
In the future, more clinical data from both lung cancer and Alzheimer’s patients;
clinical data will include notes and images. Furthermore, annotations from biomedical
ontologies will be also used to discover new connections among entities in the knowl-
edge graph, and the knowledge exploration and discovery components will be empow-
ered with new semantic similarity measures capable of benefiting from the main char-
acteristics of the knowledge graph entities, e.g., ontology annotations and links. Similar
to the approaches proposed by Traverso and Vidal [42] and Morales et al. [35], machine
learning methods will be utilized to learn the best combination of these characteristics in
the similarity measure. Moreover, latent representations, e.g., translating [3] and holo-
graphic embeddings [37], will be also considered as part of the knowledge graph, and
as the basis for machine learning based approaches for knowledge completion, e.g., us-
ing tensor factorization [38]. Finally, exhaustive evaluations will be conducted in order
to demonstrate generality and reproducibility of these initial insights; experts in lung
cancer and the Alzheimer’s disease will be included as part of the evaluations.
References
1. M. Acosta, M. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus. ANAPSID: an adaptive query
processing engine for SPARQL endpoints. In Proceedings of the 10th International Confer-
ence on The Semantic Web ISWC, Bonn, Germany, October 23-27, pages 18–34, 2011.
2. S. Andreas, M. Andrea, I. Robert, M. P. N, B. Christian, and B. Christian. Ldif-a framework
for large-scale linked data integration. In Proceedings of the 21st International World Wide
Web Conference WWW, Developers Track, Lyon, France, April 16-20, 2012.
3. A. Bordes, N. Usunier, A. García-Durán, J. Weston, and O. Yakhnenko. Translating em-
beddings for modeling multi-relational data. In Advances in Neural Information Processing
Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Pro-
ceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages
2787–2795, 2013.
4. A. Callahan, J. Cruz-Toledo, P. Ansell, and M. Dumontier. Bio2rdf release 2: improved
coverage, interoperability and provenance of life science linked data. In Extended Semantic
Web Conference, pages 200–212. Springer, 2013.
5. L. Cao. Data science: challenges and directions. Commun. ACM, 60(8):59–68, 2017.
6. M. Chen, S. Mao, and Y. Liu. Big data: A survey. MONET, 19(2):171–209, 2014.
7. D. Collarana, M. Galkin, C. Lange, S. Scerri, S. Auer, and M.-E. Vidal. Synthesizing knowl-
edge graphs from web sources with the MINTE+framework. In Accepted for publication at
ISWC 2018.
8. D. Collarana, M. Galkin, I. T. Ribón, M. Vidal, C. Lange, and S. Auer. MINTE: semanti-
cally integrating RDF graphs. In Proceedings of the 7th International Conference on Web
Intelligence, Mining and Semantics, WIMS 2017, pages 22:1–22:11, 2017.
9. D. Collarana, C. Lange, and S. Auer. FuhSen: A platform for federated, RDF-based hybrid
search. In Proceedings of the 25th International Conference on World Wide Web, pages
171–174, 2016.
10. P. Colombo and E. Ferrari. Privacy aware access control for big data: A research roadmap.
Big Data Research, 2(4):145–154, 2015.
11. A. L. Cruz, A. Baranya, and M. Vidal. Medical image rendering and description driven by
semantic annotations. In Resource Discovery - 5th International Workshop, RED 2012, Co-
located with the 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Greece,
May 27, 2012, Revised Selected Papers, pages 123–149, 2012.
12. J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. Improving efficiency and accuracy in
multilingual entity extraction. In I-SEMANTICS 2013 - 9th International Conference on
Semantic Systems, ISEM ’13, Graz, Austria, September 4-6, 2013, pages 121–124, 2013.
13. A. Dimou, M. V. Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. V. de Walle. RML:
A generic language for integrated RDF mappings of heterogeneous data. In Proceedings of
the Workshop on Linked Data on the Web co-located with the 23rd International World Wide
Web Conference (WWW 2014), 2014.
14. K. M. Endris, Z. Almhithawi, I. Lytra, M. Vidal, and S. Auer. BOUNCER: privacy-aware
query processing over federations of RDF datasets. In Database and Expert Systems Appli-
cations - 29th International Conference, DEXA 2018, Regensburg, Germany, September 3-6,
2018, Proceedings, Part I, pages 69–84, 2018.
15. K. M. Endris, M. Galkin, I. Lytra, M. N. Mami, M. Vidal, and S. Auer. MULDER: query-
ing the linked data web by bridging RDF molecule templates. In Database and Expert
Systems Applications - 28th International Conference, DEXA 2017, Lyon, France, August
28-31, 2017, Proceedings, Part I, pages 3–18, 2017.
16. P. Ferragina and U. Scaiella. TAGME: on-the-fly annotation of short text fragments (by
wikipedia entities). In Proceedings of the 19th ACM Conference on Information and Knowl-
edge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, pages
1625–1628, 2010.
17. I. Fundulaki and S. Auer. Linked open data - introduction to the special theme. ERCIM
News, 2014(96), 2014.
18. M. Galkin, D. Collarana, I. T. Ribón, M. Vidal, and S. Auer. Sjoin: A semantic join operator
to integrate heterogeneous RDF graphs. In Database and Expert Systems Applications - 28th
International Conference, DEXA 2017, Lyon, France, August 28-31, 2017, Proceedings, Part
I, pages 206–221, 2017.
19. G. Gawriljuk, A. Harth, C. A. Knoblock, and P. A. Szekely. A scalable approach to in-
crementally building knowledge graphs. In Research and Advanced Technology for Digital
Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL
2016, Hannover, Germany, September 5-9, 2016, Proceedings, pages 188–199, 2016.
20. M. A. Grando and R. Schwab. Building and evaluating an ontology-based tool for reasoning
about consent permission. In AMIA 2013, American Medical Informatics Association Annual
Symposium, Washington, DC, USA, November 16-20, 2013, 2013.
21. O. Hartig, M. Vidal, and J. Freytag. Federated semantic data management (dagstuhl seminar
17262). Dagstuhl Reports, 7(6):135–167, 2017.
22. A. Hasnain, Q. Mehmood, S. Sana e Zainab, M. Saleem, C. Warren, D. Zehra, S. Decker,
and D. Rebholz-Schuhmann. Biofed: federated query processing over life sciences linked
open data. Journal of Biomedical Semantics, 8(1):13, Mar 2017.
23. W. Hu, H. Qiu, and M. Dumontier. Link analysis of life science linked data. In International
Semantic Web Conference, pages 446–462. Springer, 2015.
24. W. Hu, H. Qiu, J. Huang, and M. Dumontier. Biosearch: a semantic search engine for bio2rdf.
Database, 2017:bax059, 2017.
25. R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming.
Journal of Web Semantics, 23:2–15, 2013.
26. H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan,
and C. Shahabi. Big data and its technical challenges. Commun. ACM, 57(7):86–94, 2014.
27. E. Kamateri, E. Kalampokis, E. Tambouris, and K. A. Tarabanis. The linked medical data
access control framework. Journal of Biomedical Informatics, 50:213–225, 2014.
28. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular
graphs. SIAM Journal on scientific Computing, 20(1), 1998.
29. M. Kejriwal, P. A. Szekely, and C. A. Knoblock. Investigative knowledge discovery for
combating illicit activities. IEEE Intelligent Systems, 33(1):53–63, 2018.
30. S. Kirrane, S. Villata, and M. d’Aquin. Privacy, security and policies: A review of problems
and solutions with semantic web technologies. Semantic Web, 9(2):153–161, 2018.
31. C. A. Knoblock, P. A. Szekely, J. L. Ambite, A. Goel, S. Gupta, K. Lerman, M. Muslea,
M. Taheriyan, and P. Mallick. Semi-automatically mapping structured sources into the se-
mantic web. In Proceedings of the 9th Extended Semantic Web Conference ESWC, May
27-31, Heraklion, Crete, Greece, pages 375–390, 2012.
32. C. M. Livi, P. Klus, R. Delli Ponti, and G. G. Tartaglia. catrapid signature: identification of
ribonucleoproteins and rna-binding regions. Bioinformatics, 32(5), 2016.
33. E. Menasalvas, A. R. González, R. Costumero, H. Ambit, and C. Gonzalo. Clinical narrative
analytics challenges. In Rough Sets - International Joint Conference, IJCRS 2016, Santiago
de Chile, Chile, October 7-11, 2016, Proceedings, pages 23–32, 2016.
34. P. N. Mendes, H. Mühleisen, and C. Bizer. Sieve: linked data quality assessment and fusion.
In Proceedings of the 2012 Joint EDBT/ICDT Workshops, Berlin, Germany, March 30, 2012,
pages 116–123, 2012.
35. C. Morales, D. Collarana, M. Vidal, and S. Auer. Matetee: A semantic similarity metric based
on translation embeddings for knowledge graphs. In Web Engineering - 17th International
Conference, ICWE 2017, Rome, Italy, June 5-8, 2017, Proceedings, pages 246–263, 2017.
36. A.-C. N. Ngomo and S. Auer. Limes-a time-efficient approach for large-scale link discovery
on the web of data. In IJCAI, pages 2312–2317, 2011.
37. M. Nickel, L. Rosasco, and T. A. Poggio. Holographic embeddings of knowledge graphs.
In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17,
2016, Phoenix, Arizona, USA., pages 1955–1961, 2016.
38. M. Nickel and V. Tresp. Tensor factorization for multi-relational learning. In Machine
Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD
2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III, pages 617–
621, 2013.
39. C. A. Ortiz, C. Gonzalo-Martín, A. Garcia-Pedrero, and E. M. Ruiz. Supervoxels-based
histon as a new alzheimer’s disease imaging biomarker. Sensors, 18(6):1752, 2018.
40. G. Palma, M. Vidal, and L. Raschid. Drug-target interaction prediction using semantic sim-
ilarity and edge partitioning. In ISWC, 2014.
41. W. Perez, A. Tello, V. Saquicela, M. Vidal, and A. L. Cruz. An automatic method for the
enrichment of DICOM metadata using biomedical ontologies. In 37th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2015, Milan,
Italy, August 25-29, 2015, pages 2551–2554, 2015.
42. I. T. Ribón and M. Vidal. GARUM: A semantic similarity measure based on machine learn-
ing and entity characteristics. In Database and Expert Systems Applications - 29th Interna-
tional Conference, DEXA 2018, Regensburg, Germany, September 3-6, 2018, Proceedings,
Part I, pages 169–183, 2018.
43. I. T. Ribón, M. Vidal, B. Kämpgen, and Y. Sure-Vetter. GADES: A graph-based semantic
similarity measure. In Proceedings of the 12th International Conference on Semantic Sys-
tems, SEMANTICS 2016, Leipzig, Germany, September 12-15, 2016, pages 101–104, 2016.
44. P. Ristoski, C. Bizer, and H. Paulheim. Mining the web of linked data with rapidminer. Web
Semantics: Science, Services and Agents on the World Wide Web, 35:142–151, 2015.
45. S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu. The ubiquity of large graphs and
surprising challenges of graph processing. PVLDB, 11(4):420–431, 2017.
46. T. J. Schmidlen, L. Wawak, R. Kasper, J. F. García-España, M. F. Christman, and E. S.
Gordon. Personalized genomic results: Analysis of informational needs. Journal of Genetic
Counseling, 23(4), 2014.
47. A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx: Optimization tech-
niques for federated query processing on linked data. In Proceedings of the 10th Inter-
national Conference on The Semantic Web ISWC, Bonn, Germany, October 23-27, pages
601–616, 2011.
48. N. H. Shah, P. LePendu, A. Bauer-Mehren, Y. T. Ghebremariam, S. V. Iyer, J. Marcus, J. P. C.
Kevin T. Nead, and N. J. Leeper. Proton pump inhibitor usage and the risk of myocardial
infarction in the general population. Plos One, 10(7), 2015.
49. U. M. M. K. Sivarajah, Z. Irani, and V. Weerakkody. Critical analysis of big data challenges
and analytical methods. Journal of Business Research, 70:263–286, 2017.
50. Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, M. C. S. Ravis-
hankar Iyer, S. Sinha, and G. E. Robinson. Big data: Astronomical or genomical? Plos One,
13(7), 2015.
51. G. Wiederhold. Mediators in the architecture of future information systems. IEEE Computer,
25(3):38–49, 1992.
52. I.-M. Y, S. RC, and e. a. Toussaint PJ. Early role of vascular dysregulation on late-onset
alzheimer’s disease based on multifactorial data-driven analysis. Nature Communications,
7(11934), 2016.
53. V. Zadorozhny, L. Raschid, M. Vidal, T. Urhan, and L. Bright. Efficient evaluation of queries
in a mediator for websources. In Proceedings of the 2002 ACM SIGMOD International
Conference on Management of Data, Madison, Wisconsin, USA, June 3-6, 2002, pages 85–
96, 2002.
54. Q. Zeng, M. Zhao, P. Liu, P. Yadav, S. B. Calo, and J. Lobo. Enforcement of autonomous
authorizations in collaborative distributed query evaluation. IEEE Trans. Knowl. Data Eng.,
27(4):979–992, 2015.
... In addition to these works that aim to provide a solution to a single step of the Big Data value chain, we identify several end to end platforms, frameworks and solutions that deal with heterogeneous Big Data sources; in fact, Vidal et al. (2019) propose a knowledge-driven framework. Basically, the paper resolves semantic interoperability conflicts of heterogeneous sources and uses knowledge graphs to integrate data. ...
Article
Full-text available
Big Data deal with new challenges such as data variety, data veracity (correct, incorrect, misleading, etc.) and data completeness (provide a single part of the overall information.). In fact, the knowledge discovered from a single source that can offer incorrect or incomplete data, may have a negative impact on the quality of decisions based on it. Therefore, integrating data coming from multiple sources allows verifying the veracity and ensuring the completeness of the results and thus improving the quality of analysis and enhancing business decisions. In this paper, we present a smart framework that falls within the Big Data value chain process and aims to improve the quality of analytical results by focusing two main concerns regarding Big Data Integration; data completeness and data veracity. The framework integrates Big Data in order to build a complete global and correct insight from heterogeneous sources. The paper presents two implementations of the framework in the context of urban and highway traffic management systems.
... One of the crucial steps in conducting Big data analytics is to provide a consolidated platform to handle the heterogeneity of the datasets collected from diverse data islands. Hence, we incorporate The RDF Mapping Language (RML) (Dimou et al. 2014) as a mapping language to express data in dissimilar format into a unified RDF form, thereby mitigating the variety dimension of Big data (Vidal et al. 2019). RML defines a generic approach for mapping different data structures, where the input could be any data source and the provided output is provided as an RDF graph. ...
Article
Full-text available
Knowledge Graphs (KGs) have gained considerable attention recently from both academia and industry. In fact, incorporating graph technology and the copious of various graph datasets have led the research community to build sophisticated graph analytics tools, which has extended the application of KGs to tackle a plethora of real-life problems in dissimilar domains. Despite the abundance of the currently proliferated generic KGs, there is a vital need to construct domain-specific KGs. Further, quality and credibility should be assimilated in the process of constructing and augmenting KGs, particularly those propagated from mixed-quality resources such as social media data. For example, the amount of the political discourses in social media is overwhelming yet can be hijacked and misused by spammers to spread misinformation and false news. This paper presents a novel credibility domain-based KG Embedding framework. This framework involves capturing a fusion of data related to politics domain and obtained from heterogeneous resources into a formal KG representation depicted by a politics domain ontology. The proposed approach makes use of various knowledge-based repositories to enrich the semantics of the textual contents, thereby facilitating the interoperability of information. The proposed framework also embodies a domain-based social credibility module to ensure data quality and trustworthiness. The utility of the proposed framework is verified by means of experiments conducted on two constructed KGs. The KGs are then embedded in low-dimensional semantically-continuous space using several embedding techniques. The effectiveness of embedding techniques and social credibility module is further demonstrated and substantiated on link prediction, clustering, and visualisation tasks.
... Prototypes and services were presented to demonstrate the practicality and effectiveness of the concept. Vidal et al. (2019), presents a knowledge-driven framework that receives biomedical big data sources and integrates them into a knowledge graph. The framework enables knowledge exploration and discovery by using semantic data integration methods to identify entities that have an equivalent meaning in the real-world. ...
Chapter
Resource-limited settings (RLS) are characterised by lack of access to adequate resources such as ICT infrastructure, qualified medical personnel, healthcare facilities, and affordable healthcare for common people. The potential for the application of AI and clinical decision support systems in RLS are limited due to these challenges. Towards the improvement of the status quo, this chapter presents the conceptual design of a framework for the semantic integration of health data from multiple sources to facilitate decision support for the diagnosis and treatment of gait-related diseases in RLS. The authors describe how the framework can leverage ontologies and knowledge graphs for semantic data integration to achieve this. The plausibility of the proposed framework and the general imperatives for its practical realisation are also presented.
... Providing an efficient data analysis requires a query engine to understand what is exactly stored (R3.2) and to resolve semantic data interoperability conflicts [Pagano et al. 2013;Vidal et al. 2019] (R3.3), which entail the management of rich and appropriate metadata [Ceravolo et al. 2018]. The former (R3.2) can be tackled through the previously mentioned materialized (e.g., rule-based data transformation) or virtual (e.g., OBDA) methods of data integration, whereby schematic data conflicts (i.e., related to different schemata) are resolved. ...
Article
Full-text available
Big Data Systems (BDSs) are an emerging class of scalable software technologies whereby massive amounts of heterogeneous data are gathered from multiple sources, managed, analyzed (in batch, stream or hybrid fashion), and served to end-users and external applications. Such systems pose specific challenges in all phases of software development lifecycle and might become very complex by evolving data, technologies, and target value over time. Consequently, many organizations and enterprises have found it difficult to adopt BDSs. In this article, we provide insight into three major activities of software engineering in the context of BDSs as well as the choices made to tackle them regarding state-of-the-art research and industry efforts. These activities include the engineering of requirements, designing and constructing software to meet the specified requirements, and software/data quality assurance. We also disclose some open challenges of developing effective BDSs, which need attention from both researchers and practitioners.
... To match entities from heterogeneous sources in a unified way, Bellazi et al. [37] explain the importance of analyzing all data sources to identify interoperability conflicts. Vidal et al. [447] characterize the interoperability conflicts into six categories. We summarizes the main characteristics of each interoperability conflict. ...
Book
Full-text available
This open access book is part of the LAMBDA Project (Learning, Applying, Multiplying Big Data Analytics), funded by the European Union, GA No. 809965. Data Analytics involves applying algorithmic processes to derive insights. Nowadays it is used in many industries to allow organizations and companies to make better decisions as well as to verify or disprove existing theories or models. The term data analytics is often used interchangeably with intelligence, statistics, reasoning, data mining, knowledge discovery, and others. The goal of this book is to introduce some of the definitions, methods, tools, frameworks, and solutions for big data processing, starting from the process of information extraction and knowledge representation, via knowledge processing and analytics to visualization, sense-making, and practical applications. Each chapter in this book addresses some pertinent aspect of the data processing chain, with a specific focus on understanding Enterprise Knowledge Graphs, Semantic Big Data Architectures, and Smart Data Analytics solutions. This book is addressed to graduate students from technical disciplines, to professional audiences following continuous education short courses, and to researchers from diverse areas following self-study courses. Basic skills in computer science, mathematics, and statistics are required.
... To match entities from heterogeneous sources in a unified way, Bellazi et al. [37] explain the importance of analyzing all data sources to identify interoperability conflicts. Vidal et al. [447] characterize the interoperability conflicts into six categories. We summarizes the main characteristics of each interoperability conflict. ...
Chapter
In the Big Data era, where variety is the most dominant dimension, the RDF data model enables the creation and integration of actionable knowledge from heterogeneous data sources. However, the RDF data model allows for describing entities under various contexts, e.g., people can be described from its demographic context, but as well from their professional contexts. Context-aware description poses challenges during entity matching of RDF datasets—the match might not be valid in every context. To perform a contextually relevant entity matching, the specific context under which a data-driven task, e.g., data integration is performed, must be taken into account. However, existing approaches only consider inter-schema and properties mapping of different data sources and prevent users from selecting contexts and conditions during a data integration process. We devise COMET, an entity matching technique that relies on both the knowledge stated in RDF vocabularies and a context-based similarity metric to map contextually equivalent RDF graphs. COMET follows a two-fold approach to solve the problem of entity matching in RDF graphs in a context-aware manner. In the first step, COMET computes the similarity measures across RDF entities and resorts to the Formal Concept Analysis algorithm to map contextually equivalent RDF entities. Finally, COMET combines the results of the first step and executes a 1-1 perfect matching algorithm for matching RDF entities based on the combined scores. We empirically evaluate the performance of COMET on testbed from DBpedia. The experimental results suggest that COMET accurately matches equivalent RDF graphs in a context-dependent manner.
... One of the crucial steps in conducting Big data analytics is to provide a consolidated platform to handle the heterogeneity of the datasets collected from diverse data islands. Hence, we incorporate The RDF Mapping Language (RML) [68] as a mapping language to express data in dissimilar format into a unified RDF form, thereby mitigating the variety dimension of Big data [69]. RML defines a generic approach for mapping different data structures, where the input could be any data source and the provided output is provided as an RDF graph. ...
Preprint
Full-text available
Knowledge Graphs (KGs) have gained considerable attention recently from both academia and industry. In fact, incorporating graph technology and the copious of various graph datasets have led the research community to build sophisticated graph analytics tools. Therefore, the application of KGs has extended to tackle a plethora of real-life problems in dissimilar domains. Despite the abundance of the currently proliferated generic KGs, there is a vital need to construct domain-specific KGs. Further, quality and credibility should be assimilated in the process of constructing and augmenting KGs, particularly those propagated from mixed-quality resources such as social media data. This paper presents a novel credibility domain-based KG Embedding framework. This framework involves capturing a fusion of data obtained from heterogeneous resources into a formal KG representation depicted by a domain ontology. The proposed approach makes use of various knowledge-based repositories to enrich the semantics of the textual contents, thereby facilitating the interoperability of information. The proposed framework also embodies a credibility module to ensure data quality and trustworthiness. The constructed KG is then embedded in a low-dimension semantically-continuous space using several embedding techniques. The utility of the constructed KG and its embeddings is demonstrated and substantiated on link prediction, clustering, and visualisation tasks.
Preprint
Full-text available
Background A data-driven colorectal cancer screening strategy based on personalized approach can improve health outcomes, facilitate early stratification of at-risk patients and reduce health care costs. This study aims to develop an information road map for personalized colorectal cancer screening in Iran. Methods This study is a Mix-Method Research (MMR) which consisted of three phases: phase I, development of a checklist with 275-items for assessing required data elements of personalized colorectal cancer screening; phase II, situational analysis of colorectal cancer screening dataset according to the checklist; phase III, development of national information road map for personalized colorectal cancer screening with in-depth interview and focus groups. Results Personalized datasets of colorectal cancer screening were defined in four dimensions, including clinical dataset (5 sub-dimensions, 162 items), genetic dataset (2 sub-dimensions, 67 items), demographic dataset (1 sub-dimension, 6 items) and a social determinant dataset (3 sub-dimensions, 40 items). The next step data elements of colorectal cancer screening based on personalized datasets were analyzed. Of the 275-items, only 96 items are recorded. Only 17.8% of clinical dataset of screening program were entered. The highest data elements of clinical dimension were related to pathological datasets (53.6%) in the present screening program. The lowest data elements of the clinical dimensions were related to the clinical history dataset (3.4%). 73% of pedigree data elements and 15.33% of social determinant datasets were entered. In the final step, a national information road map of personalized CRC screening with 6 layers (information leadership, personalized datasets, data integration, data architecture, data descriptor, and screening program layers) was developed. Conclusion Personalized screening based on integration dataset play a key role for the successful implementation of the screening program. Eliminating data deficiencies can improve the quality of documentation and may lead to improved screening performance. Therefore, standard datasets and indicators can help to identify information gaps and facilitate precise decision-making. Entering data was inadequate and poor in this study. Implementation of national road map can assist to improve the quality of data in personalized screening.
Article
Full-text available
Alzheimer’s disease (AD) represents the prevalent type of dementia in the elderly, and is characterized by the presence of neurofibrillary tangles and amyloid plaques that eventually leads to the loss of neurons, resulting in atrophy in specific brain areas. Although the process of degeneration can be visualized through various modalities of medical imaging and has proved to be a valuable biomarker, the accurate diagnosis of Alzheimer’s disease remains a challenge, especially in its early stages. In this paper, we propose a novel classification method for Alzheimer’s disease/cognitive normal discrimination in structural magnetic resonance images (MRI), based on the extension of the concept of histons to volumetric images. The proposed method exploits the relationship between grey matter, white matter and cerebrospinal fluid degeneration by means of a segmentation using supervoxels. The calculated histons are then processed for a reduction in dimensionality using principal components analysis (PCA) and the resulting vector is used to train an support vector machine (SVM) classifier. Experimental results using the OASIS-1 database have proven to be a significant improvement compared to a baseline classification made using the pipeline provided by Clinica software.
Conference Paper
Full-text available
Institutions from different domains require the integration of data coming from heterogeneous Web sources. Typical use cases include Knowledge Search, Knowledge Building, and Knowledge Completion. We report on the implementation of the RDF Molecule-Based Integration Framework MINTE+ in three domain-specific applications: Law Enforcement, Job Market Analysis, and Manufacturing. The use of RDF molecules as data representation and a core element in the framework gives MINTE+ enough flexibility to synthesize knowledge graphs in different domains. We first describe the challenges in each domain-specific application, then the implementation and configuration of the framework to solve the particular problems of each domain. We show how the parameters defined in the framework allow tuning the integration process with the best values according to each domain. Finally, we present the main results, and the lessons learned from each application.
Preprint
Full-text available
Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for the citizens. However, effective data-centric applications demand data management techniques able to process large volume of data which may include sensitive data, e.g., financial transactions, medical procedures, or personal data. Managing sensitive data requires the enforcement of privacy and access control regulations, particularly, during the execution of queries against datasets that include sensitive and non-sensitive data. In this paper, we tackle the problem of enforcing privacy regulations during query processing, and propose BOUNCER, a privacy-aware query engine over federations of RDF datasets. BOUNCER allows for the description of RDF datasets in terms of RDF molecule templates, i.e., abstract descriptions of the properties of the entities in an RDF dataset and their privacy regulations. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over RDF datasets that not only contain the relevant entities to answer a query, but that are also regulated by policies that allow for accessing these relevant entities. We empirically evaluate the effectiveness of the BOUNCER privacy-aware techniques over state-of-the-art benchmarks of RDF datasets. The observed results suggest that BOUNCER can effectively enforce access control regulations at different granularity without impacting the performance of query processing.
Article
Full-text available
Developing scalable, semi-automatic approaches to derive insights from a domain-specific Web corpus is a longstanding research problem in the knowledge discovery community. The problem is particularly challenging in illicit fields, such as human trafficking, where traditional assumptions concerning information representation are frequently violated. In this article, we describe an end-to-end investigative knowledge discovery system for illicit Web domains. We built and evaluated a prototype, involving separate components for information extraction, semantic modeling and query execution, on a real-world human trafficking Web corpus containing 1.3 million pages, with promising results.
Article
Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We conducted an online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants' responses to our questions highlighting common patterns and challenges. We further reviewed user feedback in the mailing lists, bug reports, and feature requests in the source repositories of a large suite of software products for processing graphs. Through our review, we were able to answer some new questions that were raised by participants' responses and identify specific challenges that users face when using different classes of graph software. The participants' responses and data we obtained revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, and scalability and visualization are undeniably the most pressing challenges faced by participants. We hope these findings can guide future research.
Conference Paper
Knowledge graphs encode semantics that describes entities in terms of several {\it characteristics}, e.g., attributes, neighbors, class hierarchies, or association degrees. Several \emph{data-driven} tasks, e.g., ranking, clustering, or link discovery, require for assessing the relatedness between knowledge graph entities. However, state-of-the-art similarity measures may not consider all the characteristics of an entity to determine entity relatedness. We address the problem of similarity assessment between knowledge graph entities and devise GARUM, a semantic similarity measure for knowledge graphs. GARUM relies on similarities of entity characteristics and computes similarity values considering simultaneously several entity characteristics. This combination can be manually or automatically defined with the help of a machine learning approach. We empirically evaluate the accuracy of GARUM on knowledge graphs from different domains, e.g., networks of proteins and media news. In the experimental study, GARUM exhibits higher correlation with gold standards than studied existing approaches. Thus, these results suggest that similarity measures should not consider {\it entity characteristics} in isolation; contrary, combinations of these characteristics are required to precisely determine relatedness among entities in a knowledge graph. Further, the combination functions found by a machine learning approach outperform the results obtained by the manually defined aggregation functions.
Article
Semantic Web technologies aim to simplify the distribution, sharing and exploitation of information and knowledge, across multiple distributed actors on the Web. As with all technologies that manipulate information, there are privacy and security implications, and data policies (e.g., licenses and regulations) that may apply to both data and software artifacts. Additionally, semantic web technologies could contribute to the more intelligent and flexible handling of privacy, security and policy issues, through supporting information integration and sense-making. In order to better understand the scope of existing work on this topic we examine 78 articles from dedicated venues, including this special issue, the PrivOn workshop series, two SPOT workshops, as well as the broader literature that connects the Semantic Web research domain with issues relating to privacy, security and/or policies. Specifically, we classify each paper according to three taxonomies (one for each of the aforementioned areas), in order to identify common trends and research gaps. We conclude by summarising the strong focus on relevant topics in Semantic Web research (e.g. information collection, information processing, policies and access control), and by highlighting the need to further explore under-represented topics (e.g., malware detection, fraud detection, and supporting policy validation by data consumers).
Article
Developing scalable, semi-automatic approaches to derive insights from a domain-specific Web corpus is a longstanding research problem in the knowledge discovery community. The problem is particularly challenging in illicit fields, such as human trafficking, where traditional assumptions concerning information representation are frequently violated. In this article, we describe an end-to-end investigative knowledge discovery system for illicit Web domains. We built and evaluated a prototype, involving separate components for information extraction, semantic modeling and query execution, on a real-world human trafficking Web corpus containing 1.3 million pages, with promising results. The prototype includes a GUI currently used by US law enforcement agencies to combat illicit activity.