ArticlePDF Available

Abstract and Figures

22
Putting Semantics
in Systems Biology
The last decade has seen the emergence of Systems Biology: an integrative
approach to understand how biological systems are built and operate. In
parallel, the Semantic Web has started to offer new technologies that make
data on the web comprehensible for computers. The merger of the two, in
Semantic Systems Biology, offers new opportunities for data integration,
sharing and analysis through computational querying and reasoning.
with the sequence of a genome in principle
the full potential for all biomolecules made
by that organism is known (Gisler
et al.
,
2010). This provided the basic information
for the development of high throughput de-
tection technologies for virtually every bio-
molecule type, which explains the massive
amounts of data that are produced today.
This has fuelled an urgent need for effi cient
data analysis and knowledge management.
This need has proven to be a great driv-
er for the development of advanced com-
puter algorithms and applications. Al-
though computational data analysis has
been embraced by the molecular biol-
ogy scientist well before the start of the
Human Genome Project (Ouzounis and Va-
lencia, 2003) , the era of genome sequenc-
ing marked an increase in the infl uence of
computer technology in life sciences, obvi-
ously especially to harness the enormous
amounts of genome sequence data. Bioin-
formatics gained importance and became
an independent scientifi c discipline in its
own right. Institutes such as European
The research fi eld of molecular biology as
we know it today began with the discovery
of the structure of deoxyribonucleic acid
(or DNA). In the 1950s, Francis Crick and
James D. Watson developed a model that
accurately explained the structure of DNA.
This advancement was followed in the ear-
ly 1970’s by the discovery and exploitation
of the enzymatic toolbox with bacterial en-
donucleases (restriction enzymes), and
other nucleic acid modifi cation enzymes.
Together with vast advances in the tech-
nology for determining the nucleotide se-
quence of DNA especially in the last dec-
ade of the previous century, this resulted
in the determination of virtually the com-
plete sequence of the human genome, only
50 years after the initial discovery of the
structure of DNA. The successful comple-
tion in 2001 of this so-called Human Ge-
nome Project marked a sharp increase
in the amount of biological data that was
generated, because advancements in se-
quencing technology made the determina-
tion of the full DNA sequence of any organ-
ism almost trivial and very affordable, and
AUTHORS
Martin Kuiper, PhD
PI of the Semantic
Systems Biology group
Department of Biology, NTNU
Aravind Venkatesan
PhD Fellow
Department of Biology, NTNU
Erick Antezana, PhD
Affi liated Scientist
Department of Biology, NTNU
Steven Vercruysse, PhD
Postdoctoral fellow
Department of Biology, NTNU
Vladimir Mironov, PhD
Senior Scientist
Department of Biology, NTNU
Ward Blonde, PhD
Postdoctoral fellow
Institute for Medical Informatics,
Statistics and Documentation,
Medical University of Graz
Figure 1. Systems Biology cycle
Systems Biology is an integrative
biology that requires a multi-
disciplinary effort, incorporating
experimental design and data
generation (‘wet’ lab) with know-
ledge extraction, data analysis,
mathematical modeling and
simulations (‘dry’ lab). These
efforts are performed in a
consecutive an iterative way,
with each cycle improving the
quality of a system model.
23
Putting Semantics
in Systems Biology
FROM DATA TO KNOWLEDGE
From the above it will be clear that today’s
molecular biology has turned into a very
data-intensive domain. Therefore, scien-
tists in this domain are facing the same
challenges as in many other disciplines
dealing with highly distributed, heteroge-
neous and voluminous data sources. How-
ever, these problems are especially acute
because of the way the life science com-
munity works. Science is a global effort,
with many domain-specifi c research socie-
ties that may work in relative isolation from
other domains. Together with the complex-
ity of biological systems, and therefore the
data obtained from those systems, this
created the ingredients for the production
of massive, uncoordinated, distributed,
fragmented and idiosyncratic data reposi-
tories. The number of databases in the life
sciences outnumbers other data-intensive
disciplines, and these databases are typi-
cally autonomous and disconnected. The
extreme heterogeneity of data models (of-
ten of marginal quality) and formats made
interoperability and data integration hard-
ly feasible, and to make matters worse the
widespread incidence of synonymy, ho-
monymy and polysemy negatively affects
both the precision and recall of queries
Molecular Biology Lab in Heidelberg, Ger-
many, formed the very fi rst departments
exclusively devoted to bioinformatics. In
particular, bioinformatics focused on the
design and development of dedicated bi-
omolecular databases. These databases
host a broad variety of different data types
such as sequence databases (
e.g.
NCBI
GenBank, EMBL data library), transcrip-
tome databases (
e.g
. ArrayExpress, GEO),
and protein databases (
e.g.
UniProtKB).
The emergence of powerful so-called ge-
nome-wide data production technologies
both allowed and made necessary new ap-
proaches to analyse and integrate these
data, which now constitute the fi eld of Sys-
tems Biology. Systems Biology aims for a
holistic understanding of a biological sys-
tem in contrast to the more traditional re-
ductionist approach that focuses on indi-
vidual components of the system (Kitano,
2000). One of the foundations of Systems
Biology is the use of mathematical and
computational modelling: simulation of the
behaviour of complex biological systems
through mathematical and computational
models. These models serve to integrate
all relevant biological data about a pro-
cess or system; a comparison of the pro-
cess and the model’s behaviour allows an
assessment of the model’s accuracy. The
simulations ultimately help biologist pre-
dict the behaviour of a system under new
conditions which paves the way to new hy-
potheses that can be validated experimen-
tally (Figure 1).
the current World Wide Web which consists
of hyperlinked pages into a Web of Knowl-
edge that is machine comprehensible. The
SW depends on a set of web technologies
specifi cally designed to facilitate automat-
ed machine interoperability. It promises to
meet the challenge of knowledge manage-
ment of highly diverse and distributed re-
sources. Systems based on SW would pro-
vide sophisticated frameworks to manage
and retrieve information (knowledge). This
is achieved by the use of the well estab-
lished web technologies (such as HTTP)
and a number of essential components
added on top.
1) SW uses Unique Resource Identifi ers
(URI), to identify documents and concepts of
the real world (
e.g. ‘http://www.semantic-
systems-biology.org/SSB#P16220’
is a URI
representing the human protein CREB1);
2) The SW knowledge representation for-
malisms include the Resource Descrip-
tion Framework (RDF) (
www.w3.org/RDF/
),
RDF Schema (RDFS), and the Web Ontol-
ogy Language (OWL) (
www.w3.org/TR/owl-
ref/
). The mathematical model underlying
are incorporated by the genome database
providers, providing unique and cross-do-
main common entry points in the descrip-
tion of the gene products. This aids the
users (life scientists) in further investiga-
tion of the gene of interest, enriching the
knowledge related to that gene. The suc-
cess of GO gave rise to the establishment
of the Open Biomedical Ontologies (OBO)
(
www.obofoundry.org
) consortium, who
among others provides a set of foundation-
al principles to structure the further co-or-
dinated development of bio-ontologies (
e.g.
ontology orthogonality: different ontologies
in the foundry should not overlap). The OBO
foundry now constitutes a set of 53 do-
main-specifi c candidate ontologies, which
are becoming widely accepted as a refer-
ence by the life science community.
THE SEMANTIC WEB
The emergence of the Semantic Web is
starting to have a signifi cant impact on the
querying, sharing and integration of knowl-
edge produced by the Life Sciences. Con-
ceptualised by Tim Berners-Lee in 2001,
the Semantic Web (SW) intends to convert
over biological data (for a review see An-
tezana
et al
., 2009a).
The heterogeneity and fragmentation of
data sources has resulted in a signifi -
cant gap between the wealth of data and
the modicum of knowledge being extract-
ed from these data. To get more informa-
tion out of it, life scientists realised that
a seamless data integration was required
that supports complex, detailed and tar-
geted queries over multiple distributed re-
sources; facilitating data analysis, hypoth-
esis generation and experimental design.
Advancements in data archiving, querying
and analysis in turn have had a direct im-
pact in the way research was carried out
in life sciences. The rapid development
in data production rst of all spawned an
awareness that structured collection and
quality control should be adopted by the
scientifi c community, a challenge that was
taken up by grassroots movements like MI-
AME (Minimum Information About a Mi-
croarray Experiment (Brazma et al., 2000),
which put together recommendations for
experiment design and data recording in
the area of transcriptome analysis. This
initiative is carried by the global research
community, including NTNU (Beisvåg et
al., 2011), and has been taken up now in
virtually every area of data production (Tay-
lor et al., 2008).
Another important trend is the increasing
adoption of ontologies as a means to pro-
vide standardized semantics for concepts,
or ‘things’, in the Life Sciences. Bio-on-
tologies capture the entities and their in-
terrelationships within the life science
domain, thereby reducing conceptual am-
biguity, and increasing re-usability and the
effi ciency of computational approaches to-
ward mining and knowledge discovery. One
of the main ontologies is the Gene Ontology
(GO) (
www.geneontology.org
), which facili-
tated the unambiguous annotations of bio-
molecules with terms specifying molecular
functions, cellular components and biolog-
ical processes. The terminology and more
importantly the unique IDs provided in GO
Figure 2. Semantic Systems
Biology cycle
The cycle starts with gathering
and integrating biological
knowledge into a semantic
knowledge base;
(A) Data is then checked
for consistency, and made
available for querying and
automated reasoning;
(B) This yields hypotheses
about particular functions
of biological components
that may be used to design
experiments;
(C) The experiments
generates new data and
might also verify hypotheses.
(D) The new fi ndings are
integrated into the knowledge
base, thereby enhancing the
quality of the knowledge base
and allowing a new cycle of
hypothesis building and
experimentation.
The workfl ow of the Semantic
Systems Biology approach
is largely similar to Kitano’s
original depiction of Systems
Biology shown in Figure 1.
24
In order to assemble resources in the SSB
platform we have built a series of tools,
focusing on the various tasks involved
in downloading knowledge from vari-
ous sources; converting them to the nec-
essary formats to integrate them into a
SW compliant knowledge base; perform-
ing specifi c pre-computing tasks to en-
hance the information content for que-
rying; and to visualise results. The SSB
initiative and associated projects (www.
semantic-systems-biology.org) are hosted
and developed at the Norwegian Univer-
sity of Science and Technology (NTNU). In
the subsequent sections we provide a brief
description of the various project parts and
future initiatives.
The Semantic Systems Biology components
ONTO-PERL
The KBs developed under the SSB initia-
tive are hosted on Open Virtuoso, a data-
base system (triple store) that can be used
to query RDF graphs. This means that we
needed an effi cient and fl exible data trans-
formation tool to convert data into RDF
graphs. As ontologies provide the neces-
sary scaffold for a KB on which complex
queries may be executed, we have devel-
oped ONTO-PERL (Antezana et al., 2008),
essentially an API to facilitate the han-
dling of bio-ontologies. ONTO-PERL plays
a pivotal role in the creation of our KBs.
This software suite comprises an exten-
sible set of object-oriented Perl modules
to facilitate the manipulation of OBO-for-
matted ontologies
(www.geneontology.org/
GO.format.obo-1_2.shtml
) and provides
conversion utilities to various SW formats
such as RDF and OWL. ONTO-PERL is used
to build an automated pipeline for all data
transformations for our RDF stores.
Cell Cycle Ontology
The Cell Cycle Ontology (CCO) (Anteza-
na
et al.
, 2009b) was the rst knowledge
management system that we built, with an
ONTO-PERL based pipeline. CCO was con-
ceived as an application ontology to service
the domain of cell cycle research. Applica-
tion ontologies are built by combining parts
of domain ontologies, for instance the cell
cycle branch of the Gene Ontology, which
can be further extended according to the
needs of the application in question. Ap-
plication ontologies make use of the for-
malisation of domain knowledge, there-
by facilitating the integration of different
types of information, and are embedded
in a knowledge base to facilitate data min-
ing and hypothesis generation. CCO is a
protein-centric ontology providing infor-
mation pertaining to virtually all cell cy-
cle proteins, such as the cellular location,
molecular function, or biological process
they are involved in, protein-protein inter-
actions observed for them and orthology
relations with proteins in other species.
CCO is available in various formats, most
importantly the RDF format, which enables
versatile queries about the cell cycle via
a SPARQL endpoint. Metarel (see below)
and SPARUL are used to pre-compute in-
ferences supported by the CCO. The CCO
is also available in the OWL-DL format to
support more advanced forms of auto-
mated reasoning. OWL has been designed
from its inception to enable computer-as-
sisted processing, and reasoning could be
applied to consistency checking (e.g. iden-
tifying cell cycle proteins that were docu-
mented to occur at cellular locations that
seemed incompatible with a function in
cell cycle control) we and others observed
that reasoning over large OWL ontologies
is often extremely slow and/or prohibited
by every minor inconsistency in the logics.
We therefore built most of our resources in
RDF.
Metarel
RDF makes it relatively easy for domain
experts to represent and integrate large
amounts of knowledge but it was not origi-
nally designed to support reasoning tasks.
We have developed a method to make
semi-automated reasoning in RDF pos-
sible with the development of the Meta-
rel vocabulary (Blonde et al., 2011). Meta-
rel is an RDF vocabulary that utilises some
of the language constructs offered by OWL
Full, and it provides logical semantics to
these languages is the
graph
. SW graphs
could be viewed as sets of statements
of the form ‘Subject-Predicate-Object’
(known as triples) and they are typically
stored in specialised database manage-
ment systems known as RDF/triple stores.
3) The information in triple stores can be
retrieved with SPARQL (
www.w3.org/TR/
rdf-sparql-query/
). 4) Inference: the stand-
ardised knowledge base (KB) is used to in-
fer unasserted facts through a
reasoner
. 5)
Metadata: the data resources are annotated
by metadata, providing for example data
provenance.
These components are standards set by
the World Wide Web Consortium (W3C)
(
http://www.w3.org
), which aid in effi ciently
representing data and querying. The incor-
poration of the SW in the Life Sciences is
overseen by the Semantic Web Health Care
and Life Sciences Interest Group (HCLS-IG)
(http://www.w3.org/wiki/HCLSIG/), which
operates under the umbrella of the W3C.
HCLS-IG provides guidelines and carries
out SW projects in the Life Sciences. Some
examples are the DERI Health Care and
Life Science Knowledge Base (
http://www.
w3.org/wiki/HCLSIG_BioRDF_Subgroup/
DERI_HCLS_KB
), AlzSWAN (
www.hypoth-
esis.alzforum.org/swan/do!getHome.ac-
tion
) and Bio2RDF (
http://www.bio2rdf.
org
).
SEMANTIC SYSTEMS BIOLOGY
The concept of Semantic Systems Biology
We were among the fi rst Life Science re-
searchers who recognised the potential of
the SW for biological data integration and
have introduced the Semantic Systems Bi-
ology platform (SSB), consisting of a series
of resources and tools designed to support
the domain of Systems Biology with se-
mantic web technologies. SSB exploits SW
technologies in service of automated bio-
logical data handling and integration (Fig-
ure 2). Semantic Systems Biology intends
to pave the way for the use of advanced
querying and reasoning approaches to as-
semble biological system models.
25
The use of ONTO-PERL as the single source
of translations to RDF make BioGateway
much more accessible and well integrat-
ed compared to an arbitrary upload of RDF
resources from different origins. Howev-
er, even with an optimised integration of
RDF, productive query efforts require both
an in-depth knowledge of the ontologies in
BioGateway and SPARQL querying skills.
For this reason the store is made acces-
sible through a library of example queries
in SPARQL that are easy to parameterise
(Figure 3), which provides an element of
quality control and user-friendliness.
NCBI taxonomy (
http://www.ncbi.nlm.nih.
gov/taxonomy
); and high quality protein
information from the SWISS-PROT (
www.
uniprot.org
) database. Because new infor-
mation is added to these original sources
almost on a daily basis the RDF store is
built from scratch at regular intervals, in a
fully-automated way using the functionality
offered by ONTO-PERL. In this process, the
rule-based closures are computed with the
use of Metarel and added to the store. The
use of Metarel allows the execution of que-
ries like
Give me all the mammalian pro-
teins located in the nucleus or any sub-part
thereof
”, which are not possible (with a
single query) in relational databases.
relations between classes. One of the ex-
amples of the use of Metarel in automat-
ed reasoning has been in making implicit
knowledge explicit. Much of the knowledge
encapsulated in ontologies is implicit and
can be arrived at through the use of the in-
ternal logics of the ontological hierarchy
for instance, the cellular location branch
of GO states that ‘a nucleolus is a part of
a nucleus’, and ‘a nucleus is part of a cell’.
The implicit knowledge in this case is that
a nucleolus, being part of a nucleus, is also
part of a cell. Updating RDF stores with the
SPARQL/Update language, in combina-
tion with the semantic power to effi ciently
handle relationship classes in Metarel, im-
plements a reasoning system that makes
the implicit knowledge explicit and aug-
ments the store with inferred knowledge
that would otherwise not be available to a
user. This is achieved through the use of
ve rules (closures):
Refl exivity, Transitiv-
ity, Priority over Subsumption, Super-re-
lations
and
Chains
. Together, these rules
allow inferencing of knowledge present im-
plicitly in the OBO ontologies and any RDF
store using those ontologies. The appli-
cation of these rules on Metarel-empow-
ered RDF effectively allows reasoning in
a tiny logic-based RDF language that rea-
sons only through direct relations between
classes with an all-some semantics. Meta-
rel essentially augments RDF to approach
the expressiveness of OWL and sets a new
paradigm: all the inferences are pre-com-
puted before querying the KB instead of
making inferences during the execution of
the query as Description Logics (DL) rea-
soners attempt to do.
BioGateway
The BioGateway KB (Antezana
et al.
,
2009c) is the major repository of all the
RDF graphs engineered with ONTO-PERL
and augmented with Metarel. Current-
ly, BioGateway contains about 2.2 billion
RDF statements and integrates the entire
set of OBO Foundry ontologies; the anno-
tations from the complete set (13 mod-
el organisms) of Gene Ontology Annota-
tion fi les (GOA,
www.ebi.ac.uk/GOA/
); the
Figure 3. Screenshot of the query pane of BioGateway, on the SSB website.
26
important to have tools with which one can
visualise or browse ontologies intuitively,
as ontology graphs can be complex with
many interlinked terms and relationships.
For example, an obvious analysis step in
ontology exploration is to view an ontology
term and its ‘local neighbourhood’ of con-
nected terms, and then to ‘expand’ one of
the interesting neighbour terms by load-
ing its respective neighbouring terms into
the display. Existing ontology visualisers
often feel rigid and confusing, as the ad-
dition of new terms may result in largely
rearranged term displays. We created an
ontology visualiser that substantially im-
proves this exploration of bio-ontologies:
OLSVis (Vercruysse et al., 2012). OLSVis
uses real-time animation to visualise a re-
pelling and attracting force-based re-lay-
out of the visible term graph. By simulating
terms as electrically repelling nodes and
connections as mechanical springs, graph
reorganisations occur smoothly, terms
are moved minimally, and users can fol-
low what moves whereto. In addition OLS-
Vis provides user interaction for tweaking
the graph's visual representation (Figure
4). OLSVis serves the scientifi c commu-
nity with a versatile and user-friendly tool
to explore ontologies and to fi nd precise
ontology terms. The development of this
new web application (http://ols.wordvis.
com) was based on our earlier pilot project
WordVis (http://wordvis.com), which al-
lows for visualising WordNet, a lexical da-
tabase of the English language. OLSVis is
in line with our longer-term goal of devel-
oping a system for the manual creation of
digital, semantics-based summaries of re-
search papers in the life sciences. In this
setting, the visualiser can be deployed for
semantic term search during input, as well
as for the exploration of collected facts. In
this respect, our aim is also to apply the
software on other semantically formalised
knowledge resources.
Current focus
Integrating SSB into the wet-lab.
Although the above demonstrates that the
core components of SSB are in place, the
platform will only have the desired impact
when broadly used by biologists. BioGate-
way and its components are typical prod-
ucts of a technology push, offering poten-
tial users access to the powers of a new
technology. It will be evident from for in-
stance the SPARQL interface (Figure 3) that
the technology may be intimidating to the
lay person. Therefore we are now pursuing
various joint initiatives with the intended
target audience to mobilise ‘user-pullas
OrthAgogue
Much of the knowledge that our knowledge
bases offer is obtained by ‘borrowing’ from
other species or domains. If a component
in a specifi c species has been extensively
studied, the knowledge obtained for that
component (usually a protein) may also be
relevant to similar components found in
other species. These similar components
are found by assessing the degree of se-
quence similarity, or conservation, be-
tween genes or proteins that have diverged
because they evolved in different species.
The computation of this information, how-
ever, proved to be extremely time consum-
ing. Therefore we have recently developed
the C++ library OrthAgogue, which reduces
computational time by two orders of mag-
nitude, relative to the commonly used soft-
ware OrthoMCL (
www.orthomcl.org
). The
OrthAgogue tool will soon be made avail-
able through the Notur portal.
OLSVis
As described above, bio-ontologies are
highly standardised vocabularies that de-
scribe concepts or things in biology. They
constitute the foundation for the anno-
tation, or for semantic description, of re-
search ndings because they provide
a shared, well-defi ned vocabulary. It is
27
Figure 4. Screenshot of
OLSVis. The main display
shows the Human protein
CDC23 (part of the CCO
knowledge), with the differ-
ent functional or physical
relationships it displays with
other proteins or functions.
The pane on the left shows
the specifi cs of the display,
the toolbar on top allows
additional modifi cations of
the display.
uses experimental and literature based
evidence to rank a given hypothesis, after
which a biologist may consider top ranking
hypotheses for validation experiments.
The Bigger Picture
Our SW knowledge bases are examples of
knowledge warehouses, where all the in-
formation is rst centralised, reformatted
and then made available for querying. The
essence of the Semantic Web, however, is
to offer a seamless access to data that is
residing at different locations on the Web,
but the full potential of the SW in offer-
ing semantically enriched knowledge for
querying, hypothesis generation and au-
tomated reasoning has not yet been real-
ized. Many of the SW resources world-wide
are built according to a warehouse like set-
up (with one or multiple SPARQL endpoints
that can be queried) with all the classi-
cal shortcomings such as a large up-front
time investment required for data integra-
tion, technical challenges with respect to
the infrastructure, data provenance and
maintenance issues. The recently devel-
oped SPARQL 1.1 specifi cation may offer a
new perspective to query across multiple
distributed endpoints through query fed-
eration. However, this is still in its nascent
stage, issues relating to the differences be-
tween SW resources that stem from the us-
age of RDF and query optimisation needs
to be addressed. SPARQL query optimisa-
tion within a single RDF store is still a chal-
lenge, naturally handling large amounts of
results that gets exchanged over the globe
will only be more cumbersome. To estab-
lish the SW as a robust technology that fa-
cilitates the answering of complex biologi-
cal questions, the above problems have to
be addressed at the community level with
active participation of SW developers and
biological data providers. Such interactions
would encourage orthogonality of the con-
tent among the triple stores and promote
best practices for the use of RDF (Venkate-
san
et al.
, 2011).
To provide proof of concept the “Gastrin re-
sponse modelling” project has been cho-
sen as a use case. The Gastrin Systems
Biology group at the Medical Faculty of
NTNU/St Olavs Hospital has for several
years taken a systems biology approach
to study gastrointestinal tumour biology
with a special focus on the role of gastrin
in regulating potential target genes. Our
direct objectives (http://www.ntnu.edu/nt/
systemsbiology/projects/gastrin) are to
understand and extend a model capturing
current experimental results through new
hypotheses generated from the knowledge
present in GeXKB. For example, conducting
the local neighbourhood analysis of a tran-
scription factor by constructing SPARQL
queries will retrieve data pertaining to the
transcription factor integrated from vari-
ous data sources stored in GeXKB, provid-
ing us with a lead towards generating hy-
potheses based on the types of relations
linking the protein to other components in
the KB.
Hypothesis management
In molecular biology, the results from
high-throughput data production technolo-
gies, e.g. gene expression studies, can be
overwhelming to a biologist. Lack of effi -
cient hypothesis generation and manage-
ment protocols makes it diffi cult to han-
dle and comprehend these extensive data
sets and consider all possible hypotheses
that may be derived from it. Moreover, as
the experimental validation of any hypoth-
esis costs time, effort and money we seek
to support these tasks with an automated
pipeline that generates hypotheses and
ranks them for likelihood that they may be
valid. This ranking should be done consid-
ering all available sources of knowledge
accessible through the Semantic Web.
We follow the example set by HyQue
(
http://semanticscience.org/projects/
hyque/
), a tool that uses a rule-based mod-
el to prioritise scientifi c hypothesis based
on distributed knowledge sources cov-
ering molecular events in the context of
the galactose gene network in
Saccharo-
myces cerevisae
(Baker’s yeast). The tool
an essential source for platform design
ideas. Our current efforts include collab-
orations with the Gastrin Biology group at
the Department of Cancer Research and
Molecular Medicine and the Lipid Signal-
ling group (PLA2) at the Department of Bi-
ology, both at NTNU. The defi nitive aim is
to develop real world use-cases that steer
the further integration of specifi c domain
knowledge and the development of a user-
friendly interface. This has prompted us to
focus on the area of transcription control:
the analysis of components and mecha-
nisms involved in the production of mes-
senger RNAs from genes.
Capturing gene expression knowledge
Transcriptional gene expression and its
regulation depend on a large variety of cel-
lular processes that control the timing and
level of transcription of an individual gene.
Gene expression falls into two main phas-
es, i.e. transcription and translation. Dur-
ing the process of transcription, proteins
called transcription factors bind to spe-
cifi c DNA sequence motifs (binding sites)
of a gene, playing a key role in producing
pre-mRNAs which are subsequently pro-
cessed upon which mature mRNAs are
transported from the nucleus to the cyto-
plasm where the mRNA is translated into a
protein. Regulatory processes of gene ex-
pression occur at different levels, enabling
the cell to adapt to different conditions by
controlling its structure and function. Ab-
normalities in the regulation of gene ex-
pression can cause diseases such as the
occurrence of malignant cell proliferation.
The knowledge required to decipher the
various processes involved in gene expres-
sion continues to grow. However, for a sys-
tems-wide understanding of gene regula-
tion, there is a need for effi ciently capturing
knowledge of this domain in its entirety
and to further facilitate effi cient querying
of this data. Therefore we have developed
a system that captures knowledge related
to gene expression and regulation, the
Gene Expression Knowledge Base (GeXKB)
(Venkatesan
et al.
, 2012).
28
Figure 5.
Semantic Web Stack
Semantic Web quick guide:
The architecture of the Semantic Web in-
volves a hierarchical assembly (Seman-
tic Web Stack, Figure 5) of various formats
and technologies where each layer exploits
the capabilities of the layer below providing
a formal description of concepts and rela-
tionships within a given domain. The bot-
tom layers in the Semantic Web Stack con-
sists of technologies that are widely used
in the current Web, this Semantic Web is
built on the basis of these technologies and
the middle layer (RDF onwards) consists
of technologies that have been standard-
ised specifi cally to support semantic web
applications.
Unicode: This is a standard for consist-
ent representation of text expressed in the
world’s various writing systems. The usage
of unicode aids the semantic web applica-
tions in bridging documents expressed in
different human language systems.
Unique Resource Identifi er (URI): A URI is
a series of characters used to identify an
abstract or a physical entity. URIs are used
in Semantic Web based systems to de-
scribe a resource and its components,
enabling interactions over the Web using
HTTP.
Extensible Markup Language (XML):
XML is a markup language that provides
an elementary structure in describing
data. The restrictive syntax rules of XML
are highly suited for the Semantic
Web and provides a scaffold in data repre-
sentation, however, it adds no semantics
for data representation.
Resource Description Framework (RDF):
RDF is a data modelling language that
provides a framework to describe a
resource and its relationship with other
resources in a form called triples, i.e.
Subject-Predicate-Object. A set of such
triples forms a RDF graph that are made
machine-readable using the XML seriali-
that facilitates simple inference through
hierarchical specifi cation of classes and
properties.
Web Ontology Language (OWL): OWL goes
beyond RDF and RDFS in its expressive-
ness by providing logical constructs to the
description of classes. W3C provided three
specifi cations for OWL (also known as OWL
1): OWL-Lite, OWL-DL and OWL-Full (in the
increasing order of expressivity). OWL-DL,
in particular, offers an interesting trade-off
between expressiveness and computability.
OWL-DL developed based on a well-under-
stood fragment of Description Logics (DL,
www.dl.kr.org
), which guarantees comput-
ability. As a consequence, a number of DL
reasoners, developed by the artifi cial in-
telligence community, were made avail-
able for deployment in the Semantic Web
for: a) consistency checking, b) inferencing,
sation form. Furthermore, the RDF graphs
can be stored in repositories called Triple
stores or RDF stores, they function much like
the relational databases systems, where
one could store and query the graphs via
a query language, SPARQL (also known as
RDF/SPARQL endpoints). In recent times,
a number of effi cient scalable triple stores
have been developed enabled with SPARQL,
e.g. Openlink Virtuoso (http://virtuoso.open-
linksw.com/), Apache Jena (http://jena.
apache.org/) and 4store (www.4store.org).
RDF Schema (RDFS): The graph-based
data model of RDF makes it a compelling
choice to represent concepts and inte-
grate data from multiple sources. Howev-
er, RDF lacks descriptions of relationships
between predicates and other resources.
To bring an improvement on this the RDFS
was developed. It is an ontology language
29
References
Antezana E, Kuiper M, Mironov V: Biological knowledge management: the emerging role of the Semantic
Web technologies. Brief Bioinform 2009a; 10:392-407.
Antezana E, Blonde W, Egana M, Rutherford A, Stevens R, De Baets B, Mironov V, Kuiper M: BioGateway:
a semantic systems biology tool for the life sciences. BMC Bioinformatics 2009b; Suppl 10:S11.
Antezana E, Egana M, Blonde W, Illarramendi A, Bilbao I, De Baets B, Stevens R, Mironov V, Kuiper M:
The Cell Cycle Ontology: an application ontology for the representation and integrated analysis of the cell
cycle process. Genome Biol 2009c; 10:R58.
Berners-Lee T, Hendler J. Publishing on the Semantic Web. Nature 2001; 410:1023-1024.
Beisvåg V, Kauffmann A, et al. Contributions of the EMERALD project to assessing and improving
microarray data quality. BioTechniques 2011; 50:27-31.
Blonde W, Mironov V, Venkatesan A, Antezana E, De Baets B, and Kuiper M, Reasoning with bio-
ontologies: using relational closure rules to enable practical querying. Bioinformatics 2011;
27:1562-1568.
Brazma A, Hingamp P, Quackenbush J, et al. Minimum information about a microarray experiment
(MIAME)—toward standards for microarray data. Nature Genetics 2001; 29:365-371.
Gisler M, Sornette D, Woodard R. Exuberant Innovation: The Human Genome Project. Swiss Finance
Institute Research 2010; Paper No. 10-12. Available at SSRN: http://ssrn.com/abstract=1573682.
Kitano H. Systems biology: a brief overview. Science 2001; 295:1662–1664
Ouzounis CA, Valencia A. Early bioinformatics: The birth of a discipline - A personal view. Bioinformatics
2003; 19:2176–2190.
Taylor CF, Field D, Sansone SA, et al. Promoting coherent minimum reporting guidelines for biological
and biomedical investigations: the MIBBI project. Nat Biotechnol 2008; 26:889-896.
Venkatesan A, Mironov V, Kuiper M. Towards an integrated knowledge system for capturing gene
expression events. In Proceedings of ICBO 2012, Graz, Austria.
Venkatesan A, Blonde W, Antezana E, et al. The RDF Foundry: Call for an initiative to build enhanced RDF
resources for biological data integration. In: Proceedings of WIMS’11 2011, Sogndal, Norway.
Vercruysse S, Venkatesan A, Kuiper M. OLSVis: an animated, interactive visual browser for bio-
ontologies. BMC Bioinformatics 2012; 13:116.
c) classifi cation and d) querying. FaCT++,
Pellet and RACERPro are among the most
commonly used DL reasoners. Although,
research on DL has kept the hope alive for
realizing fully automated reasoning on bio-
ontologies. The application of reasoners on
fully edged large, integrated bio-ontolo-
gies failed due to scalability issues. How-
ever, with the recent release of OWL (OWL
2) signifi cant work has gone into making
reasoning more tangible, offering sever-
al DL-based sublanguages such as OWL 2
DL, OWL 2 EL and OWL 2 RL (http://www.
w3.org/TR/owl2-profi les/).
SPARQL: SPARQL or the
SPARQL Protocol
and RDF Query Language
is a query lan-
guage for RDF. It offers the developers and
end users a way to retrieve and manipu-
late data stored in RDF format. It is con-
sidered as one of the key technologies of
semantic web, as it allows users to write
unambiguous queries consisting of triple
patterns, conjunctions, disjunctions, and
optional patterns. Furthermore, as an ex-
tension to SPARQL, the SPARQL/Update
(SPARUL) has been developed. It is a de-
clarative data manipulation language that
offers the ability to insert, delete and up-
date RDF data held within a triple store.
With the latest udated version of SPAR-
QL, SPARQL 1.1, more features have been
added to the language for instance, SPAR-
QL 1.1 provides its users with the capability
of query federation i.e. the ability to launch
SPARQL queries to different RDF stores.
30
Acknowledgements
We wish to thank NOTUR II, the Norwegian Metacenter for Computational Science
for providing the computing resources; the High-Performance Computing
team at NTNU for their help in setting up the SSB servers; the end-users for
their feedback; and the OBO, Life Science and Semantic Web communities for
interesting and motivating discussions.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Transcriptional regulation of gene expression is an important mechanism in many biological processes. Aberrations in this mechanism have been implicated in cancer and other diseases. Effective investigation of gene expression mechanisms requires a system-wide integration and assessment of all available knowledge of the underlying molecular networks. This calls for a method that effectively manages and integrates the available data. We have built a semantic web based knowledge system that constitutes a significant step in this direction: the Gene Expression Knowledge Base (GeXKB). The GeXKB encompasses three application on-tologies: the Gene Expression Ontology (GeXO), the Regulation of Gene Expression Ontology (ReXO), and the Regulation of Transcription Ontology (ReTO). These three ontologies, respec-tively, integrate gene expression information that is increasingly more specific, yet decreasing in coverage, from a variety of sources. The system is capable of answering complex biological questions with respect to gene expression and in this way facili-tates the formulation or assessment of new hypothesis. Here we discuss the architecture of these ontologies and the data integration process and provide examples demonstrating the utility thereof. The knowledge base is freely available for download and can be queried through a SPARQL endpoint (http://www.semantic-systems-biology.org/apo/).
Article
Full-text available
More than one million terms from biomedical ontologies and controlled vocabularies are available through the Ontology Lookup Service (OLS). Although OLS provides ample possibility for querying and browsing terms, the visualization of parts of the ontology graphs is rather limited and inflexible. We created the OLSVis web application, a visualiser for browsing all ontologies available in the OLS database. OLSVis shows customisable subgraphs of the OLS ontologies. Subgraphs are animated via a real-time force-based layout algorithm which is fully interactive: each time the user makes a change, e.g. browsing to a new term, hiding, adding, or dragging terms, the algorithm performs smooth and only essential reorganisations of the graph. This assures an optimal viewing experience, because subsequent screen layouts are not grossly altered, and users can easily navigate through the graph. URL: http://ols.wordvis.com The OLSVis web application provides a user-friendly tool to visualise ontologies from the OLS repository. It broadens the possibilities to investigate and select ontology subgraphs through a smooth visualisation method.
Conference Paper
Full-text available
Currently, the OBO Foundry plays an important role by setting guidelines to formalise the concepts within the biomedical domain. The ontologies within the OBO Foundry are usually represented in the OBO ontology language. While being human-readable, this language lacks the computational rigour required for the Semantic Web (SW). Consequently, the RDF and OWL languages, both fundamental components of the SW technology stack, are being increasingly adopted by the biomedical community to exchange biological knowledge in a computer intelligible form. Some of the OBO- formatted ontologies have been made available in OWL, thus signalling a move towards the SW. OWL provides support for automated reasoning, which is its raison d'être. Unfortunately, automated reasoning on the massive volumes of data that are typical of the biomedical domain is riddled with performance limitations. Due to consistent support for the SPARQL specification in triple store implementations, as well as the ability to simulate some types of reasoning with pre-computed closures, RDF has evolved into a language of choice for knowledge exchange within the framework of the SW. Here, we discuss the need to establish a foundry charged with the task of harmonizing biomedical RDF resources, acting along the same lines as the OBO Foundry. To substantiate the need for an RDF Foundry, we provide the outcome of a small survey we have conducted to highlight the domain coverage, redundancies, and comprehensiveness of results obtained from a few representative distributed resources available today.
Article
Full-text available
Ontologies have become indispensable in the Life Sciences for managing large amounts of knowledge. The use of logics in ontologies ranges from sound modelling to practical querying of that knowledge, thus adding a considerable value. We conceive reasoning on bio-ontologies as a semi-automated process in three steps: (i) defining a logic-based representation language; (ii) building a consistent ontology using that language; and (iii) exploiting the ontology through querying. Here, we report on how we have implemented this approach to reasoning on the OBO Foundry ontologies within BioGateway, a biological Resource Description Framework knowledge base. By separating the three steps in a manual curation effort on Metarel, a vocabulary that specifies relation semantics, we were able to apply reasoning on a large scale. Starting from an initial 401 million triples, we inferred about 158 million knowledge statements that allow for a myriad of prospective queries, potentially leading to new hypotheses about for instance gene products, processes, interactions or diseases. SPARUL code, a query end point and curated relation types in OBO Format, RDF and OWL 2 DL are freely available at http://www.semantic-systems-biology.org/metarel.
Article
Full-text available
While minimum information about a microarray experiment (MIAME) standards have helped to increase the value of the microarray data deposited into public databases like ArrayExpress and Gene Expression Omnibus (GEO), limited means have been available to assess the quality of this data or to identify the procedures used to normalize and transform raw data. The EMERALD FP6 Coordination Action was designed to deliver approaches to assess and enhance the overall quality of microarray data and to disseminate these approaches to the microarray community through an extensive series of workshops, tutorials, and symposia. Tools were developed for assessing data quality and used to demonstrate how the removal of poor-quality data could improve the power of statistical analyses and facilitate analysis of multiple joint microarray data sets. These quality metrics tools have been disseminated through publications and through the software package arrayQualityMetrics. Within the framework provided by the Ontology of Biomedical Investigations, ontology was developed to describe data transformations, and software ontology was developed for gene expression analysis software. In addition, the consortium has advocated for the development and use of external reference standards in microarray hybridizations and created the Molecular Methods (MolMeth) database, which provides a central source for methods and protocols focusing on microarray-based technologies.
Article
Full-text available
Microarray analysis has become a widely used tool for the generation of gene expression data on a genomic scale. Although many significant results have been derived from microarray studies, one limitation has been the lack of standards for presenting and exchanging such data. Here we present a proposal, the Minimum Information About a Microarray Experiment (MIAME), that describes the minimum information required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified. The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools. With respect to MIAME, we concentrate on defining the content and structure of the necessary information rather than the technical format for capturing it.
Article
Full-text available
We present a detailed synthesis of the development of the Human Genome Project (HGP) from 1986 to 2003 in order to test the "social bubble" hypothesis that strong social interactions between enthusiastic supporters of the HGP weaved a network of reinforcing feedbacks that led to a widespread endorsement and extraordinary commitment by those involved in the project, beyond what would be rationalized by a standard cost-benefit analysis in the presence of extraordinary uncertainties and risks. The vigorous competition and race between the initially public project and several private initiatives is argued to support the social bubble hypothesis. We also present quantitative analyses of the concomitant financial bubble concentrated on the biotech sector. Confirmation of this hypothesis is offered by the present consensus that it will take decades to exploit the fruits of the HGP, via a slow and arduous process aiming at disentangling the extraordinary complexity of the human complex body. The HGP has ushered other initiatives, based on the recognition that there is much that genomics cannot do, and that "the future belongs to proteomics". We present evidence that the competition between the public and private sector actually played in favor of the former, since its financial burden as well as its horizon was significantly reduced (for a long time against its will) by the active role of the later. This suggests that governments can take advantage of the social bubble mechanism to catalyze long-term investments by the private sector, which would not otherwise be supported. Comment: 22 pages, 5 figures
Article
Full-text available
Life scientists need help in coping with the plethora of fast growing and scattered knowledge resources. Ideally, this knowledge should be integrated in a form that allows them to pose complex questions that address the properties of biological systems, independently from the origin of the knowledge. Semantic Web technologies prove to be well suited for knowledge integration, knowledge production (hypothesis formulation), knowledge querying and knowledge maintenance. We implemented a semantically integrated resource named BioGateway, comprising the entire set of the OBO foundry candidate ontologies, the GO annotation files, the SWISS-PROT protein set, the NCBI taxonomy and several in-house ontologies. BioGateway provides a single entry point to query these resources through SPARQL. It constitutes a key component for a Semantic Systems Biology approach to generate new hypotheses concerning systems properties. In the course of developing BioGateway, we faced challenges that are common to other projects that involve large datasets in diverse representations. We present a detailed analysis of the obstacles that had to be overcome in creating BioGateway. We demonstrate the potential of a comprehensive application of Semantic Web technologies to global biomedical data. The time is ripe for launching a community effort aimed at a wider acceptance and application of Semantic Web technologies in the life sciences. We call for the creation of a forum that strives to implement a truly semantic life science foundation for Semantic Systems Biology. Access to the system and supplementary information (such as a listing of the data sources in RDF, and sample queries) can be found at http://www.semantic-systems-biology.org/biogateway.
Article
Full-text available
The Cell Cycle Ontology (http://www.CellCycleOntology.org) is an application ontology that automatically captures and integrates detailed knowledge on the cell cycle process. Cell Cycle Ontology is enabled by semantic web technologies, and is accessible via the web for browsing, visualizing, advanced querying, and computational reasoning. Cell Cycle Ontology facilitates a detailed analysis of cell cycle-related molecular network components. Through querying and automated reasoning, it may provide new hypotheses to help steer a systems biology approach to biological network building.