Conference PaperPDF Available

A Performance Study of NoSQL Stores for Biomedical Data

Authors:

Abstract and Figures

NoSQL data stores can serve as alternative to traditional relational database systems, particularly for handling big data biomedical applications. Applications that model their data using two or more simple NoSQL models are known as applications with polyglot persistence. Recently, a new family of multi-model data stores was introduced, integrating simple NoSQL data models into a single unique system. In this paper, we evaluate the performance of the integration of proteomics data sources, using a polyglot persistence approach combining two NoSQL stores: Graph-oriented database (Neo4j) and Document-oriented database (MongoDB); compared to a Multi-Model (Ori-entDB) approach. In order to perform the comparison study, we used two species datasets: Homosapiens as a large dataset and Lactobacillus Rhamnosus as a small dataset. Storing, deletion and query efficiency are used as criteria for the comparison.
Content may be subject to copyright.
A Performance Study of NoSQL Stores for Biomedical Data
Chaimaa Messaoudi, Mouna Amrou Mhand and Rachida Fissoune
LabTIC laboratory, National School of Applied Sciences, ENSA,
Abdelmalek Essaadi University, Tangier, 90 000, Morocco
messaoudi.chaimaa@gmail.com
amroumounae@gmail.com
ensat.fissoune@gmail.com
Summary. NoSQL data stores can serve as alternative to traditional relational
database systems, particularly for handling big data biomedical applications.
Applications that model their data using two or more simple NoSQL models
are known as applications with polyglot persistence. Recently, a new fam-
ily of multi-model data stores was introduced, integrating simple NoSQL data
models into a single unique system. In this paper, we evaluate the perfor-
mance of the integration of proteomics data sources, using a polyglot persis-
tence approach combining two NoSQL stores: Graph-oriented database (Neo4j)
and Document-oriented database (MongoDB); compared to a Multi-Model (Ori-
entDB) approach. In order to perform the comparison study, we used two species
datasets: Homosapiens as a large dataset and Lactobacillus Rhamnosus as a
small dataset. Storing, deletion and query efficiency are used as criteria for the
comparison.
1 Introduction
While the advances of next generation sequencing has rapidly increased, the informatics
infrastructure used to manage the data generated by this technology has not filled the gap.
Extracting knowledge and useful information from biological big data is one of the main en-
deavors for bioinformatics community. Moreover, biological data sources are distributed and
heterogeneous: each source has its own data format and structure. It is common that the sci-
entific terms used to describe the data differ from one source to another. These challenges are
needed to be addressed because the current relational database technologies have insufficient
resources to handle them, and more generally big data 4 V’s (Atzeni et al., 2013).
New data stores systems called non-relational data stores systems have emerged under
the name of NoSQL systems. These systems support different types of data models that are
efficiently scalable and distributed. We distinguish four NoSQL categories with examples,
each one having its own specificities and facilitating the management of some particular kind of
data: key-value stores (DynamoDB), column family database (Cassandra, HBase), document-
based storage (MongoDB, OrientDB) and graph database (AllegroGraph, Orient DB, Neo4j)
(Moniruzzaman and Hossain, 2013). The use of a NoSQL store mainly relies to the application
context and the data model (e.g. graph). Some applications require more than one NoSQL
Performance of NoSQL Systems
stores. For example, in a proteomics application, a protein-protein interaction dataset should
be modeled as a graph, and a protein information dataset is more appropriate to be stored in a
document database. Those types of applications that simultaneously use different models and
data stores are called applications with polyglot persistence.
The polyglot persistence approach requires the understanding of more than one query lan-
guage and user interface, addition to managing the communication between the different data
stores used in the application. There have been advances to provide a unique NoSQL sys-
tem that contains multiple data models. These systems are called NoSQL multi-model, and
they simplify the process of application development because they use only one store, but they
could decrease the performance of applications (Oliveira and del Val Cura, 2016).
Several research studies have been conducted to evaluate the performance of NoSQL stores
such as MongoDB, Cassandra, ArangoDB, CouchDB and OrientDB for the management of
large biomedical data sets ((Shao and Conrad, 2015; Guimaraes et al., 2015; Wang et al.,
2014)). Oliveira and del Val Cura (2016) presented a performance evaluation of multi-model
data stores using polyglot persistence. They implemented a synthetic data generator to create
the hybrid datasets. But their relative advantage when applied to biomedical data sets has not
been characterized.
This paper presents a performance study of NoSQL stores for the integration of proteomics
data. We compare the performance of NoSQL multi-model (OrientDB) to the polyglot persis-
tence approach. OrientDB manages both document and graph data models, and it was chosen
because it is an open source data model store. The polyglot persistence consists of combining
the document database (MongoDB) with a graph database (Neo4j). The comparisons are made
from the following aspects: Insertion, deletion, importation and query performance.
This paper is structured as follows. Section 2 covers a brief introduction to NoSQL
databases and the main features of polyglot persistence. It discusses some related works that
evaluate NoSQL data store performance. Section 3 presents the evaluation study, the datasets
used and discuss the practical results obtained. Section 4 concludes and suggests future works.
2 NoSQL databases: An Overview
NoSQL databases have appeared as a solution for storage scalability, management of large
volumes of unstructured data and parallelism. We aim to provide a brief overview of NoSQL
store models as well as polyglot systems particularly polyglot persistence and multi-model sys-
tem. In the same section, we discuss some related works on NoSQL stores for the integration
of biomedical data.
2.1 NoSQL Stores Models
NoSQL data store systems differ from relational databases by offering different data mod-
els, which could be classified into four main models:
Key/Value: similar to maps or dictionaries where data are associated to a unique key.
This makes the system accessible and available at runtime anytime without conflicting
with any other stored data. Values are isolated and independent from each other.
C. Messaoudi, M. Amrou and R. Fissoune
Document: it designed to manage and store documents. These documents are encoded
in a standard data exchange format such as XML, JSON (Javascript Option Notation)
or BSON (Binary JSON).
Graph: this model has three basic components: nodes, relationships, and properties of
nodes and relationships. The graph is directed, nodes are connected by edges. This
model is opportune for applications requiring queries traversing several levels of rela-
tionships between data .
Column: stores data tables as columns rather than rows offering a more precise access
to data, specially in very large data sets.
2.2 Polyglot databases architecture
Polyglot database architectures are classified into three main types: 1) lambda Architecture,
2) polyglot persistence and 3) multi-model databases (Wiese, 2015). In this paper we will focus
on polyglot persistence and multi-model systems.
2.2.1 Polyglot Persistence
The term polyglot persistence refers to using different data stores in different circumstances
(Sadalage and Fowler, 2012), instead of choosing just one single database management system
to store the entire data. Different kinds of data are best dealt with different data stores. Polyglot
persistence makes it possible to choose as many databases as needed since they are built for
different purposes. Figure 1 shows the polyglot persistence concept.
FIG . 1: Polyglot Persistence Concept
The main advantage of polyglot persistence is giving the users the possibility to customize
their system to match the application requirements. However, uniform access and logical re-
dundancy can be some disadvantages of polyglot persistence.
Performance of NoSQL Systems
2.2.2 Multi-Model systems
Multi-model systems provide a database system that stores data in a single store but ac-
cesses the data with different APIs according to different data models. Indeed, multi-model
databases are relying on different storage backends, which increases the overall complexity of
the system and raises concerns like interdatabase consistency, inter-database transactions and
interoperability as well as version compatibility and security. They either support different
data models directly inside the store engine or they offer layers for additional data models on
top of a single-model engine, see Figure 2.
FIG . 2: Multi-Model Concept
There are two open source multi-model databases that are OrientDB and ArangoDB. Ori-
entDB presents a document API, an object API, and a graph API; it offers extensions of the
SQL standard to interact will all three APIs. Diverse advantages are presented with this multi-
model stores. For instance, Reducing database administration, improved consistency, easier
application development. Lioni et al. (2010) presents SeqWare Query Engine, which has been
created using modern cloud computing technologies and designed to support databasing infor-
mation from thousands of genomes. Their backend implementation was built using the highly
scalable, NoSQL HBase database from the Hadoop project. This software is open source and
freely available from the SeqWare project (http://seqware.sourceforge.net).
Messina (2015) presents an integrated database structured as a NoSQL graph database
based on Orientdb, which allows the integration of different types of data sources (Gene, miR-
Base, mirCancer), facilitating the performance of bioinformatics analysis using only one sys-
tem. The authors in (Bonnici et al., 2014) presented ncRNA-DB, a NoSQL database based
on the OrientDB platform that put together many biological resources that deal with several
classes of non-coding RNA (ncRNA) such as miRNA, long-non-coding RNA (lncRNA), cir-
cular RNA (circRNA) and their interactions with genes and diseases. More recently, Bio4j
(Pareja-Tobes et al., 2015) and BioGraphDB (Fiannaca et al., 2016b,a), has been developed.
Bio4j is based on a Java library that allows to build an integrated cloud-based data platform
upon a graph structure, focused on the analysis of proteomic data. It in fact integrates data
about protein sequences and annotations, GO terms, enzymes.
Another application of NoSQL stores in bioinformatics is BigQ developed by Gabetta et al.
(2015), as an extension of the i2b2 framework, which integrates patient clinical phenotypes
C. Messaoudi, M. Amrou and R. Fissoune
with genomic variant profiles generated by Next Generation Sequencing. The i2b2 web service
is composed of an efficient and scalable document-based database that manages annotations
of genomic variants and of a visual programming plug-in designed to dynamically perform
queries on clinical and genetic data. The system is based on CouchDB.
Manyam et al. (2013) developed TargetHub a CouchDB based database used for storing
miRNA-gene interactions for integration into high-throughput genomic analysis. It integrates
data from multiple miRNA repositories, allows users to systematically integrate data from mul-
tiple sources. In addition, CouchDB has been used to build three new bioinformatics resources
(Manyam et al., 2012). GeneSmash as a database that collects data from various bioinformat-
ics resources and provides automated gene-centric annotations needed and used in large scale
projects such as the Cancer Genome Atlas (TCGA). The drugBase database used for storage of
drug-target interactions and the HapMapCN drug-target database which provides an interface
to query the copy number variations identified using the HapMap datasets.
3 Evaluation Study
3.1 Datasets and materials
We conducted an experimental approach to compare the latencies of MongoDB combined
with Neo4j and OrientDB, on storing interactions of two organisms. For the graph, the data are
available in the STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database
https://string-db.org/cgi/download. For the documents datasets, we provide protein sequence
and functional information from UniProt (Universal Protein Resource). We use the following
two data files:
Homo sapiens as a large dataset of 159743 proteins with 11.5 millions Interactions.
Lactobacillus rhamnosus as a small dataset of 11707 proteins with 1.8 millions inter-
actions.
These experiments were performed on a server, running on Centos7 operating system. We can
see in Table 1 the configuration details. The versions of systems used in the experiments are
OrientDB 2.2.20, MongoDB 3.4.1, and Neo4j 3.2.1 community version. The load and query
operations used the web interface provided by the Neo4j and OrientDB data stores. Viewing
the collections and documents created in MongoDB can be done using a command prompt.
MongoDB does not offer a complete Web interface, then we used the (Robomongo) software
which provides the user interface in order to access, view, create, add, edit and delete the
existing or new collections and documents.
Features
Processor Intel Xeon CPU E5-1650 3.20GHz
Memory 16GB
Storage 500GB
TAB. 1: Server Configuration
Performance of NoSQL Systems
3.2 Results and Discussion
The publicly datasets listed in the previous section give us a huge amount of information,
that we have to integrate in an harmonious and consistent way. Evaluating the loading, deletion
and querying of these data is the goal. Moreover, the datasets are available for download
in several different formats, such as tab-delimited plain-text, structured XMLs, FASTA. For
example, The latest available release of OrientDB has a powerful tool to move data from and
to a database by executing an Extract-Transformer-Loader (ETL) process, described by a JSON
configuration file. The dataset was transformed into comma-separated values files. In the graph
model, each biological entity (protein) and its properties have been mapped respectively into
a vertex and its attributes, and each relationship between two biological entities(protein) has
been mapped into an edge. If a relationship has some properties, they are also saved as edge’s
attributes. Vertices and edges are grouped into classes, according to the nature of the entities.
For example, all the proteins imported from Uniprot become instances of the protein vertex
class.
The performance study consists of comparing NoSQL stores in terms of i) data storing,
deletion, and ii) queries latencies. Two real datasets are used to perform the comparison.
OrientDB, MongoDB, and Neo4j are considered for this study. The number of seconds taken
to complete each operation is calculated 30 times, and the average is given to compare different
stores. Smaller values of the average time indicate better performance.
3.2.1 Data Storing and Deletion
Data storing concerns with two operations i) the importation of the dataset into the NoSQL
stores and ii) the insertion of a single data record into the NoSQL stores. The importation
consists of loading the whole dataset in the stores while the insertion is done for a single data
record. Moreover, the deletion concerns with deleting the whole dataset from the stores. These
operations are applied for the small and large datasets. Figure 3 shows the importation and
deletion performance for document stores (MongoDB and OrientDB) using the small dataset
while Figure 4 displays result regarding the large dataset. The results for the importation reveal
that MongoDB has better performance than OrientDB in both cases, small and large dataset.
The same conclusion is given for the deletion operation. There is no significant performance
gain for MongoDB when the large dataset is used.
Figure 6 shows the importation and deletion performance for graph stores (Neo4j and Ori-
entDB) using the small dataset while Figure 7 displays result regarding the large dataset. The
results for the importation reveal that Neo4j has better performance than OrientDB in both
cases, small and large dataset. The same conclusion is given for the insertion and deletion op-
erations. There is a significant performance gain for Neo4j when the importation is conducted
in the large dataset. We can conclude that the larger network the more Neo4j is efficient.
Neo4j includes a ’LOAD CSV’ Cypher clause for data import, which is a powerful ETL
tool. It can load a CSV file from the local filesystem or from a remote URI (i.e. Dropbox,
Github, etc.) and can be combined with USING PERIODIC COMMIT to group the operations
on multiple rows in transactions to load large amounts of data. This can explain the superior
performance of Neo4j.
In Figure 5, we present the performance results for the insertion of a single record in
two data models: graph (Neo4j and OrientDB) and document (MongoDB and OrientDB).
C. Messaoudi, M. Amrou and R. Fissoune
FIG . 3: Document operation for
Small dataset
FIG . 4: Document operation for
Large dataset
FIG . 5: Insertion
FIG . 6: Graph operation for Small
dataset
FIG . 7: Graph operation for Large
dataset
OrientDB (document and graph) has the lower performance compared to MongoDB and Neo4j.
Performance of NoSQL Systems
FIG . 8: Graph query with depth level
1
FIG . 9: Graph query with depth level
2
FIG . 10: Graph query with depth
level 3
FIG . 11: Graph query with depth
level 4
We choose to present the performance in one dataset because their is no difference between
the small and large datasets in terms of insertion.
3.2.2 Query performance
We evaluate the performance of the multi-model and the polyglot persistence approaches
using a query that retrieves a document and its network. The document is randomly selected
from the document-oriented database, then the network of the selected document is extracted
from the graph-oriented database with a traversal through the graph up to a fixed depth level
from 1 to 4. For example, using polyglot persistence data stores, each query was run in two
steps. In the first step, a key is randomly selected and the matched Uniprot-ID is retrieved
from MongoDB. In the second step, the set of nodes connected with the selected Uniprot-ID
is retrieved from Neo4j. The total elapsed time of the query is computed as a sum of both the
Neo4j and MongoDB elapsed query times. In multi-model data stores, each query returns the
matched documents and their connected documents in the graph.
Figures 8 and 9 show the performance results for querying the small and large dataset in
depth level 1 and level 2. The results show that combining Neo4j and MongoDB has the best
performance for queries with graph traversal up to a depth level of 2. Figure 10 shows that the
performance of polyglot persistence decreases while OrientDB reaches the best performance
C. Messaoudi, M. Amrou and R. Fissoune
for queries that require graph traversal of 3 depth levels. Figure 11 shows that OrientDB is
still the multi-model data store that reaches the best performance for graph traversal depth
level 4. We conclude that when an application requires deppper levels of graph traversal, the
best performance is reached by OrientDB. The same conclusions are made when querying the
large dataset. However, there is a significance gain in the performance. For example for large
datasets and a query for graph traversal of depth 2 (Figure 9), polyglot persistence approach
shows an average time much more better (18.92s) than the multi-model (145.09s).
4 Conclusion
In this paper, a performance study is provided to evaluate the time needed for storing, delet-
ing and querying data using a polyglot persistence approach and a multi-model system. We
found out that both the depth levels of graph traversal of queries and the size of the graph influ-
ence the performance of both polyglot persistence and multi-model data stores. We conclude
that for importing, inserting and deleting biomedical data as illustrated in this paper, MongoDB
is faster than OrientDB regarding document-oriented datasets. In the case of graph oriented
datasets, Neo4j shows better performance than OrientDB, we used ’PERIODIC COMMIT’
technique given in Neo4j. In the query performance, we found out that when the application
requires deeper levels of graph traversal, the best performance is reached by OrientDB.
References
Atzeni, P., C. S. Jensen, G. Orsi, S. Ram, L. Tanca, and R. Torlone (2013). The relational
model is dead, sql is dead, and i don’t feel so good myself. ACM SIGMOD Record 42(2),
64–68.
Bonnici, V., F. Russo, N. Bombieri, A. Pulvirenti, and R. Giugno (2014). Comprehensive
reconstruction and visualization of non-coding regulatory networks in human. Frontiers in
bioengineering and biotechnology 2, 69.
Fiannaca, A., L. La Paglia, M. La Rosa, A. Messina, P. Storniolo, and A. Urso (2016a). Inte-
grated db for bioinformatics: A case study on analysis of functional effect of mirna snps in
cancer. In International Conference on Information Technology in Bio-and Medical Infor-
matics, pp. 214–222. Springer.
Fiannaca, A., M. La Rosa, L. La Paglia, A. Messina, and A. Urso (2016b). Biographdb:
a new graphdb collecting heterogeneous data for bioinformatics analysis. Proceedings of
BIOTECHNO.
Gabetta, M., I. Limongelli, E. Rizzo, A. Riva, D. Segagni, and R. Bellazzi (2015). Bigq: a
nosql based framework to handle genomic variants in i2b2. BMC bioinformatics 16(1), 415.
Guimaraes, V., F. Hondo, R. Almeida, H. Vera, M. Holanda, A. Araujo, M. E. Walter, and
S. Lifschitz (2015). A study of genomic data provenance in nosql document-oriented
database systems. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International
Conference on, pp. 1525–1531. IEEE.
Lioni, A., C. Sauwens, G. Theraulaz, and J.-L. Deneubourg (2010). Seqware query engine:
storing and searching sequence data in the cloud. BMC bioinformatics 11, S2.
Performance of NoSQL Systems
Manyam, G., C. Ivan, G. A. Calin, and K. R. Coombes (2013). targethub: a programmable
interface for mirna-gene interactions. Bioinformatics 29(20), 2657–2658.
Manyam, G., M. A. Payton, J. A. Roth, L. V. Abruzzo, and K. R. Coombes (2012). Relax with
couchdb into the non-relational dbms era of bioinformatics. Genomics 100(1), 1–7.
Messina, A. (2015). Etls for importing ncbi entrez gene, mirbase, mircancer and microrna into
a bioinformatics graph database.
Moniruzzaman, A. and S. A. Hossain (2013). Nosql database: New era of databases
for big data analytics-classification, characteristics and comparison. arXiv preprint
arXiv:1307.0191.
Oliveira, F. R. and L. del Val Cura (2016). Performance evaluation of nosql multi-model
data stores in polyglot persistence applications. In Proceedings of the 20th International
Database Engineering & Applications Symposium, pp. 230–235. ACM.
Pareja-Tobes, P., R. Tobes, M. Manrique, E. Pareja, and E. Pareja-Tobes (2015). Bio4j: a
high-performance cloud-enabled graph-based data platform. bioRxiv, 016758.
Robomongo. The web api for mongodb retrieved january 23, 2017 https://robomongo.org/.
Sadalage, P. J. and M. Fowler (2012). NoSQL distilled: a brief guide to the emerging world of
polyglot persistence. Pearson Education.
Shao, B. and T. Conrad (2015). Are nosql data stores useful for bioinformatics researchers?
International Journal on Recent and Innovation Trends in Computing and Communica-
tion 3(3), 1704–1708.
Wang, S., I. Pandis, C. Wu, S. He, D. Johnson, I. Emam, F. Guitton, and Y. Guo (2014).
High dimensional biological data retrieval optimization with nosql technology. BMC ge-
nomics 15(8), S3.
Wiese, L. (2015). Polyglot database architectures= polyglot challenges. In LWA, pp. 422–426.
C. Messaoudi, M. Amrou and R. Fissoune
Résumé
Les stores NoSQL, une alternative aux systèmes de bases de données relationnels tradition-
nels, pour la gestion des applications de big data ont été récemment introduits. Les applications
qui utilisent deux ou plusieurs modèles simple de données NoSQL sont connues sous le nom
d’applications avec polyglot persistence. Récemment, une nouvelle famille de stores de don-
nées multi-modèles a été introduite, intégrant des modèles de données NoSQL simples dans un
seul et unique système. Dans ce papier, nous évaluons les performances des stores NoSQL dans
le cadre de l’intégration des sources de données protéomique. Deux systèmes ont été évalués.
le premier est une approche polyglot persistence combinant deux stores NoSQL : une base
de données orientée graphe (Neo4j) et une base de données orientée document (MongoDB).
Le deuxième utilise la base de données Multi-Modèle (OrientDB). Nous avons utilisé deux
jeux de données : Homosapiens est un large (LARGE) ensemble de données et Lactobacil-
lus Rhamnosus comme étant un petit (SMALL) ensemble de données. Les temps de stockage,
suppression et efficacité de la requête sont utilisés comme critères de comparaison.
... We chose OrientDB, as it is currently one of the most popular and advanced multi-model database [7], [8], whereas MongoDB and Neo4j are suitable representatives of document [9] and graph [10] databases. As for the comparison metric, we use the execution time of queries as it is a standard metric for comparison also in other (non-cluster) benchmarks [11], [12], [13]. ...
... There are many comparisons of multi-model databases with different representatives of its single-model variants. The significant number of these comparisons used OrientDB and compared it with Neo4j and MongoDB [12], [14]. However, none of them used a cluster setup in their comparison. ...
... Messaoudi et al. [12] also used Neo4j, MongoDB, and OrientDB in their comparisons. Their research was focused on biomedical data. ...
Conference Paper
Full-text available
Digitalization is currently the key factor for progress, with a rising need for storing, collecting, and processing large amounts of data. In this context, NoSQL databases have become a popular storage solution, each specialized on a specific type of data. Next to that, the multi-model approach is designed to combine benefits from different types of databases, supporting several models for data. Despite its versatility, a multi-model database might not always be the best option, due to the risk of worse performance comparing to the single-model variants. It is hence crucial for software engineers to have access to benchmarks comparing the performance of multi-model and single-model variants. Moreover, in the current Big Data era, it is important to have cluster infrastructure considered within the benchmarks. In this paper, we aim to examine how the multi-model approach performs compared to its single-model variants. To this end, we compare the OrientDB multi-model database with the Neo4j graph database and the MongoDB document store. We do so in the cluster setup, to enhance state of the art in database benchmarks, which is not yet giving much insight into cluster-operating database performance.
... As a part of NoSQL benchmarks, the multi-model database benchmark is listed separately due to the particularity of its data model. According to Messaoudi et al. (2017Messaoudi et al. ( , 2018, in biomedical big data, the authors selected a single multi-model database OrientDB and a polyglot persistence instance composed of MongoDB and Neo4j to carry out performance evaluation with multiple workloads, such as insertion, deletion, and search operations. The results showed that MongoDB performed better than OrientDB in processing document data, and OrientDB performed better than Neo4j in querying graph data when the depth of the graph reached three layers. ...
Article
Full-text available
As the need for handling data from various sources becomes crucial for making optimal decisions, managing multi-model data has become a key area of research. Currently, it is challenging to strike a balance between two methods: polyglot persistence and multi-model databases. Moreover, existing studies suggest that current benchmarks are not completely suitable for comparing these two methods, whether in terms of test datasets, workloads, or metrics. To address this issue, the authors introduce MDBench, an end-to-end benchmark tool. Based on the multi-model dataset and proposed workloads, the experiments reveal that ArangoDB is superior at insertion operations of graph data, while the polyglot persistence instance is better at handling the deletion operations of document data. When it comes to multi-thread and associated queries to multiple tables, the polyglot persistence outperforms ArangoDB in both execution time and resource usage. However, ArangoDB has the edge over MongoDB and Neo4j regarding reliability and availability.
... Messaoudi [23] evaluated the performance time needed for storing, deleting and querying biomedical data of two species: Homo sapiens as a large dataset and Lactobacillus Rhamnosus as a small dataset, using Neo4J and OrientDB Graph databases. They found that Neo4J showed a better performance than OrientDB using 'PERIODIC COMMIT' technique for importing, inserting and deleting. ...
Chapter
Full-text available
Abstract. In recent years, the increase in the amount of data gener- ated in basic social practices and specifically in all fields of research has boosted the rise of new database models, many of which have been em- ployed in the field of Molecular Biology. NoSQL Graph databases have been used in many types of research with biological data, especially in cases where data integration is a determining factor. For the most part, they are used to represent relationships between data along two main lines: (i) to infer knowledge from existing relationships; (ii) to represent relationships from a previous data knowledge. In this work, a short his- tory in a timeline of events introduces the mutual evolution of databases and Molecular Biology. We present how Graph databases have been used in Molecular Biology research using High Throughput Sequencing data, and discuss their role and the open field of research in this area.
Conference Paper
Full-text available
Current bioinformatics databases provide huge amounts of different biological entities such as genes, proteins, diseases, microRNA, annotations, literature references. In many case studies, a bioinformatician often needs more than one type of resource in order to fully analyse his data. In this paper, we introduce BioGraphDB, a bioinformatics database that allows the integration of different types of data sources, so that it is possible to perform bioinformatics analysis using only a comprehensive system. Our integrated database is structured as a NoSQL graph database, based on the OrientDB platform. This way we exploit the advantages of that technology in terms of scalability and efficiency with regards to traditional SQL database. At the moment, we integrated ten different resources, storing and linking data about genes, proteins, microRNAs, molecular pathways, functional annotations, literature references and associations between microRNA and cancer diseases. Moreover, we illustrate some typical bioinformatics scenarios for which the user just needs to query the BioGraphDB to solve them.
Technical Report
Full-text available
This work is the first of a series of technical report documenting the performed activities to build a big bioinformatics database. Current available bioinformatics databases provide huge amounts of different biological entities such as genes, proteins, diseases, microRNA, annotations, literature references. But in many case studies, a bioinformatician often needs more than one type of resource in order to full analyze his data. The bioinformatics database object of this work will allow the integration of different types of data sources, so that it is possible to perform bioinformatics analysis using only one comprehensive system. The integrated database will be structured as a NoSQL graph database, based on the OrientDB platform, exploiting this way the advantages of that technology in terms of scalability and efficiency with regards to traditional SQL database.
Article
Full-text available
Background Precision medicine requires the tight integration of clinical and molecular data. To this end, it is mandatory to define proper technological solutions able to manage the overwhelming amount of high throughput genomic data needed to test associations between genomic signatures and human phenotypes. The i2b2 Center (Informatics for Integrating Biology and the Bedside) has developed a widely internationally adopted framework to use existing clinical data for discovery research that can help the definition of precision medicine interventions when coupled with genetic data. i2b2 can be significantly advanced by designing efficient management solutions of Next Generation Sequencing data. Results We developed BigQ, an extension of the i2b2 framework, which integrates patient clinical phenotypes with genomic variant profiles generated by Next Generation Sequencing. A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations. We report an evaluation of the query performance of our system on more than 11 million variants, showing that the implemented solution scales linearly in terms of query time and disk space with the number of variants. Conclusions In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data. The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0861-0) contains supplementary material, which is available to authorized users.
Article
Full-text available
Background. Next Generation Sequencing and other high-throughput technologies have brought a revolution to the bioinformatics landscape, by offering sheer amounts of data about previously unaccessible domains in a cheap and scalable way. However, fast, reproducible, and cost-effective data analysis at such scale remains elusive. A key need for achieving it is being able to access and query the vast amount of publicly available data, specially so in the case of knowledge-intensive, semantically rich data: incredibly valuable information about proteins and their functions, genes, pathways, or all sort of biological knowledge encoded in ontologies remains scattered, semantically and physically fragmented. Methods and Results. Guided by this, we have designed and developed Bio4j. It aims to offer a platform for the integration of semantically rich biological data using typed graph models. We have modeled and integrated most publicly available data linked with proteins into a set of interdependent graphs. Data querying is possible through a data model aware Domain Specific Language implemented in Java, letting the user write typed graph traversals over the integrated data. A ready to use cloud-based data distribution, based on the Titan graph database engine is provided; generic data import code can also be used for in-house deployment. Conclusion. Bio4j represents a unique resource for the current Bioinformatician, providing at once a solution for several key problems: data integration; expressive, high performance data access; and a cost-effective scalable cloud deployment model.
Article
Full-text available
Background: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results: In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions: The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
Conference Paper
NoSQL data store systems have recently been introduced as alternatives to traditional relational database management systems. These data stores systems implement simpler and scalable data models that increase the performance and efficiency of a new kind of emerging complex database application. Applications that model their data using two or more simple NoSQL models are known as applications with polyglot persistence. Usually, their implementations are complex because they must manage and store their data using several data store systems simultaneously. Recently, a new family of multi-model data stores was introduced, integrating simple NoSQL data models into a single unique system. This paper presents a performance evaluation of multi-model data stores used by an application with polyglot persistence. In this research, multi-- model datasets were synthesized in order to simulate that application. We evaluate the performance of benchmarks based on a set of basic database operations on single model and multimodel data store systems. Experimental results show that in some scenarios multi-model data stores have similar or better performance than simple model data stores.
Conference Paper
The era of “big data” arose the need to have computational tools in support of biological tasks. Many types of bioinformatics tools have been developed for different biological tasks as target, pathway and gene set analysis, but integrated resources able to incorporate a unique web interface, and to manage a biological scenario involving many different data sources are still lacking. In many bioinformatics approaches several data processing and evaluation steps are required to reach the final results. In this work, we face a biological case study by exploiting the capabilities of an integrated multi-component resources database that is able to deal with complex biological scenarios. As example of our problem-solving approach we provide a case study on the analysis of functional effect of miRNA single nucleotide polymorphisms (SNPs) in cancer disease.
Article
Research attention has been powered to understand the functional roles of non-coding RNAs (ncRNAs). Many studies have demonstrated their deregulation in cancer and other human disorders. ncRNAs are also present in extracellular human body fluids such as serum and plasma, giving them a great potential as non-invasive biomarkers. However, non-coding RNAs have been relatively recently discovered and a comprehensive database including all of them is still missing. Reconstructing and visualizing the network of ncRNAs interactions are important steps to understand their regulatory mechanism in complex systems. This work presents ncRNA-DB, a NoSQL database that integrates ncRNAs data interactions from a large number of well established online repositories. The interactions involve RNA, DNA, proteins and diseases. ncRNA-DB is available at http://ncrnadb.scienze.univr.it/ncrnadb/. It is equipped with three interfaces: web based, command line and a Cytoscape app called ncINetView. By accessing only one resource, users can search for ncRNAs and their interactions, build a network annotated with all known ncRNAs and associated diseases, and use all visual and mining features available in Cytoscape.