Content uploaded by Badir Hassan
Author content
All content in this area was uploaded by Badir Hassan on Dec 18, 2017
Content may be subject to copyright.
A Performance Study of NoSQL Stores for Biomedical Data
Chaimaa Messaoudi, Mouna Amrou Mhand and Rachida Fissoune
LabTIC laboratory, National School of Applied Sciences, ENSA,
Abdelmalek Essaadi University, Tangier, 90 000, Morocco
messaoudi.chaimaa@gmail.com
amroumounae@gmail.com
ensat.fissoune@gmail.com
Summary. NoSQL data stores can serve as alternative to traditional relational
database systems, particularly for handling big data biomedical applications.
Applications that model their data using two or more simple NoSQL models
are known as applications with polyglot persistence. Recently, a new fam-
ily of multi-model data stores was introduced, integrating simple NoSQL data
models into a single unique system. In this paper, we evaluate the perfor-
mance of the integration of proteomics data sources, using a polyglot persis-
tence approach combining two NoSQL stores: Graph-oriented database (Neo4j)
and Document-oriented database (MongoDB); compared to a Multi-Model (Ori-
entDB) approach. In order to perform the comparison study, we used two species
datasets: Homosapiens as a large dataset and Lactobacillus Rhamnosus as a
small dataset. Storing, deletion and query efficiency are used as criteria for the
comparison.
1 Introduction
While the advances of next generation sequencing has rapidly increased, the informatics
infrastructure used to manage the data generated by this technology has not filled the gap.
Extracting knowledge and useful information from biological big data is one of the main en-
deavors for bioinformatics community. Moreover, biological data sources are distributed and
heterogeneous: each source has its own data format and structure. It is common that the sci-
entific terms used to describe the data differ from one source to another. These challenges are
needed to be addressed because the current relational database technologies have insufficient
resources to handle them, and more generally big data 4 V’s (Atzeni et al., 2013).
New data stores systems called non-relational data stores systems have emerged under
the name of NoSQL systems. These systems support different types of data models that are
efficiently scalable and distributed. We distinguish four NoSQL categories with examples,
each one having its own specificities and facilitating the management of some particular kind of
data: key-value stores (DynamoDB), column family database (Cassandra, HBase), document-
based storage (MongoDB, OrientDB) and graph database (AllegroGraph, Orient DB, Neo4j)
(Moniruzzaman and Hossain, 2013). The use of a NoSQL store mainly relies to the application
context and the data model (e.g. graph). Some applications require more than one NoSQL
Performance of NoSQL Systems
stores. For example, in a proteomics application, a protein-protein interaction dataset should
be modeled as a graph, and a protein information dataset is more appropriate to be stored in a
document database. Those types of applications that simultaneously use different models and
data stores are called applications with polyglot persistence.
The polyglot persistence approach requires the understanding of more than one query lan-
guage and user interface, addition to managing the communication between the different data
stores used in the application. There have been advances to provide a unique NoSQL sys-
tem that contains multiple data models. These systems are called NoSQL multi-model, and
they simplify the process of application development because they use only one store, but they
could decrease the performance of applications (Oliveira and del Val Cura, 2016).
Several research studies have been conducted to evaluate the performance of NoSQL stores
such as MongoDB, Cassandra, ArangoDB, CouchDB and OrientDB for the management of
large biomedical data sets ((Shao and Conrad, 2015; Guimaraes et al., 2015; Wang et al.,
2014)). Oliveira and del Val Cura (2016) presented a performance evaluation of multi-model
data stores using polyglot persistence. They implemented a synthetic data generator to create
the hybrid datasets. But their relative advantage when applied to biomedical data sets has not
been characterized.
This paper presents a performance study of NoSQL stores for the integration of proteomics
data. We compare the performance of NoSQL multi-model (OrientDB) to the polyglot persis-
tence approach. OrientDB manages both document and graph data models, and it was chosen
because it is an open source data model store. The polyglot persistence consists of combining
the document database (MongoDB) with a graph database (Neo4j). The comparisons are made
from the following aspects: Insertion, deletion, importation and query performance.
This paper is structured as follows. Section 2 covers a brief introduction to NoSQL
databases and the main features of polyglot persistence. It discusses some related works that
evaluate NoSQL data store performance. Section 3 presents the evaluation study, the datasets
used and discuss the practical results obtained. Section 4 concludes and suggests future works.
2 NoSQL databases: An Overview
NoSQL databases have appeared as a solution for storage scalability, management of large
volumes of unstructured data and parallelism. We aim to provide a brief overview of NoSQL
store models as well as polyglot systems particularly polyglot persistence and multi-model sys-
tem. In the same section, we discuss some related works on NoSQL stores for the integration
of biomedical data.
2.1 NoSQL Stores Models
NoSQL data store systems differ from relational databases by offering different data mod-
els, which could be classified into four main models:
— Key/Value: similar to maps or dictionaries where data are associated to a unique key.
This makes the system accessible and available at runtime anytime without conflicting
with any other stored data. Values are isolated and independent from each other.
C. Messaoudi, M. Amrou and R. Fissoune
— Document: it designed to manage and store documents. These documents are encoded
in a standard data exchange format such as XML, JSON (Javascript Option Notation)
or BSON (Binary JSON).
— Graph: this model has three basic components: nodes, relationships, and properties of
nodes and relationships. The graph is directed, nodes are connected by edges. This
model is opportune for applications requiring queries traversing several levels of rela-
tionships between data .
— Column: stores data tables as columns rather than rows offering a more precise access
to data, specially in very large data sets.
2.2 Polyglot databases architecture
Polyglot database architectures are classified into three main types: 1) lambda Architecture,
2) polyglot persistence and 3) multi-model databases (Wiese, 2015). In this paper we will focus
on polyglot persistence and multi-model systems.
2.2.1 Polyglot Persistence
The term polyglot persistence refers to using different data stores in different circumstances
(Sadalage and Fowler, 2012), instead of choosing just one single database management system
to store the entire data. Different kinds of data are best dealt with different data stores. Polyglot
persistence makes it possible to choose as many databases as needed since they are built for
different purposes. Figure 1 shows the polyglot persistence concept.
FIG . 1: Polyglot Persistence Concept
The main advantage of polyglot persistence is giving the users the possibility to customize
their system to match the application requirements. However, uniform access and logical re-
dundancy can be some disadvantages of polyglot persistence.
Performance of NoSQL Systems
2.2.2 Multi-Model systems
Multi-model systems provide a database system that stores data in a single store but ac-
cesses the data with different APIs according to different data models. Indeed, multi-model
databases are relying on different storage backends, which increases the overall complexity of
the system and raises concerns like interdatabase consistency, inter-database transactions and
interoperability as well as version compatibility and security. They either support different
data models directly inside the store engine or they offer layers for additional data models on
top of a single-model engine, see Figure 2.
FIG . 2: Multi-Model Concept
There are two open source multi-model databases that are OrientDB and ArangoDB. Ori-
entDB presents a document API, an object API, and a graph API; it offers extensions of the
SQL standard to interact will all three APIs. Diverse advantages are presented with this multi-
model stores. For instance, Reducing database administration, improved consistency, easier
application development. Lioni et al. (2010) presents SeqWare Query Engine, which has been
created using modern cloud computing technologies and designed to support databasing infor-
mation from thousands of genomes. Their backend implementation was built using the highly
scalable, NoSQL HBase database from the Hadoop project. This software is open source and
freely available from the SeqWare project (http://seqware.sourceforge.net).
Messina (2015) presents an integrated database structured as a NoSQL graph database
based on Orientdb, which allows the integration of different types of data sources (Gene, miR-
Base, mirCancer), facilitating the performance of bioinformatics analysis using only one sys-
tem. The authors in (Bonnici et al., 2014) presented ncRNA-DB, a NoSQL database based
on the OrientDB platform that put together many biological resources that deal with several
classes of non-coding RNA (ncRNA) such as miRNA, long-non-coding RNA (lncRNA), cir-
cular RNA (circRNA) and their interactions with genes and diseases. More recently, Bio4j
(Pareja-Tobes et al., 2015) and BioGraphDB (Fiannaca et al., 2016b,a), has been developed.
Bio4j is based on a Java library that allows to build an integrated cloud-based data platform
upon a graph structure, focused on the analysis of proteomic data. It in fact integrates data
about protein sequences and annotations, GO terms, enzymes.
Another application of NoSQL stores in bioinformatics is BigQ developed by Gabetta et al.
(2015), as an extension of the i2b2 framework, which integrates patient clinical phenotypes
C. Messaoudi, M. Amrou and R. Fissoune
with genomic variant profiles generated by Next Generation Sequencing. The i2b2 web service
is composed of an efficient and scalable document-based database that manages annotations
of genomic variants and of a visual programming plug-in designed to dynamically perform
queries on clinical and genetic data. The system is based on CouchDB.
Manyam et al. (2013) developed TargetHub a CouchDB based database used for storing
miRNA-gene interactions for integration into high-throughput genomic analysis. It integrates
data from multiple miRNA repositories, allows users to systematically integrate data from mul-
tiple sources. In addition, CouchDB has been used to build three new bioinformatics resources
(Manyam et al., 2012). GeneSmash as a database that collects data from various bioinformat-
ics resources and provides automated gene-centric annotations needed and used in large scale
projects such as the Cancer Genome Atlas (TCGA). The drugBase database used for storage of
drug-target interactions and the HapMapCN drug-target database which provides an interface
to query the copy number variations identified using the HapMap datasets.
3 Evaluation Study
3.1 Datasets and materials
We conducted an experimental approach to compare the latencies of MongoDB combined
with Neo4j and OrientDB, on storing interactions of two organisms. For the graph, the data are
available in the STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database
https://string-db.org/cgi/download. For the documents datasets, we provide protein sequence
and functional information from UniProt (Universal Protein Resource). We use the following
two data files:
— Homo sapiens as a large dataset of 159743 proteins with 11.5 millions Interactions.
— Lactobacillus rhamnosus as a small dataset of 11707 proteins with 1.8 millions inter-
actions.
These experiments were performed on a server, running on Centos7 operating system. We can
see in Table 1 the configuration details. The versions of systems used in the experiments are
OrientDB 2.2.20, MongoDB 3.4.1, and Neo4j 3.2.1 community version. The load and query
operations used the web interface provided by the Neo4j and OrientDB data stores. Viewing
the collections and documents created in MongoDB can be done using a command prompt.
MongoDB does not offer a complete Web interface, then we used the (Robomongo) software
which provides the user interface in order to access, view, create, add, edit and delete the
existing or new collections and documents.
Features
Processor Intel Xeon CPU E5-1650 3.20GHz
Memory 16GB
Storage 500GB
TAB. 1: Server Configuration
Performance of NoSQL Systems
3.2 Results and Discussion
The publicly datasets listed in the previous section give us a huge amount of information,
that we have to integrate in an harmonious and consistent way. Evaluating the loading, deletion
and querying of these data is the goal. Moreover, the datasets are available for download
in several different formats, such as tab-delimited plain-text, structured XMLs, FASTA. For
example, The latest available release of OrientDB has a powerful tool to move data from and
to a database by executing an Extract-Transformer-Loader (ETL) process, described by a JSON
configuration file. The dataset was transformed into comma-separated values files. In the graph
model, each biological entity (protein) and its properties have been mapped respectively into
a vertex and its attributes, and each relationship between two biological entities(protein) has
been mapped into an edge. If a relationship has some properties, they are also saved as edge’s
attributes. Vertices and edges are grouped into classes, according to the nature of the entities.
For example, all the proteins imported from Uniprot become instances of the protein vertex
class.
The performance study consists of comparing NoSQL stores in terms of i) data storing,
deletion, and ii) queries latencies. Two real datasets are used to perform the comparison.
OrientDB, MongoDB, and Neo4j are considered for this study. The number of seconds taken
to complete each operation is calculated 30 times, and the average is given to compare different
stores. Smaller values of the average time indicate better performance.
3.2.1 Data Storing and Deletion
Data storing concerns with two operations i) the importation of the dataset into the NoSQL
stores and ii) the insertion of a single data record into the NoSQL stores. The importation
consists of loading the whole dataset in the stores while the insertion is done for a single data
record. Moreover, the deletion concerns with deleting the whole dataset from the stores. These
operations are applied for the small and large datasets. Figure 3 shows the importation and
deletion performance for document stores (MongoDB and OrientDB) using the small dataset
while Figure 4 displays result regarding the large dataset. The results for the importation reveal
that MongoDB has better performance than OrientDB in both cases, small and large dataset.
The same conclusion is given for the deletion operation. There is no significant performance
gain for MongoDB when the large dataset is used.
Figure 6 shows the importation and deletion performance for graph stores (Neo4j and Ori-
entDB) using the small dataset while Figure 7 displays result regarding the large dataset. The
results for the importation reveal that Neo4j has better performance than OrientDB in both
cases, small and large dataset. The same conclusion is given for the insertion and deletion op-
erations. There is a significant performance gain for Neo4j when the importation is conducted
in the large dataset. We can conclude that the larger network the more Neo4j is efficient.
Neo4j includes a ’LOAD CSV’ Cypher clause for data import, which is a powerful ETL
tool. It can load a CSV file from the local filesystem or from a remote URI (i.e. Dropbox,
Github, etc.) and can be combined with USING PERIODIC COMMIT to group the operations
on multiple rows in transactions to load large amounts of data. This can explain the superior
performance of Neo4j.
In Figure 5, we present the performance results for the insertion of a single record in
two data models: graph (Neo4j and OrientDB) and document (MongoDB and OrientDB).
C. Messaoudi, M. Amrou and R. Fissoune
FIG . 3: Document operation for
Small dataset
FIG . 4: Document operation for
Large dataset
FIG . 5: Insertion
FIG . 6: Graph operation for Small
dataset
FIG . 7: Graph operation for Large
dataset
OrientDB (document and graph) has the lower performance compared to MongoDB and Neo4j.
Performance of NoSQL Systems
FIG . 8: Graph query with depth level
1
FIG . 9: Graph query with depth level
2
FIG . 10: Graph query with depth
level 3
FIG . 11: Graph query with depth
level 4
We choose to present the performance in one dataset because their is no difference between
the small and large datasets in terms of insertion.
3.2.2 Query performance
We evaluate the performance of the multi-model and the polyglot persistence approaches
using a query that retrieves a document and its network. The document is randomly selected
from the document-oriented database, then the network of the selected document is extracted
from the graph-oriented database with a traversal through the graph up to a fixed depth level
from 1 to 4. For example, using polyglot persistence data stores, each query was run in two
steps. In the first step, a key is randomly selected and the matched Uniprot-ID is retrieved
from MongoDB. In the second step, the set of nodes connected with the selected Uniprot-ID
is retrieved from Neo4j. The total elapsed time of the query is computed as a sum of both the
Neo4j and MongoDB elapsed query times. In multi-model data stores, each query returns the
matched documents and their connected documents in the graph.
Figures 8 and 9 show the performance results for querying the small and large dataset in
depth level 1 and level 2. The results show that combining Neo4j and MongoDB has the best
performance for queries with graph traversal up to a depth level of 2. Figure 10 shows that the
performance of polyglot persistence decreases while OrientDB reaches the best performance
C. Messaoudi, M. Amrou and R. Fissoune
for queries that require graph traversal of 3 depth levels. Figure 11 shows that OrientDB is
still the multi-model data store that reaches the best performance for graph traversal depth
level 4. We conclude that when an application requires deppper levels of graph traversal, the
best performance is reached by OrientDB. The same conclusions are made when querying the
large dataset. However, there is a significance gain in the performance. For example for large
datasets and a query for graph traversal of depth 2 (Figure 9), polyglot persistence approach
shows an average time much more better (18.92s) than the multi-model (145.09s).
4 Conclusion
In this paper, a performance study is provided to evaluate the time needed for storing, delet-
ing and querying data using a polyglot persistence approach and a multi-model system. We
found out that both the depth levels of graph traversal of queries and the size of the graph influ-
ence the performance of both polyglot persistence and multi-model data stores. We conclude
that for importing, inserting and deleting biomedical data as illustrated in this paper, MongoDB
is faster than OrientDB regarding document-oriented datasets. In the case of graph oriented
datasets, Neo4j shows better performance than OrientDB, we used ’PERIODIC COMMIT’
technique given in Neo4j. In the query performance, we found out that when the application
requires deeper levels of graph traversal, the best performance is reached by OrientDB.
References
Atzeni, P., C. S. Jensen, G. Orsi, S. Ram, L. Tanca, and R. Torlone (2013). The relational
model is dead, sql is dead, and i don’t feel so good myself. ACM SIGMOD Record 42(2),
64–68.
Bonnici, V., F. Russo, N. Bombieri, A. Pulvirenti, and R. Giugno (2014). Comprehensive
reconstruction and visualization of non-coding regulatory networks in human. Frontiers in
bioengineering and biotechnology 2, 69.
Fiannaca, A., L. La Paglia, M. La Rosa, A. Messina, P. Storniolo, and A. Urso (2016a). Inte-
grated db for bioinformatics: A case study on analysis of functional effect of mirna snps in
cancer. In International Conference on Information Technology in Bio-and Medical Infor-
matics, pp. 214–222. Springer.
Fiannaca, A., M. La Rosa, L. La Paglia, A. Messina, and A. Urso (2016b). Biographdb:
a new graphdb collecting heterogeneous data for bioinformatics analysis. Proceedings of
BIOTECHNO.
Gabetta, M., I. Limongelli, E. Rizzo, A. Riva, D. Segagni, and R. Bellazzi (2015). Bigq: a
nosql based framework to handle genomic variants in i2b2. BMC bioinformatics 16(1), 415.
Guimaraes, V., F. Hondo, R. Almeida, H. Vera, M. Holanda, A. Araujo, M. E. Walter, and
S. Lifschitz (2015). A study of genomic data provenance in nosql document-oriented
database systems. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International
Conference on, pp. 1525–1531. IEEE.
Lioni, A., C. Sauwens, G. Theraulaz, and J.-L. Deneubourg (2010). Seqware query engine:
storing and searching sequence data in the cloud. BMC bioinformatics 11, S2.
Performance of NoSQL Systems
Manyam, G., C. Ivan, G. A. Calin, and K. R. Coombes (2013). targethub: a programmable
interface for mirna-gene interactions. Bioinformatics 29(20), 2657–2658.
Manyam, G., M. A. Payton, J. A. Roth, L. V. Abruzzo, and K. R. Coombes (2012). Relax with
couchdb into the non-relational dbms era of bioinformatics. Genomics 100(1), 1–7.
Messina, A. (2015). Etls for importing ncbi entrez gene, mirbase, mircancer and microrna into
a bioinformatics graph database.
Moniruzzaman, A. and S. A. Hossain (2013). Nosql database: New era of databases
for big data analytics-classification, characteristics and comparison. arXiv preprint
arXiv:1307.0191.
Oliveira, F. R. and L. del Val Cura (2016). Performance evaluation of nosql multi-model
data stores in polyglot persistence applications. In Proceedings of the 20th International
Database Engineering & Applications Symposium, pp. 230–235. ACM.
Pareja-Tobes, P., R. Tobes, M. Manrique, E. Pareja, and E. Pareja-Tobes (2015). Bio4j: a
high-performance cloud-enabled graph-based data platform. bioRxiv, 016758.
Robomongo. The web api for mongodb retrieved january 23, 2017 https://robomongo.org/.
Sadalage, P. J. and M. Fowler (2012). NoSQL distilled: a brief guide to the emerging world of
polyglot persistence. Pearson Education.
Shao, B. and T. Conrad (2015). Are nosql data stores useful for bioinformatics researchers?
International Journal on Recent and Innovation Trends in Computing and Communica-
tion 3(3), 1704–1708.
Wang, S., I. Pandis, C. Wu, S. He, D. Johnson, I. Emam, F. Guitton, and Y. Guo (2014).
High dimensional biological data retrieval optimization with nosql technology. BMC ge-
nomics 15(8), S3.
Wiese, L. (2015). Polyglot database architectures= polyglot challenges. In LWA, pp. 422–426.
C. Messaoudi, M. Amrou and R. Fissoune
Résumé
Les stores NoSQL, une alternative aux systèmes de bases de données relationnels tradition-
nels, pour la gestion des applications de big data ont été récemment introduits. Les applications
qui utilisent deux ou plusieurs modèles simple de données NoSQL sont connues sous le nom
d’applications avec polyglot persistence. Récemment, une nouvelle famille de stores de don-
nées multi-modèles a été introduite, intégrant des modèles de données NoSQL simples dans un
seul et unique système. Dans ce papier, nous évaluons les performances des stores NoSQL dans
le cadre de l’intégration des sources de données protéomique. Deux systèmes ont été évalués.
le premier est une approche polyglot persistence combinant deux stores NoSQL : une base
de données orientée graphe (Neo4j) et une base de données orientée document (MongoDB).
Le deuxième utilise la base de données Multi-Modèle (OrientDB). Nous avons utilisé deux
jeux de données : Homosapiens est un large (LARGE) ensemble de données et Lactobacil-
lus Rhamnosus comme étant un petit (SMALL) ensemble de données. Les temps de stockage,
suppression et efficacité de la requête sont utilisés comme critères de comparaison.