ArticlePDF Available

NoSQL evaluation: A use case oriented survey

Abstract

Motivated by requirements of Web 2.0 applications, a plethora of non-relational databases raised in recent years. Since it is very difficult to choose a suitable database for a specific use case, this paper evaluates the underlying techniques of NoSQL databases considering their applicability for certain requirements. These systems are compared by their data models, query possibilities, concurrency controls, partitioning and replication opportunities.
NoSQL Evaluation
A Use Case Oriented Survey
Robin Hecht
Chair of Applied Computer Science IV
University of Bayreuth
Bayreuth, Germany
robin.hecht@uni-bayreuth.de
Stefan Jablonski
Chair of Applied Computer Science IV
University of Bayreuth
Bayreuth, Germany
stefan.jablonski@uni-bayreuth.de
Abstract Motivated by requirements of Web 2.0 applications, a
plethora of non-relational databases raised in recent years. Since
it is very difficult to choose a suitable database for a specific use
case, this paper evaluates the underlying techniques of NoSQL
databases considering their applicability for certain
requirements. These systems are compared by their data models,
query possibilities, concurrency controls, partitioning and
replication opportunities.
NoSQL, Key Value Stores, Document Stores, Column Family
Stores, Graph Databases, Evaluation, Use Cases
I. MOTIVATION
The current NoSQL trend is motivated by applications
stemming mostly of the Web 2.0 domain. Some of these
applications have storage requirements which exceed capacities
and possibilities of relational databases.
In the past SQL databases were used for nearly every storage
problem, even if a data model did not match the relational
model well. The object-relational impedance mismatch is one
example, the transformation of graphs into tables another one
for using a data model in a wrong way. This leads to an
increasing complexity by using expensive mapping
frameworks and complex algorithms. Even if a data model can
easily be covered by the relational one, the huge feature set
offered by SQL databases is an unneeded overhead for simple
tasks like logging. The strict relational schema can be a burden
for web applications like blogs, which consist of many different
kinds of attributes. Text, comments, pictures, videos, source
code and other information have to be stored within multiple
tables. Since such web applications are very agile, underlying
databases have to be flexible as well in order to support easy
schema evaluation. Adding or removing a feature to a blog is
not possible without system unavailability if a relational
database is being used.
The increasing amount of data in the web is a problem
which has to be considered by successful web pages like the
ones of Facebook, Amazon and Google. Besides dealing with
tera- and petabytes of data, massive read and write requests
have to be responded without any noticeable latency. In order
to deal with these requirements, these companies maintain
clusters with thousands of commodity hardware machines. Due
to their normalized data model and their full ACID support,
relational databases are not suitable in this domain, because
joins and locks influence performance in distributed systems
negatively. In addition to high performance, high availability is
a fundamental requirement of many companies. Amazon
guarantees for its services availability of at least 99.9% during
a year [1]. Therefore, databases must be easily replicable and
have to provide an integrated failover mechanism to deal with
node or datacenter failures. They also must be able to balance
read requests on multiple slaves to cope with access peaks
which can exceed the capacity of a single server. Since
replication techniques offered by relational databases are
limited and these databases are typically based on consistency
instead of availability, these requirements can only be achieved
with additional effort and high expertise [1].
Due to these requirements, many companies and
organizations developed own storage systems, which are now
classified as NoSQL databases. Since every store is specialized
on the specific needs of their principals, there is no panacea on
the market covering all of the above mentioned requirements.
Therefore, it is very difficult to select one database out of the
plethora of systems, which is the most suitable for a certain use
case.
Even if NoSQL databases have already been introduced and
compared in the past [2] [3] [4] [5], no use case oriented survey
is available. Since single features of certain databases are
changing on a weekly basis, an evaluation of the different
features of certain stores is outdated at the moment it is
published. Therefore, it is necessary to consider the impact of
the underlying techniques on specific use cases in order to
provide a durable overview. This paper highlights the most
important criteria for a database selection, introduces the
underlying techniques and compares a wider range of open
source NoSQL databases including graph databases, too.
In order to evaluate the underlying techniques of these
systems, the most important features for solving the above
mentioned requirements have to be classified. Since the
relational data model is considered as not suitable for certain
use cases, chapter two will examine structure and flexibility of
different data models offered by NoSQL systems. Afterwards,
query possibilities of these stores and their impact on system
complexity will be analyzed in chapter three. In order to deal
with many parallel read and write requests, some stores loose
concurrency restrictions. Different strategies of these systems
for handling concurrent requests are inspected in chapter four.
Since huge amounts of data and high performance serving
exceed the capacities of single machines, partitioning methods
of NoSQL databases are analyzed in chapter five. Replication
techniques and their effects on availability and consistency are
examined in chapter six.
II. DATA MODEL
Mostly NoSQL databases differ from relational databases in
their data model. These systems are classified in this evaluation
into 4 groups.
A. Key Value Stores
Key value stores are similar to maps or dictionaries where
data is addressed by a unique key. Since values are
uninterpreted byte arrays, which are completely opaque to the
system, keys are the only way to retrieve stored data. Values
are isolated and independent from each other wherefore
relationships must be handled in application logic. Due to this
very simple data structure, key value stores are completely
schema free. New values of any kind can be added at runtime
without conflicting any other stored data and without
influencing system availability. The grouping of key value
pairs into collection is the only offered possibility to add some
kind of structure to the data model.
Key value stores are useful for simple operations, which are
based on key attributes only. In order to speed up a user
specific rendered webpage, parts of this page can be calculated
before and served quickly and easily out of the store by user
IDs when needed. Since most key value stores hold their
dataset in memory, they are oftentimes used for caching of
more time intensive SQL queries. Key value stores considered
in this paper are Project Voldemort [6], Redis [7] and Membase
[8].
B. Document Stores
Document Stores encapsulate key value pairs in JSON or
JSON like documents. Within documents, keys have to be
unique. Every document contains a special key ID, which is
also unique within a collection of documents and therefore
identifies a document explicitly. In contrast to key value stores,
values are not opaque to the system and can be queried as well.
Therefore, complex data structures like nested objects can be
handled more conveniently. Storing data in interpretable JSON
documents have the additional advantage of supporting data
types, which makes document stores very developer-friendly.
Similar to key value stores, document stores do not have any
schema restrictions. Storing new documents containing any
kind of attributes can as easily be done as adding new attributes
to existing documents at runtime.
Document stores offer multi attribute lookups on records
which may have complete different kinds of key value pairs.
Therefore, these systems are very convenient in data
integration and schema migration tasks. Most popular use cases
are real time analytics, logging and the storage layer of small
and flexible websites like blogs. The most prominent document
stores are CouchDB [9], MongoDB [10] and Riak [11]. Riak
offers links, which can be used to model relationships between
documents.
C. Column Family Stores
Column Family Stores are also known as column oriented
stores, extensible record stores and wide columnar stores. All
stores are inspired by Googles Bigtable [12], which is a
“distributed storage system for managing structured data that is
designed to scale to a very large size” [12]. Bigtable is used in
many Google projects varying in requirements of high
throughput and latency-sensitive data serving. The data model
is described as sparse, distributed, persistent multidimensional
sorted map” [12]. In this map, an arbitrary number of key value
pairs can be stored within rows. Since values cannot be
interpreted by the system, relationships between datasets and
any other data types than strings are not supported natively.
Similar to key value stores, these additional features have to be
implemented in the application logic. Multiple versions of a
value are stored in a chronological order to support versioning
on the one hand and achieving better performance and
consistency on the other one (chapter four). Columns can be
grouped to column families, which is especially important for
data organization and partitioning (chapter five). Columns and
rows can be added very flexibly at runtime but column families
have to be predefined oftentimes, which leads to less flexibility
than key value stores and document stores offer. HBase [13]
and Hypertable [14] are open source implementations of
Bigtable, whereas Cassandra [15] differs from the data model
described above, since another dimension called super columns
is added. These super columns can contain multiple columns
and can be stored within column families. Therefore Cassandra
is more suitable to handle complex and expressive data
structures.
Due to their tablet format, column family stores have a
similar graphical representation compared to relational
databases. The main difference lies in their handling of null
values. Considering a use case with many different kinds of
attributes, relational databases would store a null value in each
column a dataset has no value for. In contrast, column family
stores only store a key value pair in one row, if a dataset needs
it. This is what Google calls “sparse” and which makes column
family stores very efficient in domains with huge amounts of
data with varying numbers of attributes. The convenient
storage of heterogeneous datasets is also handled by document
stores very well. Supporting nested data structures, links and
data types JSON documents can be more expressive than the
data model of column families. Additionally, document stores
are very easy to maintain and are therefore suitable for flexible
web applications. The data model of column family stores is
more suitable for applications dealing with huge amounts of
data stored on very large clusters, because the data model can
be partitioned very efficiently.
Key value stores, document stores and column family stores
have in common, that they do store denormalized data in order
to gain advantages in distribution. Besides a few exceptions
like nested documents and the link feature of Riak,
relationships must be handled completely in the application
logic. The famous friend-of-a-friend problem and similar
recommendation queries would lead to many queries,
performance penalties and complex code in the application
layer if one of these databases was used for such use cases.
D. Graph Databases
In contrast to relational databases and the already
introduced key oriented NoSQL databases, graph databases are
specialized on efficient management of heavily linked data.
Therefore, applications based on data with many relationships
are more suited for graph databases, since cost intensive
operations like recursive joins can be replaced by efficient
traversals.
Neo4j [16] and GraphDB [17] are based on directed and
multi relational property graphs. Nodes and edges consist of
objects with embedded key value pairs. The range of keys and
values can be defined in a schema, whereby the expression of
more complex constraints can be described easily. Therefore it
is possible to define that a specific edge is only applicable
between a certain types of nodes.
Property graphs are distinct from resource description
framework stores like Sesame [18] and Bigdata [19] which are
specialized on querying and analyzing subject-predicate-object
statements. Since the whole set of triples can be represented as
directed multi relational graph, RDF frameworks are
considered as a special form of graph databases in this paper,
too. In contrast to property graphs, these RDF graphs do not
offer the possibility of adding additional key value pairs to
edges and nodes. On the other handy, by use of RDF schema
and the web ontology language it is possible to define a more
complex and more expressive schema, than property graph
databases do.
Twitter stores many relationships between people in order
to provide their tweet following service. These one-way
relationships are handled within their own graph database
FlockDB [20] which is optimized for very large adjacency lists,
fast reads and writes.
Use cases for graph databases are location based services,
knowledge representation and path finding problems raised in
navigation systems, recommendation systems and all other use
cases which involve complex relationships. Property graph
databases are more suitable for large relationships over many
nodes, whereas RDF is used for certain details in a graph.
FlockDB is suitable for handling simple 1-hop-neighbor
relationships with huge scaling requirements.
III. QUERY POSSIBILITIES
Since data models are tightly coupled with query
possibilities, the analysis of queries that should be supported
by the database is a very important process in order to find a
suitable data model. NoSQL databases do not only differ in
their provided data model, they also differ in the richness of
their offered query functionalities. The current state of the
NoSQL landscape can be compared to the time before Codds
SQL introduction. Similar to that time a lot of technical
heterogeneous databases raised in the last years which differ in
their offered data model, query languages and APIs.
Therefore, some research is done in order to achieve a wider
adoption of NoSQL systems by developing standardized query
languages [21] [22] for some of these stores. Up to know
developers have still to cope with the specific characteristics
of each NoSQL store.
Due to their simple data model, APIs of key value stores
provide key based put, get and delete operations only. Any
query language would be an unnecessary overhead for these
stores. If additional query functionalities are required, they
have to be implemented on the application layer which can
quickly lead to much more system complexity and
performance penalties. Therefore, key value stores should not
be used, if more complex queries or queries on values are
required. Very useful in the domain of web applications are
REST interfaces. Heterogeneous clients can directly interact
with the store in a uniform way, while requests can be load
balanced and results can be cached by proxies. Membase is the
only key value store, which offers a REST API natively.
In contrast to key value stores, document stores offer much
richer APIs. Range queries on values, secondary indexes,
querying nested documents and operations like and”, “or”,
“between” are features, which can be used conveniently. Riak
and MongoDB queries can be extended with regular
expression. While MongoDB supports additional operations
like count and distinct, Riak offers functionalities to traverse
links between documents easily. REST interfaces are
supported by document stores, as well. Due to their powerful
interfaces, a query language increasing user-friendliness by
offering an additional abstraction layer would be useful for
document stores. Since none of these stores offer any query
language, the UnQL project [22] is working on a common
query language offering a SQL-like syntax for querying JSON
document stores.
Column family stores only provide range queries and some
operations like “in”, “and/or” and regular expression, if they
are applied on row keys or indexed values. Even if every
column family store offers a SQL like query language in order
to provide a more convenient user interaction, only row keys
and indexed values can be considered in where-clauses, as
well. Since these languages are specialized on the specific
features of their stores, there is no common query language for
column family stores available yet.
As clusters of document and column family stores are able
to store huge amounts of structured data, queries can get very
inefficient if a single machine has to process the required data.
Therefore, all document and column family stores provide
MapReduce frameworks which enable parallelized
calculations over very large datasets on multiple machines.
However, MapReduce jobs are written on a very low
abstraction level and lead to custom programs which are hard
to maintain and reuse. Data analytics platforms like Pig [23]
and Hive [24] try to bridge the gap between declarative query
languages and procedural programmed MapReduce jobs by
offering high level query languages which can be compiled
into sequences of MapReduce jobs.
Graph databases can be queried in two different ways.
Graph patter matching strategies try to find parts of the
original graph, which match a defined graph pattern. Graph
traversal on the other hand starts from a chosen node and
traverses the graph according to a description. The traversal
strategies differ in detecting one matching node as fast as
possible (breadth-first) and in finding the shortest path (depth-
first). Most graph databases offer REST APIs, program
language specific interfaces and store specific query languages
in order to access data using one of the two described
strategies. In contrast to other NoSQL databases, there exist
some query languages, which are supported by more than one
graph database. SPARQL [25] is a popular, declarative query
language with a very simple syntax providing graph pattern
matching. It is supported by most RDF stores (also Sesame
and BigData) and by Neo4j. Gremlin [26] is an imperative
programming language used to perform graph traversals based
on XPATH. Besides support for Neo4j and GraphDB, also
Sesame can be queried with Gremlin after a graph
transformation. FlockDB do not offer any query language,
since only simple 1-hop-neighbor relationships are supported.
NoSQL databases differ strongly in their offered query
functionalities. Besides considering the supported data model
and how it influences queries on specific attributes, it is
necessary to have a closer look on the offered interfaces in
order to find a suitable database for a specific use case. If a
simple, language unspecific API is required, REST interfaces
can be a suitable solution especially for web applications,
whereas performance critical queries should be exchanged
over language specific APIs which are available for nearly
every common programming language like Java. Query
languages offering a higher abstraction level in order to reduce
complexity. Therefore, their use is very helpful when more
complicated queries should be handled. If calculation intensive
queries over large datasets are required, MapReduce
frameworks should be used.
TABLE I. QUERY POSSIBILITIES
Database
Query Possibilities
REST
API
JAVA
API
Query
Language
Map Reduce
Support
Key Value
Stores
Voldemort
-
+
-
-
Redis
-
+
-
-
Membase
+
+
-
-
Document
Stores
Riak
+
+
-
+
MongoDB
+
+
-
+
CouchDB
+
+
-
+
Column
Family
Stores
Cassandra
-
+
+
+
HBase
+
+
+
+
Hypertable
-
+
+
+
Graph
Databases
Sesame
+
+
+
-
BigData
+
+
+
-
Neo4J
+
+
+
-
GraphDB
+
+
+
-
FlockDB
-
+
-
-
IV. CONCURRENCY CONTROL
If many users have access to a data source in parallel,
strategies for avoiding inconsistency based on conflicting
operations are necessary. Traditional databases use pessimistic
consistency strategies with exclusive access on a dataset.
These strategies are suitable, if costs for locking are low and
datasets are not blocked for a long time. Since locks are very
expensive in database clusters which are distributed over large
distances and many web applications have to support very
high read request rates, pessimistic consistency strategies can
cause massive performance loss.
Multiversion concurrency control (MVCC) relaxes strict
consistency in favor of performance. Concurrent access is not
managed with locks but by organization of many unmodifiable
chronological ordered versions. Since datasets are not reserved
for exclusive access, read requests can be handled by offering
the latest version of a value, while a concurrent process
accomplishes write operations on the same dataset in parallel.
In order to cope with two or more conflicting write operations,
every process stores, additional to the new value, a link to the
version the process read before. Therefore, algorithms on
database or client side have the possibility to resolve
conflicting values by different strategies. HBase, Hyptertable,
Bigdata and GraphDB use the storage of different versions not
only for conflict resolving but also for offering versioning.
Besides consistency cutbacks MVCC also causes higher
storage space requirements, because multiple versions of one
value are stored in parallel. Additional to a garbage collector,
that deletes no longer used versions, conflict resolving
algorithms are needed to deal with inconsistencies. Therefore,
MVCC causes higher system complexity.
TABLE II. CONCURRENCY CONTROL
Concurrency Control
Locks
Optimistic Locking
MVCC
Key Value
Stores
Voldemort
-
+
-
Redis
-
+
-
Membase
-
+
-
Document
Stores
Riak
-
-
+
MongoDB
-
-
-
CouchDB
-
-
+
Column
Family
Stores
Cassandra
-
-
-
HBase
+
-
-
Hypertable
-
-
+
Graph
Databases
Sesame
+
-
-
BigData
-
-
-
Neo4J
+
-
-
GraphDB
-
-
+
FlockDB
-
-
-
In order to support transactions without reserving multiple
datasets for exclusive access, optimistic locking is provided by
many stores. Before changed data is committed, each
transaction checks, whether another transactions made any
conflicting modifications to the same datasets. In case of
conflicts, the transaction will be rolled back. This concept
functions well, when updates are executed rarely and chances
for conflicting transactions are low. In this case, checking and
rolling back is cheaper than locking datasets for exclusive
access.
Locking and transactions are supported by Sesame and
Neo4j. HBase is the only key based NoSQL system, which
supports a limited variant of classic locks and transactions,
since these techniques can only be applied on single rows.
Applications which do not have to deal with many conflicting
operations can choose optimistic locking in order to provide
consistent multi row update and read-update-delete operations.
Optimistic locking is supported by Voldemort, Redis and
Membase. Applications which need to respond too many
parallel read and write requests and which only have to
provide a certain level of consistency should use Riak,
CouchDB, Hypertable or GraphDB. If last-update-wins-
strategies are sufficient, MongoDB, Cassandra and FlockDB
can be used in order to handle huge request rates by avoiding
MVCC overhead at the same time.
V. PARTITIONING
At the time huge amounts of data and very high read and
write request rates exceed the capacity of one server, databases
have to be partitioned across database clusters. Due to their
normalized data model and their ACID guarantees, relational
databases do not scale horizontally. Doubling the amount of
relational database server, does not double the performance of
the cluster. Due to that lack, big web 2.0 companies like
Google, Facebook and Amazon developed their own so called
web-scale databases which are designed to scale horizontally
and therefore satisfy the very high requirements on
performance and capacity of these companies.
NoSQL databases differ in their way they distribute data on
multiple machines. Since data models of key value stores,
document stores and column family stores are key oriented,
the two common partition strategies are based on keys, too.
The first strategy distributes datasets by the range of their
keys. A routing server splits the whole keyset into blocks and
allocates these blocks to different nodes. Afterwards, one node
is responsible for storage and request handling of his specific
key ranges. In order to find a certain key, clients have to
contact the routing server for getting the partition table. This
strategy has its advantages in handling range queries very
efficiently, because neighbor keys are stored with high
percentile on the same server. Since the routing server is
responsible for load balancing, key range allocation and
partition block advices, the availability of the whole cluster
depends on the failure proneness of that single server.
Therefore, this server is oftentimes replicated to multiple
machines. Higher availability and much simpler cluster
architecture can be achieved with the second distribution
strategy called consistent hashing [27]. In this shared nothing
architecture, there exists no single point of failure. In contrast
to range based partitioning, keys are distributed by using hash
functions. Since every server is responsible for a certain hash
region, addresses of certain keys within the cluster can be
calculated very fast. Good hash functions distribute keys
intuitively even wherefore an additional load balancer is not
required. Consistent hashing also scores by dynamic cluster
resizing. In contrast to other approaches, addition or removal
of nodes only affects a small subset of all machines in the
cluster. This simple architecture leads to performance
penalties on range queries caused by high network load since
neighbored keys are distributed randomly across the cluster.
All aforementioned key value stores and the document
stores Riak and CouchDB are based on consistent hashing,
whereas MongoDB documents are partitioned by the range of
their ID. In contrast to key value and document stores, column
family stores can be partitioned vertically, too. Columns of the
same column family are stored on the same server in order to
increase attribute range query performance. Cassandra datasets
are partitioned horizontally by consistent hashing, whereas the
BigTable clones HBase and Hypertable use range based
partitioning. Since column family datamodels can be
partitioned more efficiently, these databases are more suitable
for huge datasets than document stores.
In contrast to key value based NoSQL stores, where
datasets can easily be partitioned, splitting a graph is not
straightforward at all. Graph information is not gained by
simple key lookups but by analyzing relationships between
entities. On the one hand, nodes should be distributed on many
servers evenly, on the other hand, heavily linked nodes should
not be distributed over large distances, since traversals would
cause huge performance penalty due to heavy network load.
Therefore, one has to trade between these two limitations.
Graph algorithms can help identifying hotspots of strongly
connected nodes in the graph schema. These hotspots can be
stored on one machine afterwards. Since graphs can rapidly
mutate, graph partitioning is not possible without domain
specific knowledge and many complex algorithms. Due to
these problems, Sesame, Neo4j and GraphDB do not offer any
partitioning opportunities. In contrast, FlockDB is designed
for horizontal scalability. Since FlockDB does not support
multi-hop graph traversal, a higher network load is no
problem.
Since distributed systems increase system complexity
massively, partitioning should be avoided if it is not absolutely
necessary. Systems which do mostly struggle with high-read-
request-rates can scale this workload more easily through
replication.
TABLE III. PARTITIONING
Database
Partitioning
Range based
Consistent Hashing
Key Value
Stores
Voldemort
-
+
Redis
-
+
Membase
-
+
Document
Stores
Riak
-
+
MongoDB
+
-
CouchDB
-
+
Column
Family
Stores
Cassandra
-
+
HBase
+
-
Hypertable
+
-
Graph
Databases
Sesame
-
-
BigData
-
+
Neo4J
-
-
GraphDB
-
-
FlockDB
-
+
VI. REPLICATION AND CONSISTENCY
In addition to better read performance through load
balancing, replication brings also better availability and
durability, because failing nodes can be replaced by other
servers. Since distributed databases should be able to cope with
temporary node and network failures, Brewer noticed [28], that
only full availability or full consistency can be guaranteed at
one time in distributed systems. If all replicas of a master
server were updated synchronously, the system would not be
available until all slaves had committed a write operation. If
messages got lost due to network problems, the system would
not be available for a longer period of time. For platforms
which rely on high availability, this solution is not suitable
because even a few milliseconds of latency can have big
influences on user behavior. According to Amazon, their sales
sank at 1% for a delay of 100 milliseconds during a test [29].
Google noticed 25% less traffic as 30 instead of 10 search
results were presented on the start page causing 900
milliseconds instead of 400 milliseconds page generation time
[30]. In order to increase availability, one has to deal with
consistency cut backs. If all replicas were updated
asynchronously, reads on replicas might be inconsistent for a
short period of time. These systems are classified by Brewer as
BASE systems. They are basically available, have a soft state
and they are eventually consistent. Tweets or status updates in
social networks can be inconsistent, but the system must be
high available and high performant at all time. In contrast,
consistency is the most important feature in finance
applications. Therefore, BASE systems are not suitable for
bank applications. NoSQL systems are not only full ACID or
full BASE systems. Many databases offer a lot of configuration
opportunities to customize consistency and availability cut
backs to ones needs. These systems offer optimistic replication,
since they are able to configure the level of consistency by
determining the number of nodes which have to commit a write
process or which have to respond with the same value on a read
request.
Systems which are eventually consistent in order to increase
availability are Redis, CouchDB and Neo4j. Considering that
CouchDB and Neo4j also offer master-master replication, these
systems are suitable for offline support needed e.g. in mobile
applications. Voldemort, Riak, MongoDB, Cassandra and
FlockDB offer optimistic replication, wherefore they can be
used in any context. Since Membase, HBase, Hypertable,
Sesame and GraphDB do not use replication for load
balancing, these stores offer full consistency. BigData is the
only store, which supports full consistency and replication
natively.
TABLE IV. REPLICATION
Database
Replication
Read from Replica
Write to Replica
Consistency
Key Value
Stores
Voldemort
+
-
+/-
Redis
+
-
-
Membase
-
-
+
Document
Stores
Riak
+
-
+/-
MongoDB
+
-
+/-
CouchDB
+
+
-
Column
Family Stores
Cassandra
+
-
+/-
HBase
-
-
+
Hypertable
-
-
+
Graph
Databases
Sesame
-
-
+
BigData
+
-
+
Neo4J
+
+
-
GraphDB
-
-
+
FlockDB
+
+
+/-
VII. CONCLUSION
“Use the right tool for the job” is the propagated ideology of
the NoSQL community, because every NoSQL database is
specialized on certain use cases. Since there is no evaluation
available which answers the question “which tool is the right
tool for the job?”, advantages and disadvantages of these stores
were compared in this paper.
First of all, developers have to evaluate their data in order to
identify a suitable data model to avoid unnecessary complexity
due to transformation or mapping tasks. Queries which should
be supported by the database have to be considered at the same
time, because these requirements massively influence the
design of the data model. Since no common query language is
available, every store differs in its supported query feature set.
Afterwards, developers have to trade between high
performance through partitioning and load balanced replica
servers, high availability supported by asynchronous
replication and strict consistency. If partitioning is required, the
selection of different partition strategies depends on the
supported queries and cluster complexity. Beside these
different requirements, also durability mechanism, community
support and useful features like versioning influence the
database selection. In general, key value stores should be used
for very fast and simple operations, document stores offer a
flexible data model with great query possibilities, column
family stores are suitable for very large datasets which have to
be scaled at large size, and graph databases should be used in
domains, where entities are as important as the relationships
between them.
REFERENCES
[1] G. DeCandia, et al., "Dynamo: amazon's highly available key-value
store," in SOSP '07 Proceedings of twenty-first ACM SIGOPS, New
York, USA, 2007, pp. 205-220.
[2] K. Orend, "Analysis and Classification of NoSQL Databases and
Evaluation of their Ability to Replace an Object-relational Persistence
Layer," Master Thesis, Technical University of Munich, Munich, 2010.
[3] R. Cattell, "Scalable SQL and NoSQL Data Stores," ACM SIGMOD
Record, vol. 39, December 2010.
[4] C. Strauch, "NoSQL Databases" unpublished.
[5] M. Adam, "The NoSQL Ecosystem," in The Architecture of Open
Source Applications, A. Brown and G. Wilson, Eds., lulu.com, 2011, pp.
185-204.
[6] "Project Voldemort." Internet: http://project-voldemort.com,
[30.09.2011]
[7] "Redis." Internet: http://redis.io, [30.09.2011]
[8] "Membase." Internet: http://couchbase.org/membase, [30.09.2011]
[9] "CouchDB." Internet: http://couchdb.apache.org, [30.09.2011]
[10] "MongoDB." Internet: http://mongodb.org, [30.09.2011]
[11] "Riak." Internet: http://basho.com/Riak.html, [30.09.2011]
[12] F. Chang, et al., "Bigtable: A Distributed Storage System for Structured
Data," ACM Transactions on Computer Systems, vol. 26, pp. 1-26, 2008.
[13] "HBase." Internet: http://hbase.apache.org, [30.09.2011]
[14] "Hypertable." Internet: http://hypertable.org, [30.09.2011]
[15] "Cassandra." Internet: http://cassandra.apache.org, [30.09.2011]
[16] "Neo4j." Internet: http://neo4j.org, [30.09.2011]
[17] "GraphDB." Internet: http://www.sones.com, [30.09.2011]
[18] "Sesame." Internet: http://www.openrdf.org, [30.09.2011]
[19] "Bigdata." Internet: http://www.systap.com/bigdata.htm, [30.09.2011]
[20] "FlockDB." Internet: http://github.com/twitter/flockdb, [30.09.2011]
[21] E. Meijer and G. Bierman, "A Co-Relational Model of Data for Large
Shared Data Banks," Communications of the ACM vol. 54, pp. 49-58,
March 2011 2011.
[22] "UnQL." Internet: http://unqlspec.org, [30.09.2011]
[23] " Pig." Internet: http://pig.apache.org, [30.09.2011]
[24] " Hive." Internet: http://hive.apache.org, [30.09.2011]
[25] "SPARQL." Internet: http://www.w3.org/TR/rdf-sparql-query,
[30.09.2011]
[26] " Gremlin." Internet: http://github.com/tinkerpop/gremlin, [30.09.2011]
[27] D. Karger, et al., "Web caching with consistent hashing," in Proceedings
of the eighth international conference on World Wide Web, Elsevier
North-Holland, Inc. New York, 1999, pp. 11-16.
[28] E. A. Brewer, "Towards Robust Distributed Systems," in Proceedings of
the nineteenth annual ACM Symposium on Principles of Distributed
Computing (PODC), Portland, Oregon, 2000.
[29] G. Linden. "Make Data Useful." Internet: http://www.scribd.com/doc/-
4970486/Make-Data-Useful-by-Greg-Linden-Amazoncom, [30.09.2011]
[30] M. Mayer, "In Search of... A better, faster, stronger Web," presented at
the Velocity 09, San Jose, CA, 2009.
... NoSQL systems are based on four different families of models: key-value, column-oriented, document-oriented, and graphoriented [5]. According to DB-Engines Ranking, 1 MongoDB, which is a document database, is the most well-known among NoSQL databases. ...
... Then, we retain the maximum value obtained between the two similarity measures. We note that 1 and 2 are amended comparing to [30] 5 . In addition, 1 and 2 are the tokens' sets related to 1 and 2 respectively and is the edit distance threshold used to penalize lower similarities. ...
... The idea is to iterate over L to find the set of pairs of candidate join keys (identifier, reference), if they exist, between every two collections (lines 1 . [5][6][7][8][9][10][11][12][13][14]. We consider both directions: a field in can refer to another field in (lines 1 . ...
Article
NoSQL stores have become ubiquitous since they offer a new cost-effective and schema-free system. Although NoSQL systems are widely accepted today, Business Intelligence & Analytics (BI&A) wields relational data sources. Exploiting schema-free data for analytical purposes is a challenge since it requires reviewing all the BI&A phases, particularly the Extract-Transform-Load (ETL) process, to fit big data sources as document stores. In the ETL process, the join of several collections, with a lack of explicitly known join fields is a significant dare. Detecting these fields manually is time and effort-consuming and infeasible in large-scale datasets. In this paper, we study the problem of discovering join fields automatically. We introduce an algorithm that aims to automatically detect both identifiers and references on several document stores. The modus operandi of our approach underscores three core stages: (i) global schema extraction; (ii) discovery of candidate identifiers; and (iii) identifying candidate pairs of identifier and reference fields. We use scoring features and pruning rules to discover true candidate identifiers from many initial ones efficiently. To find candidate pairs between several document stores, we put into practice node2vec as a graph embedding technique, which yields significant advantages while using syntactic and semantic similarity measures for pruning pointless candidates. Finally, we report our experimental findings that show encouraging results.
... Generally, NoSQL databases are defined through a set of features which are mainly flexibility, high availability, scalability, and low-cost requirements [12], [13]. NoSQL systems are commonly classified into four main types depending on their logical model: key-value oriented stores, column-oriented stores, documentoriented stores, and graph-oriented stores [14]- [16]. The emergence of NoSQL systems has enriched the database management landscape. ...
Article
Full-text available
Nowadays, the data used for decision-making come from a wide variety of sources which are difficult to manage using relational databases. To address this problem, many researchers have turned to Not only SQL (NoSQL) databases to provide scalability and flexibility for On-Line Analytical Processing (OLAP) systems. In this paper, we propose a set of formal rules to convert a multidimensional data model into a graph data model (MDM2G). These rules allow conventional star and snowflake schemas to fit into NoSQL graph databases. We apply the proposed rules to implement star-like and snowflake-like graph data warehouses. We compare their performances to similar relational ones focusing on the data model, dimensionality, and size. The experimental results show large differences between relational and graph implementations of a data warehouse. A relational implementation performs better for queries on a couple of tables, but conversely, a graph implementation is better when queries involve many tables. Surprisingly the performances of a star-like and snowflake-like graph data warehouses are very close. Hence a snowflake schema could be used in order to easily consider new sub-dimensions in a graph data warehouse.
... In this study, we focus on NoSQL databases (Han et al. (2011), Cattell (2011). NoSQL databases perform fast read and write procedures, support large data sets, and deal well with dynamic data, both changes in the schema and the data size (Robinson et al. (2013), Han et al. (2011), Cattell (2011, Hecht & Jablonski (2011)). ...
Article
Full-text available
Purpose Neuroscience data are spread across a variety of sources, typically provisioned through ad-hoc and non-standard approaches and formats and often have no connection to the related data sources. These make it difficult for researchers to understand, integrate and reuse brain-related data. The aim of this study is to show that a graph-based approach offers an effective mean for representing, analysing and accessing brain-related data, which is highly interconnected, evolving over time and often needed in combination. Design/methodology/approach The authors present an approach for organising brain-related data in a graph model. The approach is exemplified in the case of a unique data set of quantitative neuroanatomical data about the murine basal ganglia––a group of nuclei in the brain essential for processing information related to movement. Specifically, the murine basal ganglia data set is modelled as a graph, integrated with relevant data from third-party repositories, published through a Web-based user interface and API, analysed from exploratory and confirmatory perspectives using popular graph algorithms to extract new insights. Findings The evaluation of the graph model and the results of the graph data analysis and usability study of the user interface suggest that graph-based data management in the neuroscience domain is a promising approach, since it enables integration of various disparate data sources and improves understanding and usability of data. Originality/value The study provides a practical and generic approach for representing, integrating, analysing and provisioning brain-related data and a set of software tools to support the proposed approach.
... In this study there will not be an in-depth comparison again, only a reference to the categories used to classify the NoSQL data stores. The classification made by Hecht and Jablonski will be used [25]: Key-value, Document, Column Family and Graph databases. ...
Chapter
Full-text available
Collecting, transforming and analyzing data is seeing an all time high interest over the last few years. Due to the large amount of emerging or enhanced data management systems, as well as the high amount of differences in features and options they support, this paper is focused on reviewing the existing systems that are part of the NewSQL category. By looking at the top nine most used and researched implementations this work investigates how these compare in terms of supported data models, transactionality features and querying abilities. Furthermore, it provides an overview of scalability implementations and functionality as well as security assets and liabilities. It also includes an analysis of costs, by looking at both licensing as well as infrastructural support. This work can be used as a guiding analysis for providing insights into what features are supported and offered by the top NewSQL systems available over multiple dimensions.
Chapter
While the increasing deployment of sensors for structural health monitoring has significantly enhanced the monitoring and management of structures, the volume and the variety of data collected have raised the importance of data management. This chapter describes a cloud-based cyber infrastructure platform that builds upon information modeling, NoSQL databases, and cloud computing to address the scalability and flexibility issues for the management of sensor data and engineering information. The platform-neutral interfaces allow application services to be built using a variety of devices—from mobile device, web browser to high-performance computers. The platform is demonstrated for the monitoring of the bridge structures along the I-275 corridor in the State of Michigan. The results show that the platform can greatly facilitate data utilization and data integration.
Chapter
As the amount of information flow increases, business companies feel the need to improve on storage systems. Henceforth, to tackle this increasing need, paradigms such as NoSQL emerge to solve the unlimited data growing requirement. However, the NoSQL solution has no proofs given in the field to support their solution claims.
Chapter
An increasing number of organisations rely on NoSQL technologies to manage their mission-critical data. However, those technologies were not intended to replace relational database management systems, but rather to complement them. Hence the recent emergence of heterogeneous database architectures, commonly called hybrid polystores, that rely on a combination of several, possibly overlapping relational and NoSQL databases. Unfortunately, there is still a lack of models, methods and tools for data modeling and manipulation in such architectures. With the aim to fill this gap, we introduce HyDRa, a conceptual framework to design and manipulate hybrid polystores. We present the HyDRa textual modeling language allowing one to specify (1) the conceptual schema of a polystore, (2) the physical schemas of each of its databases, and (3) a set of mapping rules to express possibly complex correspondences between the conceptual schema elements and the physical databases.
Conference Paper
Finding a system that was both flexible and adaptable to the existing Information Technology (IT) systems was key in the decision to implement Building Information Modeling (BIM) at National Bank of Hungary (MNB). It is based on ArchiCAD and sFM and the two programs are perceived as one coherent system. Using ArchiFM all building-related information for MNB's two buildings - graphic visualizations, 3D models, as-built documentation - is accessible in an integrated database via the company's Intranet. BIM is a highly complex system based on its BIM concept to manage MNB's facilities with remote access via the Intranet with bi-directional information flow. MNB also required a solution that could be integrated with the existing SAP accounting systems for real estate and cost control and significantly raise the efficiency of facility maintenance. The MNB operating system is unique in many ways. First and foremost, because they exploit an incredible number of the functionalities in ArchiFM, but also because the system will work under the extreme IT security requirements in a large bank.
Conference Paper
Full-text available
Current distributed systems, even the ones that work, tend to be very fragile: they are hard to keep up, hard to manage, hard to grow, hard to evolve, and hard to program. In this talk, I look at several issues in an attempt to clean up the way we think about these systems. These issues include the fault model, high availability, graceful degradation, data consistency, evolution, composition, and autonomy.These are not (yet) provable principles, but merely ways to think about the issues that simplify design in practice. They draw on experience at Berkeley and with giant-scale systems built at Inktomi, including the system that handles 50% of all web searches.
Conference Paper
Full-text available
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Article
Full-text available
The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor's product to another. A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL. In this article we developed a mathematical data model for the most common form of noSQL-namely, key-value stores as the mathematical dual of SQL's foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics. In contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus coSQL. While the coSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for "small" data as well. On the other hand, it is possible to scale SQL databases by careful partitioning.2 What this all means is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Because of the common query language based on monads, both can be implemented using the same principles.
Article
Full-text available
In this paper, we examine a number of SQL and socalled "NoSQL" data stores designed to scale simple OLTP-style application loads over many servers. Originally motivated by Web 2.0 applications, these systems are designed to scale to thousands or millions of users doing updates as well as reads, in contrast to traditional DBMSs and data warehouses. We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability.
Conference Paper
AkeyperformancemeasurefortheWorldWideWebisthespeedwithwhichcontentisservedtousers.Astraffic on the Web increases, users are faced with increasing delays and failures in data delivery. Web caching is one of the key strategies that has been explored to improve performance. An important issue in many caching systems is how to decide what is cached where at any given time. Solutions have included multicast queries and directory schemes. In this paper, we offer a new Web caching strategy based on consistent hashing. Consistenthashingprovidesan alternative to multicast and directory schemes, and has several other advantages in load balancing and fault tolerance. Its performance was analyzed theoretically in previous work; in this paper we describe the implementation of a consistent-hashing-based system and experiments that support our thesis that it can provide performance improvements.
Article
A key performance measure for the World Wide Web is the speed with which content is served to users. As traffic on the Web increases, users are faced with increasing delays and failures in data delivery. Web caching is one of the key strategies that has been explored to improve performance.An important issue in many caching systems is how to decide what is cached where at any given time. Solutions have included multicast queries and directory schemes.In this paper, we offer a new Web caching strategy based on consistent hashing. Consistent hashing provides an alternative to multicast and directory schemes, and has several other advantages in load balancing and fault tolerance. Its performance was analyzed theoretically in previous work; in this paper we describe the implementation of a consistent-hashing-based system and experiments that support our thesis that it can provide performance improvements.
Article
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
Make Data Useful Internet: http://www.scribd.com/doc
  • G Linden
G. Linden. "Make Data Useful." Internet: http://www.scribd.com/doc/-4970486/Make-Data-Useful-by-Greg-Linden-Amazoncom, [30.09.2011]
NoSQL Databases" unpublished
  • C Strauch
C. Strauch, "NoSQL Databases" unpublished.