ArticlePDF Available

Abstract

Motivated by requirements of Web 2.0 applications, a plethora of non-relational databases raised in recent years. Since it is very difficult to choose a suitable database for a specific use case, this paper evaluates the underlying techniques of NoSQL databases considering their applicability for certain requirements. These systems are compared by their data models, query possibilities, concurrency controls, partitioning and replication opportunities.
NoSQL Evaluation
A Use Case Oriented Survey
Robin Hecht
Chair of Applied Computer Science IV
University of Bayreuth
Bayreuth, Germany
robin.hecht@uni-bayreuth.de
Stefan Jablonski
Chair of Applied Computer Science IV
University of Bayreuth
Bayreuth, Germany
stefan.jablonski@uni-bayreuth.de
Abstract Motivated by requirements of Web 2.0 applications, a
plethora of non-relational databases raised in recent years. Since
it is very difficult to choose a suitable database for a specific use
case, this paper evaluates the underlying techniques of NoSQL
databases considering their applicability for certain
requirements. These systems are compared by their data models,
query possibilities, concurrency controls, partitioning and
replication opportunities.
NoSQL, Key Value Stores, Document Stores, Column Family
Stores, Graph Databases, Evaluation, Use Cases
I. MOTIVATION
The current NoSQL trend is motivated by applications
stemming mostly of the Web 2.0 domain. Some of these
applications have storage requirements which exceed capacities
and possibilities of relational databases.
In the past SQL databases were used for nearly every storage
problem, even if a data model did not match the relational
model well. The object-relational impedance mismatch is one
example, the transformation of graphs into tables another one
for using a data model in a wrong way. This leads to an
increasing complexity by using expensive mapping
frameworks and complex algorithms. Even if a data model can
easily be covered by the relational one, the huge feature set
offered by SQL databases is an unneeded overhead for simple
tasks like logging. The strict relational schema can be a burden
for web applications like blogs, which consist of many different
kinds of attributes. Text, comments, pictures, videos, source
code and other information have to be stored within multiple
tables. Since such web applications are very agile, underlying
databases have to be flexible as well in order to support easy
schema evaluation. Adding or removing a feature to a blog is
not possible without system unavailability if a relational
database is being used.
The increasing amount of data in the web is a problem
which has to be considered by successful web pages like the
ones of Facebook, Amazon and Google. Besides dealing with
tera- and petabytes of data, massive read and write requests
have to be responded without any noticeable latency. In order
to deal with these requirements, these companies maintain
clusters with thousands of commodity hardware machines. Due
to their normalized data model and their full ACID support,
relational databases are not suitable in this domain, because
joins and locks influence performance in distributed systems
negatively. In addition to high performance, high availability is
a fundamental requirement of many companies. Amazon
guarantees for its services availability of at least 99.9% during
a year [1]. Therefore, databases must be easily replicable and
have to provide an integrated failover mechanism to deal with
node or datacenter failures. They also must be able to balance
read requests on multiple slaves to cope with access peaks
which can exceed the capacity of a single server. Since
replication techniques offered by relational databases are
limited and these databases are typically based on consistency
instead of availability, these requirements can only be achieved
with additional effort and high expertise [1].
Due to these requirements, many companies and
organizations developed own storage systems, which are now
classified as NoSQL databases. Since every store is specialized
on the specific needs of their principals, there is no panacea on
the market covering all of the above mentioned requirements.
Therefore, it is very difficult to select one database out of the
plethora of systems, which is the most suitable for a certain use
case.
Even if NoSQL databases have already been introduced and
compared in the past [2] [3] [4] [5], no use case oriented survey
is available. Since single features of certain databases are
changing on a weekly basis, an evaluation of the different
features of certain stores is outdated at the moment it is
published. Therefore, it is necessary to consider the impact of
the underlying techniques on specific use cases in order to
provide a durable overview. This paper highlights the most
important criteria for a database selection, introduces the
underlying techniques and compares a wider range of open
source NoSQL databases including graph databases, too.
In order to evaluate the underlying techniques of these
systems, the most important features for solving the above
mentioned requirements have to be classified. Since the
relational data model is considered as not suitable for certain
use cases, chapter two will examine structure and flexibility of
different data models offered by NoSQL systems. Afterwards,
query possibilities of these stores and their impact on system
complexity will be analyzed in chapter three. In order to deal
with many parallel read and write requests, some stores loose
concurrency restrictions. Different strategies of these systems
for handling concurrent requests are inspected in chapter four.
Since huge amounts of data and high performance serving
exceed the capacities of single machines, partitioning methods
of NoSQL databases are analyzed in chapter five. Replication
techniques and their effects on availability and consistency are
examined in chapter six.
II. DATA MODEL
Mostly NoSQL databases differ from relational databases in
their data model. These systems are classified in this evaluation
into 4 groups.
A. Key Value Stores
Key value stores are similar to maps or dictionaries where
data is addressed by a unique key. Since values are
uninterpreted byte arrays, which are completely opaque to the
system, keys are the only way to retrieve stored data. Values
are isolated and independent from each other wherefore
relationships must be handled in application logic. Due to this
very simple data structure, key value stores are completely
schema free. New values of any kind can be added at runtime
without conflicting any other stored data and without
influencing system availability. The grouping of key value
pairs into collection is the only offered possibility to add some
kind of structure to the data model.
Key value stores are useful for simple operations, which are
based on key attributes only. In order to speed up a user
specific rendered webpage, parts of this page can be calculated
before and served quickly and easily out of the store by user
IDs when needed. Since most key value stores hold their
dataset in memory, they are oftentimes used for caching of
more time intensive SQL queries. Key value stores considered
in this paper are Project Voldemort [6], Redis [7] and Membase
[8].
B. Document Stores
Document Stores encapsulate key value pairs in JSON or
JSON like documents. Within documents, keys have to be
unique. Every document contains a special key ID, which is
also unique within a collection of documents and therefore
identifies a document explicitly. In contrast to key value stores,
values are not opaque to the system and can be queried as well.
Therefore, complex data structures like nested objects can be
handled more conveniently. Storing data in interpretable JSON
documents have the additional advantage of supporting data
types, which makes document stores very developer-friendly.
Similar to key value stores, document stores do not have any
schema restrictions. Storing new documents containing any
kind of attributes can as easily be done as adding new attributes
to existing documents at runtime.
Document stores offer multi attribute lookups on records
which may have complete different kinds of key value pairs.
Therefore, these systems are very convenient in data
integration and schema migration tasks. Most popular use cases
are real time analytics, logging and the storage layer of small
and flexible websites like blogs. The most prominent document
stores are CouchDB [9], MongoDB [10] and Riak [11]. Riak
offers links, which can be used to model relationships between
documents.
C. Column Family Stores
Column Family Stores are also known as column oriented
stores, extensible record stores and wide columnar stores. All
stores are inspired by Googles Bigtable [12], which is a
“distributed storage system for managing structured data that is
designed to scale to a very large size” [12]. Bigtable is used in
many Google projects varying in requirements of high
throughput and latency-sensitive data serving. The data model
is described as sparse, distributed, persistent multidimensional
sorted map” [12]. In this map, an arbitrary number of key value
pairs can be stored within rows. Since values cannot be
interpreted by the system, relationships between datasets and
any other data types than strings are not supported natively.
Similar to key value stores, these additional features have to be
implemented in the application logic. Multiple versions of a
value are stored in a chronological order to support versioning
on the one hand and achieving better performance and
consistency on the other one (chapter four). Columns can be
grouped to column families, which is especially important for
data organization and partitioning (chapter five). Columns and
rows can be added very flexibly at runtime but column families
have to be predefined oftentimes, which leads to less flexibility
than key value stores and document stores offer. HBase [13]
and Hypertable [14] are open source implementations of
Bigtable, whereas Cassandra [15] differs from the data model
described above, since another dimension called super columns
is added. These super columns can contain multiple columns
and can be stored within column families. Therefore Cassandra
is more suitable to handle complex and expressive data
structures.
Due to their tablet format, column family stores have a
similar graphical representation compared to relational
databases. The main difference lies in their handling of null
values. Considering a use case with many different kinds of
attributes, relational databases would store a null value in each
column a dataset has no value for. In contrast, column family
stores only store a key value pair in one row, if a dataset needs
it. This is what Google calls “sparse” and which makes column
family stores very efficient in domains with huge amounts of
data with varying numbers of attributes. The convenient
storage of heterogeneous datasets is also handled by document
stores very well. Supporting nested data structures, links and
data types JSON documents can be more expressive than the
data model of column families. Additionally, document stores
are very easy to maintain and are therefore suitable for flexible
web applications. The data model of column family stores is
more suitable for applications dealing with huge amounts of
data stored on very large clusters, because the data model can
be partitioned very efficiently.
Key value stores, document stores and column family stores
have in common, that they do store denormalized data in order
to gain advantages in distribution. Besides a few exceptions
like nested documents and the link feature of Riak,
relationships must be handled completely in the application
logic. The famous friend-of-a-friend problem and similar
recommendation queries would lead to many queries,
performance penalties and complex code in the application
layer if one of these databases was used for such use cases.
D. Graph Databases
In contrast to relational databases and the already
introduced key oriented NoSQL databases, graph databases are
specialized on efficient management of heavily linked data.
Therefore, applications based on data with many relationships
are more suited for graph databases, since cost intensive
operations like recursive joins can be replaced by efficient
traversals.
Neo4j [16] and GraphDB [17] are based on directed and
multi relational property graphs. Nodes and edges consist of
objects with embedded key value pairs. The range of keys and
values can be defined in a schema, whereby the expression of
more complex constraints can be described easily. Therefore it
is possible to define that a specific edge is only applicable
between a certain types of nodes.
Property graphs are distinct from resource description
framework stores like Sesame [18] and Bigdata [19] which are
specialized on querying and analyzing subject-predicate-object
statements. Since the whole set of triples can be represented as
directed multi relational graph, RDF frameworks are
considered as a special form of graph databases in this paper,
too. In contrast to property graphs, these RDF graphs do not
offer the possibility of adding additional key value pairs to
edges and nodes. On the other handy, by use of RDF schema
and the web ontology language it is possible to define a more
complex and more expressive schema, than property graph
databases do.
Twitter stores many relationships between people in order
to provide their tweet following service. These one-way
relationships are handled within their own graph database
FlockDB [20] which is optimized for very large adjacency lists,
fast reads and writes.
Use cases for graph databases are location based services,
knowledge representation and path finding problems raised in
navigation systems, recommendation systems and all other use
cases which involve complex relationships. Property graph
databases are more suitable for large relationships over many
nodes, whereas RDF is used for certain details in a graph.
FlockDB is suitable for handling simple 1-hop-neighbor
relationships with huge scaling requirements.
III. QUERY POSSIBILITIES
Since data models are tightly coupled with query
possibilities, the analysis of queries that should be supported
by the database is a very important process in order to find a
suitable data model. NoSQL databases do not only differ in
their provided data model, they also differ in the richness of
their offered query functionalities. The current state of the
NoSQL landscape can be compared to the time before Codds
SQL introduction. Similar to that time a lot of technical
heterogeneous databases raised in the last years which differ in
their offered data model, query languages and APIs.
Therefore, some research is done in order to achieve a wider
adoption of NoSQL systems by developing standardized query
languages [21] [22] for some of these stores. Up to know
developers have still to cope with the specific characteristics
of each NoSQL store.
Due to their simple data model, APIs of key value stores
provide key based put, get and delete operations only. Any
query language would be an unnecessary overhead for these
stores. If additional query functionalities are required, they
have to be implemented on the application layer which can
quickly lead to much more system complexity and
performance penalties. Therefore, key value stores should not
be used, if more complex queries or queries on values are
required. Very useful in the domain of web applications are
REST interfaces. Heterogeneous clients can directly interact
with the store in a uniform way, while requests can be load
balanced and results can be cached by proxies. Membase is the
only key value store, which offers a REST API natively.
In contrast to key value stores, document stores offer much
richer APIs. Range queries on values, secondary indexes,
querying nested documents and operations like and”, “or”,
“between” are features, which can be used conveniently. Riak
and MongoDB queries can be extended with regular
expression. While MongoDB supports additional operations
like count and distinct, Riak offers functionalities to traverse
links between documents easily. REST interfaces are
supported by document stores, as well. Due to their powerful
interfaces, a query language increasing user-friendliness by
offering an additional abstraction layer would be useful for
document stores. Since none of these stores offer any query
language, the UnQL project [22] is working on a common
query language offering a SQL-like syntax for querying JSON
document stores.
Column family stores only provide range queries and some
operations like “in”, “and/or” and regular expression, if they
are applied on row keys or indexed values. Even if every
column family store offers a SQL like query language in order
to provide a more convenient user interaction, only row keys
and indexed values can be considered in where-clauses, as
well. Since these languages are specialized on the specific
features of their stores, there is no common query language for
column family stores available yet.
As clusters of document and column family stores are able
to store huge amounts of structured data, queries can get very
inefficient if a single machine has to process the required data.
Therefore, all document and column family stores provide
MapReduce frameworks which enable parallelized
calculations over very large datasets on multiple machines.
However, MapReduce jobs are written on a very low
abstraction level and lead to custom programs which are hard
to maintain and reuse. Data analytics platforms like Pig [23]
and Hive [24] try to bridge the gap between declarative query
languages and procedural programmed MapReduce jobs by
offering high level query languages which can be compiled
into sequences of MapReduce jobs.
Graph databases can be queried in two different ways.
Graph patter matching strategies try to find parts of the
original graph, which match a defined graph pattern. Graph
traversal on the other hand starts from a chosen node and
traverses the graph according to a description. The traversal
strategies differ in detecting one matching node as fast as
possible (breadth-first) and in finding the shortest path (depth-
first). Most graph databases offer REST APIs, program
language specific interfaces and store specific query languages
in order to access data using one of the two described
strategies. In contrast to other NoSQL databases, there exist
some query languages, which are supported by more than one
graph database. SPARQL [25] is a popular, declarative query
language with a very simple syntax providing graph pattern
matching. It is supported by most RDF stores (also Sesame
and BigData) and by Neo4j. Gremlin [26] is an imperative
programming language used to perform graph traversals based
on XPATH. Besides support for Neo4j and GraphDB, also
Sesame can be queried with Gremlin after a graph
transformation. FlockDB do not offer any query language,
since only simple 1-hop-neighbor relationships are supported.
NoSQL databases differ strongly in their offered query
functionalities. Besides considering the supported data model
and how it influences queries on specific attributes, it is
necessary to have a closer look on the offered interfaces in
order to find a suitable database for a specific use case. If a
simple, language unspecific API is required, REST interfaces
can be a suitable solution especially for web applications,
whereas performance critical queries should be exchanged
over language specific APIs which are available for nearly
every common programming language like Java. Query
languages offering a higher abstraction level in order to reduce
complexity. Therefore, their use is very helpful when more
complicated queries should be handled. If calculation intensive
queries over large datasets are required, MapReduce
frameworks should be used.
TABLE I. QUERY POSSIBILITIES
Database
Query Possibilities
REST
API
JAVA
API
Query
Language
Map Reduce
Support
Key Value
Stores
Voldemort
-
+
-
-
Redis
-
+
-
-
Membase
+
+
-
-
Document
Stores
Riak
+
+
-
+
MongoDB
+
+
-
+
CouchDB
+
+
-
+
Column
Family
Stores
Cassandra
-
+
+
+
HBase
+
+
+
+
Hypertable
-
+
+
+
Graph
Databases
Sesame
+
+
+
-
BigData
+
+
+
-
Neo4J
+
+
+
-
GraphDB
+
+
+
-
FlockDB
-
+
-
-
IV. CONCURRENCY CONTROL
If many users have access to a data source in parallel,
strategies for avoiding inconsistency based on conflicting
operations are necessary. Traditional databases use pessimistic
consistency strategies with exclusive access on a dataset.
These strategies are suitable, if costs for locking are low and
datasets are not blocked for a long time. Since locks are very
expensive in database clusters which are distributed over large
distances and many web applications have to support very
high read request rates, pessimistic consistency strategies can
cause massive performance loss.
Multiversion concurrency control (MVCC) relaxes strict
consistency in favor of performance. Concurrent access is not
managed with locks but by organization of many unmodifiable
chronological ordered versions. Since datasets are not reserved
for exclusive access, read requests can be handled by offering
the latest version of a value, while a concurrent process
accomplishes write operations on the same dataset in parallel.
In order to cope with two or more conflicting write operations,
every process stores, additional to the new value, a link to the
version the process read before. Therefore, algorithms on
database or client side have the possibility to resolve
conflicting values by different strategies. HBase, Hyptertable,
Bigdata and GraphDB use the storage of different versions not
only for conflict resolving but also for offering versioning.
Besides consistency cutbacks MVCC also causes higher
storage space requirements, because multiple versions of one
value are stored in parallel. Additional to a garbage collector,
that deletes no longer used versions, conflict resolving
algorithms are needed to deal with inconsistencies. Therefore,
MVCC causes higher system complexity.
TABLE II. CONCURRENCY CONTROL
Concurrency Control
Locks
Optimistic Locking
MVCC
Key Value
Stores
Voldemort
-
+
-
Redis
-
+
-
Membase
-
+
-
Document
Stores
Riak
-
-
+
MongoDB
-
-
-
CouchDB
-
-
+
Column
Family
Stores
Cassandra
-
-
-
HBase
+
-
-
Hypertable
-
-
+
Graph
Databases
Sesame
+
-
-
BigData
-
-
-
Neo4J
+
-
-
GraphDB
-
-
+
FlockDB
-
-
-
In order to support transactions without reserving multiple
datasets for exclusive access, optimistic locking is provided by
many stores. Before changed data is committed, each
transaction checks, whether another transactions made any
conflicting modifications to the same datasets. In case of
conflicts, the transaction will be rolled back. This concept
functions well, when updates are executed rarely and chances
for conflicting transactions are low. In this case, checking and
rolling back is cheaper than locking datasets for exclusive
access.
Locking and transactions are supported by Sesame and
Neo4j. HBase is the only key based NoSQL system, which
supports a limited variant of classic locks and transactions,
since these techniques can only be applied on single rows.
Applications which do not have to deal with many conflicting
operations can choose optimistic locking in order to provide
consistent multi row update and read-update-delete operations.
Optimistic locking is supported by Voldemort, Redis and
Membase. Applications which need to respond too many
parallel read and write requests and which only have to
provide a certain level of consistency should use Riak,
CouchDB, Hypertable or GraphDB. If last-update-wins-
strategies are sufficient, MongoDB, Cassandra and FlockDB
can be used in order to handle huge request rates by avoiding
MVCC overhead at the same time.
V. PARTITIONING
At the time huge amounts of data and very high read and
write request rates exceed the capacity of one server, databases
have to be partitioned across database clusters. Due to their
normalized data model and their ACID guarantees, relational
databases do not scale horizontally. Doubling the amount of
relational database server, does not double the performance of
the cluster. Due to that lack, big web 2.0 companies like
Google, Facebook and Amazon developed their own so called
web-scale databases which are designed to scale horizontally
and therefore satisfy the very high requirements on
performance and capacity of these companies.
NoSQL databases differ in their way they distribute data on
multiple machines. Since data models of key value stores,
document stores and column family stores are key oriented,
the two common partition strategies are based on keys, too.
The first strategy distributes datasets by the range of their
keys. A routing server splits the whole keyset into blocks and
allocates these blocks to different nodes. Afterwards, one node
is responsible for storage and request handling of his specific
key ranges. In order to find a certain key, clients have to
contact the routing server for getting the partition table. This
strategy has its advantages in handling range queries very
efficiently, because neighbor keys are stored with high
percentile on the same server. Since the routing server is
responsible for load balancing, key range allocation and
partition block advices, the availability of the whole cluster
depends on the failure proneness of that single server.
Therefore, this server is oftentimes replicated to multiple
machines. Higher availability and much simpler cluster
architecture can be achieved with the second distribution
strategy called consistent hashing [27]. In this shared nothing
architecture, there exists no single point of failure. In contrast
to range based partitioning, keys are distributed by using hash
functions. Since every server is responsible for a certain hash
region, addresses of certain keys within the cluster can be
calculated very fast. Good hash functions distribute keys
intuitively even wherefore an additional load balancer is not
required. Consistent hashing also scores by dynamic cluster
resizing. In contrast to other approaches, addition or removal
of nodes only affects a small subset of all machines in the
cluster. This simple architecture leads to performance
penalties on range queries caused by high network load since
neighbored keys are distributed randomly across the cluster.
All aforementioned key value stores and the document
stores Riak and CouchDB are based on consistent hashing,
whereas MongoDB documents are partitioned by the range of
their ID. In contrast to key value and document stores, column
family stores can be partitioned vertically, too. Columns of the
same column family are stored on the same server in order to
increase attribute range query performance. Cassandra datasets
are partitioned horizontally by consistent hashing, whereas the
BigTable clones HBase and Hypertable use range based
partitioning. Since column family datamodels can be
partitioned more efficiently, these databases are more suitable
for huge datasets than document stores.
In contrast to key value based NoSQL stores, where
datasets can easily be partitioned, splitting a graph is not
straightforward at all. Graph information is not gained by
simple key lookups but by analyzing relationships between
entities. On the one hand, nodes should be distributed on many
servers evenly, on the other hand, heavily linked nodes should
not be distributed over large distances, since traversals would
cause huge performance penalty due to heavy network load.
Therefore, one has to trade between these two limitations.
Graph algorithms can help identifying hotspots of strongly
connected nodes in the graph schema. These hotspots can be
stored on one machine afterwards. Since graphs can rapidly
mutate, graph partitioning is not possible without domain
specific knowledge and many complex algorithms. Due to
these problems, Sesame, Neo4j and GraphDB do not offer any
partitioning opportunities. In contrast, FlockDB is designed
for horizontal scalability. Since FlockDB does not support
multi-hop graph traversal, a higher network load is no
problem.
Since distributed systems increase system complexity
massively, partitioning should be avoided if it is not absolutely
necessary. Systems which do mostly struggle with high-read-
request-rates can scale this workload more easily through
replication.
TABLE III. PARTITIONING
Database
Partitioning
Range based
Consistent Hashing
Key Value
Stores
Voldemort
-
+
Redis
-
+
Membase
-
+
Document
Stores
Riak
-
+
MongoDB
+
-
CouchDB
-
+
Column
Family
Stores
Cassandra
-
+
HBase
+
-
Hypertable
+
-
Graph
Databases
Sesame
-
-
BigData
-
+
Neo4J
-
-
GraphDB
-
-
FlockDB
-
+
VI. REPLICATION AND CONSISTENCY
In addition to better read performance through load
balancing, replication brings also better availability and
durability, because failing nodes can be replaced by other
servers. Since distributed databases should be able to cope with
temporary node and network failures, Brewer noticed [28], that
only full availability or full consistency can be guaranteed at
one time in distributed systems. If all replicas of a master
server were updated synchronously, the system would not be
available until all slaves had committed a write operation. If
messages got lost due to network problems, the system would
not be available for a longer period of time. For platforms
which rely on high availability, this solution is not suitable
because even a few milliseconds of latency can have big
influences on user behavior. According to Amazon, their sales
sank at 1% for a delay of 100 milliseconds during a test [29].
Google noticed 25% less traffic as 30 instead of 10 search
results were presented on the start page causing 900
milliseconds instead of 400 milliseconds page generation time
[30]. In order to increase availability, one has to deal with
consistency cut backs. If all replicas were updated
asynchronously, reads on replicas might be inconsistent for a
short period of time. These systems are classified by Brewer as
BASE systems. They are basically available, have a soft state
and they are eventually consistent. Tweets or status updates in
social networks can be inconsistent, but the system must be
high available and high performant at all time. In contrast,
consistency is the most important feature in finance
applications. Therefore, BASE systems are not suitable for
bank applications. NoSQL systems are not only full ACID or
full BASE systems. Many databases offer a lot of configuration
opportunities to customize consistency and availability cut
backs to ones needs. These systems offer optimistic replication,
since they are able to configure the level of consistency by
determining the number of nodes which have to commit a write
process or which have to respond with the same value on a read
request.
Systems which are eventually consistent in order to increase
availability are Redis, CouchDB and Neo4j. Considering that
CouchDB and Neo4j also offer master-master replication, these
systems are suitable for offline support needed e.g. in mobile
applications. Voldemort, Riak, MongoDB, Cassandra and
FlockDB offer optimistic replication, wherefore they can be
used in any context. Since Membase, HBase, Hypertable,
Sesame and GraphDB do not use replication for load
balancing, these stores offer full consistency. BigData is the
only store, which supports full consistency and replication
natively.
TABLE IV. REPLICATION
Database
Replication
Read from Replica
Write to Replica
Consistency
Key Value
Stores
Voldemort
+
-
+/-
Redis
+
-
-
Membase
-
-
+
Document
Stores
Riak
+
-
+/-
MongoDB
+
-
+/-
CouchDB
+
+
-
Column
Family Stores
Cassandra
+
-
+/-
HBase
-
-
+
Hypertable
-
-
+
Graph
Databases
Sesame
-
-
+
BigData
+
-
+
Neo4J
+
+
-
GraphDB
-
-
+
FlockDB
+
+
+/-
VII. CONCLUSION
“Use the right tool for the job” is the propagated ideology of
the NoSQL community, because every NoSQL database is
specialized on certain use cases. Since there is no evaluation
available which answers the question “which tool is the right
tool for the job?”, advantages and disadvantages of these stores
were compared in this paper.
First of all, developers have to evaluate their data in order to
identify a suitable data model to avoid unnecessary complexity
due to transformation or mapping tasks. Queries which should
be supported by the database have to be considered at the same
time, because these requirements massively influence the
design of the data model. Since no common query language is
available, every store differs in its supported query feature set.
Afterwards, developers have to trade between high
performance through partitioning and load balanced replica
servers, high availability supported by asynchronous
replication and strict consistency. If partitioning is required, the
selection of different partition strategies depends on the
supported queries and cluster complexity. Beside these
different requirements, also durability mechanism, community
support and useful features like versioning influence the
database selection. In general, key value stores should be used
for very fast and simple operations, document stores offer a
flexible data model with great query possibilities, column
family stores are suitable for very large datasets which have to
be scaled at large size, and graph databases should be used in
domains, where entities are as important as the relationships
between them.
REFERENCES
[1] G. DeCandia, et al., "Dynamo: amazon's highly available key-value
store," in SOSP '07 Proceedings of twenty-first ACM SIGOPS, New
York, USA, 2007, pp. 205-220.
[2] K. Orend, "Analysis and Classification of NoSQL Databases and
Evaluation of their Ability to Replace an Object-relational Persistence
Layer," Master Thesis, Technical University of Munich, Munich, 2010.
[3] R. Cattell, "Scalable SQL and NoSQL Data Stores," ACM SIGMOD
Record, vol. 39, December 2010.
[4] C. Strauch, "NoSQL Databases" unpublished.
[5] M. Adam, "The NoSQL Ecosystem," in The Architecture of Open
Source Applications, A. Brown and G. Wilson, Eds., lulu.com, 2011, pp.
185-204.
[6] "Project Voldemort." Internet: http://project-voldemort.com,
[30.09.2011]
[7] "Redis." Internet: http://redis.io, [30.09.2011]
[8] "Membase." Internet: http://couchbase.org/membase, [30.09.2011]
[9] "CouchDB." Internet: http://couchdb.apache.org, [30.09.2011]
[10] "MongoDB." Internet: http://mongodb.org, [30.09.2011]
[11] "Riak." Internet: http://basho.com/Riak.html, [30.09.2011]
[12] F. Chang, et al., "Bigtable: A Distributed Storage System for Structured
Data," ACM Transactions on Computer Systems, vol. 26, pp. 1-26, 2008.
[13] "HBase." Internet: http://hbase.apache.org, [30.09.2011]
[14] "Hypertable." Internet: http://hypertable.org, [30.09.2011]
[15] "Cassandra." Internet: http://cassandra.apache.org, [30.09.2011]
[16] "Neo4j." Internet: http://neo4j.org, [30.09.2011]
[17] "GraphDB." Internet: http://www.sones.com, [30.09.2011]
[18] "Sesame." Internet: http://www.openrdf.org, [30.09.2011]
[19] "Bigdata." Internet: http://www.systap.com/bigdata.htm, [30.09.2011]
[20] "FlockDB." Internet: http://github.com/twitter/flockdb, [30.09.2011]
[21] E. Meijer and G. Bierman, "A Co-Relational Model of Data for Large
Shared Data Banks," Communications of the ACM vol. 54, pp. 49-58,
March 2011 2011.
[22] "UnQL." Internet: http://unqlspec.org, [30.09.2011]
[23] " Pig." Internet: http://pig.apache.org, [30.09.2011]
[24] " Hive." Internet: http://hive.apache.org, [30.09.2011]
[25] "SPARQL." Internet: http://www.w3.org/TR/rdf-sparql-query,
[30.09.2011]
[26] " Gremlin." Internet: http://github.com/tinkerpop/gremlin, [30.09.2011]
[27] D. Karger, et al., "Web caching with consistent hashing," in Proceedings
of the eighth international conference on World Wide Web, Elsevier
North-Holland, Inc. New York, 1999, pp. 11-16.
[28] E. A. Brewer, "Towards Robust Distributed Systems," in Proceedings of
the nineteenth annual ACM Symposium on Principles of Distributed
Computing (PODC), Portland, Oregon, 2000.
[29] G. Linden. "Make Data Useful." Internet: http://www.scribd.com/doc/-
4970486/Make-Data-Useful-by-Greg-Linden-Amazoncom, [30.09.2011]
[30] M. Mayer, "In Search of... A better, faster, stronger Web," presented at
the Velocity 09, San Jose, CA, 2009.
... In these new architectures, there is not an endeavor to fit all data into a single data model [17]. Consequently, relational databases are no more favored in designing distributed systems since they have static schema and rigid structure [18], they do not scale horizontally, and they have join and lock characteristics as negative effector in terms of distributed systems performance [19]. As a result, in contemporary software development, it is an observed fact that software developers are not conducting traditional data analysis techniques such as creating Entity-Relationship Diagrams, conducting Relational data analysis, etc. [35]. ...
... Monoliths were designed in a way to use the database, memory and files of a single machine [12]. Relational databases, which were highly used in monoliths, cannot accommodate new architectures because they have a strict relational schema, a normalized data model, and ACID property yielding lack of horizontal scaling and join and lock concepts affecting the performance in a negative way in distributed systems [19]. There are many large scale software applications in which limitations of relational databases in fulfilling the requirements of big users and big data are already discovered [18]. ...
Article
Full-text available
Software Size Measurement is a critical task in Software Development Life Cycle (SDLC). It is the primary input for effort estimation models and an important measure for project control and process improvement. There exist various size measurement methods whose successes have already been proven for traditional software architectures and application domains. Being one of them, functional size measurement (FSM) attracts specific attention due to its applicability at the early phases of SDLC. Although FSM methods were successful on the data-base centric, transaction oriented stand-alone applications, in contemporary software development projects, Agile methods are highly used, and a centralized database and a relational approach are not used as before while the requirements suffer from a lack of detail. Today’s software is frequently service based, highly distributed, message-driven, scalable and has unprecedented levels of availability. In the new era, event-driven architectures are appearing as one of the emerging approaches where the ‘event’ concept largely replaces the ‘data’ concept. Considering the important place of events in contemporary architectures, we focused on approaching the software size measurement problem from the event-driven perspective. This situation guided us to explore how useful event as a size measure in comparison to data-movement based methods. The findings of our study indicates that events can be promising for measurement and should be investigated further in detail to be formalized for creating a measurement model thereby providing a replicable approach.
... If an RDBMS is capable of efficiently handling the data, the utilization of a NoSQL storage system may be unnecessary. In [37], the researchers provided a comprehensive analysis of various NoSQL solutions, including Voldemort, Redis, Riak, MongoDB, CouchDB, Cassandra, and HBase, among others, highlighting their key characteristics and features. The study conducted in [38] focused on examining the elasticity of non-relational solutions. ...
Article
Full-text available
This study introduces a novel methodology designed to assess the accuracy of data processing in the Lambda Architecture (LA), an advanced big-data framework qualified for processing streaming (data in motion) and batch (data at rest) data. Distinct from prior studies that have focused on hardware performance and scalability evaluations, our research uniquely targets the intricate aspects of data-processing accuracy within the various layers of LA. The salient contribution of this study lies in its empirical approach. For the first time, we provide empirical evidence that validates previously theoretical assertions about LA, which have remained largely unexamined due to LA's intricate design. Our methodology encompasses the evaluation of prospective technologies across all levels of LA, the examination of layer-specific design limitations, and the implementation of a uniform software development framework across multiple layers. Specifically, our methodology employs a unique set of metrics, including data latency and processing accuracy under various conditions, which serve as critical indicators of LA's accurate data-processing performance. Our findings compellingly illustrate LA's "eventual consistency". Despite potential transient inconsistencies during real-time processing in the Speed Layer (SL), the system ultimately converges to deliver precise and reliable results, as informed by the comprehensive computations of the Batch Layer (BL). This empirical validation not only confirms but also quantifies the claims posited by previous theoretical discourse, with our results indicating a 100% accuracy rate under various severe data-ingestion scenarios. We applied this methodology in a practical case study involving air/ground surveillance, a domain where data accuracy is paramount. This application demonstrates the effectiveness of the methodology using real-world data-intake scenarios, therefore distinguishing this study from hardware-centric evaluations. This study not only contributes to the existing body of knowledge on LA but also addresses a significant literature gap. By offering a novel, empirically supported methodology for testing LA, a methodology with potential applicability to other big-data architectures, this study sets a precedent for future research in this area, advancing beyond previous work that lacked empirical validation.
... In [102] presents a study that assesses consistency guarantees and the performance of the VoltDB NewSQL database. [2], [3], [8], [14], [30] Network Database Management System [15], [16], [17], [20], [98] Relational Database Management System [21], [22], [29], [31], [37], [42], [43], [44], [45], [46], [47], [48], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [104], [105] Object Oriented Database Management System [4], [27], [35], [36], [38], [39], [40], [41], [60], [61], [62], [64], [65], [66], [71] Object Relational Database Management System [34], [67], [68], [69], [70], [71], [72], [95] NoSQL [6], [7], [10], [18], [23], [24], [25], [26], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [93], [101] NewSQL [5], [11], [24], [84], [85], [86], [87], [88], [89], [90], [91], [92], [93], [102] 1. Easy to understand because of its one-to-many relationships. ...
... In order to tackle Big Data, the world has moved away from the "one size fits all" philosophy of RDBMS and toward the more flexible approach of No-SQL systems [3]. NoSQL databases give more freedom to design systems by studying the application's features and the type and volume of data it is dealing with, which is the cause for the shift in emphasis from traditional RD-BMS to NoSQL database systems for handling Big Data. ...
... While RDBMSs have favored data consistency [9] by eliminating redundant data through logical database design, and through a strong consistency model, NoSQL DBMSs have generally adopted the opposite approach. In several NoSQL data models such as keyvalue pairs and documents, redundant data is stored at the cost of storage space [43]. This approach enables query languages to be simple [25], avoiding complex and potentially slow queries. ...
Preprint
Full-text available
Efficiency has been a pivotal aspect of the software industry since its inception, as a system that serves the end-user fast, and the service provider cost-efficiently benefits all parties. A database management system (DBMS) is an integral part of effectively all software systems, and therefore it is logical that different studies have compared the performance of different DBMSs in hopes of finding the most efficient one. This survey systematically synthesizes the results and approaches of studies that compare DBMS performance and provides recommendations for industry and research. The results show that performance is usually tested in a way that does not reflect real-world use cases, and that tests are typically reported in insufficient detail for replication or for drawing conclusions from the stated results.
... Emerging big data systems [1][2][3] aim at managing voluminous data spread between multiple servers and ensuring the low response time of queries. Such requirements for managing and querying big data have revealed the limitations of relational databases [4,5]. A new kind of non-relational database management systems, named NoSQL stores, have been developed to deal with such limitations [6][7][8]. ...
Article
Due to the scalability and availability problems with traditional relational database systems, a variety of NoSQL stores have emerged over the last decade to deal with big data. How data are structured in a NoSQL store has a large impact on the query and update performance and the storage usage. Thus, different from the traditional database design, not only the data structure but also the data access patterns need to be considered in the design of NoSQL database schemas. In this paper, we present a general workload-driven method for designing key–value, wide-column, and document NoSQL database schemas. We first present a generic logical model Query Path Graph (QPG) that can represent the data structures of the UML class diagram. We also define mappings from the SQL-based query patterns to QPG and from QPG to aggregate-oriented NoSQL schemas. We use a cost model to measure the query and update performance and optimize the QPG schemas. We evaluate the proposed method with several typical case studies by simulating workloads on databases with different schema designs. The results demonstrate that our method preserves the generality and the quality of the design.
Article
Full-text available
Efficiency has been a pivotal aspect of the software industry since its inception, as a system that serves the end-user fast, and the service provider cost-efficiently benefits all parties. A database management system (DBMS) is an integral part of effectively all software systems, and therefore it is logical that different studies have compared the performance of different DBMSs in hopes of finding the most efficient one. This study systematically synthesizes the results and approaches of studies that compare DBMS performance and provides recommendations for industry and research. The results show that performance is usually tested in a way that does not reflect real-world use cases, and that tests are typically reported in insufficient detail for replication or for drawing conclusions from the stated results.
Chapter
MicroStream is a new in-memory data engine for Java applications. It directly stores the Java object graph in an optimized way, removing the burden of having to map data from the Java object model to the relational data model and vice versa, a problem well known as the impedance mismatch. Its vendor claims that their product outperforms JPA-based systems realized with Hibernate. They furthermore argue that it is well-suited for implementing microservices in a cloud-native way where each service complies with the decentralized data management principle of microservices. Our work empirically assessed the performance of MicroStream by implementing two applications. The first one is a modified version of MicroStream’s BookStore performance demo application. We used it to reproduce the data the MicroStream developers used as backing for their performance claims. The second application is an OLTP system based on the TPC-C benchmark specification. MicroStream does not provide any sophisticated features for concurrent data access management. Therefore, we created two distinct MicroStream-based approaches for our OLTP application. For the first solution, we used a third-party transaction management system called JACIS. The second solution relies on structured modelling and Java 1.0 concurrency concepts. Our results show that MicroStream is indeed up to 427 times faster when comparing the service execution time on the server with the fastest JPA transaction. From a user’s perspective, where network overhead, scheduling etc. impact the overall server response time, MicroStream is still up to 47% faster than a comparable JPA-based solution. Furthermore, we implemented concurrent data access by using an approach based on structured modelling to handle lock granularity and deadlocks.
Conference Paper
Full-text available
Current distributed systems, even the ones that work, tend to be very fragile: they are hard to keep up, hard to manage, hard to grow, hard to evolve, and hard to program. In this talk, I look at several issues in an attempt to clean up the way we think about these systems. These issues include the fault model, high availability, graceful degradation, data consistency, evolution, composition, and autonomy.These are not (yet) provable principles, but merely ways to think about the issues that simplify design in practice. They draw on experience at Berkeley and with giant-scale systems built at Inktomi, including the system that handles 50% of all web searches.
Conference Paper
Full-text available
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Article
Full-text available
The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor's product to another. A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL. In this article we developed a mathematical data model for the most common form of noSQL-namely, key-value stores as the mathematical dual of SQL's foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics. In contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus coSQL. While the coSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for "small" data as well. On the other hand, it is possible to scale SQL databases by careful partitioning.2 What this all means is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Because of the common query language based on monads, both can be implemented using the same principles.
Article
Full-text available
In this paper, we examine a number of SQL and socalled "NoSQL" data stores designed to scale simple OLTP-style application loads over many servers. Originally motivated by Web 2.0 applications, these systems are designed to scale to thousands or millions of users doing updates as well as reads, in contrast to traditional DBMSs and data warehouses. We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability.
Conference Paper
AkeyperformancemeasurefortheWorldWideWebisthespeedwithwhichcontentisservedtousers.Astraffic on the Web increases, users are faced with increasing delays and failures in data delivery. Web caching is one of the key strategies that has been explored to improve performance. An important issue in many caching systems is how to decide what is cached where at any given time. Solutions have included multicast queries and directory schemes. In this paper, we offer a new Web caching strategy based on consistent hashing. Consistenthashingprovidesan alternative to multicast and directory schemes, and has several other advantages in load balancing and fault tolerance. Its performance was analyzed theoretically in previous work; in this paper we describe the implementation of a consistent-hashing-based system and experiments that support our thesis that it can provide performance improvements.
Article
A key performance measure for the World Wide Web is the speed with which content is served to users. As traffic on the Web increases, users are faced with increasing delays and failures in data delivery. Web caching is one of the key strategies that has been explored to improve performance.An important issue in many caching systems is how to decide what is cached where at any given time. Solutions have included multicast queries and directory schemes.In this paper, we offer a new Web caching strategy based on consistent hashing. Consistent hashing provides an alternative to multicast and directory schemes, and has several other advantages in load balancing and fault tolerance. Its performance was analyzed theoretically in previous work; in this paper we describe the implementation of a consistent-hashing-based system and experiments that support our thesis that it can provide performance improvements.
Article
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
Make Data Useful Internet: http://www.scribd.com/doc
  • G Linden
G. Linden. "Make Data Useful." Internet: http://www.scribd.com/doc/-4970486/Make-Data-Useful-by-Greg-Linden-Amazoncom, [30.09.2011]
NoSQL Databases" unpublished
  • C Strauch
C. Strauch, "NoSQL Databases" unpublished.