[Show abstract][Hide abstract] ABSTRACT: Modern Internet applications such as websites and mobile games produce a
large amount of activity data representing information associated with user
actions such as login or online purchases. Cohort analysis, originated from
Social Science, is a powerful data exploration technique for ?finding unusual
user behavior trends in large activity datasets using the concept of cohort.
This paper presents the design and implementation of database support for
cohort analysis. We introduce an extended relational data model for
representing a collection of activity data as an activity relation, and define
a set of cohort operators on the activity relations, for composing cohort
queries. To evaluate a cohort query, we present three schemes: a SQL based
approach which translates a cohort query into a set of SQL statements for
execution, a materialized view approach which materializes birth activity
tuples to speed up SQL execution, and a new cohort query evaluation scheme
specially designed for cohort query processing. We implement the first two
schemes on MySQL and MonetDB respectively and develop a prototype of our own
cohort query engine, COHANA, for the third scheme. An extensive experimental
evaluation shows that the performance of the proposed cohort query evaluation
scheme is up to three orders of magnitude faster than the performance of the
two SQL based schemes.
[Show abstract][Hide abstract] ABSTRACT: Due to the coarse granularity of data accesses and the heavy use of latches,
indices in the B-tree family are not efficient for in-memory databases,
especially in the context of today's multi-core architecture.
In this paper, we present PI, a Parallel in-memory skip list based Index that
lends itself naturally to the parallel and concurrent environment, particularly
with non-uniform memory access. In PI, incoming queries are collected, and
disjointly distributed among multiple threads for processing to avoid the use
of latches. For each query, PI traverses the index in a Breadth-First-Search
(BFS) manner to find the list node with the matching key, exploiting SIMD
processing to speed up the search process. In order for query processing to be
latch-free, PI employs a light-weight communication protocol that enables
threads to re-distribute the query workload among themselves such that each
list node that will be modified as a result of query processing will be
accessed by exactly one thread. We conducted extensive experiments, and the
results show that PI can be up to three times as fast as the Masstree, a
state-of-the-art B-tree based index.
[Show abstract][Hide abstract] ABSTRACT: Recent years have witnessed amazing outcomes from "Big Models" trained by
"Big Data". Most popular algorithms for model training are iterative. Due to
the surging volumes of data, we can usually afford to process only a fraction
of the training data in each iteration. Typically, the data are either
uniformly sampled or sequentially accessed.
In this paper, we study how the data access pattern can affect model
training. We propose an Active Sampler algorithm, where training data with more
"learning value" to the model are sampled more frequently. The goal is to focus
training effort on valuable instances near the classification boundaries,
rather than evident cases, noisy data or outliers. We show the correctness and
optimality of Active Sampler in theory, and then develop a light-weight
vectorized implementation. Active Sampler is orthogonal to most approaches
optimizing the efficiency of large-scale data analytics, and can be applied to
most analytics models trained by stochastic gradient descent (SGD) algorithm.
Extensive experimental evaluations demonstrate that Active Sampler can speed up
the training procedure of SVM, feature selection and deep learning, for
comparable training quality by 1.6-2.2x.
[Show abstract][Hide abstract] ABSTRACT: We study the query optimization problem in declarative crowdsourcing systems. Declarative crowdsourcing is designed to hide the complexities and relieve the user of the burden of dealing with the crowd. The user is only required to submit an SQL-like query and the system takes the responsibility of compiling the query, generating the execution plan and evaluating in the crowdsourcing marketplace. A given query can have many alternative execution plans and the difference in crowdsourcing cost between the best and the worst plans may be several orders of magnitude. Therefore, as in relational database systems, query optimization is important to crowdsourcing systems that provide declarative query interfaces. In this paper, we propose CrowdOp, a cost-based query optimization approach for declarative crowdsourcing systems. CrowdOp considers both cost and latency in query optimization objectives and generates query plans that provide a good balance between the cost and latency. We develop efficient algorithms in the CrowdOp for optimizing three types of queries: selection queries, join queries, and complex selection-join queries. We validate our approach via extensive experiments by simulation as well as with the real crowd on Amazon Mechanical Turk.
No preview · Article · Aug 2015 · IEEE Transactions on Knowledge and Data Engineering
[Show abstract][Hide abstract] ABSTRACT: The Big Data problem is characterized by the so-called 3V features: volume—a huge amount of data, velocity—a high data ingestion rate, and variety—a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source implementation Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal with the data variety well since the programming interfaces and its associated data processing model are inconvenient and inefficient for handling structured data and graph data. This paper presents epiC, an extensible system to tackle the Big Data’s data variety challenge. epiC introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations. Users process multi-structured datasets with appropriate epiC extensions, and the implementation of a data processing model best suited for the data type and auxiliary code for mapping that data processing model into epiC’s concurrent programming model. Like Hadoop, programs written in this way can be automatically parallelized and the runtime system takes care of fault tolerance and inter-machine communications. We present the design and implementation of epiC’s concurrent programming model. We also present two customized data processing models, an optimized MapReduce extension and a relational model, on top of epiC. We show how users can leverage epiC to process heterogeneous data by linking different types of operators together. To improve the performance of complex analytic jobs, epiC supports a partition-based optimization technique where data are streamed between the operators to avoid the high I/O overheads. Experiments demonstrate the effectiveness and efficiency of our proposed epiC.
No preview · Article · Jul 2015 · The VLDB Journal
[Show abstract][Hide abstract] ABSTRACT: Multi-modal retrieval is emerging as a new search paradigm that enables seamless information retrieval from various types of media. For example, users can simply snap a movie poster to search for relevant reviews and trailers. The mainstream solution to the problem is to learn a set of mapping functions that project data from different modalities into a common metric space in which conventional indexing schemes for high-dimensional space can be applied. Since the effectiveness of the mapping functions plays an essential role in improving search quality, in this paper, we exploit deep learning techniques to learn effective mapping functions. In particular, we first propose a general learning objective that effectively captures both intramodal and intermodal semantic relationships of data from heterogeneous sources. Given the general objective, we propose two learning algorithms to realize it: (1) an unsupervised approach that uses stacked auto-encoders and requires minimum prior knowledge on the training data and (2) a supervised approach using deep convolutional neural network and neural language model. Our training algorithms are memory efficient with respect to the data volume. Given a large training dataset, we split it into mini-batches and adjust the mapping functions continuously for each batch. Experimental results on three real datasets demonstrate that our proposed methods achieve significant improvement in search accuracy over the state-of-the-art solutions.
No preview · Article · Jul 2015 · The VLDB Journal
[Show abstract][Hide abstract] ABSTRACT: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.
Full-text · Article · Jul 2015 · IEEE Transactions on Knowledge and Data Engineering
[Show abstract][Hide abstract] ABSTRACT: The increase in the capacity of main memory coupled with the decrease in cost has fueled research in and development of in-memory databases. In recent years, the emergence of new hardware has further given rise to new challenges which have attracted a lot of attention from the research community. In particular, it is widely accepted that hardware solutions can provide promising alternatives for realizing the full potential of in-memory systems. Here, we argue that naive adoption of hardware solutions does not guarantee superior performance over software solutions, and identify problems in such hardware solutions that limit their performance. We also highlight the primary challenges faced by in-memory databases, and summarize their potential solutions, from both software and hardware perspectives.
No preview · Article · Jun 2015 · ACM SIGMOD Record
[Show abstract][Hide abstract] ABSTRACT: With the proliferation of geo-positioning and geo-tagging techniques, spatio-textual objects that possess both a geographical location and a textual description are gaining in prevalence, and spatial keyword queries that exploit both location and textual description are gaining in prominence. However, the queries studied so far generally focus on finding individual objects that each satisfy a query rather than finding groups of objects where the objects in a group together satisfy a query. We define the problem of retrieving a group of spatio-textual objects such that the group's keywords cover the query's keywords and such that the objects are nearest to the query location and have the smallest inter-object distances. Specifically, we study three instantiations of this problem, all of which are NP-hard. We devise exact solutions as well as approximate solutions with provable approximation bounds to the problems. In addition, we solve the problems of retrieving top-k groups of three instantiations, and study a weighted version of the problem that incorporates object weights. We present empirical studies that offer insight into the efficiency of the solutions, as well as the accuracy of the approximate solutions.
No preview · Article · Jun 2015 · ACM Transactions on Database Systems
[Show abstract][Hide abstract] ABSTRACT: By maintaining the data in main memory, in-memory databases dramatically
reduce the I/O cost of transaction processing. However, for recovery purpose,
those systems still need to flush the logs to disk, generating a significant
number of I/Os. A new type of logs, the command log, is being employed to
replace the traditional data log (e.g., ARIES log). A command log only tracks
the transactions being executed, thereby effectively reducing the size of the
log and improving the performance. Command logging on the other hand increases
the cost of recovery, because all the transactions in the log after the last
checkpoint must be completely redone when there is a failure. For distributed
database systems with many processing nodes, failures cannot be assumed as
exceptions, and as such, the long recovery time incurred by command logging may
compromise the objective of providing efficient support for OLTP.
In this paper, we first extend the command logging to a distributed system,
where all the nodes can perform their recovery in parallel. Showing that the
synchronisation cost caused by dependency is the bottleneck for command logging
in a distributed system, We consequently propose an adaptive logging approach
by combining data logging and command logging. The intuition is to use data
logging to break the dependency, while applying command logging for most
transactions to reduce I/O costs. The percentage of data logging versus command
logging becomes an optimization between the performance of transaction
processing and recovery to suit different OLTP applications. Our experimental
study compares the performance of our proposed adaptive logging, ARIES style
data logging and command logging on top of H-Store. The results show that
adaptive logging can achieve a 10x boost for recovery and a transaction
throughput that is comparable to that of command logging.
[Show abstract][Hide abstract] ABSTRACT: Multicore CPUs and large memories are increasingly becoming the norm in
modern computer systems. However, current database management systems (DBMSs)
are generally ineffective in exploiting the parallelism of such systems. In
particular, contention can lead to a dramatic fall in performance. In this
paper, we propose a new concurrency control protocol called DGCC (Dependency
Graph based Concurrency Control) that separates concurrency control from
execution. DGCC builds dependency graphs for batched transactions before
executing them. Using these graphs, contentions within the same batch of
transactions are resolved before execution. As a result, the execution of the
transactions does not need to deal with contention while maintaining full
equivalence to that of serialized execution. This better exploits multicore
hardware and achieves higher level of parallelism. To facilitate DGCC, we have
also proposed a system architecture that does not have certain centralized
control components yielding better scalability, as well as supports a more
efficient recovery mechanism. Our extensive experimental study shows that DGCC
achieves up to four times higher throughput compared to that of
state-of-the-art concurrency control protocols for high contention workloads.
[Show abstract][Hide abstract] ABSTRACT: Abstract
Every few years a group of database researchers meets to discuss the state of database research,
its impact on practice, and important new directions. This report summarizes the discussion and
conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that
Big Data has now become a de ning challenge of our time, and that the database research community
is uniquely positioned to address it, with enormous opportunities to make transformative impact. To
do so, the report recommends signi cantly more attention to ve research areas: scalable big/fast data
infrastructures; coping with diversity in the data management landscape; end-to-end processing and
understanding of data; cloud services; and managing the diverse roles of people in the data life cycle.
The Beckman Report on Database Research. Available from: https://www.researchgate.net/publication/286244634_The_Beckman_Report_on_Database_Research?origin=mail [accessed Dec 10, 2015].
Full-text · Article · Dec 2014 · ACM SIGMOD Record
[Show abstract][Hide abstract] ABSTRACT: Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpointbased recovery by up to 30x on a cluster of 40 compute nodes.
No preview · Article · Dec 2014 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract] ABSTRACT: Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B+-tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.
No preview · Article · Dec 2014 · IEEE Transactions on Knowledge and Data Engineering
[Show abstract][Hide abstract] ABSTRACT: Companies are increasingly moving their data processing to the cloud, for reasons of cost, scalability, and convenience, among others. However, hosting multiple applications and storage systems on the same cloud introduces resource sharing and heterogeneous data processing challenges due to the variety of resource usage patterns employed, the variety of data types stored, and the variety of query interfaces presented by those systems. Furthermore, real clouds are never perfectly symmetric - there often are differences between individual processors in their capabilities and connectivity. In this paper, we introduce a federation framework to manage such heterogeneous clouds. We then use this framework to discuss several challenges and their potential solutions.
Full-text · Article · Jul 2014 · IEEE Transactions on Knowledge and Data Engineering
[Show abstract][Hide abstract] ABSTRACT: The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing (LSH) has been widely accepted as an effective hash method for high-dimensional similarity search. However, data sets are typically not distributed uniformly over the space, and as a result, the buckets of LSH are unbalanced, causing the performance of LSH to degrade. In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. DSH improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies. DSH leverages data distributions and is capable of directly preserving the nearest neighbor relations. We show the theoretical guarantee of DSH, and demonstrate its efficiency experimentally.