[Show abstract][Hide abstract]ABSTRACT: Modern Internet applications often produce a large volume of user activity records. Data analysts are interested in cohort analysis, or finding unusual user behavioral trends, in these large tables of activity records. In a traditional database system, cohort analysis queries are both painful to specify and expensive to evaluate. We propose to extend database systems to support cohort analysis. We do so by extending SQL with three new operators. We devise three different evaluation schemes for cohort query processing. Two of them adopt a non-intrusive approach. The third approach employs a columnar based evaluation scheme with optimizations specifically designed for cohort query processing. Our experimental results confirm the performance benefits of our proposed columnar database system, compared against the two non-intrusive approaches that implement cohort queries on top of regular relational databases.
[Show abstract][Hide abstract]ABSTRACT: As the use of crowdsourcing increases, it is important to think about performance optimization. For this purpose, it is possible to think about each worker as a HPU(Human Processing Unit), and to draw inspiration from performance optimization on traditional computers or cloud nodes with CPUs. However, as we characterize HPUs in detail for this purpose, we find that there are important differences between CPUs and HPUs, leading to the need for completely new optimization algorithms. In this paper, we study the specific optimization problem of obtaining results fastest for a crowd sourced job with a fixed total budget. In crowdsourcing, jobs are usually broken down into sets of small tasks, which are assigned to workers one at a time. We consider three scenarios of increasing complexity: Identical Round Homogeneous tasks, Multiplex Round Homogeneous tasks, and Multiple Round Heterogeneous tasks. For each scenario, we analyze the stochastic behavior of the HPU clock-rate as a function of the remuneration offered. After that, we develop an optimum Budget Allocation strategy to minimize the latency for job completion. We validate our results through extensive simulations and experiments on Amazon Mechanical Turk.
[Show abstract][Hide abstract]ABSTRACT: Accelerating the pace of materials discovery and development requires new approaches and means of collaborating and sharing information. To address this need, we are developing the Materials Commons, a collaboration platform and information repository for use by the structural materials community. The Materials Commons has been designed to be a continuous, seamless part of the scientific workflow process. Researchers upload the results of experiments and computations as they are performed, automatically where possible, along with the provenance information describing the experimental and computational processes. The Materials Commons website provides an easy-to-use interface for uploading and downloading data and data provenance, as well as for searching and sharing data. This paper provides an overview of the Materials Commons. Concepts are also outlined for integrating the Materials Commons with the broader Materials Information Infrastructure that is evolving to support the Materials Genome Initiative.
Article · Jul 2016 · JOM: the journal of the Minerals, Metals & Materials Society
[Show abstract][Hide abstract]ABSTRACT: Data variety, as one of the three Vs of the Big Data, is manifested by a growing number of complex data types such as documents, sequences, trees, graphs and high dimensional vectors. To perform similarity search on these data, existing works mainly choose to create customized indexes for different data types. Due to the diversity of customized indexes, it is hard to devise a general parallelization strategy to speed up the search. In this paper, we propose a generic inverted index on the GPU (called GENIE), which can support similarity search of multiple queries on various data types. GENIE can effectively support the approximate nearest neighbor search in different similarity measures through exerting Locality Sensitive Hashing schemes, as well as similarity search on original data such as short document data and relational data. Extensive experiments on different real-life datasets demonstrate the efficiency and effectiveness of our system.
[Show abstract][Hide abstract]ABSTRACT: Aims/hypothesis:
Diabetic peripheral neuropathy (DPN) and diabetic nephropathy (DN) are two common microvascular complications of type 1 and type 2 diabetes mellitus that are associated with a high degree of morbidity. In this study, using a variety of systems biology approaches, our aim was to identify common and distinct mechanisms underlying the pathogenesis of these two complications.
Our previously published transcriptomic datasets of peripheral nerve and kidney tissue, derived from murine models of type 1 diabetes (streptozotocin-injected mice) and type 2 diabetes (BKS-db/db mice) and their respective controls, were collected and processed using a unified analysis pipeline so that comparisons could be made. In addition to looking at genes and pathways dysregulated in individual datasets, pairwise comparisons across diabetes type and tissue type were performed at both gene and transcriptional network levels to complete our proposed objective.
Gene-level analysis identified exceptionally high levels of concordant gene expression in DN (94% of 2,433 genes), but not in DPN (54% of 1,558 genes), between type 1 diabetes and type 2 diabetes. These results suggest that common pathogenic mechanisms exist in DN across diabetes type, while in DPN the mechanisms are more distinct. When these dysregulated genes were examined at the transcriptional network level, we found that the Janus kinase (JAK)-signal transducer and activator of transcription (STAT) pathway was significantly dysregulated in both complications, irrespective of diabetes type.
Using a systems biology approach, our findings suggest that common pathogenic mechanisms exist in DN across diabetes type, while in DPN the mechanisms are more distinct. We also found that JAK-STAT signalling is commonly dysregulated among all datasets. Using such approaches, further investigation is warranted to determine whether the same changes are observed in patients with diabetic complications.
[Show abstract][Hide abstract]ABSTRACT: Due to the coarse granularity of data accesses and the heavy use of latches,
indices in the B-tree family are not efficient for in-memory databases,
especially in the context of today's multi-core architecture.
In this paper, we present PI, a Parallel in-memory skip list based Index that
lends itself naturally to the parallel and concurrent environment, particularly
with non-uniform memory access. In PI, incoming queries are collected, and
disjointly distributed among multiple threads for processing to avoid the use
of latches. For each query, PI traverses the index in a Breadth-First-Search
(BFS) manner to find the list node with the matching key, exploiting SIMD
processing to speed up the search process. In order for query processing to be
latch-free, PI employs a light-weight communication protocol that enables
threads to re-distribute the query workload among themselves such that each
list node that will be modified as a result of query processing will be
accessed by exactly one thread. We conducted extensive experiments, and the
results show that PI can be up to three times as fast as the Masstree, a
state-of-the-art B-tree based index.
[Show abstract][Hide abstract]ABSTRACT: Recent years have witnessed amazing outcomes from "Big Models" trained by
"Big Data". Most popular algorithms for model training are iterative. Due to
the surging volumes of data, we can usually afford to process only a fraction
of the training data in each iteration. Typically, the data are either
uniformly sampled or sequentially accessed.
In this paper, we study how the data access pattern can affect model
training. We propose an Active Sampler algorithm, where training data with more
"learning value" to the model are sampled more frequently. The goal is to focus
training effort on valuable instances near the classification boundaries,
rather than evident cases, noisy data or outliers. We show the correctness and
optimality of Active Sampler in theory, and then develop a light-weight
vectorized implementation. Active Sampler is orthogonal to most approaches
optimizing the efficiency of large-scale data analytics, and can be applied to
most analytics models trained by stochastic gradient descent (SGD) algorithm.
Extensive experimental evaluations demonstrate that Active Sampler can speed up
the training procedure of SVM, feature selection and deep learning, for
comparable training quality by 1.6-2.2x.
[Show abstract][Hide abstract]ABSTRACT: When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results. We call this problem the MisMatch problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: target node type and Distinguishability. Target Node Type represents the type of node a query result intends to match, and Distinguishability is used to measure the importance of the query keywords. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any lowest common ancestor-based matching semantics (for XML data without ID references) or minimal Steiner tree-based matching semantics (for XML data with ID references) which return tree structures as results. It is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.
[Show abstract][Hide abstract]ABSTRACT: Users make choices among multi-attribute objects in a data set in a variety of domains including used car purchase, job search and hotel room booking. Individual users sometimes have strong preferences between objects, but these preferences may not be universally shared by all users. If we can cast these preferences as derived from a quantitative user-specific preference function, then we can predict user preferences by learning their preference function, even though the preference function itself is not directly observable, and may be hard to express. In this paper we study the problem of preference learning with pairwise comparisons on a set of entities with multiple attributes. We formalize the problem into two subproblems, namely preference estimation and comparison selection. We propose an innovative approach to estimate the preference, and introduce a binary search strategy to adaptively select the comparisons. We introduce the concept of an orthogonal query to support this adaptive selection, as well as a novel S-tree index to enable efficient evaluation of orthogonal queries. We integrate these components into a system for inferring user preference with adaptive pairwise comparisons. Our experiments and user study demonstrate that our adaptive system significantly outperforms the naïve random selection system on both real data and synthetic data, with either simulated or real user feedback. We also show our preference learning approach is much more effective than existing approaches, and our S-tree can be constructed efficiently and perform orthogonal query at interactive speeds.
Article · Jul 2015 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract]ABSTRACT: As Big Data inexorably draws attention from every segment of society, it has also suffered from many characterizations that are incorrect. This article explores a few of the more common myths about Big Data, and exposes the underlying truths.
[Show abstract][Hide abstract]ABSTRACT: Abstract
Every few years a group of database researchers meets to discuss the state of database research,
its impact on practice, and important new directions. This report summarizes the discussion and
conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that
Big Data has now become a de ning challenge of our time, and that the database research community
is uniquely positioned to address it, with enormous opportunities to make transformative impact. To
do so, the report recommends signi cantly more attention to ve research areas: scalable big/fast data
infrastructures; coping with diversity in the data management landscape; end-to-end processing and
understanding of data; cloud services; and managing the diverse roles of people in the data life cycle.
The Beckman Report on Database Research. Available from: https://www.researchgate.net/publication/286244634_The_Beckman_Report_on_Database_Research?origin=mail [accessed Dec 10, 2015].
Full-text available · Article · Dec 2014 · ACM SIGMOD Record
[Show abstract][Hide abstract]ABSTRACT: Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpointbased recovery by up to 30x on a cluster of 40 compute nodes.
Article · Dec 2014 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract]ABSTRACT: Natural language has been the holy grail of query interface designers, but has generally been considered too hard to work with, except in limited specic circumstances. In this paper, we describe the architecture of an interactive natural language query interface for relational databases. Through a carefully limited interaction with the user, we are able to correctly interpret complex natural language queries, in a generic manner across a range of domains. By these means, a logically complex English language sentence is correctly translated into a SQL query, which may include aggregation, nesting, and various types of joins, among other things, and can be evaluated against an RDBMS.We have constructed a system, NaLIR (Natural Language Interface for Relational databases), embodying these ideas. Our experimental assessment, through user studies, demonstrates that NaLIR is good enough to be usable in practice: even naive users are able to specify quite complex ad-hoc queries.
Article · Sep 2014 · Proceedings of the VLDB Endowment