[Show abstract][Hide abstract] ABSTRACT: Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without
sacrificing the quality provided by the repair algorithms.
[Show abstract][Hide abstract] ABSTRACT: While syntactic transformations require the application of a formula on the input values, such as unit conversion or date format conversions, semantic transformations, such as zip code to city, require a look-up in some reference data. We recently presented DataXFormer, a system that leverages Web tables, Web forms, and expert sourcing to cover a wide range of transformations. In this demonstration, we present the user-interaction with DataXFormer and show scenarios on how it can be used to transform data and explore the effectiveness and efficiency of several approaches for transformation discovery, leveraging about 112 million tables and online sources.
[Show abstract][Hide abstract] ABSTRACT: Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the " ideal " representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.
International Conference on Data Engineering, Seoul, South Korea; 04/2015
[Show abstract][Hide abstract] ABSTRACT: Data transformation is a crucial step in data integration. While some transformations, such as liters to gallons, can be easily performed by applying a formula or a program on the input values, others, such as zip code to city, require sifting through a repository containing explicit value map-pings. There are already powerful systems that provide for-mulae and algorithms for transformations. However, the automated identification of reference datasets to support value mapping remains largely unresolved. The Web is home to millions of tables with many containing explicit value map-pings. This is in addition to value mappings hidden behind Web forms. In this paper, we present DataXFormer, a transformation engine that leverages Web tables and Web forms to perform transformation tasks. In particular, we describe an inductive, filter-refine approach for identifying explicit transformations in a corpus of Web tables and an approach to dynamically retrieve and wrap Web forms. Experiments show that the combination of both resource types covers more than 80% of transformation queries formulated by real-world users.
[Show abstract][Hide abstract] ABSTRACT: Advances in geo-sensing technology have led to an unprecedented spread of location-aware devices. In turn, this has resulted into a plethora of location-based services in which huge amounts of spa- tial data need to be efficiently consumed by spatial query proces- sors. For a spatial query processor to properly choose among the various query processing strategies, the cost of the spatial operators has to be estimated. In this paper, we study the problem of estimat- ing the cost of the spatial k-nearest-neighbor (k-NN, for short) op- erators, namely, k-NN-Select and k-NN-Join. Given a query that has a k-NN operator, the objective is to estimate the number of blocks that are going to be scanned during the processing of this operator. Estimating the cost of a k-NN operator is challenging for several reasons. For instance, the cost of a k-NN-Select operator is directly affected by the value of k, the location of the query focal point, and the distribution of the data. Hence, a cost model that captures these factors is relatively hard to realize. This paper in- troduces cost estimation techniques that maintain a compact set of catalog information that can be kept in main-memory to enable fast estimation via lookups. A detailed study of the performance and accuracy trade-off of each proposed technique is presented. Experimental results using real spatial datasets from OpenStreetMap demonstrate the robustness of the proposed estimation techniques.
[Show abstract][Hide abstract] ABSTRACT: The SQL group-by operator plays an important role in summarizing and
aggregating large datasets in a data analytic stack.While the standard group-by
operator, which is based on equality, is useful in several applications,
allowing similarity aware grouping provides a more realistic view on real-world
data that could lead to better insights. The Similarity SQL-based Group-By
operator (SGB, for short) extends the semantics of the standard SQL Group-by by
grouping data with similar but not necessarily equal values. While existing
similarity-based grouping operators efficiently materialize this approximate
semantics, they primarily focus on one-dimensional attributes and treat
multidimensional attributes independently. However, correlated attributes, such
as in spatial data, are processed independently, and hence, groups in the
multidimensional space are not detected properly. To address this problem, we
introduce two new SGB operators for multidimensional data. The first operator
is the clique (or distance-to-all) SGB, where all the tuples in a group are
within some distance from each other. The second operator is the
distance-to-any SGB, where a tuple belongs to a group if the tuple is within
some distance from any other tuple in the group. We implement and test the new
SGB operators and their algorithms inside PostgreSQL. The overhead introduced
by these operators proves to be minimal and the execution times are comparable
to those of the standard Group-by. The experimental study, based on TPC-H and a
social check-in data, demonstrates that the proposed algorithms can achieve up
to three orders of magnitude enhancement in performance over baseline methods
developed to solve the same problem.
[Show abstract][Hide abstract] ABSTRACT: Similarity group-by (SGB, for short) has been proposed as a relational
database operator to match the needs of emerging database applications. Many
SGB operators that extend SQL have been proposed in the literature, e.g.,
similarity operators in the one-dimensional space. These operators have various
semantics. Depending on how these operators are implemented, some of the
implementations may lead to different groupings of the data. Hence, if SQL code
is ported from one database system to another, it is not guaranteed that the
code will produce the same results. In this paper, we investigate the various
semantics for the relational similarity group-by operators in the
multi-dimensional space. We define the class of order-independent SGB operators
that produce the same results regardless of the order in which the input data
is presented to them. Using the notion of interval graphs borrowed from graph
theory, we prove that, for certain SGB operators, there exist order-independent
implementations. For each of these operators, we provide a sample algorithm
that is order-independent. Also, we prove that for other SGB operators, there
does not exist an order-independent implementation for them, and hence these
SGB operators are ill-defined and should not be adopted in extensions to SQL to
realize similarity group-by. In this paper, we introduce an SGB operator,
namely SGB-All, for grouping multi-dimensional data using similarity. SGB-All
forms groups such that a data item, say O, belongs to a group, say G, if and
only if O is within a user-defined threshold from all other data items in G. In
other words, each group in SGB-All forms a clique of nearby data items in the
multi-dimensional space. We prove that SGB-All are order-independent, i.e.,
there is at least one algorithm for each option that is independent of the
presentation order of the input data.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we address the problem of managing complex dependencies among the database items that involve human actions, e.g., conducting a wet-lab experiment or taking manual measurements. If y = F(x) where F is a human action that cannot be coded inside the DBMS, then whenever x changes, y remains invalid until F is externally performed and its output result is reflected into the database. Many application domains, e.g., scientific applications in biology, chemistry, and physics, contain multiple such derivations and dependencies that involve human actions. In this paper, we propose HandsOn DB, a prototype database engine for managing dependencies that involve human actions while maintaining the consistency of the derived data. HandsOn DB includes the following features: (1) semantics and syntax for interfaces through which users can register human activities and express the dependencies among the data items on these activities, (2) mechanisms for invalidating and revalidating the derived data, and (3) new operator semantics that alert users when the returned query results contain potentially invalid data, and enable evaluating queries on either valid data only, or both valid and potentially invalid data. Performance results demonstrate the feasibility and practicality in realizing HandsOn DB.
IEEE Transactions on Knowledge and Data Engineering 09/2014; 26(9). DOI:10.1109/TKDE.2013.117 · 1.82 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.
[Show abstract][Hide abstract] ABSTRACT: We describe ionomicshub, iHUB for short, a large scale cyber-infrastructure to support end-to-end research that aims to improve our understanding of how plants take up, transport and store their nutrient and toxic elements.
Proceedings of the companion publication of the 23rd international conference on World wide web companion; 04/2014
[Show abstract][Hide abstract] ABSTRACT: The continuous and dynamic nature of data streams may lead a query execution plan (QEP) of a long-running continuous query to become suboptimal during execution, and hence will need to be al- tered. The ability to perform an efficient and flawless transition to an equivalent, yet optimal QEP is essential for a data stream query processor. Such transition is challenging for plans with stateful bi- nary operators, such as joins, where the states of the QEP have to be maintained during query transition without compromising the correctness of the query output. This paper presents Just-In-Time State Completion (JISC); a new technique for query plan migration. JISC does not cause any halt to the query execution, and thus allows the query to maintain steady output. JISC is applicable to pipelined as well as eddy-based query evaluation frameworks. Probabilistic analysis of the cost and experimental studies show that JISC in- creases the execution throughput during the plan migration stage by up to an order of magnitude compared to existing solutions.
[Show abstract][Hide abstract] ABSTRACT: Deep web or hidden web refers to the hidden part of the Web (usually residing in structured databases) that remains unavailable for standard Web crawlers. Obtaining content of the deep web is challenging and has been acknowledged as a significant gap ...
Information Systems 09/2013; 38(6):885–886. DOI:10.1016/j.is.2013.03.001 · 1.24 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We present NADEEF, an extensible, generic and easy-to-deploy data cleaning system. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows users to specify data quality rules by writing code that implements predefined classes. These classes uniformly define what is wrong with the data and (possibly) how to fix it. We will demonstrate the following features provided by NADEEF. (1) Heterogeneity: The programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. (2) Interdependency: The core algorithms can interleave multiple types of rules to detect and repair data errors. (3) Deployment and extensibility: Users can easily customize NADEEF by defining new types of rules, or by extending the core. (4) Metadata management and data custodians: We show a live data quality dashboard to effectively involve users in the data cleaning process.
[Show abstract][Hide abstract] ABSTRACT: Entity disambiguation is an important step in many information retrieval applications. This paper proposes new research for entity disambiguation with the focus of name disambiguation in digital libraries. In particular, pairwise similarity is first learned for publications that share the same author name string (ANS) and then a novel Hierarchical Agglomerative Clustering approach with Adaptive Stopping Criterion (HACASC) is proposed to adaptively cluster a set of publications that share a same ANS to individual clusters of publications with different author identities. The HACASC approach utilizes a mixture of kernel ridge regressions to intelligently determine the threshold in clustering. This obtains more appropriate clustering granularity than non-adaptive stopping criterion. We conduct a large scale empirical study with a dataset of more than 2 million publication record pairs to demonstrate the advantage of the proposed HACASC approach.
[Show abstract][Hide abstract] ABSTRACT: Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.