Table 6 - uploaded by Xiao Chen
Content may be subject to copyright.
Source publication
Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relative...
Contexts in source publication
Context 1
... handling: Redundancy handling signifies some detailed measures to reduce the total run time, which includes reducing the number of record pairs to be compared and reducing the communication efforts between different processors. Table 6, Table 7, Table 8 and Table 9 provide an overview and classification of the 34 considered approaches based on the above-explained efficiency criteria. Since data partitioning and load balancing is tightly related, they are in the tables as one column. ...
Context 2
... the four tables on efficiency (Table 6, Table 7, Table 8, and Table 9), measures for redundancy handling are listed in the last column. When developers design a new workflow for ER, possible optimizations can be inspired by those measures, and we will classify them to four categories. ...
Citations
... ER schemes may be evaluated from multiple perspectives [8] [9]: (1) Effectiveness or performance of clustering (for example in terms of recall and precision); (2) Efficiency, or the number of queries required per sample to achieve this performance; (3) Operation and scalability, i.e., whether the scheme is adaptive or non-adaptive, whether it runs online or in batch, and whether it is parallelizable; and, (4) Genericity, or how and whether the scheme may be applied to different scenarios. For example, the Jaccard similarity function is popular when only dealing with textual data. ...
We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning. This is typically an expensive and time consuming process and therefore, we seek ways to do so efficiently. The conventional approach involves comparing each sample with (the representative of) each class to find a match. In a setting with
N
equally likely classes, this involves
N/2
pairwise comparisons (queries per sample) on average. We consider a
k
-ary query scheme with
samples in a query that identifies (dis)similar items in the set while effectively exploiting the associated transitive relations. We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of
. In addition, we present an adaptive greedy query scheme, which achieves an average rate of
queries per sample with triplet queries. For the proposed algorithms, we investigate the query rate performance analytically and with simulations. Empirical studies suggest that each triplet query takes an expert at most 50% more time compared with a pairwise query, indicating the effectiveness of the proposed
k
-ary query schemes. We generalize the analyses to nonuniform class distributions when possible.
... ER schemes may be evaluated from multiple perspectives [8] [9]: (1) Effectiveness or performance of clustering (for example in terms of recall and precision); (2) Efficiency, or the number of queries required per sample to achieve this performance; (3) Operation and scalability, i.e., whether the scheme is adaptive or non-adaptive, whether it runs online or in batch, and whether it is parallelizable; and, (4) Genericity, or how and whether the scheme may be applied to different scenarios. For example, the Jaccard similarity function is popular when only dealing with textual data. ...
We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning. This is typically an expensive and time consuming process and therefore, we seek ways to do so efficiently. The conventional approach involves comparing each sample with (the representative of) each class to find a match. In a setting with N equally likely classes, this involves N/2 pairwise comparisons (queries per sample) on average. We consider a k-ary query scheme with samples in a query that identifies (dis)similar items in the set while effectively exploiting the associated transitive relations. We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of . In addition, we present an adaptive greedy query scheme, which achieves an average rate of queries per sample with triplet queries. For the proposed algorithms, we investigate the query rate performance analytically and with simulations. Empirical studies suggest that each triplet query takes an expert at most 50\% more time compared with a pairwise query, indicating the effectiveness of the proposed k-ary query schemes. We generalize the analyses to nonuniform class distributions when possible.
... Individual characteristics of Big Data have been the focus of previous research work in ER. For example, there is a continuous concern for improving the scalability of ER techniques over increasing Volumes of entities using massively parallel implementations [29]. Moreover, uncertain entity descriptions due to high Veracity have been resolved using approximate matching [50,69]. ...
... Approximate instance matching is surveyed in [50], link discovering algorithms in [127], and uncertain ER in [69]. Recent efforts to enhance scalability of ER methods by leveraging distribution and parallelization techniques are surveyed in [29], while overviews of blocking and filtering techniques are presented in [132,140]. In contrast, our goal is to present an in-depth survey on all tasks required to implement complex ER workflows, including Indexing, Matching and Clustering. ...
One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to the same real-world entity. Despite several decades of research, ER remains a challenging problem. In this survey, we highlight the novel aspects of resolving Big Data entities when we should satisfy more than one of the Big Data characteristics simultaneously (i.e., Volume and Velocity with Variety). We present the basic concepts, processing steps, and execution strategies that have been proposed by database, semantic Web, and machine learning communities in order to cope with the loose structuredness , extreme diversity , high speed, and large scale of entity descriptions used by real-world applications. We provide an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and conclude with the main open research directions.
... In the Hadoop distributed environment, the preprocessors lack access to a large shared memory space which makes the standard ER blocking approach impossible [11]. ...
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a software framework used with Apache Hadoop, which has become the de facto standard platform for processing and storing large amounts of data in a distributed computing environment. The research presented here focuses on the variations observed among the results of an efficient iterative transitive closure algorithm when run against different distributed environments. The results from these comparisons were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The experiment results highlighted the inconsistencies that can occur when using the same codebase with different implementations of Map Reduce.
... To the best of our knowledge, no work except [21] discusses task parallelism in ER. However, according to [4] "since each step in ER needs time to process large-scala data, task parallelism is suitable for ER to reduce its entire processing time and (improve) throughput". In addition to better streamlining ER processing, introducing task parallelism further enables incremental processing of fast changing data (the "velocity" characteristic largely ignored in Big Data integration [8]) as well as feedback-loops across processing stages as part of pay-as-you-go data integration approaches [14]. ...
... Data-parallelism for ER Existing work on parallel ER is mostly based on data parallel solutions as surveyed in [4]. Most of the parallel ER solutions focus on the parallelization of blocking and pairwise comparison steps (e.g., [6,10,11]). ...
... Task-parallelism for ER According to [4] and to the best of our knowledge, the only work targeting task-parallelism in ER is [21]. It actually mixes both data-and task-parallelism as it divides the computation across pipeline stages and each pipeline is replicated multiple times to handle data in parallel. ...
Entity resolution (ER) refers to the problem of finding which virtual representations in one or more data sources refer to the same real-world entity. A central question in ER is how to find matching entity representations (so called duplicates) efficiently and in a scalable way. One general technique to address these issues is to leverage parallelization. In particular, almost all work on parallel ER focus on data parallelism. This paper focuses on task parallelism for ER. This type of parallelism allows to support incremental ER that offers incremental computation of the solution by streaming results of intermediate stages of ER as soon as they are computed. This possibly allows to obtain results in a more timely fashion and can also serve in a service-oriented setting with limited time or monetary budget. In summary, this paper presents a framework for task-parallelization of ER, supporting in particular ER of large amounts of semi-structured and heterogeneous data. We also discuss a possible implementation of our framework.
... At first, input data is preprocessed if necessary, which may include data cleaning, formatting, standardization. Afterwards, blocking is performed to omit unnecessary comparisons, which are obvious non-matches based on predefined blocking keys [5]. Then candidate pairs are generated based on the blocking result. ...
Entity resolution identifies records that refer to the same real-world entity. For its classification step, supervised learning can be adopted, but this faces limitations in the availability of labeled training data. Under this situation, active learning has been proposed to gather labels while reducing the human labeling effort, by selecting the most informative data as candidates for labeling. Committee-based active learning is one of the most commonly used approaches, which chooses data with the most disagreement of voting results of the committee, considering this as the most informative data. However, the current state-of-the-art committee-based active learning approaches for entity resolution have two main drawbacks: First, the selected initial training data is usually not balanced and informative enough. Second, the committee is formed with homogeneous classifiers by comprising their accuracy to achieve diversity of the committee, i.e., the classifiers are not trained with all available training data or the best parameter setting. In this paper, we propose our committee-based active learning approach HeALER, which overcomes both drawbacks by using more effective initial training data selection approaches and a more effective heterogenous committee. We implemented HeALER and compared it with passive learning and other state-of-the-art approaches. The experiment results prove that our approach outperforms other state-of-the-art committee-based active learning approaches.
... Individual characteristics of Big Data have been the focus of previous research work in ER. For example, there is a continuous concern for improving the scalability of ER techniques using e.g., massively parallel implementations [24], or approximately matching uncertain entity descriptions [47,69]. However, traditional deduplication techniques [30,58] have been mostly conceived for processing structured data of few entity types after being adequately pre-processed in a data warehouse, and hence been able to discover blocking keys of entities and/or mapping rules between their types. ...
... Moreover, uncertain ER has been presented in [69], approximate instance matching have been surveyed in [47], and link discovering algorithms in [131]. Recent efforts to enhance scalability of ER techniques by leveraging distribution and parallelization techniques have been surveyed in [24]. ...
One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions.
... Entity resolution: Most of the ER research is pair-based and shares a common ER process, as surveyed in [Ch12] and [EPIV07]. In recent years, along with the increase of input data, solutions for ER are also asked to be scalable, which facilitates using parallel computing for ER, Chen et al. give an overview and classification on the research of parallel ER [CSG18]. Word embedding for entity resolution: As aforementioned, recent research has considered applying word embedding for ER. ...
Recently word embedding has become a beneficial technique for diverse natural language processing tasks, especially after the successful introduction of several popular neural word embedding models, such as word2vec, GloVe, and FastText. Also entity resolution, i.e., the task of identifying digital records that refer to the same real-world entity, has been shown to benefit from word embedding. However, the use of word embeddings does not lead to a one-size-fits-all solution, because it cannot provide an accurate result for those values without any semantic meaning, such as numerical values. In this paper, we propose to use the combination of general word embedding with traditional hand-picked similarity measures for solving ER tasks, which aims to select the most suitable similarity measure for each attribute based on its property. We provide some guidelines on how to choose suitable similarity measures for different types of attributes and evaluate our proposed hybrid method on both synthetic and real datasets. Experiments show that a hybrid method reliant on correctly selecting required similarity measures can outperform the method of purely adopting traditional or word-embedding-based similarity measures.
... To implement parallel ER, there have been two main research directions so far: the first one is with parallel DBMSs, the other way is to employ a distributed computation framework to help with the implementation [7]. The solution of using parallel DBMSs proposed more than two decades ago has some shortcomings for individuals or small and medium-sized enterprises with ER tasks. ...
... Research in parallel ER has been dominated by the use of the MapReduce programming model e.g., [16,19]. We give an overview and classification on the current state of parallel ER in our survey paper [7]. ...
... Therefore, we can see that ER is a typical case that requires big data processing frameworks, comparing to other applications that do not need quadratic time to perform. Therefore, so far there has been a large majority of research that explores using big data frameworks to support parallel ER [3]. In recent years parallel ER using Apache Spark has received attention from the data management community. ...
During the last decade, several big data processing frameworks have emerged enabling users to analyze large scale data with ease. With the help of those frameworks, people are easier to manage distributed programming, failures and data partitioning issues. Entity Resolution is a typical application that requires big data processing frameworks, since its time complexity increases quadratically with the input data. In recent years Apache Spark has become popular as a big data framework providing a flexible programming model that supports in-memory computation. Spark offers three APIs: RDDs, which gives users core low-level data access, and high-level APIs like DataFrame and Dataset, which are part of the Spark SQL library and undergo a process of query optimization. Stemming from their different features, the choice of API can be expected to have an influence on the resulting performance of applications. However, few studies offer experimental measures to characterize the effect of such distinctions. In this paper we evaluate the performance impact of such choices for the specific application of parallel entity resolution under two different scenarios, with the goal to offer practical guidelines for developers.