Article

MFIBlocks: An effective blocking algorithm for entity resolution

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Entity resolution is the process of discovering groups of tuples that correspond to the same real-world entity. Blocking algorithms separate tuples into blocks that are likely to contain matching pairs. Tuning is a major challenge in the blocking process and in particular, high expertise is needed in contemporary blocking algorithms to construct a blocking key, based on which tuples are assigned to blocks. In this work, we introduce a blocking approach that avoids selecting a blocking key altogether, relieving the user from this difficult task. The approach is based on maximal frequent itemsets selection, allowing early evaluation of block quality based on the overall commonality of its members. A unique feature of the proposed algorithm is the use of prior knowledge of the estimated size of duplicate sets in enhancing the blocking accuracy. We report on a thorough empirical analysis, using common benchmarks of both real-world and synthetic datasets to exhibit the effectiveness and efficiency of our approach.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... MFIBlocks [75] hash based redundancy positive proactive static Sorted Neighborhood (SN) [60,61,132] sort based redundancy neutral proactive static Extended Sorted Neighborhood [25] sort based redundancy neutral lazy static Incrementally Adaptive SN [185] sort based redundancy neutral proactive static Accumulative Adaptive SN [185] sort based redundancy neutral proactive static ...
... A more advanced q-gram-based approach is MFIBlocks [75]. Its transformation function concatenates keys of Q-Grams Blocking into itemsets and uses a maximal frequent itemset algorithm to define as new blocking keys those exceeding a predetermined support threshold. ...
... Extended Q-grams Blocking raises PQ and RR at a limited, if any, cost in PC. MFIBlocks reduces significantly the number of blocks and matching candidates (i.e., very high PQ and RR) [75], but it may come at the cost of missed matches (insufficient PC) in case the resulting blocking keys are very restrictive for matches with noisy descriptions [128]. ...
Full-text available
Article
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that correspond to the same real-world object. Due to its inherently quadratic complexity, a series of techniques accelerate it so that it scales to voluminous data. In this survey, we review a large number of relevant works under two different but related frameworks: Blocking and Filtering. The former restricts comparisons to entity pairs that are more likely to match, while the latter identifies quickly entity pairs that are likely to satisfy predetermined similarity thresholds. We also elaborate on hybrid approaches that combine different characteristics. For each framework we provide a comprehensive list of the relevant works, discussing them in the greater context. We conclude with the most promising directions for future work in the field.
... The reduction achieved by blocking may result in a significant efficiency improvement, but, it may also reduce accuracy since the applied heuristic may falsely exclude valid linked records. Several blocking-based techniques were introduced for the non-incremental cases and include Multi-pass blocking [13,14], Q-gram based indexing [9,15,16], Canopy clustering [17], Suffix blocking [18,19], Mapping [4], MFIBlocks [20], and Meta-blocking [2,21]. Unfortunately, these methods cannot directly be used for the incremental case. ...
... Unfortunately, these methods cannot directly be used for the incremental case. Some like Multi-pass blocking [13,14] and Qgram based indexing [9,15,16] need some modifications [22], while others like Canopy clustering [17], Suffix blocking [18,19], Mapping [4], MFIBlocks [20], and Meta-blocking [2,21] need a complete redesign. ...
... One of the most prominent blocking methods is based on suffixes [20,23]. In more details, suffix-based blocking methods construct a structure combining the suffixes of all records in the dataset D [18,19]. ...
Article
Record linkage is the problem that identifies the different records that represent the same real-world object. Entity resolution is the problem that ensures that a real-world object is represented by a single record. The incremental versions of record linkage and entity resolution address the respective problems after the insertion of a new record in the dataset. Record linkage, entity resolution and their incremental versions are of paramount importance and arise in several contexts such as data warehouses, heterogeneous databases and data analysis. Blocking techniques are usually utilized to address these problems in order to avoid comparing all record pairs. Suffix blocking is one of the most efficient and accurate blocking techniques. In this paper, we consider the non-incremental variation of record linkage and present a method that is more than five times faster and achieves similar accuracy to the current state-of-the-art suffix-based blocking method. Then, we consider the incremental variation of record linkage and propose a novel incremental suffix-based blocking mechanism that outperforms existing incremental blocking methods in terms of blocking accuracy and efficiency. Finally, we consider incremental entity resolution and present two novel techniques based on suffix blocking that are able to handle the tested dataset in a few seconds (while a current state-of-the-art technique requires more than eight hours). Our second technique proposes a novel method that keeps a history of the deleted records and the merging process. Thus, we are able to discover alternative matches for the inserted record that are not possible for existing methods and improve the accuracy of the algorithm. We have implemented and extensively experimentally evaluated all our methods. We offer two implementations of our proposals. The first one is memory-based and offers the best efficiency while the second one is disk-based and scales seamlessly to very large datasets.
... Among the available blocking algorithms that offer soft clusters, clusters that may share records as an outcome, we have selected MFIBlocks, due to four 215 major unique features that make it best suited for uncertain entity resolution. For a detailed literature comparison, we refer the interested reader to [18]. ...
... We omit some of the configuration options 260 and implementation details for brevity. For a detailed description see [18]. ...
... In our second evaluation we examine how the system's runtime scales with dataset size and the minsup parameter. We employ the method reported in [18] to prune the .03% most frequent items and compare the runtime with and without pruning. ...
Article
In this work we present a multi-source uncertain entity resolution model and show its implementation in a use case of Yad Vashem, the central repository of Holocaust-era information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project motivates the use of multi-source resolution on a big-data scale. We instantiate the proposed model using the MFIBlocks entity resolution algorithm and a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset that make it a good candidate for multi-source entity resolution. We conclude with proposing avenues for future research in this realm.
... Regarding the performance of a block collection, a common premise in the literature is that it is independent of the entity matching method that executes the pair-wise comparisons [4,2,10]. The rationale is that two duplicates can be detected as long as they co-occur in at least one block. ...
... More formally, two established measures are used for assessing the performance of a block collection [11,2,3,10]: Figure 2: The formal definition of the weighting schemes of Meta-blocking. For EJS, |V B | stands for the order (i.e., number of nodes) of the blocking graph G B , while |n x | denotes the degree of node n x . ...
... Measures. To assess the effectiveness of the (restructured) block collections, we employ four established measures [2,3,10,11,17]: ...
Article
Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. In order to enable entity resolution to scale to large volumes of data, blocking is typically employed: it clusters similar entities into (overlapping) blocks so that it suffices to perform comparisons only within each block. To further increase efficiency, Meta-blocking is being used to clean the overlapping blocks from unnecessary comparisons, increasing precision by orders of magnitude at a small cost in recall. Despite its high time efficiency though, using Meta-blocking in practice to solve entity resolution problem on very large datasets is still challenging: applying it to 7.4 million entities takes (almost) 8 full days on a modern high-end server. In this paper, we introduce scalable algorithms for Meta-blocking, exploiting the MapReduce framework. Specifically, we describe a strategy for parallel execution that explicitly targets the core concept of Meta-blocking, the blocking graph. Furthermore, we propose two more advanced strategies, aiming to reduce the overhead of data exchange. The comparison-based strategy creates the blocking graph implicitly, while the entity-based strategy is independent of the blocking graph, employing fewer MapReduce jobs with a more elaborate processing. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the feasibility and superiority of our advanced strategies, and demonstrates their scalability to very large datasets.
... The code and the data of our experiments are publicly available for any interested researcher. 3 The rest of the paper is structured as follows: in Section 2, we delve into the most relevant works in the literature, while in Section 3, we formally define the task of Meta-blocking, elaborating on its main notions. Section 4 introduces our novel techniques, and Section 5 presents our thorough experimental evaluation. ...
... The redundancy-positive methods ensure that the more blocks two entities share, the more likely they are to be matching. In this category fall the Suffix Arrays [13], Q-grams Blocking [14], MFIBlocks [3], Attribute Clustering [2] and Token Blocking [9]. 2. The redundancy-negative methods ensure that the most similar entities share just one block. ...
... Performance Evaluation Metrics To assess the performance of a blocking method, we follow the best practice in the literature, treating entity matching as an orthogonal task [1,3,2]. We assume that two duplicate entities can be detected using any of the available methods as long as they co-occur in at least one block. ...
Article
Entity Resolution constitutes a quadratic task that typically scales to large entity collections through blocking. The resulting blocks can be restructured by Meta-blocking to raise precision at a limited cost in recall. At the core of this procedure lies the blocking graph, where the nodes correspond to entities and the edges connect the comparable pairs. There are several configurations for Meta-blocking, but no hints on best practices. In general, the node-centric approaches are more robust and suitable for a series of applications, but suffer from low precision, due to the large number of unnecessary comparisons they retain.
... Among the available blocking algorithms that offer soft clusters, clusters that may share records as an outcome, we have selected MFIBlocks, due to four major unique features that make it best suited for the analysis of the Yad Vashem data. For s detailed literature comparison, we refer the interested reader to [16]. ...
... We omit some of the configuration options and implementation details for brevity. For a detailed description see [16]. ...
... In our second evaluation we examine how the system's runtime scales with dataset size and the minsup parameter. We employ the method reported in [16] to prune the .03% most frequent items and compare the runtime with and without pruning. ...
Conference Paper
In this work we describe an entity resolution project performed at Yad Vashem, the central repository of Holocaust-era information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project sets an example for multi-source resolution on a big-data scale. We discuss a set of requirements that led us to choose the MFIBlocks entity resolution algorithm in achieving the goals of the application. We also provide a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset, highlighting the shortcomings of current methods and proposing avenues for future research in this realm.
... Entity Resolution (ER) aims at "cleaning" noisy data collections by identifying entity profiles, or simply entities, that represent the same real-world object. With a body of research that spans over multiple decades, ER has a wealth of formal models [7,11], efficient and effective algorithmic solutions [18,26], as well as a bulk of systems and benchmarks that allow for comparative analyses of solutions [4]. Elmagarmid et al. provide a comprehensive survey covering the complete deduplication process [6]. ...
... Block Building (BlBu) takes as input one or two entity collections and clusters them into blocks. Typically, each entity is represented by multiple blocking keys that are determined a-priori, except for MFIBlocks [18] , where the keys are the result of mining; blocks are then created based on the similarity, or equality of these keys. As an illustrating example, consider Figure 2, which demonstrates the functionality of Standard Blocking [5] when using the unsupervised, schema-agnostic keys proposed in [24]. ...
... Q-grams Blocking (QGBl) [15] transforms the blocking keys of StBl into a format more resilient to noise: it converts every token into sub-sequences of q characters (q-grams) and builds blocks on Standard Blocking (StBl) [5,26] Sorted Neighborhood (SoNe) [16] Q-grams Blocking (QGBl) [15] Extended Q-grams Blocking (EQGBl) [5] Suffix Arrays (SuAr) [1] Extended Suffix Arrays (ESuAr) [5] Canopy Clustering (CaCl) [21] Extended Canopy Clustering (ECaCl) [5] Attribute Clustering (ACl) [26] TYPiMatch (TYPiM) [20] MFIBlocks (MFIB) [18] Extended Sorted Neighborhood (ESoNe) [5] Figure 3: The relations between lazy and proactive methods. ...
Full-text available
Article
Entity Resolution is a core task for merging data collections. Due to its quadratic complexity, it typically scales to large volumes of data through blocking: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities , at the cost of some missed matches. There are numerous blocking methods, and the aim of this work is to offer a comprehensive empirical survey, extending the dimensions of comparison beyond what is commonly available in the literature. We consider 17 state-of-the-art blocking methods and use 6 popular real datasets to examine the robustness of their internal configurations and their relative balance between effectiveness and time efficiency. We also investigate their scalability over a corpus of 7 established synthetic datasets that range from 10,000 to 2 million entities.
... Many tools exist to query, transform and analyze KGs. Notable examples include graph databases, such as RDF triple stores and Neo4J; 2 tools for operating on RDF such as graphy 3 and RDFlib 4 , entity linking tools such as WAT [18] or BLINK [26], entity resolution tools such as MinHash-LSH [14] or MFIBlocks [12], libraries to compute graph embeddings such as PyTorch-BigGraph [13] and libraries for graph analytics, such as graph-tool 5 and NetworkX. 6 There are three main challenges when using these tools together. ...
... Other efforts employed a similar set of processing steps [25]. 12 These range from mapping the CORD-19 data to RDF, 13 to adding annotations to the articles in the dataset pointing to entities extracted from the text, obtained from various sources [8]. 14 A common thread among these efforts involves leveraging existing KGs such as Wikidata and Microsoft Academic Graph to, for example, build a citation network of the papers, authors, affiliations, etc. 15 Other efforts focused on extraction of relevant entities (genes, proteins, cells, chemicals, diseases), relations (causes, upregulates, treats, binds), and linking them to KGs such as Wikidata and DBpedia. ...
Full-text available
Chapter
Knowledge graphs (KGs) have become the preferred technology for representing, sharing and adding knowledge to modern AI applications. While KGs have become a mainstream technology, the RDF/SPARQL-centric toolset for operating with them at scale is heterogeneous, difficult to integrate and only covers a subset of the operations that are commonly needed in data science applications. In this paper we present KGTK, a data science-centric toolkit designed to represent, create, transform, enhance and analyze KGs. KGTK represents graphs in tables and leverages popular libraries developed for data science applications, enabling a wide audience of developers to easily construct knowledge graph pipelines for their applications. We illustrate the framework with real-world scenarios where we have used KGTK to integrate and manipulate large KGs, such as Wikidata, DBpedia and ConceptNet.
... Table 1 maps all methods discussed in Sections 3.3 and 3.4 to our taxonomy. [45] hash-based redundancy-free lazy static structured data Suffix Arrays Blocking (SA) [3] hash-based redundancy-positive proactive static structured data Extended Suffix Arrays Blocking [23,100] hash-based redundancy-positive proactive static structured data Improved Suffix Arrays Blocking [31] hash-based redundancy-positive proactive static structured data Q-Grams Blocking [23,100] hash-based redundancy-positive lazy static structured data Extended Q-Grams Blocking [11,23,100] hash-based redundancy-positive lazy static structured data MFIBlocks [67] hash-based redundancy-positive proactive static structured data Sorted Neighborhood (SN) [56,57,120] sort-based redundancy-neutral proactive static structured data Extended Sorted Neighborhood [23] sort-based redundancy-neutral lazy static structured & XML data Incrementally Adaptive SN [165] sort-based redundancy-neutral proactive static structured data Accumulative Adaptive SN [165] sort-based redundancy-neutral proactive static structured data Duplicate Count Strategy (DCS) [39] sort-based redundancy-neutral proactive dynamic structured data DCS++ [39] sort-based redundancy-neutral proactive dynamic structured data Sorted Blocks [38] hybrid redundancy-neutral lazy static structured data Sorted Blocks New Partition [38] hybrid redundancy-neutral proactive static structured data Sorted Blocks Sliding Window [38] hybrid redundancy-neutral proactive static structured data (a) Rule-based, schema-aware methods. ApproxRBSetCover [16] hash-based redundancy-positive lazy static structured data ApproxDNF [16] hash-based redundancy-positive lazy static structured data Blocking Scheme Learner (BSL) [89] hash-based redundancy-positive lazy static structured data Conjunction Learner [19] (semi-supervised) hash-based redundancy-positive lazy static structured data BGP [44] hash-based redundancy-positive lazy static structured data CBlock [133] hash-based redundancy-positive proactive static structured data DNF Learner [50] hash-based redundancy-positive lazy dynamic structured data FisherDisjunctive [64] (unsupervised) hash-based redundancy-positive lazy static structured data (b) ML-based (supervised), schema-aware methods. ...
... A more advanced q-gram-based approach is MFIBlocks [67]. Its transformation function concatenates keys of Q-Grams Blocking into itemsets and uses a maximal frequent itemset algorithm for defining new blocking keys. ...
Full-text available
Preprint
Efficiency techniques are an integral part of Entity Resolution, since its infancy. In this survey, we organized the bulk of works in the field into Blocking, Filtering and hybrid techniques, facilitating their understanding and use. We also provided an in-dept coverage of each category, further classifying the corresponding works into novel sub-categories. Lately, the efficiency techniques have received more attention, due to the rise of Big Data. This includes large volumes of semi-structured data, which pose challenges not only to the scalability of efficiency techniques, but also to their core assumptions: the requirement of Blocking for schema knowledge and of Filtering for high similarity thresholds. The former led to the introduction of schema-agnostic Blocking in conjunction with Block Processing techniques, while the latter led to more relaxed criteria of similarity. Our survey covers these new fields in detail, putting in context all relevant works.
... This is done by either utilizing expert knowledge or using supervised machine learning, e.g., [1]. In this work we seek to utilize an unsupervised method for key selection, using a blocking algorithm called MFI-Blocks [5]. This method chooses in a selective manner to use all (or part of the) attributes to be used as clustering (blocking) criteria. ...
... We briefly describe the MFIBlocks algorithm, first described in [5], starting with MFIs, maximum frequent itemsets. ...
Full-text available
Conference Paper
In this work we explore relationships between financial entities for the purpose of community detection. We use the MFIBlocks algorithm to perform the task via subspace clustering and present some initial results over the FEIII 2018 challenge dataset.
... Finally, [60] introduces a method for building blocks using Maximal Frequent Itemsets (MFI) as blocking keys. Abstractly, each MFI (an itemset can be a set of tokens) of a specific attribute in the schema of a description defines a block, and descriptions containing the tokens of an MFI for this attribute are placed in a common block. ...
... Standard blocking [42,61] Q-grams [46] Suffixes [2] Sorted neighborhood [48,63] Adaptive sorted neighborhood [103] MFI [60] Token blocking [78] Attribute clustering blocking [81] Prefix-infix(-suffix) blocking [80] ppjoin+ [98,102] LSH blocking [70] different levels of heterogeneity in the input entity collections. Since ppjoin+ and LSH blocking require a pre-defined similarity threshold for pairs to be considered as candidate matches and there is no generic or efficient way of setting it, we have not included these methods in our experimental study. ...
Full-text available
Thesis
Entity resolution (ER) is the problem of identifying descriptions of the same real-world entities among or within knowledge bases (KBs). In this PhD thesis, we study the problem of ER in the Web of data, in which entities are described using graph-structured RDF data, following the principles of the Linked Data paradigm. The two core ER problems are: (a) how can we effectively compute similarity of Web entities, and (b) how can we efficiently resolve sets of entities within or across KBs. Compared to deduplication of entities described by tabular data, the new challenges for these problems stem from the Variety (i.e., multiple entity types and cross-domain descriptions), the Volume (i.e., thousands of Web KBs with billions of facts, hosting millions of entity descriptions) and Veracity (i.e., various forms of inconsistencies and errors) of entity descriptions published in the Web of data. At the core of an ER task lies the process of deciding whether a given pair of descriptions refer to the same real-world entity i.e., if they match (problem a). The matching decision typically depends on the assessment of the similarity of two descriptions, based on their content or their neighborhood descriptions (i.e., of related entity types). This process is usually iterative, as matches found in one iteration help the decisions at the next iteration, via similarity propagation until no more matches are found. The number of iterations to converge clearly depends on the size and the complexity of the resolved entity collections. Moreover, pairwise entity matching is by nature quadratic to the number of entity descriptions, and thus prohibitive at the Web scale (problem b). In this respect, blocking aims to discard as many comparisons as possible without missing matches. It places entity descriptions into overlapping or disjoint blocks, leaving to the matching phase comparisons only between descriptions belonging to the same block. For this reason, overlapping blocking methods are accompanied by Meta-blocking filtering techniques, which aim to discard comparisons suggested by blocking that are either repeated (i.e., suggested by different blocks) or unnecessary (i.e., unlikely to result in matches) due to the noise in entity descriptions.To address ER at the Web-scale, we need to relax a number of assumptions underlying several methods and techniques proposed in the context of database, machine learning and semantic Web communities. Overall, the Big Data characteristics of entity descriptions in the Web of data call for novel ER frameworks supporting: (i) near similarity (identify matches with low similarity in their content), (ii) schema-free (do not rely on a given set of attributes used by all descriptions), (iii) no human in the loop (do not rely on domain-experts for training data, aligned relations, matching rules), (iv) non-iterative (avoiding data-convergence methods at several iteration steps), and (v) scalable to very large volumes of entity collections (massively parallel architecture needed).To satisfy the requirements of a Web-scale ER, we introduce the MinoanER framework. Our framework exploits new similarity metrics for assessing matching evidence based on both the content and the neighbors of entities, without requiring knowledge or alignment of the entity types. These metrics allow for a compact representation of similarity evidence that can be obtained from different blocking schemes on the names and values of the descriptions, but also on the values of their entity neighbors. This enables the identification of nearly similar matches even from the step of blocking. This composite blocking, accompanied by a novel composite Meta-blocking capturing the similarity evidence from the different types of blocks, set the ground for a non-iterative matching. The matching algorithm, built on a massively parallel architecture, is equipped with computationally cheap heuristics to detect matches in a fixed number of steps. The main contribution of MinoanER is that it achieves at least equivalent results over homogeneous KBs (stemming from common data sources, thus exhibiting strongly similar matches) and significantly better results over heterogeneous KBs (stemming from different sources, thus exhibiting many nearly similar matches) to state-of-the-art ER tools, without requiring any domain-specific knowledge, in a non-iterative and highly efficient way. show less
... Rule-based methods group tuples by static keys or decision rules that are derived by experts or from mere heuristics. Sortingbased methods (Papadakis et al. 2015;Kenig and Gal 2013) group tuples by efficiently sorting their textual similarities measured by various similarity functions. Hash-based approaches adopt hashing techniques (e.g., Min-Hashing (Steorts et al. 2014;Wang, Cui, and Liang 2015) and LSH (Ebraheem et al. 2018)) to map tuples into hash buckets. ...
Article
BERT has set a new state-of-the-art performance on entity resolution (ER) task, largely owed to fine-tuning pre-trained language models and the deep pair-wise interaction. Albeit being remarkably effective, it comes with a steep increase in computational cost, as the deep-interaction requires to exhaustively compute every tuple pair to search for co-references. For ER task, it is often prohibitively expensive due to the large cardinality to be matched. To tackle this, we introduce a siamese network structure that independently encodes tuples using BERT but delays the pair-wise interaction via an enhanced alignment network. This siamese structure enables a dedicated blocking module to quickly filter out obviously dissimilar tuple pairs, and thus drastically reduces the cardinality of fine-grained matching. Further, the blocking and entity matching are integrated into a multi-task learning framework for facilitating both tasks. Extensive experiments on multiple datasets demonstrate that our model significantly outperforms state-of-the-art models (including BERT) in both efficiency and effectiveness.
... The blocking technique is widely adopted to prune the record pair set to an affordable size before EM [28,47]. However, most conventional blocking approaches [1,25,31,35] are key-based and learning-free, and their performance depends heavily on finetuning (see Survey [37]), while the deep blocking approaches [11,47] require a large amount of labels for training. The key is that these methods can hardly measure their effect on the following EM when deciding their hyperparameters. ...
Full-text available
Article
Entity matching (EM), as a fundamental task in data cleansing and integration, aims to identify the data records in databases that refer to the same real-world entity. While recent deep learning technologies significantly improve the performance of EM, they are often restrained by large-scale noisy data and insufficient labeled examples. In this paper, we present a novel EM approach based on deep neural networks and adversarial active learning. Specifically, we design a deep EM model to automatically complete missing textual values and capture both similarity and difference between records. Given that learning massive parameters in the deep model needs expensive labeling cost, we propose an adversarial active learning framework, which leverages active learning to collect a small amount of “good” examples and adversarial learning to augment the examples for stability enhancement. Additionally, to deal with large-scale databases, we present a dynamic blocking method that can be interactively tuned with the deep EM model. Our experiments on benchmark datasets demonstrate the superior accuracy of our approach and validate the effectiveness of all the proposed modules.
... At the heart of the data integration realm lies the matching task [5], in charge of aligning attributes of data sources both at a schema and data level, in order to enable formal mappings. Numerous algorithmic attempts (matchers) were suggested over the years for efficient and effective integration (e.g., [7,11,22,24]). Both practitioners and researchers also discussed data spaces as an appropriate data integration concept for DEs. ...
Article
A data ecosystem (DE) offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requirements and challenges that DEs face when ensuring data transparency. Requirements are derived from the data and organizational management, as well as from broader legal and ethical considerations. We propose a novel knowledge-driven DE architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our proposal in a real-world scenario. Last, we discuss and rate the potential of the proposed architecture in the fulfillmentof these requirements.
... In the blocking key construction, the attributes that compose the blocking key are selected and the blocking key method is determined. ER is only performed between records located in the same block to enhance efficiency and accuracy [36] [37]. A taxonomy of the blocking dimension could be found in [38] [39]. ...
Full-text available
Article
Entity Resolution (ER) is defined as the process 0f identifying records/ objects that correspond to real-world objects/ entities. To define a good ER approach, the schema of the data should be well-known. In addition, schema alignment of multiple datasets is not an easy task and may require either domain expert or ML algorithm to select which attributes to match. Schema agnostic meta-blocking tries to solve such a problem by considering each token as a blocking key regardless of the attributes it appears in. It may also be coupled with meta-blocking to reduce the number of false negatives. However, it requires the exact match of tokens which is very hard to occur in the actual datasets and it results in very low precision. To overcome such issues, we propose a novel and efficient ER approach for big data implemented in Apache Spark. The proposed approach is employed to avoid schema alignment as it treats the attributes as a bag of words and generates a set of n-grams which is transformed to vectors. The generated vectors are compared using a chosen similarity measure. The proposed approach is a generic one as it can accept all types of datasets. It consists of five consecutive sub-modules: 1) Dataset acquisition, 2) Dataset pre-processing, 3) Setting selection criteria, where all settings of the proposed approach are selected such as the used blocking key, the significant attributes, NLP techniques, ER threshold, and the used scenario of ER, 4) ER pipeline construction, and 5) Clustering where the similar records are grouped into the similar cluster. The ER pipeline could accept two types of attributes; the Weighted Attributes (WA) or the Compound Attributes (CA). In addition, it accepts all the settings selected in the fourth module. The pipeline consists of five phases. Phase 1) Generating the tokens composing the attributes. Phase 2) Generating n-grams of length n. Phase 3) Applying the hashing Text Frequency (TF) to convert each n-grams to a fixed-length feature vector. Phase 4) Applying Locality Sensitive Hashing (LSH), which maps similar input items to the same buckets with a higher probability than dissimilar input items. Phase 5) Classification of the objects to duplicates or not according to the calculated similarity between them. We introduced seven different scenarios as an input to the ER pipeline. To minimize the number of comparisons, we proposed the length filter which greatly contributes to improving the effectiveness of the proposed approach as it achieves the highest F-measure between the existing computational resources and scales well with the available working nodes. Three results have been revealed: 1) Using the CA in the different scenarios achieves better results than the single WA in terms of efficiency and effectiveness. 2) Scenario 3 and 4 Achieve the best performance time because using Soundex and Stemming contribute to reducing the performance time of the proposed approach. 3) Scenario 7 achieves the highest F-measure because by utilizing the length filter, we only compare records that are nearly within a pre-determined percentage of increase or decrease of string length. LSH is used to map the same inputs items to the buckets with a higher probability than dis-similar ones. It takes numHashTables as a parameter. Increasing the number of candidate pairs with the same numHashTables will reduce the accuracy of the model. Utilizing the length filter helps to minimize the number of candidates which in turn increases the accuracy of the approach.
... At the heart of the data integration realm lies the matching task [5], in charge of aligning attributes of data sources both at a schema and data level, in order to enable formal mappings. Numerous algorithmic attempts (matchers) were suggested over the years for efficient and effective integration (e.g., [7,11,22,24]). Both practitioners and researchers also discussed data spaces as an appropriate data integration concept for DEs. ...
Full-text available
Preprint
A Data Ecosystem offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requirements and challenges that data ecosystems face when ensuring data transparency. Requirements are derived from the data and organizational management, as well as from broader legal and ethical considerations. We propose a novel knowledge-driven data ecosystem architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our proposal in a real-world scenario. Lastly, we discuss and rate the potential of the proposed architecture in the fulfillment of these requirements.
... Many tools exist to query, transform and analyze KGs. Notable examples include graph databases such as RDF triple stores and Neo4J; 2 tools for operating on RDF such as graphy 3 and RDFlib 4 , entity linking tools such as WAT [16] or BLINK [24], entity resolution tools such as MinHash-LSH [12] or MFIBlocks [10], libraries to compute graph embeddings such as PyTorch-BigGraph [11] and libraries for graph analytics, such as graph-tool 5 and NetworkX. 6 There are three main challenges when using these tools together. ...
Full-text available
Preprint
Knowledge graphs (KGs) have become the preferred technology for representing, sharing and adding knowledge to modern AI applications. While KGs have become a mainstream technology, the RDF/SPARQL-centric toolset for operating with them at scale is heterogeneous, difficult to integrate and only covers a subset of the operations that are commonly needed in data science applications. In this paper, we present KGTK, a data science-centric toolkit to represent, create, transform, enhance and analyze KGs. KGTK represents graphs in tables and leverages popular libraries developed for data science applications, enabling a wide audience of developers to easily construct knowledge graph pipelines for their applications. We illustrate KGTK with real-world scenarios in which we have used KGTK to integrate and manipulate large KGs, such as Wikidata, DBpedia and ConceptNet, in our own work.
... ML techniques have been used for entity resolution as well. Kenig and Gal (2013) used an unsupervised ML technique called maximal frequent item-sets (MFI) to learn the optimal clusters in which to search for duplicates. Sagi et al. (2017) expanded upon this work by training an alternating decision tree model (Freund and Mason, 1999) to classify pairs within the blocks to matched and unmatched entities. ...
Full-text available
Article
Oceanographic research is a multidisciplinary endeavor that involves the acquisition of an increasing amount of in-situ and remotely sensed data. A large and growing number of studies and data repositories are now available on-line. However, manually integrating different datasets is a tedious and grueling process leading to a rising need for automated integration tools. A key challenge in oceanographic data integration is to map between data sources that have no common schema and that were collected, processed, and analyzed using different methodologies. Concurrently, artificial agents are becoming increasingly adept at extracting knowledge from text and using domain ontologies to integrate and align data. Here, we deconstruct the process of ocean science data integration, providing a detailed description of its three phases: discover, merge, and evaluate/correct. In addition, we identify the key missing tools and underutilized information sources currently limiting the automation of the integration process. The efforts to address these limitations should focus on (i) development of artificial intelligence-based tools for assisting ocean scientists in aligning their schema with existing ontologies when organizing their measurements in datasets; (ii) extension and refinement of conceptual coverage of – and conceptual alignment between – existing ontologies, to better fit the diverse and multidisciplinary nature of ocean science; (iii) creation of ocean-science-specific 'entity resolution' benchmarks to accelerate the development of tools utilizing ocean science terminology and nomenclature; (iv) creation of ocean-science-specific schema matching and mapping benchmarks to accelerate the development of matching and mapping tools utilizing semantics encoded in existing vocabularies and ontologies; (v) annotation of datasets, and development of tools and benchmarks for the extraction and categorization of data quality and preprocessing descriptions from scientific text; and (vi) creation of large-scale word embeddings trained upon ocean science literature to accelerate the development of information extraction and matching tools based on artificial intelligence.
... Since one blocking key may combine one or more attributes and one attribute may have multiple tokens, each record can be assigned to many groups. In this category fall traditional blocking approach [6], suffix-array [9,10], q-gram based methods [11,12], sorted neighborhood [6,13], MFIblocks [14], LSH-based methods [15][16][17] and PPJoin and its variants [18]. MFIblocks is an effective blocking method based on maximal frequent itemsets but limited by the high time complexity enforced by the MFI mining algorithms despite of some improvements given by the authors. ...
Article
Entity resolution is a well-known challenge in data management for the lack of unique identifiers of records and various errors hidden in the data, undermining the identifiability of entities they refer to. To reveal matching records, every record potentially needs to be compared with all other records in the database, which is computationally intractable even for moderately-sized databases. To circumvent this quadratic challenge, blocking methods are typically employed to facilitate restricting promising comparisons of pairs within small subsets, called blocks, of records. Existing effective methods typically rely on blocking keys created by experts to capture matches, which inevitably involves a large amount of human labor and do not guarantee high-quality results. To reduce manual labor and promote accuracy, machine learning approaches are investigated to meet the challenge with limited success, due to high requirements of training data and inefficiency, especially for large databases. The exhaustive method produces exact results but suffers from efficiency problems. In this paper, we propose a paradigm of divide-and-conquer entity resolution, named recursive blocking, which derives comparatively good results while largely alleviating efficiency concerns. Specifically, recursive blocking refines blocks and traps matches in an iterative fashion to derive high-quality results, and we study two types of recursive blocking, i.e. redundancy- and partition-based approaches, and investigate their relative performance. Comprehensive experiments on both real-world and synthetic datasets verified the superiority of our approaches over the existing ones.
... Finally, MFIBlocks [99] uses maximal frequent itemsets as blocking keys. Each itemset is a collection of concatenated tokens from a specific attribute. ...
Preprint
One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions.
... Another recent schema-based blocking method uses Maximal Frequent Itemsets (MFI) as blocking keys [19] -an itemset can be a set of tokens. Abstractly, each MFI of a specific attribute in the schema of a description defines a block, and descriptions containing the tokens of an MFI for this attribute are placed in a common block. ...
Full-text available
Preprint
Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.
... Another recent schema-based blocking method uses Maximal Frequent Itemsets (MFI) as blocking keys [19] -an itemset can be a set of tokens. Abstractly, each MFI of a specific attribute in the schema of a description defines a block, and descriptions containing the tokens of an MFI for this attribute are placed in a common block. ...
Full-text available
Conference Paper
Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations , as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.
... In which it picks one or more specific fields to be used as a blocking key. [7] Then it compares the chosen blocking key values (BKV) together; only the records with the same BKVs are inserted into the same block. The most well-known algorithms implementing the standard blocking are the naïve pairwise and the sorted neighborhood blocking. ...
Chapter
Entity resolution is a critical process to enable big data integration. It aims to identify records that refer to the same real-world entity over one or several data sources. By time entity resolution processing has become more problematic and very challenging process due to the continuous increases in the data volume and variety. Therefore, blocking techniques have been developed to solve entity resolution limitations through partitioning datasets into “Blocks” of records. This partitioning step allows their processing in parallel for applying entity resolution methods within each block individually. The current blocking techniques are categorized into two main types: efficient or effective. The effective category includes the techniques that target the accuracy and quality of results. On the other hand, the efficient category includes the fast techniques yet report low accuracy. Nevertheless, there is no technique that succeeded to combine efficiency and effectiveness together, which become a crucial requirement especially with the evolution of the big-data area. This paper introduces a novel technique to fulfill the existing gap in order to achieve high efficiency with no cost to effectiveness through combining the core idea of the canopy clustering with the hashing blocking technique. It is worth to mention that the canopy clustering is classified as the most efficient blocking technique, while the hashing is classified as the most effective one. The proposed technique is named overlapped hashing. The extensive simulation studies conducted on benchmark dataset proved the ability to combine both concepts in one technique yet avoiding their drawbacks. The results report an outstanding performance in terms of scalability, efficiency and effectiveness and promise to create a new step forward in the entity resolution field.
... • If the organization sends the product which Robert Davis (original profile a1) has purchased to the address shown in the Robert Davis profile a2, the product will have to be re-sent to the correct address in profile a1, incurring increased cost and loss for the organization. The task of identifying records which refer to the same real-world entity has been studied extensively under various terminologies such as merge-purge, data de-duplication, instance identification, database hardening, co-reference resolution, identity uncertainty, entity resolution and duplicate detection [32][33][34]. Some of these methods are based on a single database, whereas others are based on the analysis of two or more databases. ...
Article
We propose an interactive decision-making framework to assist a Customer Service Representative (CSR) in the efficient and effective recognition of customer records in a database with many ambiguous entries. Our proposed framework consists of three integrated modules. The first module focuses on the detection and resolution of duplicate records to improve effectiveness and efficiency in customer recognition. The second module determines the level of ambiguity in recognizing an individual customer when there are multiple records with the same name. The third module recommends the series of feature-related questions that the CSR should ask the customer to enable rapid recognition, based on that level of ambiguity. In the first module, the F-Swoosh approach for duplicate detection is used, and in the second module a dynamic programming-based technique is used to determine the level of ambiguity within the customer database for a given name. In the third module, Levenshtein edit distance is used for feature selection in combination with weights based on the Inverse Document Frequency (IDF) of terms. The algorithm that requires the minimum number of questions to be put to the customer to achieve recognition is the algorithm that is chosen. We evaluate the proposed framework on a synthetic dataset and demonstrate how it assists the CSR to rapidly recognize the correct customer.
... Kenig and Gal [61] propose another technique that clusters tuples according to their overlap of common attributes. However, this approach relies on knowledge about the estimated size of the duplicate clusters and considers only full attributes. ...
Thesis
Carrying out business processes successfully is closely linked to the quality of the data inventory in an organization. Lacks in data quality lead to problems: Incorrect address data prevents (timely) shipments to customers. Erroneous orders lead to returns and thus to unnecessary effort. Wrong pricing forces companies to miss out on revenues or to impair customer satisfaction. If orders or customer records cannot be retrieved, complaint management takes longer. Due to erroneous inventories, too few or too much supplies might be reordered. A special problem with data quality and the reason for many of the issues mentioned above are duplicates in databases. Duplicates are different representations of same real-world objects in a dataset. However, these representations differ from each other and are for that reason hard to match by a computer. Moreover, the number of required comparisons to find those duplicates grows with the square of the dataset size. To cleanse the data, these duplicates must be detected and removed. Duplicate detection is a very laborious process. To achieve satisfactory results, appropriate software must be created and configured (similarity measures, partitioning keys, thresholds, etc.). Both requires much manual effort and experience. This thesis addresses automation of parameter selection for duplicate detection and presents several novel approaches that eliminate the need for human experience in parts of the duplicate detection process. A pre-processing step is introduced that analyzes the datasets in question and classifies their attributes semantically. Not only do these annotations help understanding the respective datasets, but they also facilitate subsequent steps, for example, by selecting appropriate similarity measures or normalizing the data upfront. This approach works without schema information. Following that, we show a partitioning technique that strongly reduces the number of pair comparisons for the duplicate detection process. The approach automatically finds particularly suitable partitioning keys that simultaneously allow for effective and efficient duplicate retrieval. By means of a user study, we demonstrate that this technique finds partitioning keys that outperform expert suggestions and additionally does not need manual configuration. Furthermore, this approach can be applied independently of the attribute types. To measure the success of a duplicate detection process and to execute the described partitioning approach, a gold standard is required that provides information about the actual duplicates in a training dataset. This thesis presents a technique that uses existing duplicate detection results and crowdsourcing to create a near gold standard that can be used for the purposes above. Another part of the thesis describes and evaluates strategies how to reduce these crowdsourcing costs and to achieve a consensus with less effort.
... Therefore, in recent years, researchers have proposed block-based ER techniques, which divide datasets to smaller data blocks according to some kind of features or rules. Then, ER is executed in these blocks to improve the efficiency of the algorithms [3,4]. ...
... Schema matching [2], process model matching [49], ontology alignment [17], music similarity [4], and Web service composition [7] are all examples of such problems. With a body of research that spans over multiple decades, data integration has a wealth of formal models of integration [24,18,22,2], algorithmic solutions for efficient and effective integration [46,26,21], and a body of systems, benchmarks and competitions that allow comparative empirical analysis of integration solutions [8,9]. ...
Full-text available
Article
The evolution of data accumulation, management, analytics, and visualization has led to the coining of the term big data, which challenges the task of data integration. This task, common to any matching problem in computer science involves generating alignments between structured data in an automated fashion. Historically, set-based measures, based upon binary similarity matrices (match/non-match), have dominated evaluation practices of matching tasks. However, in the presence of big data, such measures no longer suffice. In this work, we propose evaluation methods for non-binary matrices as well. Non-binary evaluation is formally defined together with several new, non-binary measures using a vector space representation of matching outcome. We provide empirical analyses of the usefulness of non-binary evaluation and show its superiority over its binary counterparts in several problem domains.
... Common approaches include key based blocking that partitions tuples into blocks based on their values on certain attributes and rule based blocking where a decision rule determines which block a tuple falls into. There has been limited work on simplifying this process by either learning blocking schemes such as [37] or tuning the blocking [42]. In contrast, our work automates the blocking process by requiring minimal input from the domain expert. ...
Article
Entity Resolution (ER) is a fundamental problem with many applications. Machine learning (ML)-based and rule-based approaches have been widely studied for decades, with many efforts being geared towards which features/attributes to select, which similarity functions to employ, and which blocking function to use - complicating the deployment of an ER system as a turn-key system. In this paper, we present DEEPER, a turn-key ER system powered by deep learning (DL) techniques. The central idea is that distributed representations and representation learning from DL can alleviate the above human efforts for tuning existing ER systems. DeepER makes several notable contributions: encoding a tuple as a distributed representation of attribute values, building classifiers using these representations and a semantic aware blocking based on LSH, and learning and tuning the distributed representations for ER. We evaluate our algorithms on multiple benchmark datasets and achieve competitive results while requiring minimal interaction with experts.
... More than 25 millions articles exist in Baidu Baike and Hudong Baike, while the number of instances to be matched is typically small. Thus, blocking techniques [46,[60][61][62][63] should be used to reduce the number of instances for matching in advance. ...
Full-text available
Article
Instance matching is the problem of determining whether two instances describe the same real-world entity or not. Instance matching plays a key role in data integration and data cleansing, especially for building a knowledge base. For example, we can regard each article in encyclopedias as an instance, and a group of articles which refers to the same real-world object as an entity. Therefore, articles about Washington should be distinguished and grouped into different entities such as Washington, D.C (the capital of the USA), George Washington (first president of the USA), Washington (a state of the USA), Washington (a village in West Sussex, England), Washington F.C. (a football club based in Washington, Tyne and Wear, England), Washington, D.C. (a novel). In this paper, we proposed a novel instance matching approach Active Instance Matching with Pairwise Constraints, which can bring the human into the loop of instance matching. The proposed approach can generate candidate pairs in advance to reduce the computational complexity, and then iteratively select the most informative pairs according to the uncertainty, influence, connectivity and diversity of pairs. We evaluated our approach one two publicly available datasets AMINER and WIDE-NUS and then applied our approach to the two large-scale real-world datasets, Baidu Baike and Hudong Baike, to build a Chinese knowledge base. The experiments and practice illustrate the effectiveness of our approach.
... Kenig and Gal [17] focus on the subproblem of blocking entities for achieving better performance in the entity matching problem. Blocking is the process of grouping tuples together in such a way that tuples in different groups (blocks) must not refer to the same entity and tuples in the same group may refer to the same entity, serving as a preprocessing step for more complex comparisons inside the blocks. ...
Article
Data deduplication is process of discovering multiple representations of same entity in an information system. Blocking has been a benchmark technique for avoiding the pair-wise record comparisons in data deduplication. Standard blocking (SB) aims at putting the potential duplicate records in the same block on the basis of a blocking key. Afterwards, the detailed comparisons are made only among the records residing in the same block. The selection of blocking key is a tedious process that involves exponential alternatives. The outcome of SB varies considerably with a change in blocking key. To this end, we have proposed a robust blocking technique called Locality Sensitive Blocking (LSB) that does not require the selection of blocking key. The experimental results show an increase of up to 0.448 in F-score as compared with SB. Furthermore, it is found that LSB is more robust towards blocking parameters and data noise.
Article
Entity matching (EM) aims to identify whether two records refer to the same underlying real-world entity. Traditional entity matching methods mainly focus on structured data, where the attribute values are short and atomic. Recently, there has been an increasing demand for matching textual records, such as matching descriptions of products that correspond to long spans of text, which challenges the applications of these methods. Although a few deep learning (DL) solutions have been proposed, these solutions tend to “directly” use the DL techniques and treat the EM as NLP tasks without determining the unique demand for the EM task. Thus, the performance of these DL-based solutions is still far from satisfactory. In this paper, we present JointMatcher, a novel EM method based on the pre-trained Transformer-based language models so that the generated features of the textual records contain the context information. We realize that more attention paid to the similar segments and number-contained segments of the record pair is crucial for accurate matching. To integrate the high-contextualized features with the consideration of paying more attention to the similar segments and the number-contained segments, JointMatcher is equipped with the relevance-aware encoder and the numerically-aware encoder. Extensive experiments using structured and real-world textual datasets demonstrated that JointMatcher outperforms the previous state-of-the-art (SOTA) results without injecting any domain knowledge when small or medium size training sets are used.
Article
Different use cases have acknowledged the importance of author identities and the non-triviality of determining them. Author disambiguation (AD) is a special case of entity resolution resolving author mentions to actual real-world authors. Like in other entity resolution tasks, AD methods are strongly restricted by scale and person name conventions. So far, this has been addressed by static blocking methods which cannot adapt to such collection-dependent properties. We address this gap by presenting the first progressive method of author disambiguation. Progressive entity resolution tackles large-scale conflation problems by repeatedly increasing the number of pairs compared for potential equivalence. Our method uses lattice structures to model name inclusion in an adaptive and more efficient way than traditional blocking techniques based on alphabetical order or fixed-level generalization. Our work offers additional insights into the relationship between name-matching, different blocking schemes, blocking and clustering as well as cost and benefit. Using the Web of Science as large-scale annotated test data, we observe and compare our model’s performance over time and compare it with various configurations and baselines. Our approach consistently outperforms state-of-the-art blocking methods, underlining its contribution to the field of author disambiguation. Our approach offers a novel alternative for tackling ambiguity in entity resolution, which is a major challenge for many information systems.
Article
Entity resolution refers to the process of identifying, matching, and integrating records belonging to unique entities in a data set. However, a comprehensive comparison across all pairs of records leads to quadratic matching complexity. Therefore, blocking methods are used to group similar entities into small blocks before the matching. Available blocking methods typically do not consider semantic relationships among records. In this paper, we propose a Semantic-aware Meta-Blocking approach called SeMBlock. SeMBlock considers the semantic similarity of records by applying locality-sensitive hashing (LSH) based on word embedding to achieve fast and reliable blocking in a large-scale data environment. To improve the quality of the blocks created, SeMBlock builds a weighted graph of semantically similar records and prunes the graph edges. We extensively compare SeMBlock with 16 existing blocking methods, using three real-world data sets. The experimental results show that SeMBlock significantly outperforms all 16 methods with respect to two relevant measures, F-measure and pair-quality measure. F-measure and pair-quality measure of SeMBlock are approximately 7% and 27%, respectively, higher than recently released blocking methods.
Full-text available
Article
In Papadakis et al. (2020), we presented the latest release of JedAI, an open-source Entity Resolution (ER) system that allows for building a large variety of end-to-end ER pipelines. Through a thorough experimental evaluation, we compared a schema-agnostic ER pipeline based on blocks with another schema-based ER pipeline based on similarity joins. We applied them to 10 established, real-world datasets and assessed them with respect to effectiveness and time efficiency. Special care was taken to juxtapose their scalability, too, using seven established, synthetic datasets. Moreover, we experimentally compared the effectiveness of the batch schema-agnostic ER pipeline with its progressive counterpart. In this companion paper, we describe how to reproduce the entire experimental study that pertains to JedAI’s serial execution through its intuitive user interface. We also explain how to examine the robustness of the parameter configurations we have selected.
Chapter
Sustainable development consists of a set of goals (SDG) associated with a myriad of targets. The aim is to achieve those underlying goals between 2015 and 2030. Although none of the goals refers directly to information and communication technologies (ICT) that latter can accelerate the development of human being and bridge the digital gaps, so as to build modern communities. Data cleaning is a field of computing technologies which aims to extract meaningful information that can be used in many areas in order to help the community. In particular, entity matching is a crucial step for data cleaning and data integration. The task consists of grouping similar instances of the real-world entities. The main challenge is to reduce the number of required comparisons, since a pairwise comparison of all records is time consuming (quadratic time complexity). For this reason, the use of indexing techniques such as Sorted Neighborhood Methods and blocking is indispensable considering that they divide data into partitions such manner that only records within some block are compared with each other. In this work, we propose two novel hybrid approaches which use a varying block. Indeed, reducing the similarity distance can improve the number of detected duplicates with a smaller number of comparisons. We prove theoretically the efficiency of our algorithms. The computational tests are performed by evaluating our technique on two real-world datasets and compare them with three baseline algorithms, and the results show that our proposed approach performs in efficient way.
Article
Entity resolution (ER) is a significant task in data integration, which aims to detect all entity profiles that correspond to the same real-world entity. Due to its inherently quadratic complexity, blocking was proposed to ameliorate ER, and it offers an approximate solution which clusters similar entity profiles into blocks so that it suffices to perform pairwise comparisons inside each block in order to reduce the computational cost of ER. This paper presents a comprehensive survey on existing blocking technologies. We summarize and analyze all classic blocking methods with emphasis on different blocking construction and optimization techniques. We find that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces. How to use schema information flexibly is of great significance to efficiently process data with the new features of this era. Machine learning is an important tool for ER, but end-to-end and efficient machine learning methods still need to be explored. We also sum up and provide the most promising trend for future work from the directions of real-time blocking ER, incremental blocking ER, deep learning with ER, etc.
Full-text available
Preprint
An increasing number of entities are described by interlinked data rather than documents on the Web. Entity Resolution (ER) aims to identify descriptions of the same real-world entity within one or across knowledge bases in the Web of data. To reduce the required number of pairwise comparisons among descriptions, ER methods typically perform a pre-processing step, called \emph{blocking}, which places similar entity descriptions into blocks and thus only compare descriptions within the same block. We experimentally evaluate several blocking methods proposed for the Web of data using real datasets, whose characteristics significantly impact their effectiveness and efficiency. The proposed experimental evaluation framework allows us to better understand the characteristics of the missed matching entity descriptions and contrast them with ground truth obtained from different kinds of relatedness links.
Article
Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.
Chapter
With the development of the Internet and cloud computing, there is the need of databases that will be able to store and process big data, and Not only SQL ’NoSQL’ databases are becoming increasingly used in the big data domains and have some interesting strengths such as scalability and flexibility. This paper explains the growing interest of implementing NoSQL in Data Warehouses. In addition, this paper investigates the use of data cleaning (the process of detecting and correcting or removing inaccurate records from a database) in NoSQL databases. More precisely, we are interested in adapting the data deduplication algorithms in two NoSQL models: document-oriented and column-oriented. Finally, a comparison between the implemented algorithms and the results of our simulations.
Full-text available
Article
An increasing number of entities are described by interlinked data rather than documents on the Web. Entity Resolution (ER) aims to identify descriptions of the same real-world entity within one or across knowledge bases in the Web of data. To reduce the required number of pairwise comparisons among descriptions, ER methods typically perform a pre-processing step, called blocking, which places similar entity descriptions into blocks and thus only compare descriptions within the same block. We experimentally evaluate several blocking methods proposed for the Web of data using real datasets, whose characteristics significantly impact their effectiveness and efficiency. The proposed experimental evaluation framework allows us to better understand the characteristics of the missed matching entity descriptions and contrast them with ground truth obtained from different kinds of relatedness links.
Book
In recent years, several knowledge bases have been built to enable large-scale knowledge sharing, but also an entity-centric Web search, mixing both structured data and text querying. These knowledge bases offer machine-readable descriptions of real-world entities, e.g., persons, places, published on the Web as Linked Data. However, due to the different information extraction tools and curation policies employed by knowledge bases, multiple, complementary and sometimes conflicting descriptions of the same real-world entities may be provided. Entity resolution aims to identify different descriptions that refer to the same entity appearing either within or across knowledge bases. The objective of this book is to present the new entity resolution challenges stemming from the openness of the Web of data in describing entities by an unbounded number of knowledge bases, the semantic and structural diversity of the descriptions provided across domains even for the same real-world entities, as well as the autonomy of knowledge bases in terms of adopted processes for creating and curating entity descriptions. The scale, diversity, and graph structuring of entity descriptions in the Web of data essentially challenge how two descriptions can be effectively compared for similarity, but also how resolution algorithms can efficiently avoid examining pairwise all descriptions. The book covers a wide spectrum of entity resolution issues at the Web scale, including basic concepts and data structures, main resolution tasks and workflows, as well as state-of-the-art algorithmic techniques and experimental trade-offs.
Full-text available
Article
Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates.Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach.
Full-text available
Article
Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effective-ness at the expense of poor efficiency. To reduce the typ-ically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infras-tructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similar-ity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effec-tiveness of the proposed approaches.
Full-text available
Conference Paper
Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.
Full-text available
Conference Paper
A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use an approach based on computing (closed) sets of attributes having large support (large extent) as clusters of similar documents. The method is tested in a series of computer experiments on large public collections of web documents and compared to other established methods and software, such as biclustering, on same datasets. Practical efficiency of different algorithms for computing frequent closed sets of attributes is compared.
Full-text available
Conference Paper
Record linkageis the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is imprac- tical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For ex- ample, in a person matching problem, blocking might return all people with the same last name as candidate matches. Two main problems in blocking are the selection of attributes for generating the candidate matches and deciding which meth- ods to use to compare the selected attributes. These attribute and method choices constitute a blocking scheme. Previ- ous approaches to record linkage address the blocking issue in a largely ad-hoc fashion. This paper presents a machine learning approach to automatically learn effective blocking schemes. We validate our approach with experiments that show our learned blocking schemes outperform the ad-hoc blocking schemes of non-experts and perform comparably to those manually built by a domain expert.
Full-text available
Conference Paper
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Full-text available
Conference Paper
In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neigh- borhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.
Full-text available
Conference Paper
A large proportion of the massive amounts of data that are being collected by many organisations today is about people, and often contains identifying information like names, addresses, dates of birth, or social security numbers. Privacy and confidentiality are of great concern when such data is being processed and analysed, and when there is a need to share such data between organisations or make it publicly avail- able. The research area of data linkage is especially suffering from a lack of publicly available real-world data sets, as experimental evaluations and comparisons are difficult to conduct without real data. Inorder to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data with realistic characteristics, such as frequency distributions and error probabilities. Our data generator sig- nificantly improves similar earlier approaches, and allows the creation of data containing records for individuals, households and families.
Full-text available
Article
Since the introduction of association rule mining in 1993 by Agrawal Imielinski and Swami, the frequent itemset mining (FIM) tasks have received a great deal of attention. Within the last decade, a phenomenal number of algorithms have been developed for mining all, closed and maximal frequent itemsets. Every new paper claims to run faster than previously existing algorithms, based on their experimental testing, which is oftentimes quite limited in scope, since many of the original algorithms...
Full-text available
Article
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.
Full-text available
Chapter
We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.
Full-text available
Article
There have been several recent advancements in Machine Learning community on the Entity Matching (EM) problem. However, their lack of scalability has prevented them from being applied in practical settings on large real-life datasets. Towards this end, we propose a principled framework to scale any generic EM algorithm. Our technique consists of running multiple instances of the EM algorithm on small neighborhoods of the data and passing messages across neighborhoods to construct a global solution. We prove formal properties of our framework and experimentally demonstrate the effectiveness of our approach in scaling EM algorithms.
Full-text available
Article
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of in-put entities and affinity-based scheduling of match tasks. Comment: 11 pages
Full-text available
Conference Paper
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance
Full-text available
Article
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area
Full-text available
Article
Clustering techniques often define the similarity between instances using distance measures over the various dimensions of the data [12, 14]. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each instance described. In high dimensional data, however, many of the dimensions are often irrelevant. These irrelevant dimensions confuse clustering algorithms by hiding clusters in noisy data. In very high dimensions it is common for all of the instances in a dataset to be nearly equidistant from each other, completely masking the clusters. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. This paper presents a survey of the various subspace clustering algorithms. We then compare the two main approaches to subspace clustering using empirical scalability and accuracy tests.
Full-text available
Article
Record linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage accuracy. New blocking methods have been implemented recently using high-dimensional indexing or clustering algorithms.
Full-text available
Article
This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful ideas for estimating record linkage parameters and other ideas that still influence record linkage today. Record linkage research is characterized by its synergism of statistics, computer science, and operations research. Many difficult algorithms have been developed and put in software systems. Record linkage practice is still very limited. Some limits are due to existing software. Other limits are due to the difficulty in automatically estimating matching parameters and error rates, with current research highlighted by the work of Larsen and Rubin. Keywords: computer matching, modeling, iterative fitting, string comparison, optimization RsSUMs Cet article donne une vue d'ensemble sur les ...
Full-text available
Article
Many applications of the Fellegi-Sunter model use simplifying assumptions and ad hoc modifications to improve matching efficacy. Because of model misspecification, distinctive approaches developed in one application typically cannot be used in other applications and do not always make use of advances in statistical and computational theory. An ExpectationMaximization (EMH) algorithm that constrains the estimates to a convex subregion of the parameter space is given. The EMH algorithm provides probability estimates that yield better decision rules than unconstrained estimates. The algorithm is related to results of Meng and Rubin (1993) on Multi-Cycle Expectation-Conditional Maximization algorithms and make use of results of Haberman (1977) that hold for large classes of loglinear models. Key Words: MCECM Algorithm, Latent Class, Computer Matching, Error Rate This paper provides a theory for obtaining constrained maximum likelihood estimates for latent-class, loglinear models on finite ...
Article
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Article
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GMD) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GMD is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GMD measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GMD measure, and the widely used pairwise F1 measure can be directly computed using GMD. We present an efficient linear-time algorithm that correctly computes the GMD measure for a large class of cost functions that satisfy reasonable properties.
Article
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F 1 , cluster F 1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an extensive survey on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differ-ently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GM D) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GM D is that the cost functions for splits and merges can be configured to adjust two important parameters: sensi-tivity to error type and sensitivity to cluster size. This flex-ibility enables us to clearly understand the characteristics of a defined GM D measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is also a special case of our GM D measure, and the widely used pairwise F 1 measure can be directly computed using GM D. We present an efficient linear-time algorithm that correctly computes the GM D measure for a large class of cost functions that satisfy reasonable properties. As a result, both Variation of Information and pairwise F 1 can be computed in linear time.
Article
Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious non-matching pairs, while at the same time maintaining high matching quality. This paper presents a survey of twelve variations of six indexing techniques. Their complexity is analysed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.
Article
A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events (said to be matched). A comparison is to be made between the recorded characteristics and values in two records (one from each file) and a decision made as to whether or not the members of the comparison-pair represent the same person or event, or whether there is insufficient evidence to justify either of these decisions at stipulated levels of error. These three decisions are referred to as link (A1), a non-link (A3), and a possible link (A2). The first two decisions are called positive dispositions. The two types of error are defined as the error of the decision A1 when the members of the comparison pair are in fact unmatched, and the error of the decision A3 when the members of the comparison pair are, in fact matched. The probabilities of these errors are defined as and respectively where u(γ), m(γ) are the probabilities of realizing γ (a comparison vector whose components are the coded agreements and disagreements on each characteristic) for unmatched and matched record pairs respectively. The summation is over the whole comparison space r of possible realizations. A linkage rule assigns probabilities P(A1|γ), and P(A2|γ), and P(A3|γ) to each possible realization of γ ε Γ. An optimal linkage rule L (μ, λ, Γ) is defined for each value of (μ, λ) as the rule that minimizes P(A2) at those error levels. In other words, for fixed levels of error, the rule minimizes the probability of failing to make positive dispositions. A theorem describing the construction and properties of the optimal linkage rule and two corollaries to the theorem which make it a practical working tool are given.
Article
This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, the data from the sources may be extracted using information extraction techniques and so may yield erroneous data. Third, queries to the system may be posed with keywords rather than in a structured form. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we do not know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of probabilistic schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting. Finally, we consider using probabilistic mappings in the scenario of data exchange.
Conference Paper
Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once---for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, high-dimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. U...
Conference Paper
Mining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexity-theoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present the first formal proof that the problem of counting the number of distinct maximal frequent itemsets in a database of transactions, given an arbitrary support threshold, is #P-complete, thereby providing strong theoretical evidence that the problem of mining maximal frequent itemsets is NP-hard. This result is of particular interest since the associated decision problem of checking the existence of a maximal frequent itemset is in P.We also extend our complexity analysis to other similar data mining problems dealing with complex data structures, such as sequences, trees, and graphs, which have attracted intensive research interests in recent years. Normally, in these problems a partial order among frequent patterns can be defined in such a way as to preserve the downward closure property, with maximal frequent patterns being those without any successor with respect to this partial order. We investigate several variants of these mining problems in which the patterns of interest are subsequences, subtrees, or subgraphs, and show that the associated problems of counting the number of maximal frequent patterns are all either #P-complete or #P-hard.
Conference Paper
Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous Web resources. However, when targeting large-scale data, the cost of enumerating all the possible linkages often becomes impracticably high. Based on this background, this paper proposes a fast and efficient method for linkage detection. The features of the proposed approach are: first, it exploits a suffix array structure that enables linkage detection using variable length n-grams. Second, it dynamically generates blocks of possibly associated records using ‘blocking keys’ extracted from already known reliable linkages. The results from our preliminary experiments where the proposed method was applied to the integration of four bibliographic databases, which scale up to more than 10 million records, are also reported in the paper.
Conference Paper
We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms.Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five
Conference Paper
Many data mining tasks require computing similarity be- tween pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is im- practical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we in- troduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We de- scribe two predicate-based formulations of learnable block- ing functions and provide learning algorithms for train- ing them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.
Book
With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates. Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection. Table of Contents: Data Cleansing: Introduction and Motivation / Problem Definition / Similarity Functions / Duplicate Detection Algorithms / Evaluating Detection Success / Conclusion and Outlook / Bibliography With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates. Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
In this paper, we present a new record linkage approach that uses entity behavior to decide if potentially different entities are in fact the same. An entity’s behavior is extracted from a transaction log that records the actions of this entity with respect to a given data source. The core of our approach is a technique that merges the behavior of two possible matched entities and computes the gain in recognizing behavior patterns as their matching score. The idea is that if we obtain a well recognized behavior after merge, then most likely, the original two behaviors belong to the same entity as the behavior becomes more complete after the merge. We present the necessary algorithms to model entities ’ behavior and compute a matching score for them. To improve the computational efficiency of our approach, we precede the actual matching phase with a fast candidate generation that uses a ”quick and dirty ” matching method. Extensive experiments on real data show that our approach can significantly enhance record linkage quality while being practical for large transaction logs. 1.
Article
In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.
Book
Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.
Conference Paper
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.
Conference Paper
This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. The record-linkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. We explore a novel approach to this problem. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we use the FastMap approach as an example. Given the merging rule that defines when two records are Similar a set of attributes are chosen along which the merge will proceed A multidimensional similarity join over the chosen attributes is used to determine similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and accuracy.