Preprint

Fast Record Linkage for Company Entities

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Record Linkage is an essential part of almost all real-world systems that consume data coming from different sources, structured and unstructured. Typically no common key is available in order to connect the records. Often massive data cleaning and data integration processes have to be completed before any data analytics and further processing can be performed. Though record linkage is often seen as a somewhat tedious necessary step, it is able to reveal valuable insights of the data at hand. These insights guide further analytic approaches over the data and support data visualization. In this work we focus on company entity matching, where company name, location and industry are taken into account. The matching is done on the fly to accommodate realtime processing of streamed data. Our contribution is a system that uses rule-based matching algorithms for scoring operations which we extend with a machine learning approach to account for short company names. We propose an end-to-end highly scalable enterprise-grade system. Linkage time is greatly reduced by efficient decomposition of the search space using MinHash. High linkage accuracy is reached by the proposed thorough scoring process of the matching candidates. Based on two real world ground truth datasets, we show that our approach reaches a recall of 91% compared to 86% for baseline approaches. These results are achieved while scaling linearly with the number of nodes used in the system.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Entity resolution (ER) is the problem of accurately identifying multiple, differing, and possibly contradicting representations of unique real-world entities in data. It is a challenging and fundamental task in data cleansing and data integration. In this work, we propose graph differential dependencies (GDDs) as an extension of the recently developed graph entity dependencies (which are formal constraints for graph data) to enable approximate matching of values. Furthermore, we investigate a special discovery of GDDs for ER by designing an algorithm for generating a non-redundant set of GDDs in labelled data. Then, we develop an effective ER technique, Certus, that employs the learned GDDs for improving the accuracy of ER results. We perform extensive empirical evaluation of our proposals on five real-world ER benchmark datasets and a proprietary database to test their effectiveness and efficiency. The results from the experiments show the discovery algorithm and Certus are efficient; and more importantly, GDDs significantly improve the precision of ER without considerable trade-off of recall.
Conference Paper
Full-text available
Entity resolution (ER) is the task of identifying different representations of the same real-world object across datasets. Designing and tuning ER algorithms is an error-prone, labor-intensive process, which can significantly benefit from data-driven, automated learning methods. Our focus is on "big data'' scenarios where the primary challenges include 1) identifying, out of a potentially massive set, a small subset of informative examples to be labeled by the user, 2) using the labeled examples to efficiently learn ER algorithms that achieve both high precision and high recall, and 3) executing the learned algorithm to determine duplicates at scale. Recent work on learning ER algorithms has employed active learning to partially address the above challenges by aiming to learn ER rules in the form of conjunctions of matching predicates, under precision guarantees. While successful in learning a single rule, prior work has been less successful in learning multiple rules that are sufficiently different from each other, thus missing opportunities for improving recall. In this paper, we introduce an active learning system that learns, at scale, multiple rules each having significant coverage of the space of duplicates, thus leading to high recall, in addition to high-precision. We show the superiority of our system on real-world ER scenarios of sizes up to tens of millions of records, over state-of-the-art active learning methods that learn either rules or committees of statistical classifiers for ER, and even over sophisticated methods based on first-order probabilistic models.
Conference Paper
Full-text available
While named entity recognition is a much addressed research topic, recognizing companies in text is of particular difficulty. Company names are extremely heterogeneous in structure, a given company can be referenced in many different ways, their names include person names, locations, acronyms, numbers, and other unusual tokens. Further, instead of using the official company name, quite different colloquial names are frequently used by the general public. We present a machine learning (CRF) system that reliably recognizes organizations in German texts. In particular, we construct and employ various dictionaries, regular expressions, text context, and other techniques to improve the results. In our experiments we achieved a precision of 91.11% and a recall of 78.82%, showing significant improvement over related work. Using our system we were able to extract 263,846 company mentions from a corpus of 141,970 newspaper articles.
Article
Full-text available
Entity resolution is an important application in field of data cleaning. Standard approaches like deterministic methods and probabilistic methods are generally used for this purpose. Many new approaches using single layer perceptron, crowdsourcing etc. are developed to improve the efficiency and also to reduce the time of entity resolution. The approaches used for this purpose also depend on the type of dataset, labeled or unlabeled. This paper presents a new method for labeled data which uses single layered convolutional neural network to perform entity resolution. It also describes how crowdsourcing can be used with the output of the convolutional neural network to further improve the accuracy of the approach while minimizing the cost of crowdsourcing. The paper also discusses the data pre-processing steps used for training the convolutional neural network. Finally it describes the airplane sensor dataset which is used for demonstration of this approach and then shows the experimental results achieved using convolutional neural network.
Conference Paper
Full-text available
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of “roughly the same” and “roughly contained.” The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints
Article
Full-text available
Background Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality. Methods A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality. Results Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall. Conclusions Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process.
Conference Paper
Full-text available
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information is published on the Web for human- and machine-consumption. We describe some emerging applications from the DBpedia community and show how website authors can facilitate DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data.
Conference Paper
Full-text available
We define and study the notion of min-wise independent families of permutations. We say that F⊆S n is min-wise independent if for any set X⊆[n] and any x∈X, when π is chosen at random in F we have Pr(min{π(X)}=π(x))=1 |X|· In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept – we present the solution to some of them and we list the rest as open problems.
Article
Full-text available
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and eservices.
Article
Full-text available
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.
Conference Paper
Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.
Article
This paper presents a new algorithm for calculating hash signatures of sets which can be directly used for Jaccard similarity estimation. The new approach is an improvement over the MinHash algorithm, because it has a better runtime behavior and the resulting signatures allow a more precise estimation of the Jaccard index.
Article
Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification~\cite{Proc:OneHashLSH_ICML14,Proc:Shrivastava_UAI14} have shown that it is possible to compute k minwise hashes, of a vector with d nonzeros, in mere (d+k)(d + k) computations, a significant improvement over the classical O(dk). These advances have led to an algorithmic improvement in the query complexity of traditional indexing algorithms based on minwise hashing. Unfortunately, the variance of the current densification techniques is unnecessarily high, which leads to significantly poor accuracy compared to vanilla minwise hashing, especially when the data is sparse. In this paper, we provide a novel densification scheme which relies on carefully tailored 2-universal hashes. We show that the proposed scheme is variance-optimal, and without losing the runtime efficiency, it is significantly more accurate than existing densification techniques. As a result, we obtain a significantly efficient hashing scheme which has the same variance and collision probability as minwise hashing. Experimental evaluations on real sparse and high-dimensional datasets validate our claims. We believe that given the significant advantages, our method will replace minwise hashing implementations in practice.
Article
In this paper, we propose a variety of Long Short-Term Memory (LSTM) based models for sequence tagging. These models include LSTM networks, bidirectional LSTM (BI-LSTM) networks, LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF) and bidirectional LSTM with a CRF layer (BI-LSTM-CRF). Our work is the first to apply a bidirectional LSTM CRF (denoted as BI-LSTM-CRF) model to NLP benchmark sequence tagging data sets. We show that the BI-LSTM-CRF model can efficiently use both past and future input features thanks to a bidirectional LSTM component. It can also use sentence level tag information thanks to a CRF layer. The BI-LSTM-CRF model can produce state of the art (or close to) accuracy on POS, chunking and NER data sets. In addition, it is robust and has less dependence on word embedding as compared to previous observations.
Book
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.
Article
An interesting relationship between rounding algorithms used for rounding fractional solutions of LPs and vector solutions of SDPs on the one hand, and the constructions of locality sensitive hash functions for interesting classes of objects, on the other was demonstrated. Thus, rounding algorithms yielded new constructions of locality sensitive hash functions that were not previously known. Conversely, locality sensitive hash functions lead to rounding algorithm.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Article
A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events (said to be matched). A comparison is to be made between the recorded characteristics and values in two records (one from each file) and a decision made as to whether or not the members of the comparison-pair represent the same person or event, or whether there is insufficient evidence to justify either of these decisions at stipulated levels of error. These three decisions are referred to as link (A1), a non-link (A3), and a possible link (A2). The first two decisions are called positive dispositions. The two types of error are defined as the error of the decision A1 when the members of the comparison pair are in fact unmatched, and the error of the decision A3 when the members of the comparison pair are, in fact matched. The probabilities of these errors are defined as and respectively where u(γ), m(γ) are the probabilities of realizing γ (a comparison vector whose components are the coded agreements and disagreements on each characteristic) for unmatched and matched record pairs respectively. The summation is over the whole comparison space r of possible realizations. A linkage rule assigns probabilities P(A1|γ), and P(A2|γ), and P(A3|γ) to each possible realization of γ ε Γ. An optimal linkage rule L (μ, λ, Γ) is defined for each value of (μ, λ) as the rule that minimizes P(A2) at those error levels. In other words, for fixed levels of error, the rule minimizes the probability of failing to make positive dispositions. A theorem describing the construction and properties of the optimal linkage rule and two corollaries to the theorem which make it a practical working tool are given.
Conference Paper
Near-duplicate web documents are abundant. Two such documents dier from each other in a very small portion that displays advertisements, for example. Such dierences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection sys- tem for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's nger- printing technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f- bit ngerprin ts that dier from a given ngerprin t in at most k bit-positions, for small k. Our technique is useful for both online queries (single ngerprin ts) and batch queries (mul- tiple ngerprin ts). Experimental evaluation over real data conrms the practicality of our design.
Article
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an ffl-approximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
Big data integration for product specifications
  • L Barbosa
  • V Crescenzi
  • X L Dong
  • P Merialdo
  • F Piai
  • D Qiu
  • Y Shen
  • D Srivastava
L. Barbosa, V. Crescenzi, X. L. Dong, P. Merialdo, F. Piai, D. Qiu, Y. Shen, and D. Srivastava. Big data integration for product specifications. IEEE Data Eng. Bull., 41(2):71-81, 2018.
Entity resolution: Theory, practice & open challenges
  • L Getoor
  • A Machanavajjhala
L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice & open challenges. Proceedings of the VLDB Endowment, 5:2018-2019, 08 2012.
Murmurhash3, an ultra fast hash algorithm for c#
  • A Horvath
A. Horvath. Murmurhash3, an ultra fast hash algorithm for c#.net. http://blog.teamleadnet.com/2012/ 08/murmurhash3-ultra-fast-hash-algorithm.html. Accessed: 2019-05-28.
Navigation and Nautical Astronomy: For the Use of
  • J Inman
J. Inman. Navigation and Nautical Astronomy: For the Use of British Seamen (3 ed.). London, UK: W. Woodward, C. & J. Rivington, 1835.
Magellan: toward building entity matching management systems over data science stacks
  • P Konda
  • J Naughton
  • S Prasad
  • G Krishnan
  • R Deep
  • V Raghavendra
  • S Das
  • P Suganthan
  • A Doan
  • A Ardalan
  • J R Ballard
  • H Li
  • F Panahi
  • H Zhang
P. Konda, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra, S. Das, P. Suganthan G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, and H. Zhang. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, 9:1581-1584, 09 2016.
Similarity matching system for record linkage. US Patent Application P201704804US01
  • K Mirylenka
  • P Scotton
  • C Miksovic
  • A Schade
K. Mirylenka, P. Scotton, C. Miksovic, and A. Schade. Similarity matching system for record linkage. US Patent Application P201704804US01, 2018.
Methods and apparatus for estimating similarity
  • I M V U S Moses
  • Google Charikar
I. M. V. U. Moses S. Charikar, Google. Methods and apparatus for estimating similarity.
Mathias Baek Tejs Knudsen. Fast similarity sketching
  • M T Søren Dahlgaard
M. T. Søren Dahlgaard, Mathias Baek Tejs Knudsen. Fast similarity sketching, 2017.