Article

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious non-matching pairs, while at the same time maintaining high matching quality. This paper presents a survey of twelve variations of six indexing techniques. Their complexity is analysed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Q-grams (also called 'n-grams'; see [26]) are short character sub-strings (of length q) of the record strings [27]. These methods have been applied in various forms, such as through the application of different approaches to spelling correction or the use of inverted indexes [28][29][30]. The Q-gram method is a favorite indexing method in enterprise and web search engines; it is used by tech companies with the largest market shares and capitalization. ...
... In the numerical study, we will work with datasets commonly employed for evaluating the effectiveness of approximate matching algorithms, including those analyzed in studies such as [2,4,15,28]. These datasets provide a benchmark for comparing algorithm performance across various domains. ...
... However, in large storage situations with millions of records, such time complexity in the algorithm is not usable for real-time approximate string matching and search. A filter, also called a blocking technique, could be developed using a method with much lower time complexity; this could thus filter many records and significantly reduce the large comparison space [28][29][30]. Practically all used methods are sub-optimal filters which, during filtering, lose true-positive candidates. The reason that we observe for this is that these methods are not explicitly mathematically united. ...
Article
Full-text available
In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of ≈O(1). In the second stage, FRMS runs for a polynomial time of approximately ≈O(n4) and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime.
... The traditional pairwise entity resolution method [2] classifies record pairs based on similarities calculated by comparing semantically corresponding attributes or derived embeddings obtained from pretrained language models [15] between two data sources. Filtering or blocking techniques [3,23] mitigate the time complexity associated with pairwise comparisons, significantly reducing the number of candidate pairs. For each candidate pair, a set of similarity measures is computed using string similarity metrics and numerical distance functions. ...
... Method Parameters: We use the following configuration summarized in Tab. 3 where the values written in bold are default values. We split the available linkage problems into 50% solved linkage problems LP I and consequently 50% unsolved linkage problems LP U for the Dexter data set. ...
... In contrast to an active learning method, TransER is designed to reuse existing training data for new linkage problems, so we treat all linkage problems LP I among integrated data sources as training data. 3 https://github.com/nishadi/TransER TransER's parameter values were set at = 10, = 0.9, = 0.9, and = 0.9, as determined through pre-experimental tuning. ...
Preprint
Full-text available
Entity resolution is essential for data integration, facilitating analytics and insights from complex systems. Multi-source and incremental entity resolution address the challenges of integrating diverse and dynamic data, which is common in real-world scenarios. A critical question is how to classify matches and non-matches among record pairs from new and existing data sources. Traditional threshold-based methods often yield lower quality than machine learning (ML) approaches, while incremental methods may lack stability depending on the order in which new data is integrated. Additionally, reusing training data and existing models for new data sources is unresolved for multi-source entity resolution. Even the approach of transfer learning does not consider the challenge of which source domain should be used to transfer model and training data information for a certain target domain. Naive strategies for training new models for each new linkage problem are inefficient. This work addresses these challenges and focuses on creating as well as managing models with a small labeling effort and the selection of suitable models for new data sources based on feature distributions. The results of our method StoRe demonstrate that our approach achieves comparable qualitative results. Regarding efficiency, StoRe outperforms both a multi-source active learning and a transfer learning approach, achieving efficiency improvements of up to 48 times faster than the active learning approach and by a factor of 163 compared to the transfer learning method.
... Then given a data set of M records aggregated from many data sources with possibly numerous duplicated entities perturbed by noise, the task of entity resolution is to identify and remove the duplicate entities. For a review of entity resolution see (Winkler, 2006;Christen, 2012;Liseo and Tancredi, 2013). ...
... Due to this (and to the best of our knowledge), accurate entity resolution algorithms scale quadratically or higher (> O(M 2 )) making them computationally intractable for large data sets. Reducing the computational cost in entity resolution is known as blocking, which, via deterministic or probabilistic algorithms, places similar records into blocks or bins (Christen, 2012;Steorts et al., 2014). The computational efficiency comes at the cost of missed links and reduced accuracy for entity resolution. ...
... Instead, we rely on the observation that record pairs with high similarity have a higher chance of being duplicate records. That is, we assume that when two entities R i and R j are similar in their attributes, it is more likely that they refer to the same entities (Christen, 2012). 2 We note that this probabilistic observation is the weakest possible assumption, and almost always true for entity resolution tasks because linking records by a similarity score is one simple way of approaching entity resolution (Christen, 2012;Winkler, 2006;Fellegi and Sunter, 1969). ...
Preprint
Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of 191,874±1772191,874 \pm 1772 documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.
... Over the years, a wide range of blocking techniques has been developed, from simple heuristic-based methods to advanced approaches involving deep learning (for surveys, see [21], [22], [8]). These methods can be categorized in various ways, each providing a different perspective on their function and applications. ...
... Various metrics have been used to evaluate blocking quality, but three are most commonly considered comprehensive: RR, PQ, and PC [27], [28], [22]. RR measures how much a blocking method reduces the total number of comparisons, PQ represents the percentage of candidate pairs that are true matches after blocking, and PC indicates the percentage of true matches present in the candidate set after blocking. ...
... Several studies also considered runtime as a main evaluation measure in blocking [33], [34], [25]. There is also research that employed the harmonic mean of PC and RR [35], [36], [37], [38], [39] or the harmonic mean of PC and PQ [22]. We will formally define these metrics in Section III and discuss which measures provide a more meaningful way to evaluate the blocking methods considered in Section V. ...
Preprint
Full-text available
Entity Matching (EM) is crucial for identifying equivalent data entities across different sources, a task that becomes increasingly challenging with the growth and heterogeneity of data. Blocking techniques, which reduce the computational complexity of EM, play a vital role in making this process scalable. Despite advancements in blocking methods, the issue of fairness; where blocking may inadvertently favor certain demographic groups; has been largely overlooked. This study extends traditional blocking metrics to incorporate fairness, providing a framework for assessing bias in blocking techniques. Through experimental analysis, we evaluate the effectiveness and fairness of various blocking methods, offering insights into their potential biases. Our findings highlight the importance of considering fairness in EM, particularly in the blocking phase, to ensure equitable outcomes in data integration tasks.
... Most filtering techniques in the literature are crafted for textual attribute values and are based on heuristics, operating in a learning-free manner that requires no labelled instances for training [3,5,8]. The reason is that labelled data are rarely available and too expensive for such a coarse-grained process [9]. ...
... 1. Blocking workflows transform every input entity profile into a set of representative signatures that facilitates the efficient creation of clusters [3,8,10]. 2. Nearest neighbor (NN) workflows transform every input entity into a sparse [11,12] or dense numeric vector [13]. These vectors are indexed through an efficient data structure, which is queried by part or all of the entity profiles so as to yield as candidates the most similar vectors/entities. ...
... Nevertheless, no prior work has investigated their relative performance. The past experimental analyses examine each type independently of the other: blocking workflows are tested in [8,10,14], whereas the sparse [11,12,15] and the dense vector-based NN workflows [13] have only been evaluated with respect to run-time and approximation quality, which is not related to their ER performance (cf. Sect. ...
Article
Full-text available
Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
... As presented in Fig. 2 and Fig. 5, these two diversity criteria, i.e. diversity in safety categories an data content, allow us initially to collect 529,816 harmful instruction samples from 18 sources spanning all eight safety categories (a full description of these sources is provided in Tab. 4). Recognizing the presence of significant redundancy in the raw data, we apply three standard deduplication techniques -n-gram matching (Lin, 2004), cosine similarity on TF-IDF vectors (Christen, 2011), and sentence embedding similarity (Reimers & Gurevych, 2019) -to remove duplicate or near-identical samples. This refinement process results in a final dataset comprising 40,961 unique harmful instructions. ...
... To ensure data quality and reduce redundancy, we applied a multi-step filtering pipeline consisting of n-gram matching (Lin, 2004), TF-IDF cosine similarity (Christen, 2011), and sentence embedding similarity (Reimers & Gurevych, 2019). Below, we provide details on the specific thresholds and procedures used in each step. ...
Preprint
Full-text available
This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.
... Indeed, its quadratic time complexity means every possible entity pair must be compared in the worst case. Traditional solutions like blocking [1,2,5,6] and nearest-neighbor [4,7,8] techniques reduce comparisons by filtering unlikely matches [3]. However, these methods often struggle with graph data, where entity relationships provide crucial context for identification. ...
... Finally, to our knowledge, the only two existing on-demand ER solution in the literature [10,14] are designed for relational data, thus, ill-equipped to handle Requirement 1. We present an on-demand ER framework for property graphs, FastER, that meets all six (6) Contributions. We present a graph differential dependencies (GDDs) [20] based framework for on-demand entity resolution in property graphs. ...
Preprint
Full-text available
Entity resolution (ER) is the problem of identifying and linking database records that refer to the same real-world entity. Traditional ER methods use batch processing, which becomes impractical with growing data volumes due to high computational costs and lack of real-time capabilities. In many applications, users need to resolve entities for only a small portion of their data, making full data processing unnecessary -- a scenario known as "ER-on-demand". This paper proposes FastER, an efficient ER-on-demand framework for property graphs. Our approach uses graph differential dependencies (GDDs) as a knowledge encoding language to design effective filtering mechanisms that leverage both structural and attribute semantics of graphs. We construct a blocking graph from filtered subgraphs to reduce the number of candidate entity pairs requiring comparison. Additionally, FastER incorporates Progressive Profile Scheduling (PPS), allowing the system to incrementally produce results throughout the resolution process. Extensive evaluations on multiple benchmark datasets demonstrate that FastER significantly outperforms state-of-the-art ER methods in computational efficiency and real-time processing for on-demand tasks while ensuring reliability. We make FastER publicly available at: https://anonymous.4open.science/r/On_Demand_Entity_Resolution-9DFB
... See e.g. Steorts et al. (2014); Christen (2012b) and Baxter et al. (2003) for reviews. We reserve discussion of these for Section 6. ...
... We have focused on three simple but extremely common indexing methods: blocking, indexing by disjunctions, and filtering. A range of other techniques for indexing exist, including sophisticated approaches based on various hashing algorithms (Christen, 2012b;Steorts et al., 2014). These are perhaps best understood as fast approximations to filtering, and our discussion here applies more or less directly (especially if these indexing methods are followed by a subsequent filtering step to remove unlikely pairs that escape indexing). ...
Preprint
Probabilistic record linkage, the task of merging two or more databases in the absence of a unique identifier, is a perennial and challenging problem. It is closely related to the problem of deduplicating a single database, which can be cast as linking a single database against itself. In both cases the number of possible links grows rapidly in the size of the databases under consideration, and in most applications it is necessary to first reduce the number of record pairs that will be compared. Spurred by practical considerations, a range of methods have been developed for this task. These methods go under a variety of names, including indexing and blocking, and have seen significant development. However, methods for inferring linkage structure that account for indexing, blocking, and additional filtering steps have not seen commensurate development. In this paper we review the implications of indexing, blocking and filtering within the popular Fellegi-Sunter framework, and propose a new model to account for particular forms of indexing and filtering.
... Datasets for Blocking Scalability. The datasets in Table 3(b) are common in the literature for assessing the scalability of blocking methods [73][74][75]. They formulate Dirty ER or Deduplication tasks, where the goal is to detect matches between all possible pairs of entities [3,7]. ...
... To assess the relative blocking performance of language models, we estimate the recall of the candidate pairs they yield, a measure also known as pairs completeness [74,75,78]. This constitutes the most critical aspect of blocking, because it sets the upper bound for the subsequent matching step-missed duplicates cannot be detected by matching unless complex and time-consuming iterative algorithms are employed [4,5]. ...
Article
Full-text available
Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications.
... Record deduplication, a critical process in data management, refers to the identification and removal of duplicate records in databases. This process is essential for maintaining data quality and integrity, especially in large databases where duplicate records can lead to inconsistent, misleading, or erroneous data analysis [1,4]. ...
... Recent advances in machine learning and natural language processing have opened up new opportunities for more sophisticated deduplication strategies. [1,8,9]. These approaches use complex algorithms to identify duplicates with higher accuracy, even in data sets with high variability and noise. ...
Chapter
Full-text available
Record linkage is the process of matching records from multiple data sources that refer to the same entities. When applied to a single data source, this process is known as deduplication. With the increasing size of data source, recently referred to as big data, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent decades, several blocking, indexing and filtering techniques have been developed. Their purpose is to reduce the number of record pairs to be compared by removing obvious non-matching pairs in the deduplication process, while maintaining high quality of matching. Currently developed algorithms and traditional techniques are not efficient, using methods that still lose significant proportion of true matches when removing comparison pairs. This paper proposes more efficient algorithms for removing non-matching pairs, with an explicitly proven mathematical lower bound on recently used state-of-the-art approximate string matching method-Fuzzy Jaccard Similarity. The algorithm is also much more efficient in classification using Density-based spatial clustering of applications with noise (DBSCAN) in log-linear time complexity O(|E| log(|E|)).
... Several blocking and indexing techniques (Christen, 2012) have been proposed for ER, and the usage of a single blocking key is a dominant approach in finding pairs of references that are potential matches (Papadakis et al., 2020). However, certain pairs that are supposed to be matched do not get matched because such pairs did not end up in the same block, hence reducing the number of complete pairs. ...
... This increases the computational time and causes a lot of workloads for computing resources. To solve this problem and reduce the computational complexity, blocking is used in ER to group references having similar characteristics into the same group before comparing them (Christen, 2012). Blocking is a form of rough matching before the main similarity comparison and linking process in ER and is useful, especially when processing larger volumes of data sets. ...
Article
Full-text available
Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible. An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume. This paper describes the parallelization of an unsupervised entity resolution (ER) process. ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product. The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume. The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources. This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce. However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required. The DWM uses an entropy measure to self-evaluate the quality of record clustering. The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory. The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce. We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM. The scalability of the proposed system is displayed using publicly available ER datasets. Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.
... Christen [29] presented a study of indexing schemes for record linkage and deduplication in different databases and analyzed their performance in terms of scalability and good matching quality. Xia et al. [30] discussed key features and background of deduplication with current industry trend and publically available sources for deduplication oriented research and studies. ...
... Problems related to cross-user deduplication are still facing challenges in designing chunking schemes. Similarly different finger-print indexing [29] schemes using various trees [85] and hash-based [86] data structures are used to index the unique incoming data at new index and discard the duplicate one if it already exist with some index in the database. Redundancy elimination can also be performed based on locality of data including temporal locality; based upon time of reference [87] and spatial locality of data in which Deduplication (secure and non-secure) has been envisioned widely by researchers in the context of the cloud computing environment, but it is still in its infancy in the IoT environment, where heterogeneous data is expanding exponentially due to dataintensive applications using diverse sensing devices. ...
Article
Full-text available
The Internet of Things (IoT), an enlarged Internet-based network, is a key element of the next information technology revolution. With the evolution of numerous IoT based smart applications like smart city, healthcare, huge amount of heterogenous data called big data with varying volume, velocity and variety is getting space on various storage systems. Data redundancy is a serious problem that wastes a lot of storage capacity and network bandwidth with less data security in setups that blend cloud integrated IoT data storage. Data deduplication/redundancy elimination strategies can effectively decrease and control this issue by removing duplicate data in cloud-integrated IoT storage systems. Security and privacy of data is another major concern. To maximise the storage effectively and extremely securely, with maintained data integrity, confidentiality, minimal storage cost and increased storage use, data deduplication (DD) over encrypted data is also a key problem in cloud integrated IoT storage and computing environment. With the dynamic nature of big data, the majority of current data deduplication techniques which primarily revolve around the static scenes like the backup and archive systems, are inappropriate to be applicable for real time scenarios in IoT environment. To overcome the aforementioned issues this survey presents an analysis of literature on conventional deduplication techniques. It is highlighting the need of deduplication in IoT oriented big data, its parameters and properties of effectiveness, with taxonomy for secure and conventional deduplication and scope of implementing it with new technologies; Blockchain. Further, it elaborates on issues and challenges along with future scope of deduplication schemes in various application domains.
... Previous research [18][19][20] has predominantly utilised contextual information, including local contexts of entity mentions and document-level coherence of referenced entities, for disambiguation. On the other hand, entity matching, entity resolution, or record linkage involves matching records from different relational tables that refer to the same entities [21][22][23][24]. When applied within the same relational table, it is referred to as deduplication. ...
... The matching process involves comparing attribute values using specific similarity measures and aggregating comparison results across all attributes. To reduce the number of record pairs to compare, indexing or blocking techniques are commonly employed to filter out obvious non-matching pairs [22]. ...
Article
Full-text available
Entity alignment plays an essential role in the integration of knowledge graphs (KGs) as it seeks to identify entities that refer to the same real-world objects across different KGs. Recent research has primarily centred on embedding-based approaches. Among these approaches, there is a growing interest in graph neural networks (GNNs) due to their ability to capture complex relationships and incorporate node attributes within KGs. Despite the presence of several surveys in this area, they often lack comprehensive investigations specifically targeting GNN-based approaches. Moreover, they tend to evaluate overall performance without analysing the impact of individual components and methods. To bridge these gaps, this paper presents a framework for GNN-based entity alignment that captures the key characteristics of these approaches. We conduct a fine-grained analysis of individual components and assess their influences on alignment results. Our findings highlight specific module options that significantly affect the alignment outcomes. By carefully selecting suitable methods for combination, even basic GNN networks can achieve competitive alignment results.
... Recently, several approaches were published that provide speed and scalability. Indexing methods are surveyed in [5]. Blocking methods are described in [23] in detail. ...
... Access to per-file metadata labels enables flexible dataset splits for generative and MIR tasks. A central concern of ours was entity resolution (ER) (Christen, 2011), i.e., identifying the compositional source of each recording and using this information to mitigate overrepresentation of popular pieces 6 . ...
Preprint
Full-text available
We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.
... This POC is being developed at the University of Arkansas at Little Rock (UALR) and has been named Data Washing Machine (DWM). The DWM employs frequency-based blocking, a multi-token scoring matrix for ER matching, and entropy-based quality evaluation of clustering [19][20][21][22]. Modules were added to implement 18 context-aware correction methods required for this research. ...
... For example, in census studies datafiles are often partitioned according to ZIP Codes (postal codes) and then only records sharing the same ZIP Code are attempted to be linked, that is, pairs of records with different ZIP Codes are assumed to be non-matches (Herzog et al., 2007). Blocking can be used with any record linkage approach and there are different variations (see Christen, 2012, for an extensive survey). Our presentation in this paper assumes that no blocking is being used, but in practice if blocking is needed the methodologies can simply be applied independently to each block. ...
Preprint
The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador.
... An alternative is to exploit prior knowledge on the kinds of errors that would be unlikely for a certain field, thereby declaring as non-coreferent any record pair that disagrees more than a predefined amount in that field. There also exist other more sophisticated techniques to detect sets of non-coreferent pairs, which are extensively surveyed by Christen (2012). ...
Preprint
Multiple-systems or capture-recapture estimation are common techniques for population size estimation, particularly in the quantitative study of human rights violations. These methods rely on multiple samples from the population, along with the information of which individuals appear in which samples. The goal of record linkage techniques is to identify unique individuals across samples based on the information collected on them. Linkage decisions are subject to uncertainty when such information contains errors and missingness, and when different individuals have very similar characteristics. Uncertainty in the linkage should be propagated into the stage of population size estimation. We propose an approach called linkage-averaging to propagate linkage uncertainty, as quantified by some Bayesian record linkage methodologies, into a subsequent stage of population size estimation. Linkage-averaging is a two-stage approach in which the results from the record linkage stage are fed into the population size estimation stage. We show that under some conditions the results of this approach correspond to those of a proper Bayesian joint model for both record linkage and population size estimation. The two-stage nature of linkage-averaging allows us to combine different record linkage models with different capture-recapture models, which facilitates model exploration. We present a case study from the Salvadoran civil war, where we are interested in estimating the total number of civilian killings using lists of witnesses' reports collected by different organizations. These lists contain duplicates, typographical and spelling errors, missingness, and other inaccuracies that lead to uncertainty in the linkage. We show how linkage-averaging can be used for transferring the uncertainty in the linkage of these lists into different models for population size estimation.
... Such methods can be seen as a preprocessing step which identifies records which are not likely to be duplicates, such that the pairwise feature similarity does only need to be computed for those features that co-appear in likely duplicates. A survey of various such indexing methods is given in [15]. We did not include an indexing step in our experiments in this paper, so that our experiments are run without excluding any record pairings a priori, but they can be incorporated into our method Pay-as-you-go [67] or progressive duplicate detection methods [52,34] have been developed for applications in which the duplicate detection has to happen in limited time on data which is acquired in small batches or in (almost) real-time [41]. ...
Preprint
We consider the problem of duplicate detection in noisy and incomplete data: given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into "unique entities", and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set.
... This process is computationally expensive and often inefficient when dealing with big data. To solve this problem, record blocking (Christen, 2012;Papadakis et al., 2015Papadakis et al., , 2020) is used to group references based on a common blocking key, and only references in a particular block are compared for similarity. ...
Article
Full-text available
Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.
... The dataset used is the North Carolina voter registration list (NCVR), which contains publicly available real information. The experiment was analyzed from the following four aspects: runtime, recall, precision, and F-measure [37]. The runtime reflects the efficiency of the method in processing large amounts of data and is a key standard for measuring the scalability of the method. ...
Article
Full-text available
With the world’s data volume growing exponentially, it becomes critical to link it and make decisions. Privacy-preserving record linkage (PPRL) aims to identify all the record information corresponding to the same entity from multiple data sources, without disclosing sensitive information. Previous works on multi-party PPRL methods typically adopt homomorphic encryption technology due to its ability to perform computations on encrypted data without needing to decrypt it first, thus maintaining data confidentiality. However, these methods have notable shortcomings, such as the risk of collusion among participants leading to the potential disclosure of private keys, high computational costs, and decreased efficiency. The advent of trusted execution environments (TEEs) offers a solution by protecting computations involving private data through hardware isolation, thereby eliminating reliance on trusted third parties, preventing malicious collusion, and improving efficiency. Nevertheless, TEEs are vulnerable to side-channel attacks. In this work, we propose an enhanced PPRL method based on TEE technology. Our methodology involves processing plaintext data within a TEE using the inner product mask technique, which effectively obfuscates the data, making it impervious to side-channel attacks. The experimental results demonstrate that our approach not only significantly improves resistance to side-channel attacks but also enhances efficiency, showing better performance and privacy preservation compared to existing methods. This work provides a robust solution to the challenges faced by current PPRL methods and sets the stage for future research aimed at further enhancing scalability and security.
... The indexing stage reduces the number of FPs from the overall data processed. In a single database, the complete indexing process is implemented using (6), where A is the number of dataset records in a database [42]. Thus, the complete indexing of each dataset can be explained as (6): |A|x(|A|-1)/2 Using a library record linkage toolkit, the distribution of several blocking mechanisms is presented in Table 2. Considering the data distribution, the field/record comparison stage uses the least number of candidate pairs, which are 806 in the WoS dataset, 65,541 in the Scopus dataset, and 904,008 in the GS dataset. ...
Article
Full-text available
Bibliographic databases are used to measure the performance of researchers, universities and research institutions. Thus, high data quality is required and data duplication is avoided. One of the weaknesses of the threshold-based approach in duplication detection is the low accuracy level. Therefore, another approach is required to improve duplication detection. This study proposes a method that combines threshold-based and rule-based approaches to perform duplication detection. These two approaches are implemented in the comparison stage. The cosine similarity function is used to create weight vectors from the features. Then, the comparison operator is used to determine whether the pair of records are grouped as duplication or not. Three research databases: Web of Science (WoS), Scopus, and Google Scholar (GS) on the Science and Technology Index (SINTA) database are investigated. Rule 4 and Rule 5 provide the best performance. For WoS dataset, the accuracy, precision, recall, and F1-measure values were 100.00%. For Scopus dataset, the accuracy and precision values were 100.00%, recall: 98.00%, and the F1-measure value is 98.00%. For GS dataset, the accuracy value was 100.00%, precision: 99.00%, recall: 97.00%, and the F1-measure value is 98.00%. The proposed method is potential tool for accurate detection on duplication records in publication databases.
... Since mid-Nineties, many applications of record linkage have concerned the issue of linking historical census data, as in the works of Ferrie (1996), Rosenwaike et al. (1998) and Ruggles (2002); more recent literature on record linkage concerns different real life issues, such as health issues (Mumme et al. 2022;Heidinger et al. 2022), deduplication issues (Christen et al. 2011;Tancredi et al. 2020) and crime and fraud detection (Vatsalan et al. 2013), among others. ...
Article
Full-text available
Nowadays public authorities and research organizations compile and disseminate statistics based on granular data with rapidly increasing volumes. Efficient statistical methods for data quality management are essential to ensure high quality in the produced statistics and consequently in the policy decisions. In order to guarantee smooth data quality checks, such methods need to be automatic, especially during situations of constraints on human resources. This paper deals with an issue of anomaly detection in very granular insurance data which are periodically used by central banks to produce European statistics. Since 2016, insurance corporations have been reporting granular assets data in Solvency II templates on a quarterly basis. Assets are uniquely identified by codes that by regulation must be kept stable and consistent over time; nevertheless, due to reporting errors, unexpected changes in these codes may occur, leading to inconsistencies when compiling statistics and analysing balance sheets. The current work addresses the data quality issue as a record linkage problem and proposes different supervised classification models to detect anomalies in the data. Test results for the selected random forest model provide excellent performance metrics, robust to different periods and types of assets.
... Postal patrol is the first 5 digits of the postal code. In other words, to reduce the possibly very large number of pairs of records that need to be compared, indexing techniques are applied [33] in SVM methods. These techniques filter out record pairs that are very unlikely to correspond to matches. ...
Article
Full-text available
Today, most activities of the statistical offices need to be adapted to the modernization policies of the national statistical system. Therefore, the application of machine learning techniques is mandatory for the main activities of statistical centers. These include important issues such as coding business activities, address matching, prediction of response propensities, and many others. One of the common applications of machine learning methods in official statistics is to match a statistical address to a postal address, in order to establish a link between register-based census and traditional censuses with the aim of providing time series census information. Since there is no unique identifier to directly map the records from different databases, text-based approaches can be applied. In this paper, a novel application of machine learning will be investigated to integrate data sources of governmental records and census, employing text-based learning. Additionally, three new methods of machine learning classification algorithms are proposed. A simulation study has been performed to evaluate the robustness of methods in terms of the degree of duplication and purity of the texts. Due to the limitation of the R programming environment on big data sets, all programming has been successfully implemented on SAS (Statistical analysis system) software.
... Extended suffix array blocking (22), which considers each substring of the blocking keys, looks like a much more promising technique. However, the generated blocks are overwhelmed with records, because a match between a pair of substrings does not imply similarity between the corresponding keys. ...
Article
Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, can improve service levels, accelerate sales, or elevate healthcare decision support. Towards this direction, blocking methods are used with the aim to group matching records in the same block using a combination of their attributes as blocking keys. This paper introduces a suite of randomized algorithms specifically crafted for streaming record linkage settings. Using a bounded in-memory data structure, in terms of the number of blocks and positions within each block, our algorithms guarantee that the most frequently accessed and the most recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. The operation of our algorithms rely on simple random choices, instead of utilizing cumbersome sorting data structures, which ensure that the probability of inactive blocks and older records to remain in main memory decays in order to free space for more promising blocks and fresher records, respectively. We also introduce an algorithm that performs approximate blocking to tackle the problem of misspellings and typos present in the blocking keys. The experimental evaluation showcases that our proposed algorithms scale efficiently to data streams by providing certain accuracy guarantees.
... Blocking, comparing, and classifying are the three distinct stages of deduplication. By grouping typical qualities, the blocking phase seeks to minimize the number of comparisons [Christen (2012)]. For instance, in Simple Blocking Methods, any data that share the initial character of a name or feature are entered in the identical block, preventing the formation of pairs of pairs. ...
... Clean-Clean ER is also called Record Linkage, in contrast to Deduplication or Dirty ER, where a single entity collection E is given as input with duplicates in itself [1], [11]. ...
Conference Paper
Full-text available
Entity Resolution has been an active research topic for the last three decades, with numerous algorithms proposed in the literature. However, putting them into practice is often a complex task that requires implementing, combining and configuring complementary individual algorithms into comprehensive end-to-end workflows. To facilitate this process, we are developing pyJedAI, a novel system that provides a unifying framework for any type of main works in the field (i.e., both unsupervised and learning-based ones). Our vision is to facilitate both novice and expert users to use and combine these algorithms through a series of principled approaches for automatically configuring and benchmarking end-to-end pipelines.
... While the problem of EA was introduced a few years ago, the more generic version of the problem -identifying entity records referring to the same real-world entity from different data sources-has been investigated from various angles by different communities, under the names of entity resolution (ER) [15,18,45], entity matching [13,42], record linkage [8,34], deduplication [16], instance/ontology matching [20,35,[49][50][51], link discovery [43,44], and entity linking/entity disambiguation [11,29]. Next, we describe the related work and the scope of this book. ...
Chapter
Full-text available
In this section, we provide a concise overview of the entity alignment task and also discuss other related tasks that have a close connection to entity alignment.
... L'un des principaux défis de la déduplication est sa complexité quadratique : dans le pire des cas, elle examine toutes les paires d'enregistrements possibles. Le Blocking est généralement utilisé pour atténuer cette complexité, surtout avec des données volumineuses (Christen, 2012b;Papadakis et al., 2020b). Le blocking rassemble les enregistrements similaires en groupes appelés blocs en appliquant des schémas ou des fonctions de blocage. ...
Conference Paper
Full-text available
La déduplication est une tâche qui consiste à reconnaître plusieurs représentations d'un même objet du monde réel. La majorité des solutions exis-tantes se concentrent sur les données textuelles et souvent négligent les attributs booléens et numériques, tandis que le problème des valeurs manquantes n'est pas suffisamment couvert. Les solutions supervisées ne peuvent être appliquées sans un nombre adéquat d'exemples étiquetés, ce qui implique des processus d'étiquetage coûteux en temps. Nous proposons dans ce papier D-HAT, un pipeline non supervisé qui est intrinsèquement capable de traiter des types d'attributs de haute dimension, épars et hétérogènes. Au coeur de ce pipeline se trouvent : (i) une nouvelle fonction de matching qui résume efficacement les signaux de cor-respondance multiples, et (ii) MutMax, un algorithme de regroupement glouton qui désigne comme doublons les paires ayant un score de matching mutuelle-ment maximal. Nous évaluons D-HAT sur cinq datasets réels, et démontrons que notre approche surpasse significativement l'état de l'art.
Article
Full-text available
Bugungi kunda axborot xajmining keskin suratlarda oshib borayotganligi hamda har bir shaxs uchun turli ma’lumotlar bazasidagi yozuvlardan foydalangan holda xulosalar berish jarayoni takomillashib bormoqda. Turli ma’lumotlar bazalaridagi yozuvlarni moslashtirish orqali yagona bazaga yigʻish orqali qarorlar qabul qilish ish jarayonlarni optimallashtirish hamda iqtisodiy samaradorlikka egadir. Tibbiyot, Online savdolar, Bandlik tizmlarida bu kabi moslashtirishlar vaqt va iqtisodiy koʻrsatkichlarni yaxshilaydi. Shu sababdan ushbu maqolada turli ma’lumotlar bazasidagi yozuvlarni moslashtirishning usul, model va yondashuvlari oʻrganilgan.
Article
This work presents an open-source Python library, named pyJedAI, which provides functionalities supporting the creation of algorithms related to product entity resolution. Building over existing state-of-the-art resolution algorithms, the tool offers a plethora of important tasks required for processing product data collections. It can be easily used by researchers and practitioners for creating algorithms analyzing products, such as real-time ad bidding, sponsored search, or pricing determination. In essence, it allows users to easily import product data from the possible sources, compare products in order to detect either similar or identical products, generate a graph representation using the products and desired relationships, and either visualize or export the outcome in various forms. Our experimental evaluation on data from well-known online retailers illustrates high accuracy and low execution time for the supported tasks. To the best of our knowledge, this is the first Python package to focus on product entities and provide this range of product entity resolution functionalities. History: Accepted by Ted Ralphs, Area Editor for Software Tools. Funding: This was partially funded by the EU project STELAR (Horizon Europe) [Grant 101070122]. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2023.0410 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2023.0410 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .
Article
In the era of data information explosion, there are different observations on an object (e.g., the height of the Himalayas) from different sources on the web, social sensing, crowd sensing, and data sensing applications. Observations from different sources on an object can conflict with each other due to errors, missing records, typos, outdated data, etc. How to discover truth facts for objects from various sources is essential and urgent. In this paper, we aim to deliver a comprehensive and exhaustive survey on truth discovery problems from the perspectives of concepts, methods, applications, and opportunities. We first systematically review and compare problems from objects, sources, and observations. Based on these problem properties, different methods are analyzed and compared in depth from observation with single or multiple values, independent or dependent sources, static or dynamic sources, and supervised or unsupervised learning, followed by the surveyed applications in various scenarios. For future studies in truth discovery fields, we summarize the code sources and datasets used in above methods. Finally, we point out the potential challenges and opportunities on truth discovery, with the goal of shedding light and promoting further investigation in this area.
Article
Analysis of integrated data often requires record linkage in order to join together the data residing in separate sources. In case linkage errors cannot be avoided, due to the lack a unique identity key that can be used to link the records unequivocally, standard statistical techniques may produce misleading inference if the linked data are treated as if they were true observations. In this paper, we propose methods for categorical data analysis based on linked data that are not prepared by the analyst, such that neither the match‐key variables nor the unlinked records are available. The adjustment is based on the proportion of false links in the linked file and our approach allows the probabilities of correct linkage to vary across the records without requiring that one is able to estimate this probability for each individual record. It accommodates also the general situation where unmatched records that cannot possibly be correctly linked exist in all the sources. The proposed methods are studied by simulation and applied to real data.
Article
Full-text available
The product of this research work takes raw atmospheric science data as input and generates clean, standardized, and redundancy-free data as output. There are two major tasks involved in the work: data processing and information retrieval. Data processing involves removing inaccuracies and resolving inconsistencies among data. Information retrieval involves identifying the information needed, extracting the data, and consolidating similar entries. There are four challenges encountered in this study: the inaccuracy of the raw data, the inconsistency of records, the need for multiple criteria for validating a record, and the large number of alternatives for certain records. Some existing research dealt with some of the complexities involved in this study. Unfortunately, a solution that targets all these challenges at once was not found in the literature. The contribution of this work is to find a comprehensive approach that works with all four problems and produces a reliable outcome. In particular, fuzzy matching and fuzzy rule-based inference systems have been used for removing inconsistencies among data entries, retrieving information from certain sections of data files, and consolidating information from different sources. A rule-based system is chosen to represent the factors that are associated with the type of measurement as well as their interrelationships, and then make decisions on the category of the measurement. The retrieval of instrument information is aided by a structural-based approach using the natural language processing (NLP) technique. The output of NLP is used to match an entry in the instrument dictionary using a fuzzy rule-based inference system. A multi-criteria decision-making algorithm is used to aggregate information from different sources and select the instrument classification by factoring in the significance of each source. A software package based on the algorithms presented in this paper has been developed; the package has been deployed for real-world applications.
Chapter
Record Linkage is the process of merging data from several sources and identifying records that are associated with the same entities, or individuals, where a unique identifier is not available. Record Linkage has applications in several domains such as master data management, law enforcement, health care, social networking, historical research, etc. A straight forward algorithm for record linkage would compare every pair of records and hence take at least quadratic time. In a typical application of interest, the number of records could be in the millions or more. Thus quadratic algorithms may not be feasible in practice. It is imperative to create novel record linkage algorithms that are very efficient. In this paper, we as address this crucial problem. One of the popular techniques used to speedup record linkage algorithms is blocking. Blocking can be thought of as a step in which potentially unrelated record pairs are pruned from distance calculations. A large number of blocking techniques have been proposed in the literature. In this paper we offer novel blocking techniques that are based on mapping kk-mers into a suitable range. In this paper, we also study the effect of distance metrics in the efficiency of record linkage algorithms. Specifically, we offer some novel variations of existing metrics that lead to improvements in the run times.
Chapter
Entity resolution (ER) finds records that refer to the same entities in the real world. Blocking is an important task in ER, filtering out unnecessary comparisons and speeding up ER. Blocking is usually an unsupervised task. In this paper, we develop an unsupervised blocking framework based on pre-trained language models (B-PLM). B-PLM exploits the powerful linguistic expressiveness of the pre-trained language models. A design space for B-PLM contains two steps. (1) The Record Embedding step generates record embeddings with pre-trained language models like BERT and Sentence-BERT. (2) The Block Generation step generates blocks with clustering algorithms and similarity search methods. We explore multiple combinations in above two dimensions of B-PLM. We evaluate B-PLM on six datasets (Structured + dirty, and Textual). The B-PLM is superior to previous deep learning methods in textual and dirty datasets. We perform sufficient experiments to compare and analyze different combinations of record embedding and block generation. Finally, we recommend some good combinations in B-PLM.
Conference Paper
Este artigo apresenta um sistema para gestão de propriedade intelectual, capaz de realizar a extração e organização dos dados publicados na revista do Instituto Nacional de Propriedade Industrial acerca de uma propriedade intelectual (PI). Para uma instituição estar a par das informações sobre suas PIs, tais como patentes e registros de software, é necessário que sejam acompanhadas semanalmente as publicações realizadas na revista, abrindo margem para erro humano e podendo demandar tempo considerável. Assim, o sistema propõe uma solução ao extrair e interpretar dados semiestruturados, realizar um processo de deduplicação em diferentes ocorrências do nome de uma instituição e exibir as informações ao usuá́rio de maneira organizada.
Article
Full-text available
Duplicate detection is the problem of identifying pairs of records that represent the same real world object, and could thus be merged into a single record. To avoid a prohibitively expensive comparison of all pairs of records, a common tech- nique is to carefully partition the records into smaller sub- sets. If duplicate records appear in the same partition, only all pairs within each partition must be compared. Two competing approaches are often cited: Blocking methods strictly partition records into disjoint subsets, for instance using zip-codes as partitioning key. Windowing methods, in particular the Sorted-Neighborhood method, sort the data according to some key, such as zip-code, and then slide a window of xed size across the sorted data and compare pairs only within the window. Herein we compare both approaches qualitatively and ex- perimentally. Further, we present a new generalized algo- rithm, the Sorted Blocks method, with the competing meth- ods as extreme cases. Experiments show that the windowing algorithm is better than blocking and that the generalized algorithm slightly improves upon it in terms of eciency (detected duplicates vs. overall number of comparisons).
Article
Full-text available
Record Linkage (RL) is an important component of data cleansing and integration. For years, many efforts have focused on improving the performance of the RL process, either by reducing the number of record comparisons or by reducing the number of attribute comparisons, which reduces the computational time, but very often decreases the quality of the results. However, the real bottleneck of RL is the post-process, where the results have to be reviewed by experts that decide which pairs or groups of records are real links and which are false hits. In this paper, we show that exploiting the relationships (e.g. foreign key) established between one or more data sources, makes it possible to find a new sort of semantic blocking method that improves the number of hits and reduces the amount of review effort.
Article
Full-text available
Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the ??presumed?? nonduplicate records from the same source can be used as training examples alleviating the burden of users having to manually label training examples. Starting from the nonduplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply.
Conference Paper
Full-text available
Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.
Conference Paper
Full-text available
Record linkageis the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is imprac- tical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For ex- ample, in a person matching problem, blocking might return all people with the same last name as candidate matches. Two main problems in blocking are the selection of attributes for generating the candidate matches and deciding which meth- ods to use to compare the selected attributes. These attribute and method choices constitute a blocking scheme. Previ- ous approaches to record linkage address the blocking issue in a largely ad-hoc fashion. This paper presents a machine learning approach to automatically learn effective blocking schemes. We validate our approach with experiments that show our learned blocking schemes outperform the ad-hoc blocking schemes of non-experts and perform comparably to those manually built by a domain expert.
Conference Paper
Full-text available
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efficient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods.
Conference Paper
Full-text available
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Conference Paper
Full-text available
A large proportion of the massive amounts of data that are being collected by many organisations today is about people, and often contains identifying information like names, addresses, dates of birth, or social security numbers. Privacy and confidentiality are of great concern when such data is being processed and analysed, and when there is a need to share such data between organisations or make it publicly avail- able. The research area of data linkage is especially suffering from a lack of publicly available real-world data sets, as experimental evaluations and comparisons are difficult to conduct without real data. Inorder to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data with realistic characteristics, such as frequency distributions and error probabilities. Our data generator sig- nificantly improves similar earlier approaches, and allows the creation of data containing records for individuals, households and families.
Conference Paper
Full-text available
Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. In this paper, a novel approach is proposed that uses a two level matching based on double embedding. First, records are embedded into a metric space of dimension K, then they are embedded into a smaller dimension K�. The first matching phase operates on the K�- vectors, performing a quick-and-dirty comparison, pruning a large number of true negatives while ensuring a high recall. Then a more accurate matching phase is performed on the matching pairs in the K-dimension. Experiments have been conducted on real data sets and results revealed a gain in time performance ranging from 30% to 60% while achieving the same level of recall and accuracy as in previous single embedding schemes.
Article
Full-text available
Duplicate detection is the process of identifying multiple representations of a same real-world object in a data source. Duplicate detection is a problem of critical importance in many applications, including customer relationship management, personal information management, or data mining. In this paper, we present how a research prototype, namely DogmatiX, which was designed to detect duplicates in hierarchical XML data, was successfully extended and applied on a large scale industrial relational database in cooperation with Schufa Holding AG. Schufa's main business line is to store and retrieve credit histories of over 60 million individuals. Here, correctly identifying duplicates is critical both for individuals and companies: On the one hand, an incorrectly identified duplicate potentially results in a false negative credit history for an individual, who will then not be granted credit anymore. On the other hand, it is essential for companies that Schufa detects duplicates of a person that deliberately tries to create a new identity in the database in order to have a clean credit history. Besides the quality of duplicate detection, i.e., its effectiveness, scalability cannot be neglected, because of the considerable size of the database. We describe our solution to coping with both problems and present a comprehensive evaluation based on large volumes of real-world data.
Article
Full-text available
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.
Article
Full-text available
Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using efficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.
Article
Full-text available
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple ref- erences to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolu- tion, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically gen- erated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.
Article
Full-text available
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.
Article
Full-text available
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.
Article
Full-text available
Record-linkage is the process of identifying whether two separate records refer to the same real-world entity when some elements of the record's identifying information (attributes) agree and others disagree. Existing record-linkage decision methodologies use the outcomes from the comparisons of the whole set of attributes. Here, we propose an alternative scheme that assesses the attributes sequentially, allowing for a decision to made at any attribute's comparison stage, and thus before exhausting all available attributes. The scheme we develop is optimum in that it minimizes a well-defined average cost criterion while the corresponding optimum solution can be easily mapped into a decision tree to facilitate the record-linkage decision process. Experimen- tal results performed in real datasets indicate the superiority of our methodology compared to existing approaches.
Article
Full-text available
Finding and matching personal names is at the core of an increasing number of applications: from text andWeb min- ing, information retrieval and extraction, search engines, to deduplication and data linkage systems. Variations and errors in names make exact string matching problematic, and approximate matching techniques based on phonetic encoding or pattern matching have to be applied. When compared to general text, however, personal names have different characteristics that need to be considered. In this paper we discuss the characteristics of personal names and present potential sources of variations and er- rors. We overview a comprehensive number of commonly used, as well as some recently developed name matching techniques. Experimental comparisons on four large name data sets indicate that there is no clear best technique. We provide a series of recommendations that will help re- searchers and practitioners to select a name matching tech- nique suitable for a given data set.
Article
Full-text available
Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking is more accurate and efficient than blocking, especially for large datasets.
Article
Full-text available
This article outlines a protocol for facilitating access to administrative data for the purpose of health services research, when these data are sourced from multiple organisations. This approach is designed to promote confidence in the community and among data custodians that there are benefits of linked health information being used and that individual privacy is being rigorously protected. Linked health administration data can provide an unparalleled resource for the monitoring and evaluation of health care services. However, for a number of reasons, these data have not been readily available to researchers. In Australia, an additional barrier to research is the result of health data sets being collected by different levels of government - thus all are not available to any one authority. To improve this situation, a practical blue-print for the conduct of data linkage is proposed. This should provide an approach suitable for most projects that draw large volumes of information from multiple sources, especially when this includes organisations in different jurisdictions. Health data, although widely and diligently collected, continue to be under-utilised for research and evaluation in most countries. This protocol aims to make these data more easily available to researchers by providing a controlled and secure mechanism that guarantees privacy protection.
Article
Full-text available
Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs). HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems. Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, accuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed. Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve.
Article
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Article
This paper introduces a scalable approach for probabilistic top-k similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that is assumed to be mutually exclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several state-of-the-art probabilistic ranking models. Existing approaches compute this probability distribution by applying the Poisson binomial recurrence technique of quadratic complexity. In this paper, we theoretically as well as experimentally show that our framework reduces this to a linear-time complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic top-k ranking for the objects, according to different state-of-the-art definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach.
Article
A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events (said to be matched). A comparison is to be made between the recorded characteristics and values in two records (one from each file) and a decision made as to whether or not the members of the comparison-pair represent the same person or event, or whether there is insufficient evidence to justify either of these decisions at stipulated levels of error. These three decisions are referred to as link (A1), a non-link (A3), and a possible link (A2). The first two decisions are called positive dispositions. The two types of error are defined as the error of the decision A1 when the members of the comparison pair are in fact unmatched, and the error of the decision A3 when the members of the comparison pair are, in fact matched. The probabilities of these errors are defined as and respectively where u(γ), m(γ) are the probabilities of realizing γ (a comparison vector whose components are the coded agreements and disagreements on each characteristic) for unmatched and matched record pairs respectively. The summation is over the whole comparison space r of possible realizations. A linkage rule assigns probabilities P(A1|γ), and P(A2|γ), and P(A3|γ) to each possible realization of γ ε Γ. An optimal linkage rule L (μ, λ, Γ) is defined for each value of (μ, λ) as the rule that minimizes P(A2) at those error levels. In other words, for fixed levels of error, the rule minimizes the probability of failing to make positive dispositions. A theorem describing the construction and properties of the optimal linkage rule and two corollaries to the theorem which make it a practical working tool are given.
Article
The need to consolidate the information contained in heterogeneous data sources has been widely documented in recent years. In order to accomplish this goal, an organization must resolve several types of heterogeneity problems, especially the entity heterogeneity problem that arises when the same real-world entity type is represented using different identifiers in different data sources. Statistical record linkage techniques could be used for resolving this problem. However, the use of such techniques for online record linkage could pose a tremendous communication bottleneck in a distributed environment (where entity heterogeneity problems are often encountered). In order to resolve this issue, we develop a matching tree, similar to a decision tree, and use it to propose techniques that reduce the communication overhead significantly, while providing matching decisions that are guaranteed to be the same as those obtained using the conventional linkage technique. These techniques have been implemented, and experiments with real-world and synthetic databases show significant reduction in communication overhead.
Article
This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for finding duplicates within files or across files.
Conference Paper
The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.
Conference Paper
Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once---for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, high-dimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. U...
Conference Paper
The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important steps in many data mining applications. In this paper, we address one of the sub-tasks in record linkage, i.e., the problem of assigning record pairs with an appropriate matching status. Techniques for solving this problem are referred to as decision models. Most existing decision models rely on good training data, which is, however, not commonly available in real-world applications. Decision models based on unsupervised machine learning techniques have recently been proposed. In this paper, we review several existing decision models and then propose an enhancement to cluster-based decision models. Experimental results show that our proposed decision model achieves the same accuracy of existing models while significantly reducing the number of record pairs required for manual review. The proposed model also provides a mechanism to trade off the accuracy with the number of record pairs required for clerical review.
Conference Paper
Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark.
Conference Paper
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.
Conference Paper
Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous Web resources. However, when targeting large-scale data, the cost of enumerating all the possible linkages often becomes impracticably high. Based on this background, this paper proposes a fast and efficient method for linkage detection. The features of the proposed approach are: first, it exploits a suffix array structure that enables linkage detection using variable length n-grams. Second, it dynamically generates blocks of possibly associated records using ‘blocking keys’ extracted from already known reliable linkages. The results from our preliminary experiments where the proposed method was applied to the integration of four bibliographic databases, which scale up to more than 10 million records, are also reported in the paper.
Conference Paper
Data cleaning is a vital process that ensures the quality ofdata stored in real-world databases. Data cleaning problemsare frequently encountered in many research areas,such as knowledge discovery in databases, data warehousing,system integration and e-services. The process ofidentifying the record pairs that represent the same entity(duplicate records), commonly known as record linkage,is one of the essential elements of data cleaning. In thispaper, we address the record linkage...
Conference Paper
Answering approximate queries on string collec- tions is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approxi- mate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining efficient query processing. We first study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the flexibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising finding is that while we can reduce the index size significantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.
Conference Paper
Traditionally, record linkage algorithms have played an im- portant role in maintaining digital libraries { i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed suc- cessfully. Often, however, existing solutions have a set of parameters whose values are set by human experts o-line and are xed during the execution. Since nding the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of existing so- lutions to new scenarios or domains is greatly hampered. To remedy this problem, we argue that one can achieve signi- cant improvement by adaptively and dynamically changing such parameters of record linkage algorithms. To validate our hypothesis, we take a classical record linkage algorithm, the sorted neighborhood method (SNM), and demonstrate how we can achieve improved accuracy and performance by adaptively changing its xed sliding window size. Our claim is analytically and empirically validated using both real and synthetic data sets of digital libraries and other domains.
Conference Paper
Many data mining tasks require computing similarity be- tween pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is im- practical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we in- troduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We de- scribe two predicate-based formulations of learnable block- ing functions and provide learning algorithms for train- ing them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.
Article
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.
Conference Paper
Matching records that refer to the same entity across data- bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new tech- niques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many re- cently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphi- cal user interface (GUI). Febrl thus allows even inexperi- enced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to com- pare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally,Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.
Article
With the rapid development of various optical, infrared, and radar sensors and GPS techniques, there are a huge amount of multidimensional uncertain data collected and accumulated everyday. Recently, considerable research efforts have been made in the field of indexing, analyzing, and mining uncertain data. As shown in a recent book on uncertain data, in order to efficiently manage and mine uncertain data, effective indexing techniques are highly desirable. Based on the observation that the existing index structures for multidimensional data are sensitive to the size or shape of uncertain regions of uncertain objects and the queries, in this paper, we introduce a novel R-Tree-based inverted index structure, named UI-Tree, to efficiently support various queries including range queries, similarity joins, and their size estimation, as well as top-k range query, over multidimensional uncertain objects against continuous or discrete cases. Comprehensive experiments are conducted on both real data and synthetic data to demonstrate the efficiency of our techniques.
Article
Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information re- trieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and eciency when linking or deduplicating very large data sets. Dieren t measures have been used to characterise the quality and complexity of data linkage al- gorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.
Article
The field of Record Linkage is concerned with identi- fying records from one or more datasets which refer to the same underlying entities. Where entity-unique identifiers are not available and errors occur, the pro- cess is non-trivial. Many techniques developed in this field require human intervention to set parameters, manually classify possibly matched records, or pro- vide examples of matched and non-matched records. Whilst of great use and providing high quality re- sults, the requirement of human input, besides being costly, means that if the parameters or examples are not produced or maintained properly, linkage quality will be compromised. The contributions of this paper are a critical discussion on the record linkage pro- cess, arguing for a more restrictive use of blocking in research, and evaluating and modifying the farthest- first clustering technique to produce results close to a supervised technique.
Conference Paper
Today's Web is so huge and diverse that it arguably reflects the real world. For this reason, searching the Web is a promising approach to find things in the real world. We present NEXAS, an extension to Web search engines that attempts to find real-world entities relevant to a topic. Its basic idea is to extract proper names from the Web pages retrieved for the topic. A main advantage of this approach is that users can query any topic and learn about relevant real-world entities without dedicated databases for the topic. In particular, we focus on an application for finding authoritative people from the Web. This application is practically important because once personal names are obtained; they can lead users from the Web to managed information stored in digital libraries. To explore effective ways of finding people, we first examine the distribution of Japanese personal names by analyzing about 50 million Japanese Web pages. We observe that personal names appear frequently on the Web, but the distribution is highly influenced by automatically generated texts. To remedy the bias and find widely acknowledged people accurately, we utilize the number of Web servers containing a name instead of the number of Web pages. We show its effectiveness by an experiment covering a wide range of topics. Finally, we demonstrate several examples and suggest possible applications.
Article
A list of technical reports, including some abstracts and copies of some full reports may,be found,at: http://cs.anu.edu.au/techreports/ Recent reports in this series: TR-CS-07-02 Sophie Pinchinat.,Quantified mu-calculus with decision
Article
BACKGROUND: Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rulebased data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs). METHODS: HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems. RESULTS: Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, acccuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed. CONCLUSION: Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve.
Article
The frequency of early fatality and the transient nature of emergency medical care mean that a single database will rarely suffice for population based injury research. Linking records from multiple data sources is therefore a promising method for injury surveillance or trauma system evaluation. The purpose of this article is to review the historical development of record linkage, provide a basic mathematical foundation, discuss some practical issues, and consider some ethical concerns. Clerical or computer assisted deterministic record linkage methods may suffice for some applications, but probabilistic methods are particularly useful for larger studies. The probabilistic method attempts to simulate human reasoning by comparing each of several elements from the two records. The basic mathematical specifications are derived algebraically from fundamental concepts of probability, although the theory can be extended to include more advanced mathematics. Probabilistic, deterministic, and clerical techniques may be combined in different ways depending upon the goal of the record linkage project. If a population parameter is being estimated for a purely statistical study, a completely probabilistic approach may be most efficient; for other applications, where the purpose is to make inferences about specific individuals based upon their data contained in two or more files, the need for a high positive predictive value would favor a deterministic method or a probabilistic method with careful clerical review. Whatever techniques are used, researchers must realize that the combination of data sources entails additional ethical obligations beyond the use of each source alone.
Conference Paper
The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data. The learned similarity function is subsequently used in clustering to determine which records are co-referent and should be linked. We present an online machine learning method for addressing this problem, where a composite similarity function based on a linear combination of basis functions is learned incrementally. We illustrate the efficacy of this approach on several real-world datasets from an Internet comparison shopping site, and show that our method is able to effectively learn various distance functions for product data with differing characteristics. We also provide experimental results that show the importance of considering multiple performance measures in record linkage evaluation.