Fig 3 - uploaded by Thomas Gschwind
Content may be subject to copyright.
Source publication
Record linkage is an essential part of nearly all real-world systems that consume structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data integration processes often have to be completed before any data analytics and further processing can be performed. In this work w...
Context in source publication
Context 1
... that correct matches with a score < 0.8 are rare, the 75% figure has been arbitrarily chosen as a tradeoff between performance and matching accuracy. The row-band configurations are shown in Figure 3 and Table I. minhash 4/10 5/18 6/30 σ = 0.5 47.5% 43.5% 37.6% σ = 0.6 75.0% 76.7% 76.1% σ = 0.7 93.5% 96.3% 97.6% σ = 0.8 99.4% 99.9% 99.9% TABLE I: Minhash matching probabilities A higher number of rows and bands yields a sharper Scurve. ...
Similar publications
Tensor train (TT) decomposition, a powerful tool for analysing multidimensional data, exhibits superior performance in many signal processing and machine learning tasks. However, existing methods for TT decomposition either require the knowledge of the true TT ranks, or extensive fine-tuning the balance between model complexity and representation a...
Citations
... MinHash algorithm, when used with the LSH forest data structure, represents a text similarity method that approximates the Jaccard set similarity score [32] MinHash was used to replace the large sets of string data with smaller "signatures" that still preserve the underlying similarity metric, hence producing a signature matrix, but a pair-wise signature comparison was still needed. Here the LSH Forest comes into play. ...
Privacy is a fundamental human right according to the Universal Declaration of Human Rights of the United Nations. Adoption of the General Data Protection Regulation (GDPR) in European Union in 2018 was turning point in management of personal data, specifically personal identifiable information (PII). Although there were many previous privacy laws in existence before, GDPR has brought privacy topic in the regulatory spotlight. Two most important novelties are seven basic principles related to processing of personal data and huge fines defined for violation of the regulation. Many other countries have followed the EU with the adoption of similar legislation. Personal data management processes in companies, especially in analytical systems and Data Lakes, must comply with the regulatory requirements. In Data Lakes, there are no standard architectures or solutions for the need to discover personal identifiable information, match data about the same person from different sources, or remove expired personal data. It is necessary to upgrade the existing Data Lake architectures and metadata models to support these functionalities. The goal is to study the current Data Lake architecture and metadata models and to propose enhancements to improve the collection, discovery, storage, processing, and removal of personal identifiable information. In this paper, a new metadata model that supports the handling of personal identifiable information in a Data Lake is proposed.
... Record linkage can be understood as a process for extracting records from various data sources and combining them to form a single entity [6], for both structured and unstructured data [7]. The same thing was also conveyed by [8] which defines that record linkage is a step to identify a number of records that refer to the same thing. ...
Merging databases from different data sources is one of the important tasks in the data integration process. This study will integrate lecturer data from data sources in the application of academic information systems and research information systems at the Sriwijaya State Polytechnic. This integration of lecturer data will later be used as a single data as master data that can be used by other applications. Lecturer data in the academic section contains 444 records, while those from the p3m section contain 443 records. An important task in the database merging process is to eliminate duplicate records. One of the important libraries in the formation of this master data management uses the record linkage toolkit which is implemented in the python programming language. The steps taken are pre-processing, generating candidate record pairs, compare pairs, score pairs, and finally the data link to merge the two data sources. In this study, 5 fields, namely username, name, place of birth, date of birth, and gender, from each data source were used to measure the level of record similarity. The result of this research is the formation of lecturer master data from the merging of the two sources.
... Since the early 2000's supervised machine learning approaches such as decision trees or logistic regression have been used, and meanwhile deep learning has found its way into the RL process [6]. The use of deep learning for RL can improve results, especially for unstructured and messy data [7,8]. A key challenge is that using machine learning for RL requires large amounts of labelled training data with high data quality. ...
... Creating such training data requires a lot of manual effort. Thus, the limited amount of training data is a bottleneck for supervised learning in RL [7,9]. Another reason why handling a substantial volume of training data presents difficulties is the practical impossibility of merging data from various sources in a centralized location due to factors like data privacy, legal requirements, or limited resources [10]. ...
... For example, in the data preparation, legal forms are classified and standardized by neural networks [4], blocking is performed with deep learning approaches [42], [43], word embeddings are applied to compare candidate pairs [44], [45] and the classification into match and non-match is done by using neural networks [46]. The major limitation to the application of ML in DI is the lack of sufficient amounts of annotated training data [7]. For DI to be performed with high quality, robust models that can handle diverse data problems are needed. ...
Data integration is utilized to integrate heterogeneous data from multiple sources, representing a crucial step to improve information value in data analysis and mining. Incorporating machine and deep learning into data integration has proven beneficial, particularly for messy data. Yet, a significant challenge is the scarcity of training data, impeding the development of robust models. Federated learning has emerged as a promising solution to address the challenge of limited training data in various research domains. Through collaborative model training, robust model training is enabled while upholding data privacy and security. This paper explores the potential of applying federated learning to data integration through a structured literature review, offering insights into the current state-of-the-art and future directions in this interdisciplinary field.
... Formally, the address matching task may be considered as a binary classification problem [4,3,5,6,7] where the predicted class is either Match or No Match. However, given two companies with the same name, it is important to identify addresses that are partially similar, such as those having the same city and the same road but differ in the house number or in the case where both addresses are correct but one of them corresponds to a former address company, in order to complete addresses with up-to-date information. ...
... Former address matching approaches [6,7] are based on similarity measures and matching rules. However, these methods perform a structural comparison of addresses and are unable to identify some relationship between two addresses when they have few literal overlaps [3]. ...
In this paper, we describe a solution for a specific Entity Matching problem, where entities contain (postal) address information. The matching process is very challenging as addresses are often prone to (data) quality issues such as typos, missing or redundant information. Besides, they do not always comply with a standardized (address) schema and may contain polysemous elements. Recent address matching approaches combine static word embedding models with machine learning algorithms. While the solutions provided in this setting partially solve data quality issues, neither they handle polysemy, nor they leverage of geolocation information. In this paper, we propose GeoRoBERTa, a semantic address matching approach based on RoBERTa, a Transformer-based model, enhanced by geographical knowledge. We validate the approach in conducting experiments on two different real datasets and demonstrate its effectiveness in comparison to baseline methods.
... Quite often, these vast amounts of data include data that refer to persons. Due to the many different attributes that refer to the same person, it is very common for organizations and data controllers to keep duplicate instances that, in some cases, may be identical but, in most, differ slightly Information 2022, 13,116 2 of 15 and could, thus, be mistakenly treated as referring to different persons. Furthermore, as data volumes grow, storage also needs to increase, rendering the minimization of storage space a key challenge in order to build more efficient backup processes [4]. ...
... Based on this, Gschwind et al. introduced their proposed solution which comprises rule-based linkage algorithms and ML models. Their study achieved a 91% recall rate on a real-world dataset [13]. ...
Analysis of extreme-scale data is an emerging research topic; the explosion in available data raises the need for suitable content verification methods and tools to decrease the analysis and processing time of various applications. Personal data, for example, are a very valuable source of information for several purposes of analysis, such as marketing, billing and forensics. However, the extraction of such data (referred to as person instances in this study) is often faced with duplicate or similar entries about persons that are not easily detectable by the end users. In this light, the authors of this study present a machine learning- and deep learning-based approach in order to mitigate the problem of duplicate person instances. The main concept of this approach is to gather different types of information referring to persons, compare different person instances and predict whether they are similar or not. Using the Jaro algorithm for person attribute similarity calculation and by cross-examining the information available for person instances, recommendations can be provided to users regarding the similarity or not between two person instances. The degree of importance of each attribute was also examined, in order to gain a better insight with respect to the declared features that play a more important role.
... Big data and its emerging technologies, including business intelligence and big data analytics systems, have become a mainstream market adopted broadly across industries, organizations, and geographic regions to facilitate data-driven decision making and significantly affect the way that decision-makers, such as CEOs, operate and run their business [2,3]. One of the fundamental functionalities a business intelligence system must have is the ability to integrate many data sources [2,4]. The integration of multiple sources usually requires linking between the significant entities that exist across data sources and systems. ...
... The integration of multiple sources usually requires linking between the significant entities that exist across data sources and systems. Usually, those entities function as a "primary key" in each system [4]. The technique used for performing such linkage is commonly referred to as "Record Linkage," "Data Deduplication," "Object Matching," or "Entity Matching" [4,5]. ...
... Usually, those entities function as a "primary key" in each system [4]. The technique used for performing such linkage is commonly referred to as "Record Linkage," "Data Deduplication," "Object Matching," or "Entity Matching" [4,5]. ...
Entity Matching is an essential part of all real-world systems that take in structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data cleaning and integration processes require completion before any data analytics, or further processing can be performed. Although record linkage is frequently regarded as a somewhat tedious but necessary step, it reveals valuable insights, supports data visualization, and guides further analytic approaches to the data. Here, we focus on organization entity matching. We introduce CompanyName2Vec, a novel algorithm to solve company entity matching (CEM) using a neural network model to learn company name semantics from a job ad corpus, without relying on any information on the matched company besides its name. Based on a real-world data, we show that CompanyName2Vec outperforms other evaluated methods and solves the CEM challenge with an average success rate of 89.3%.
... We test our approach on the company domain in Wikidata. The company domain has many applications in the areas of enterprise and finance where there is a focus on market intelligence or stock market data (Gschwind et al., 2019). We focus on company entities as they present a useful microcosm of the overall challenges of knowledge graph entity translation, such as the mix of translation and transliteration in cross-lingual labelling. ...
Content on the web is predominantly in English, which makes it inaccessible to individuals who exclusively speak other languages. Knowledge graphs can store multilingual information, facilitate the creation of multilingual applications, and make these accessible to more language communities. In this thesis, we present studies to assess and improve the state of labels and languages in knowledge graphs and apply multilingual information. We propose ways to use multilingual knowledge graphs to reduce gaps in coverage between languages.
We explore the current state of language distribution in knowledge graphs by developing a framework - based on existing standards, frameworks, and guidelines - to measure label and language distribution in knowledge graphs. We apply this framework to a dataset representing the web of data, and to Wikidata. We find that there is a lack of labelling on the web of data, and a bias towards a small set of languages. Due to its multilingual editors, Wikidata has a better distribution of languages in labels. We explore how this knowledge about labels and languages can be used in the domain of question answering. We show that we can apply our framework to the task of ranking and selecting knowledge graphs for a set of user questions A way of overcoming the lack of multilingual information in knowledge graphs is to transliterate and translate knowledge graph labels and aliases. We propose the automatic classification of labels into transliteration or translation in order to train a model for each task. Classification before generation improves results compared to using either a translation- or transliteration-based model on their own. A use case of multilingual labels is the generation of article placeholders for Wikipedia using neural text generation in lower-resourced languages. On the basis of surveys and semi-structured interviews, we show that Wikipedia community members find the placeholder pages, and especially the generated summaries, helpful, and are highly likely to accept and reuse the generated text.<br/
... This method describes our approach to analysing our eleven existing data sources (see table 1) and integrating various of them through a RL process to find general RL challenges for the real-world entity company. One of the most relevant attributes in company entity matching is the company name [15,16] which we will focus on in this paper. The legal form of a company is also an important attribute, as it is discriminatory when comparing companies [15]. ...
... We have identified the papers of Schild and Schultz [15], Cuffe and Goldschlag [25], and Gschwind et al. [16] as research papers that focus specifically on company entity matching. Schild and Schultz [15] present in their paper a self-developed RL process to integrate different data sources containing companies for research purposes of the Deutsche Bundesbank. ...
... Gschwind et al. [16] focus on company entity matching to integrate data sources needed for further processing, such as data analytics. The attributes company name, location, and industry are used. ...
This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.
... Although knowledge graphs (KGs) and ontologies have been exploited successfully for data integration [Trivedi et al. 2018;Azmy et al. 2019], entity matching involving structured and unstructured sources has usually been performed by treating records without explicitly taking into account the natural graph representation of structured sources and the potential graph representation of unstructured data [Mudgal et al. 2018;Gschwind et al. 2019]. To address this limitation, we propose a methodology for leveraging graph-structured information in entity matching. ...
... The resulting training graph contains approximately 40k nodes organized into 1.7k business entities. As a data-augmentation step, we generate an additional canonical or normalized version of a company name and link it to the real name in the graph, using conditional random fields, as described in [Gschwind et al. 2019]. This step yields an enriched training graph with 70k nodes. ...
... Experiments. We compared our S-GCN model against three baselines, namely (i) a record-linkage system (RLS) designed for company entities [Gschwind et al. 2019], (ii) a Table 1: Accuracy of entity matching on the test set feed-forward neural network (NN), and (iii) a model based on graph convolutional networks (GCN). Both the GCN and NN models use BERT features as input and a softmax output layer. ...
Data integration has been studied extensively for decades and approached from different angles. However, this domain still remains largely rule-driven and lacks universal automation. Recent developments in machine learning and in particular deep learning have opened the way to more general and efficient solutions to data-integration tasks. In this paper, we demonstrate an approach that allows modeling and integrating entities by leveraging their relations and contextual information. This is achieved by combining siamese and graph neural networks to effectively propagate information between connected entities and support high scalability. We evaluated our approach on the task of integrating data about business entities, demonstrating that it outperforms both traditional rule-based systems and other deep learning approaches.
... Although knowledge graphs (KGs) and ontologies have been exploited successfully for data integration [Trivedi et al. 2018;Azmy et al. 2019], entity matching involving structured and unstructured sources has usually been performed by treating records without explicitly taking into account the natural graph representation of structured sources and the potential graph representation of unstructured data [Mudgal et al. 2018;Gschwind et al. 2019]. To address this limitation, we propose a methodology for leveraging graph-structured information in entity matching. ...
... The resulting training graph contains approximately 40k nodes organized into 1.7k business entities. As a data-augmentation step, we generate an additional canonical or normalized version of a company name and link it to the real name in the graph, using conditional random fields, as described in [Gschwind et al. 2019]. This step yields an enriched training graph with 70k nodes. ...
... Experiments. We compared our S-GCN model against three baselines, namely (i) a record-linkage system (RLS) designed for company entities [Gschwind et al. 2019], (ii) a Table 1: Accuracy of entity matching on the test set feed-forward neural network (NN), and (iii) a model based on graph convolutional networks (GCN). Both the GCN and NN models use BERT features as input and a softmax output layer. ...
Data integration has been studied extensively for decades and approached from different angles. However, this domain still remains largely rule-driven and lacks universal automation. Recent developments in machine learning and in particular deep learning have opened the way to more general and efficient solutions to data-integration tasks. In this paper, we demonstrate an approach that allows modeling and integrating entities by leveraging their relations and contextual information. This is achieved by combining siamese and graph neural networks to effectively propagate information between connected entities and support high scalability. We evaluated our approach on the task of integrating data about business entities, demonstrating that it outperforms both traditional rule-based systems and other deep learning approaches.