Conference PaperPDF Available

Fast Record Linkage for Company Entities

Authors:

Abstract and Figures

Record linkage is an essential part of nearly all real-world systems that consume structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data integration processes often have to be completed before any data analytics and further processing can be performed. In this work we focus on company entity matching, where company name, location and industry are taken into account. Our contribution is a highly scalable, enterprise-grade end-to-end system that uses rule-based linkage algorithms in combination with a machine learning approach to account for short company names. Linkage time is greatly reduced by an efficient decomposition of the search space using MinHash. Based on real-world ground truth datasets, we show that our approach reaches a recall of 91% compared to 73% for baseline approaches, while scaling linearly with the number of nodes used in the system.
Content may be subject to copyright.
Fast Record Linkage for Company Entities
Thomas Gschwind, Christoph Miksovic, Julian Minder, Katsiaryna Mirylenka, Paolo Scotton
IBM Research Zurich
R¨
uschlikon, Switzerland
{thg, cmi, jmd, kmi, psc}@zurich.ibm.com
Abstract—Record linkage is an essential part of nearly all
real-world systems that consume structured and unstructured
data coming from different sources. Typically no common key
is available for connecting records. Massive data integration
processes often have to be completed before any data analytics
and further processing can be performed. In this work we focus
on company entity matching, where company name, location
and industry are taken into account. Our contribution is a
highly scalable, enterprise-grade end-to-end system that uses
rule-based linkage algorithms in combination with a machine
learning approach to account for short company names. Linkage
time is greatly reduced by an efficient decomposition of the search
space using MinHash. Based on real-world ground truth datasets,
we show that our approach reaches a recall of 91% compared
to 73% for baseline approaches, while scaling linearly with the
number of nodes used in the system.
I. INTRODUCTION
Enterprise artificial intelligence applications require the
integration of many data sources. In such applications, one
of the most important entity attributes to be linked is often
the company name. It acts as a “primary key” across multiple
datasets such as company descriptions, marketing intelligence
databases, ledger databases, or stock market related data. The
technique used to perform such a linkage is commonly referred
to as record linkage or entity matching.
RL is in charge of joining various representations of the
same entity (e.g., a company, an organization, a product,
etc.) residing in structured records coming from different
datasets [23]. Record linkage (RL) has been extensively stud-
ied in recent decades. It was formalized by Fellegi and Sunter
in 1969 [8]. The tutorial by Lise Getoor [10] provides an
excellent overview of use cases and techniques. Essentially,
RL has been used to link entities from different sets or to
deduplicate/canonize entities within a given set. To this extent,
several approaches have been envisaged ranging from feature
matching or rule-based to machine learning approaches.
Typically, RL is performed in batch mode to link a large
number of entities between two or more datasets [10]. A
challenge of enterprise applications is the ever-increasing
amount of unstructured data such as news, blogs and social
media content to be integrated with enterprise data. As a
consequence, RL has to be performed between structured
records and unstructured documents. This large amount of data
may flow in streams for rapid consumption and analysis by
enterprise systems. Therefore, RL needs to be executed on the
fly and with stringent time constraints.
We use the Watson Natural Language Understanding ser-
vice [14] to identify mentions in unstructured text. These
entities are then passed to RL in a structured fashion in
the form of a record containing attributes that, for example,
represent company names, locations, industries and others. RL
is in charge of linking this record against one or multiple
reference datasets.
The main contributions of this work are
1) an end-to-end RL system that is highly scalable and
provides an enterprise-grade RL service,
2) scoring functions for various attribute types together with
a hierarchical scoring tree that allows the efficient and
flexible implementation of multi-criteria scoring, and
3) the automatic extraction of short company name, an
important feature of the company entity, based on condi-
tional random fields.
The paper is organized as follows. Section II presents related
work and discusses the general background of RL. Section III
describes the proposed system in detail. The performance of
the proposed system is discussed in Section IV and Section V
presents future research directions and concludes the paper.
II. BACKGROU ND
Various record linkage systems have been proposed in
recent decades [2], [12], [16], [17], [20]. As mentioned in
the introduction, they can usually be divided into rule-based
and machine learning-based systems. Konda et al. [16] have
proposed a system to perform RL on a variety of entity types,
providing a great flexibility in defining the linkage workflow.
This system allows the user to select the various algorithms
being used at various stages of the linkage process. Despite
its flexibility, this approach does not address the performance
problem at the center of the class of applications that we are
addressing. The Certus system proposed in [17] exploits graph
differential dependencies for the RL task. Even though there
is no need for an expert to create these graphs manually, still
an essential amount of training data is needed to leverage
the graphs automatically. However, we cannot apply such
techniques as we consider cases where the amount of training
data is very limited.
In the domain of RL, locality-sensitive hashing (LSH)
methods are generally used to provide entities with signatures
in such a way that similar entities have identical signatures
with high probability [29]. These signatures are commonly
referred to as blocking keys, which denote blocks. Blocks are
used to limit the number of comparisons needed during the
scoring phase.
Original
Reference
Database
Query
DBpedia
Cleaning for
Scoring
LSH Function
Cleaning for
Blocking
Reference
Database
Blocking
Keys
Database
Cleaning for
Blocking
Compile Cleaning for
Scoring
LSH Function Retrieve
Candidates IDs
Score
Candidates
Retrieve
Candidates
Compile Query Matches
Short Name
Model Training
Short Name
Service
Short Name
Extraction
Short Name
Extraction
Short Name
Extraction
Preprocessing Pipeline
Runtime Pipeline
Short Name Extraction
Fig. 1: Preprocessing and runtime pipeline.
Typical LSH algorithms are MinHash [3], [4], [19] and
SimHash [5], [6]. MinHash can be parametrized by decom-
posing the hashing functions in rows and bands [19]. The
row-band parameter settings for a desired minimal similarity
threshold can be determined by an “S-curve”. In our current
setup, we chose MinHash and tuned it for a high recall rate,
a key requirement for RL.
A first approach to use machine learning techniques for
record linkage was proposed in 2003 by Elfeky et al. [7]. A
trained classifier approach is compared to unsupervised clus-
tering and to probabilistic approaches. Although the trained
classifier outperforms the other approaches, the authors em-
phasize the difficulty of obtaining training data. More recent
studies [11], [26] assess the applicability of neural networks
to record linkage. In particular, Mudgal et al. [26] show
that, compared to “classical” approaches, deep learning brings
significant advantages for unstructured and noisy data.
The major limitation to using machine learning techniques
for record linkage is the difficulty of finding sufficient an-
notated training data. This is especially true with company
names. Moreover, for each new reference dataset introduced in
the system, a specific new training dataset must be developed.
To alleviate this problem, some promising approaches such as
the use of active learning [27] have been proposed. However,
the application of machine learning techniques to record
linkage remains limited at the moment. Nevertheless, machine
learning can be applied to sub-problems within record linkage.
In this work, we propose a novel machine learning-based
technique to extract a short name from a conventional com-
pany name. Full company names usually contain many ac-
companying words, e.g. “Systems, Inc. in “Cisco Systems,
Inc.”, that contain additional information about a company’s
organizational entity type, its location, line of business, size
and share in the international market. The accompanying
words often vary greatly across datasets. For example, some
systems will have just “Cisco” instead of the conventional
name “Cisco Systems, Inc.”. Short company names (also re-
ferred colloquial or normalized company names) represent the
most discriminative substring in a company name string and
are particularly popular in unstructured data sources such as
media publications or financial reports, where many company
mentions are aggregated.
It has been shown by Loster et al. [20] that taking short
(colloquial) company names into account is greatly beneficial
for company record linkage. However, the company entity
matching system described in [20] used a manually created
short company name corpus, whereas in this work we focus
on automated short name extraction.
III. RECORD LIN KAG E SYS TE M
As mentioned above, we consider the problem of RL
performed on the fly, i.e. dynamically linking an incoming
record to records in one or more reference datasets. A record is
defined as a collection of attributes, each of which corresponds
to a column in the dataset. Attributes typically include the
company name, street address, city, postal code, country code,
industry, etc. Different reference datasets might not contain the
same attribute types, and/or attributes might be referenced by
different names.
The record linkage system essentially comprises three com-
ponents (Figure 1). Short name extraction is in charge of
training the service to extract short company names. The
preprocessing pipeline prepares the reference datasets. Finally,
the runtime pipeline is responsible for matching incoming
requests against candidate records and returning the best
matches.
A. Short Name Extraction
We use two data sources as the training corpus for the short
name extraction. DBpedia [1], the first source, contains some
65K company entities derived from the English version of
Wikipedia. The company entities contain a name, a label and
a homepage of a company. We use all these fields to derive
a company short name, which, in most cases, appears either
in the label or on the homepage of a company. For example,
the company Aston Martin Lagonda Limited” has the label
support
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
prec ision reca ll f1-score
class IN class OUT mic ro av g macro avg
(a) DBpedia corpus.
support
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
prec ision reca ll f1-score
class IN class OUT mic ro av g macro avg
(b) Aggregated DBpedia and commercial corpus.
Fig. 2: CRF performance for company short name extraction.
Aston Martin”. In this and similar cases based on the handful
of heuristically devised rules, we conclude that Aston Martin”
is the short name of the company.
Another source of training data is a commercial company
database that contains company entities, such as branches,
subsidiaries and headquarters, all having individual local and
global identifiers. The set of all identifiers associated with a
company can be represented hierarchically. Based on these
hierarchies, we identify the families of companies which are
represented as a tree structure. For each family of companies,
we extracted the common tokens of the company names as
a short name for the entire family. After extracting common
tokens, additional checks were performed to exclude legal
entity types of companies from the token list. The remaining
tokens were combined and used as a short name for all the
company names in the family. For example, from a family of
companies that have two distinct names “Zumu Holdings Pty
Ltd” and “Zumu Foods Pty Ltd”, we extracted “Zumu” to be
the representative short name. Given this data source, we were
able to extract 950K of long-short name pairs for training.
In total, more than a million long name, short name pairs
were used as the corpus for the automatic extraction of short
names. The task of extracting short names in the case of
the commercial company data is more difficult because the
variability of names within the family of companies is greater,
and the short name is often the most discriminative part of
the name, whereas some other quite discriminative words
should be omitted. As can be seen from the support pie
chart in Figure 2b, indeed, for the overall corpus, where the
commercial company data portion is dominant, the number
of words that should be omitted is slightly greater than the
number of words that should be kept in a short name.
We treat the short name learning process as a sequence
labeling task, where for each word in a sequence, we need to
decide whether the word is kept or omitted from a company
name. Conditional Random Fields (CRF) [18] is one of the
best-performing models applied for sequence labeling [13]. In
our case, we have only two labels: “IN” and “OUT” to indicate
whether the word is included in or omitted from a short name,
respectively.
To evaluate CRF for the task of short name extraction, preci-
sion, recall and F1-score are computed separately for “IN” and
“OUT” classes. We also present micro and macro averages for
each performance measure. The plots for DBpedia corpus and
for the aggregated DBpedia and commercial company corpus
are shown in Figure 2.
The results demonstrate that CRF is able to distinguish
between discriminative and non-discriminative words in a
company name as all the performance measures are greater
than 0.76 for all the datasets under consideration. Indeed, the
task for DBpedia names is easier, and CRF achieves an overall
accuracy of approximately 0.9 for both classes. For the larger
corpus, the model struggled to reveal all words that should
have been included in the short name, providing the recall
for “IN” class, which is equal to 0.76. For other performance
measures, the values are close to 0.81.
The trained CRF model is applied to extract the short names
in the main record linkage system presented above, with the
results in Section IV.
B. Preprocessing Pipeline
The preprocessing pipeline reads records from a given
source format, converts the string into their decomposed UTF-
8 representation [32], collapses multiple consecutive spaces
into a single space, and generates a binary database that
supports the efficient retrieval of the records. Once the binary
database has been generated, a blocking key database is built
by computing for each record a set of blocking key values
corresponding to an LSH function. The blocking key database
stores the corresponding record indices for each blocking
key. As discussed in Section II, our implementation uses
MinHash [3], [4] as its LSH function.
The computation of the blocking key is based on a cleaned
version of the company name and the company’s short name.
The cleaning ensures that records with notational variations are
assigned the same blocking key (for instance, by consistently
omitting the legal entity type). This cleaning will generate a
number of incorrect matches that will have to be removed by
the scoring algorithm of the runtime pipeline. Other than for
the computation of the blocking keys, we do not perform any
additional cleaning as data cleaning destroys information [28].
C. Runtime Pipeline
The runtime pipeline links entity queries to the entities
stored in the entity database. It computes the blocking keys
and retrieves the corresponding candidate entities. It also
transforms the query into a more efficient representation in the
form of a scoring tree which is evaluated against the candidate
entities. Once the scores for the candidate records have been
computed, they are sorted.
The scoring tree uses different scoring algorithms, depend-
ing on the type of data to be processed. If the data describes an
address, we use a geographic scoring, whereas if it describes a
company name, we use a scoring algorithm tuned for company
names. If multiple types of data are present, the scoring
tree combines the scores into a single value. More formally,
the scoring tree represents a scoring function s(Rq, Rr)that
evaluates the similarity between a query record Rqand a
record in the reference dataset Rrsuch that s(Rq, Rr)[0,1]
and s(Rq, Rr) = 1 iff records Rqand Rrare identical.
Scoring Company Names. In order to score company
names, we started with different string similarity functions,
such as the Jaccard similarity j() on which MinHash is
based, or the Levenshtein distance l() which we convert into a
similarity function to obtain a score value [0,1]. Hence, we
compute the Jaccard and Levenshtein scores as follows where
n1and n2represent company names of length |n1|and |n2|:
sJac(n1, n2) = j(n1, n2)and sLev (n1, n2)=1l(n1, n2)
|n1|+|n2|
A first limitation we observed is that both Jaccard and
Levenshtein scores give too much weight to diacritics. A na¨
ıve
approach is simply to remove all the diacritics as part of the
cleaning step. However, there are company names that are only
differentiated by the presence of diacritics. To tackle this prob-
lem, we leverage a property of Unicode representation where
diacritics are represented as special combining characters. The
combining characters are given a lower weight in the scoring
process.
Another challenge is to deal with legal entity types of
companies such as “inc.” or “ltd.”, which may or may not
be included in the company name. In our initial attempt, we
simply removed these legal entity type identifiers. However,
we soon came across companies where the names differ only
in the legal entity type but are actually distinct companies. This
is one of several occurrences where cleaning had a negative
effect on scoring, which confirms the observations made by
Randall et al. [28]. Generally, one approach to alleviate the
problem related to special mentions (e.g. legal entity types)
is to assign them a lower weight in the scoring process.
Therefore we adopted the approach of assigning legal entity
types the same weight as a single character minus a small
value of = 1/256 (the smallest value available in our weight
representation). The fact of subtracting allows us to give
precendence to changes that are not in the legal entity type.
In some situations, the city name can be included in
the company name. For example, IBM Research Zurich is
sometimes indicated as IBM Research if it is clear from the
context that the geographic region is Switzerland. To handle
this, we detect city name mentions in a company name and
reduce its weight if the city is in the company’s vicinity. This
allows more flexibility with regard to names. To look up city
names, we use a fast trie implementation.
Additionally, as described previously, we derive for each
company name a short name. Words that are part of the short
name are weighted three times the normal weight. This ap-
proach allows us to place more emphasis on the characteristic
words of the company compared to other elements present in
the name.
In the following, we represent the Levenshtein and Jaccard
similarities that consider these modifications as sLev’ and sJac’.
The final company name score is computed as:
s(n1, n2)=0.9·max(sJac’(n1, n2), sLev’ (n1, n2)) +
0.1·min(sJac’(n1, n2), sLev’ (n1, n2))
The rationale behind this choice is that the Jaccard score
allows for word permutations, whereas the Levenshtein score
relies on the character sequence. We do not simply use the
maximum between the two similarities is that, because in
certain cases, the Jaccard similarity may return a similarity of
1 for names that are different. This is to ensure that a match
with a Jaccard similarity of 1 is not chosen coincidentally
over a Levenshtein similarity of 1, which is only possible if
the strings are equal. The values of 0.9 and 0.1 have been
chosen arbitrarily.
We dubbed this measure RLS. In Section IV we will
compare this approach against the
1) Jaccard score (sJacc),
2) Levenshtein score (sLev),
3) weighted average of the Jaccard and Levenhstein scores
(sweighted = 0.5·sJacc + 0.5·sLev), and the
4) RLS scoring function (smaxmin = 0.9·max(sJacc, sLev ) +
0.1·min(sJacc, sLev )), with optimizations for diacritic char-
acters, legal entity type, and city optimizations disabled
(max-min).
Scoring Other Attributes. Multiple attributes can be taken
into consideration for scoring; in this section we describe
geolocation and industry scoring. A geographical location is
represented by an address element. This element contains the
street address, postal code, city and country code attributes.
Each component is scored using a specific algorithm. The
street address is currently scored using a tokenized string
matching (e.g. Levenshtein tokenized distance [25]). This
provides a reasonable measure between street address strings,
especially if street number and street name appear in different
orders. Postal codes are evaluated according to the number
of matching digits or characters. The rationale behind this
approach is that, to the best of our knowledge, the vast majority
of postal code systems are organized in a hierarchical fashion.
However, this scoring can be improved by using a geographic
location lookup service. The city is scored using the Haversine
distance [15] if the GPS location is available in the reference
dataset. To retrieve the GPS location of the city mentioned in
the query record, we use a trie data structure, which contains
the names and GPS position of some 195,000 cities worldwide
obtained from geonames.org [9]. To evaluate the score, we
compute the Haversine distance between cities associated with
an exponential decay in [1,0). As a fallback, if the GPS
position is not available or the city in the query record cannot
be found in the trie, we use the Levenshtein score (described
previously) between city names. Finally, the country code
score simply returns 1 if the country matches and 0 otherwise.
Industries are typically represented by four-digit Standard
Industry Classification (SIC) codes [30]. Similar to postal
codes, SIC industry codes are also hierarchical: The first
two left-most digits represent a “Major Group” (e.g. Mining,
Manufacturing and others), the following digit is the “Indus-
trial Group” and the last digit is the specific industry within
the industrial group. When representing an industry, codes
of variable length can be used, depending on the level of
generality of the representation. To evaluate the industry score,
we use a measure similar to the one used for postal codes.
D. Implementation
The design of our RL system is driven by three main goals:
versatility, speed, and scalability.
Versatility is given by the generality of the approach. As we
have shown in the previous sections, the various components
have been designed to be able to accommodate virtually any
reference dataset and to perform RL on a large variety of
entities. The scoring function set can be extended to other
attribute types, e.g. product names, person names, and others.
Also, the scoring tree can be adapted to accommodate these
new attribute types with appropriate combining functions. The
central element of the system is a generic “linker”, which
can easily be configured to load a preprocessed dataset and
perform linkage. To maximize performance in terms of speed,
the linker has been written in C++ and loads the entity database
into memory. Therefore, once the linker is started and initial-
ized, all operations are performed in memory. Additionally, the
linker uses a multi-threaded approach such that RL requests
can be processed in a parallel fashion to exploit the cores
available on the physical system. Each reference dataset and,
therefore, the associated entity databases are loaded into a
specific linker.
To ensure scalability, we have adopted a containerized
approach; each linker runs in an individual container. In
conjunction with a container orchestration system, such as Ku-
bernetes, it is possible to run and dispatch linkers on multiple
physical machines. This approach allows linear scaling with
the number of nodes that are added to the cluster as well
as the ability to run linkages simultaneously against multiple
datasets. Moreover, the overall system is resilient to node
failures, which is an important characteristic for an enterprise-
grade application.
IV. EVALUATION
We evaluated our RL system across multiple dimensions.
First, we assessed the parameters under which it yields the best
performance. That is, identifying the right tradeoffs between
memory usage and performance as well as the different scoring
strategies available. Second, we compare our scoring algorithm
to using just standard scoring algorithms. Third, we compare
our RL system to two baseline systems: using a simple case
insensitive string lookup of the company name (to identify
the number of trivial matches) as well as using Apache Solr,
a state-of-the-art distributed indexing system that powers the
search and navigation features of many of the world’s largest
internet sites. Fourth, we evaluate the scalability in relation to
the number of parallel clients accessing the system [31].
To evaluate these dimensions, we rely on two commercial
but publicly available company databases. The first company
database comprises 150 million records and is used by any
company that engages in government contracts in the United
States. The second is used by financial analysts and comprises
approximately 15 million records. Finally, we also use internal
accounting data comprising 2 million records. We denote these
company databases as 150m, 15m, and 2m respectively.
Swiss dataset. We randomly selected 450 companies located
in Switzerland from the 15m database and manually matched it
against the 150m database. The use case behind this dataset is
to identify local companies with a random mix between small,
medium, and big companies as would be encountered by a user
with a strong local interest. Switzerland was chosen for two
reasons: (i) it has four different official languages allowing us
to assess the system in combination with different languages
and (ii) our familiarity with the region was instrumental in
correctly identifying records referring to the same company.
Company records were divided into the following types of
records:
Matched (296 records, 196 unique records) that show all cor-
rect matches (potentially multiple matches, for instance,
if one database was missing the address or listed many
subsidiaries).
Unmatched (114 records) where no corresponding company
was present in the other database.
Undecided (80 records) where we were unable to decide con-
clusively whether companies are the same or if one of the
companies had been renamed, e.g. following a merger.
Undecided records were counted neither as true positive nor
as negative but only as false positive if they were matched
against a different record.
Accounting dataset. We leveraged internally available fi-
nancial data that maps accounting company data from the
2m database to the 150m database. This dataset consists of
55k records. As this linkage was manually performed by
domain experts, we can assume that >99% of detected links
are accurate. The use case behind this dataset is to match
accounting data against a reference database as is it usually
performed in large companies.
News dataset. The dataset is based on a random selection
of 104 current news articles from different sources. It was
manually curated and lists for each article the companies that
should be found from the reference company database. The
0
0.2
0.4
0.6
0.8
1
0 0.25 0.5 0.75 1
r=4,b=10
r=5,b=18
r=6,b=30
0.6
0.66
0.72
0.78
0.84
0.9
0.55 0.6 0.65
Fig. 3: S-curves (left: full; right: zoomed; x-axis: Jaccard
similarity; y-axis: matching probability)
use case behind this dataset is to mine data about companies
from unstructured data sources.
A. Performance Tuning
In this section, we evaluate different tuning parameters and
their effects on the performance of our RL system. First,
we analyze different MinHash row-band configurations. From
experience, we know that correct matches typically have a
Jaccard similarity greater than 0.8. However, some correct
matches have a score as low as 0.6. Using these numbers, we
have chosen three row-band configurations such that records
with a Jaccard similarity of 0.8are matched with a
probability >99% and those with a similarity of <0.8with
a probability >75%. Considering that correct matches with a
score <0.8are rare, the 75% figure has been arbitrarily chosen
as a tradeoff between performance and matching accuracy. The
row-band configurations are shown in Figure 3 and Table I.
minhash 4/10 5/18 6/30
σ= 0.547.5% 43.5% 37.6%
σ= 0.675.0% 76.7% 76.1%
σ= 0.793.5% 96.3% 97.6%
σ= 0.899.4% 99.9% 99.9%
TABLE I: Minhash matching probabilities
A higher number of rows and bands yields a sharper S-
curve. Hence, entities with a low score are less probable to
be considered a match. However, this comes at the expense of
having to compute more MinHashes (rows ×bands) as well
as consuming more memory to store the additional bands.
Table II shows the results of linking the Swiss dataset
against 150M database. The recall figures for the different
configurations are between 86.6% and 87.2%. This is not
surprising, considering that the S-curve was configured to
capture all company names with a Jaccard similarity of 0.8
MinHash 4/10 5/18 6/30
recall 86.67% 87.18% 87.18%
database size 38.8GiB 57.6GiB 99.5GiB
comparisons 72.9k 55.0k 23.6k
TABLE II: Memory and performance comparison
83
84
85
86
87
88
89
90
91
92
52 54 56 58 60 62 64 66 68
Recall [%]
Precision [%]
Jaccard
Levenshtein
weighted
max-min
RLS
Fig. 4: Different scoring functions for different MinHash
configurations: r= 4,b= 10 (encircled data point), followed
by r= 5,b= 18 and r= 6,b= 30 (x-axis: precision [%];
y-axis: recall [%])
with a >99% probability. Memory consumption almost grows
linearly with the number of bands. The number of comparisons
necessary, however, looks surprising, because the S-curves
are relatively close to each other. After considering that the
scoring distribution among candidate entities has a heavy tail
distribution with considerably more candidate entities having
a low Jaccard similarity, this difference is easily explained.
Finally, the time spent computing the 6×30 MinHashes for
each record to be matched was negligible, which, looking at
the table, is due to the fact that each record must be compared
to 23 to 72 thousand candidate records.
We chose the configuration with 30 bands for our RL system
as it provides the best tradeoff between memory consumption
and comparison operations required.
B. Scoring Evaluation
In Section III-C, we have described our algorithm for scor-
ing company names. Figure 4 shows the precision and recall
numbers for the different strategies. The similarity functions
are shown for the row-band configurations discussed previ-
ously: r= 4, b = 10 (encircled), followed by r= 5, b = 18
and r= 6, b = 30. The results are very similar for the
different band configurations, a bit better for those with higher
band numbers, which would be supported by the fact that the
matching probability for similarities >0.6, to include most
outliers, is slightly higher for higher band numbers.
The Jaccard similarity has a lower recall than the Leven-
shtein similarity because the former is more sensitive to small
changes in the name such as diacritics. As a consequence, its
precision is higher. The weighted approach lies somewhere in
the middle.
The max-min strategy compared to the weighted strategy
yields similar results in terms of recall but with lower preci-
sion. This can be explained in the case where two company
names have a “high” Jaccard score and a “low” Levenshtein
score. For example, if the Jaccard score is 1.0 and the
Levenshtein score is 0.4, then the arithmetic mean is 0.7,
which is barely above our threshold, whereas max-min yields
a score of 0.94, which closely resembles the Jaccard similarity.
The comprehensive RLS approach shows significant im-
provements in terms of recall. The precision is similar to the
max-min strategy but below the weighted or Jaccard strategies.
This is because it finds matches for records in the ground truth
dataset that have no corresponding matches in the reference
database. In this case it is almost impossible to discern close
matches from non-matches. Considering that we favor recall
over precision, this is an acceptable tradeoff.
C. Matching Accuracy
In this section, we evaluate the matching accuracyof the RL
system in terms of precision and recall. We compare to a trivial
case insensitive string comparison approach, to identify the
number of trivial matches, as well as to a Solr based approach.
Solr was configured in a manner to allow for flexible search
operations that do not impose rigid restrictions on the type
and structure of the query terms. A default search field based
on the “solr.TextField” class including standard tokenization
and lowercasing was used to copy all the relevant company
attributes into one multivalued field. This setup allows to
submit compact query data, e.g. company name only, as well
as complex query phrases that contain a company name and
arbitrary additional attributes like address, city, and country.
The performance results for all our datasets (Swiss, Ac-
counting, News) are summarized in Table III.
Swiss dataset. For the first use case, we see that the trivial
matching algorithm is already able to correctly match 57% of
the records of the Swiss dataset. We assume that this is because
many companies ensure that their information is correctly
stored in the 150m and 15m databases. Precision is at 100%
because all matched names have been identified correctly.
Compared to the trivial approach, Solr is able to match
another 15% of the records, i.e., 35% of those records not
matched by the trivial approach. Typically, when the name
is similar or has been slightly shortened or extended. RLS
is able to match yet another 19% of the records that have
not been matched by Solr, or in other terms, 79% of the
records not matched by the trivial approach. The reason is
that RLS is aware of matching semantics of different artifacts
that compose a company record. The relatively low precision
is explained by the fact that both, Solr and RLS, try to find a
match for every record. The precision for Solr is lower as as
it is less specialized for the task of company matching.
Accounting dataset. The recall values of the accounting
dataset mostly mirrors the Swiss dataset despite storing mostly
bigger and international companies that frequently use English
words in their names. Interestingly the trivial matching per-
forms much worse. This seems to be because the accounting
database with 2m records is only internally available and hence
at times uses unofficial name variations of the company. These
variations are mostly trivial and hence both, Solr and RLS,
perform similarly to the Swiss dataset.
The precision and recall values are identical for Solr and
RLS because this dataset only contains records that are present
Trivial Solr RLS
Swiss R: 57% R: 72% R: 91%
P: 100% P: 41% P: 65%
Accounting R: 28% R: 73% R: 89%
P: 100% P: 73% P: 89%
News R: 33% R: 42% R: 64%
P: 37% P: 42% P: 58%
TABLE III: Matching Performance: Trivial vs. Solr vs. RLS
in both databases and both systems have returned a match for
every record.
News dataset. For this dataset, as mentioned previously,
we use Watson NLU to identify company names in news
articles. Subsequently, these names are matched to the 150m
database. This task is much harder as very limited context
is available and company names may vary substantially. This
leads to lower accuracy results. Due to its specialization and
extra-processing, RLS again outperforms Solr.
It has to be noted that we only used the company name
as other attributes are often not reliable or not present in the
unstructured case. For example, an article may contain several
company names, cities and countries and therefore it can be
ambiguous to an automated entity resolution system which
city/country refers to a given company.
D. Scalability
Each RLS instance accepts up to 8 parallel client requests.
A request can contain multiple individual queries that are
distributed over 4 threads. This gives a theoretical maximum
of 32 queries being processed in parallel. Each instance runs
on a server with two Intel®Xeon®CPU E5-2630 2.2GHz
(total 40 threads) and 400GB of memory.
0
5
10
15
20
25
0
2
4
6
8
10
12
0
10
20
30
40
50
60
70
80
90
100
CPU
Load
t
[ms]
Number
of
Clients
Load
Time/Req
Fig. 5: Scalability Analysis of the RL System
We have deployed this service on a node which is part of
our Kubernetes cluster. Requests are issued by a range of one
to twelve parallel clients. Each client sends 10’000, requests
each containing 80 queries. The scalability results are shown
in Figure 5. Up to eight clients, the CPU load increases almost
linearly. We notice that the average processing time decreases
as a benefit of parallel processing. It converges to a value of
17ms per request. With more than eight clients, the CPU load
and performance gain levels off as requests start to be queued.
Scaling requests over multiple nodes is performed by the
load balancer of Kubernetes. Since each instance keeps its
own copy of the reference company database and hence
runs independently, no performance penalties are incurred by
Kubernetes scaling the service over multiple nodes.
V. CONCLUSIONS AND FUTURE WORK
The proposed RL system is able to accurately match about
30% more records compared to the baselines. This improve-
ment is due to two contributions: (i) the introduction of
short company name extractions and their use both in the
preprocessing phase as well as in the scoring phase and (ii)
specific improvements of the scoring function, namely taking
into account diacritic characters, legal entity type, and the
ability to identify geographic locations in a company name.
Additionally, despite being deployed on a single node
only in our three node cluster, the system is capable of an
aggregated processing time of 17ms per record, which means
that we are able to match approximately 5M records per day.
These performance figures scale linearly with the number of
nodes, making the system perfectly suited for analyzing high-
volume streamed contents.
Our future work will: (i) explore the use of other LSH
functions such as SimHash [5] to assess whether our recall
values can be improved further, (ii) maintain automatic pa-
rameter learning and automatic training dataset augmentation,
(iii) consider the historical evolution of company names and
additional company modeling [21], [22], [24].
REFERENCES
[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives.
DBpedia: A nucleus for a web of open data. In Proceedings of the 6th
International The Semantic Web and 2nd Asian Conference on Asian
Semantic Web Conference, pages 722–735. Springer-Verlag, 2007.
[2] L. Barbosa, V. Crescenzi, X. L. Dong, P. Merialdo, F. Piai, D. Qiu,
Y. Shen, and D. Srivastava. Big data integration for product specifica-
tions. IEEE Data Engineering Bulletin, 41(2):71–81, June 2018.
[3] A. Z. Broder. On the resemblance and containment of documents. In
Proceedings. Compression and Complexity of SEQUENCES 1997, pages
21–29, June 1997.
[4] A. Z. Broder, M. S. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-
Wise independent permutations (extended abstract). In Proceedings of
the 30th ACM Symposium on the Theory of Computing, 1998.
[5] M. S. Charikar. Methods and apparatus for estimating similarity. US
Patent No. 7158961, 2001.
[6] M. S. Charikar. Similarity estimation techniques from rounding algo-
rithms. In Proceedings of the 34th ACM Symposium on the Theory of
Computing, 2002.
[7] M. Elfeky, V. Verykios, A. Elmagarmid, T. Ghanem, and A. Huwait.
Record linkage: A machine learning approach, a toolbox, and a digital
government web service. Technical Report 1573, Purdue University,
2003. Computer Science Technical Reports.
[8] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of
the American Statistical Association, 64(328):1183–1210, 1969.
[9] GeoNames. https://www.geonames.org/.
[10] L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice
& open challenges. Proceedings of the VLDB Endowment, 5(12):2018–
2019, 2012.
[11] R. D. Gottapu, C. H. Dagli, and A. Bahrami. Entity resolution using
convolutional neural network. Procedia Computer Science, 95:153–158,
Nov. 2016.
[12] L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current
practice and future directions. CSIRO Mathematical and Information
Sciences Technical Report, 3, June 2003.
[13] Z. Huang, W. Xu, and K. Yu. Bidirectional LSTM-CRF models for
sequence tagging. CoRR, abs/1508.01991, 2015.
[14] IBM. Watson Natural Language Understanding. https://www.ibm.com/
watson/services/natural-language- understanding/.
[15] J. Inman. Navigation and Nautical Astronomy: For the Use of British
Seamen (3rd ed.). London, UK: W. Woodward, C. & J. Rivington, 1835.
[16] P. Konda, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra,
S. Das, P. Suganthan G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li,
F. Panahi, and H. Zhang. Magellan: Toward building entity matching
management systems over data science stacks. Proceedings of the VLDB
Endowment, 9(12):1581–1584, Aug. 2016.
[17] S. Kwashie, J. Liu, J. Li, L. Liu, M. Stumptner, and L. Yang. Certus: An
effective entity resolution approach with graph differential dependencies
(GDDs). Proceedings of the VLDB Endowment, 12(6):653–666, Feb.
2019.
[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional ran-
dom fields: Probabilistic models for segmenting and labeling sequence
data. In Proceedings of the 18th International Conference on Machine
Learning (ICML), pages 282–289, 2001.
[19] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive
Datasets. Cambridge University Press, 2nd edition, 2014.
[20] M. Loster, Z. Zuo, F. Naumann, O. Maspfuhl, and D. Thomas. Improv-
ing company recognition from unstructured text by using dictionaries.
In Proceedings of the 20th International Conference on Extending
Database Technology (EDBT), pages 610–619, 2017.
[21] K. Mirylenka, C. Miksovic, and P. Scotton. Applicability of latent
dirichlet allocation for company modeling. In Industrial Conference
on Data Mining (ICDM’2016), 2016.
[22] K. Mirylenka, C. Miksovic, and P. Scotton. Recurrent neural net-
works for modeling company-product time series. Proceedings of
2nd ECML/PKDD Workshop on Advanced Analytics and Learning on
Temporal Data (AALTD), pages 29–36, 2016.
[23] K. Mirylenka, P. Scotton, C. Miksovic, and S.-E. B. Alaoui. Linking IT
product records. In Proceedings of the Data Integration and Applications
Workshop (DINA), 2019.
[24] K. Mirylenka, P. Scotton, C. Miksovic, and J. Dillon. Hidden layer
models for company representations and product recommendations. In
Advances in Database Technology - 22nd International Conference on
Extending Database Technology (EDBT), pages 468–476, 2019.
[25] K. Mirylenka, P. Scotton, C. Miksovic, and A. Schade. Similarity match-
ing system for record linkage. US Patent Application P201704804US01,
2018.
[26] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep,
E. Arcaute, and V. Raghavendra. Deep learning for entity matching:
A design space exploration. In Proceedings of the 2018 International
Conference on Management of Data (SIGMOD), pages 19–34, 2018.
[27] K. Qian, L. Popa, and P. Sen. Active learning for large-scale entity
resolution. In Proceedings of the 2017 ACM Conference on Information
and Knowledge Management (CIKM), pages 1379–1388, Nov. 2017.
[28] S. M. Randall, A. M. Ferrante, J. H. Boyd, and J. B. Semmens.
The effect of data cleaning on record linkage quality. BMC Medical
Informatics and Decision Making, 13(1), June 2013.
[29] R. M. Sariel Har-Peled, Piotr Indyk. Approximate nearest neighbor:
Towards removing the curse of dimensionality. THEORY OF COM-
PUTING, 8:321–350, 2012.
[30] What is a SIC code? https://siccode.com/page/what-is-a- sic-code.
[31] Solr. https://lucene.apache.org/solr/.
[32] The Unicode Consortium. The Unicode®Standard, 2019. https://www.
unicode.org/standard/standard.html.
... Since the early 2000's supervised machine learning approaches such as decision trees or logistic regression have been used, and meanwhile deep learning has found its way into the RL process [6]. The use of deep learning for RL can improve results, especially for unstructured and messy data [7,8]. A key challenge is that using machine learning for RL requires large amounts of labelled training data with high data quality. ...
... Creating such training data requires a lot of manual effort. Thus, the limited amount of training data is a bottleneck for supervised learning in RL [7,9]. Another reason why handling a substantial volume of training data presents difficulties is the practical impossibility of merging data from various sources in a centralized location due to factors like data privacy, legal requirements, or limited resources [10]. ...
... For example, in the data preparation, legal forms are classified and standardized by neural networks [4], blocking is performed with deep learning approaches [42], [43], word embeddings are applied to compare candidate pairs [44], [45] and the classification into match and non-match is done by using neural networks [46]. The major limitation to the application of ML in DI is the lack of sufficient amounts of annotated training data [7]. For DI to be performed with high quality, robust models that can handle diverse data problems are needed. ...
Conference Paper
Data integration is utilized to integrate heterogeneous data from multiple sources, representing a crucial step to improve information value in data analysis and mining. Incorporating machine and deep learning into data integration has proven beneficial, particularly for messy data. Yet, a significant challenge is the scarcity of training data, impeding the development of robust models. Federated learning has emerged as a promising solution to address the challenge of limited training data in various research domains. Through collaborative model training, robust model training is enabled while upholding data privacy and security. This paper explores the potential of applying federated learning to data integration through a structured literature review, offering insights into the current state-of-the-art and future directions in this interdisciplinary field.
... Although knowledge graphs (KGs) and ontologies have been exploited successfully for data integration [Trivedi et al. 2018;Azmy et al. 2019], entity matching involving structured and unstructured sources has usually been performed by treating records without explicitly taking into account the natural graph representation of structured sources and the potential graph representation of unstructured data [Mudgal et al. 2018;Gschwind et al. 2019]. To address this limitation, we propose a methodology for leveraging graph-structured information in entity matching. ...
... The resulting training graph contains approximately 40k nodes organized into 1.7k business entities. As a data-augmentation step, we generate an additional canonical or normalized version of a company name and link it to the real name in the graph, using conditional random fields, as described in [Gschwind et al. 2019]. This step yields an enriched training graph with 70k nodes. ...
... Experiments. We compared our S-GCN model against three baselines, namely (i) a record-linkage system (RLS) designed for company entities [Gschwind et al. 2019], (ii) a Table 1: Accuracy of entity matching on the test set feed-forward neural network (NN), and (iii) a model based on graph convolutional networks (GCN). Both the GCN and NN models use BERT features as input and a softmax output layer. ...
Article
Data integration has been studied extensively for decades and approached from different angles. However, this domain still remains largely rule-driven and lacks universal automation. Recent developments in machine learning and in particular deep learning have opened the way to more general and efficient solutions to data-integration tasks. In this paper, we demonstrate an approach that allows modeling and integrating entities by leveraging their relations and contextual information. This is achieved by combining siamese and graph neural networks to effectively propagate information between connected entities and support high scalability. We evaluated our approach on the task of integrating data about business entities, demonstrating that it outperforms both traditional rule-based systems and other deep learning approaches.
... Formally, the address matching task may be considered as a binary classification problem [4,3,5,6,7] where the predicted class is either Match or No Match. However, given two companies with the same name, it is important to identify addresses that are partially similar, such as those having the same city and the same road but differ in the house number or in the case where both addresses are correct but one of them corresponds to a former address company, in order to complete addresses with up-to-date information. ...
... Former address matching approaches [6,7] are based on similarity measures and matching rules. However, these methods perform a structural comparison of addresses and are unable to identify some relationship between two addresses when they have few literal overlaps [3]. ...
Article
Full-text available
In this paper, we describe a solution for a specific Entity Matching problem, where entities contain (postal) address information. The matching process is very challenging as addresses are often prone to (data) quality issues such as typos, missing or redundant information. Besides, they do not always comply with a standardized (address) schema and may contain polysemous elements. Recent address matching approaches combine static word embedding models with machine learning algorithms. While the solutions provided in this setting partially solve data quality issues, neither they handle polysemy, nor they leverage of geolocation information. In this paper, we propose GeoRoBERTa, a semantic address matching approach based on RoBERTa, a Transformer-based model, enhanced by geographical knowledge. We validate the approach in conducting experiments on two different real datasets and demonstrate its effectiveness in comparison to baseline methods.
... Quite often, these vast amounts of data include data that refer to persons. Due to the many different attributes that refer to the same person, it is very common for organizations and data controllers to keep duplicate instances that, in some cases, may be identical but, in most, differ slightly Information 2022, 13,116 2 of 15 and could, thus, be mistakenly treated as referring to different persons. Furthermore, as data volumes grow, storage also needs to increase, rendering the minimization of storage space a key challenge in order to build more efficient backup processes [4]. ...
... Based on this, Gschwind et al. introduced their proposed solution which comprises rule-based linkage algorithms and ML models. Their study achieved a 91% recall rate on a real-world dataset [13]. ...
Article
Full-text available
Analysis of extreme-scale data is an emerging research topic; the explosion in available data raises the need for suitable content verification methods and tools to decrease the analysis and processing time of various applications. Personal data, for example, are a very valuable source of information for several purposes of analysis, such as marketing, billing and forensics. However, the extraction of such data (referred to as person instances in this study) is often faced with duplicate or similar entries about persons that are not easily detectable by the end users. In this light, the authors of this study present a machine learning- and deep learning-based approach in order to mitigate the problem of duplicate person instances. The main concept of this approach is to gather different types of information referring to persons, compare different person instances and predict whether they are similar or not. Using the Jaro algorithm for person attribute similarity calculation and by cross-examining the information available for person instances, recommendations can be provided to users regarding the similarity or not between two person instances. The degree of importance of each attribute was also examined, in order to gain a better insight with respect to the declared features that play a more important role.
... MinHash algorithm, when used with the LSH forest data structure, represents a text similarity method that approximates the Jaccard set similarity score [32] MinHash was used to replace the large sets of string data with smaller "signatures" that still preserve the underlying similarity metric, hence producing a signature matrix, but a pair-wise signature comparison was still needed. Here the LSH Forest comes into play. ...
Article
Full-text available
Privacy is a fundamental human right according to the Universal Declaration of Human Rights of the United Nations. Adoption of the General Data Protection Regulation (GDPR) in European Union in 2018 was turning point in management of personal data, specifically personal identifiable information (PII). Although there were many previous privacy laws in existence before, GDPR has brought privacy topic in the regulatory spotlight. Two most important novelties are seven basic principles related to processing of personal data and huge fines defined for violation of the regulation. Many other countries have followed the EU with the adoption of similar legislation. Personal data management processes in companies, especially in analytical systems and Data Lakes, must comply with the regulatory requirements. In Data Lakes, there are no standard architectures or solutions for the need to discover personal identifiable information, match data about the same person from different sources, or remove expired personal data. It is necessary to upgrade the existing Data Lake architectures and metadata models to support these functionalities. The goal is to study the current Data Lake architecture and metadata models and to propose enhancements to improve the collection, discovery, storage, processing, and removal of personal identifiable information. In this paper, a new metadata model that supports the handling of personal identifiable information in a Data Lake is proposed.
... Record linkage can be understood as a process for extracting records from various data sources and combining them to form a single entity [6], for both structured and unstructured data [7]. The same thing was also conveyed by [8] which defines that record linkage is a step to identify a number of records that refer to the same thing. ...
Article
Full-text available
Merging databases from different data sources is one of the important tasks in the data integration process. This study will integrate lecturer data from data sources in the application of academic information systems and research information systems at the Sriwijaya State Polytechnic. This integration of lecturer data will later be used as a single data as master data that can be used by other applications. Lecturer data in the academic section contains 444 records, while those from the p3m section contain 443 records. An important task in the database merging process is to eliminate duplicate records. One of the important libraries in the formation of this master data management uses the record linkage toolkit which is implemented in the python programming language. The steps taken are pre-processing, generating candidate record pairs, compare pairs, score pairs, and finally the data link to merge the two data sources. In this study, 5 fields, namely username, name, place of birth, date of birth, and gender, from each data source were used to measure the level of record similarity. The result of this research is the formation of lecturer master data from the merging of the two sources.
... We test our approach on the company domain in Wikidata. The company domain has many applications in the areas of enterprise and finance where there is a focus on market intelligence or stock market data (Gschwind et al., 2019). We focus on company entities as they present a useful microcosm of the overall challenges of knowledge graph entity translation, such as the mix of translation and transliteration in cross-lingual labelling. ...
Thesis
Full-text available
Content on the web is predominantly in English, which makes it inaccessible to individuals who exclusively speak other languages. Knowledge graphs can store multilingual information, facilitate the creation of multilingual applications, and make these accessible to more language communities. In this thesis, we present studies to assess and improve the state of labels and languages in knowledge graphs and apply multilingual information. We propose ways to use multilingual knowledge graphs to reduce gaps in coverage between languages. We explore the current state of language distribution in knowledge graphs by developing a framework - based on existing standards, frameworks, and guidelines - to measure label and language distribution in knowledge graphs. We apply this framework to a dataset representing the web of data, and to Wikidata. We find that there is a lack of labelling on the web of data, and a bias towards a small set of languages. Due to its multilingual editors, Wikidata has a better distribution of languages in labels. We explore how this knowledge about labels and languages can be used in the domain of question answering. We show that we can apply our framework to the task of ranking and selecting knowledge graphs for a set of user questions A way of overcoming the lack of multilingual information in knowledge graphs is to transliterate and translate knowledge graph labels and aliases. We propose the automatic classification of labels into transliteration or translation in order to train a model for each task. Classification before generation improves results compared to using either a translation- or transliteration-based model on their own. A use case of multilingual labels is the generation of article placeholders for Wikipedia using neural text generation in lower-resourced languages. On the basis of surveys and semi-structured interviews, we show that Wikipedia community members find the placeholder pages, and especially the generated summaries, helpful, and are highly likely to accept and reuse the generated text.<br/
Article
This article contributes to the field of matching techniques by introducing a new algorithm based on labor market data enrichment. This approach is able to collect and balance the training and test samples for data integration purposes. By setting thresholds for textual matching and geographic proximity, it simplifies the process of finding suitable company matches. Based on insufficiently studied datasets, the experimental findings show that the performance evaluation of proposed models differs depending on the similarity thresholds used.
Article
Full-text available
Entity resolution (ER) is the problem of accurately identifying multiple, differing, and possibly contradicting representations of unique real-world entities in data. It is a challenging and fundamental task in data cleansing and data integration. In this work, we propose graph differential dependencies (GDDs) as an extension of the recently developed graph entity dependencies (which are formal constraints for graph data) to enable approximate matching of values. Furthermore, we investigate a special discovery of GDDs for ER by designing an algorithm for generating a non-redundant set of GDDs in labelled data. Then, we develop an effective ER technique, Certus, that employs the learned GDDs for improving the accuracy of ER results. We perform extensive empirical evaluation of our proposals on five real-world ER benchmark datasets and a proprietary database to test their effectiveness and efficiency. The results from the experiments show the discovery algorithm and Certus are efficient; and more importantly, GDDs significantly improve the precision of ER without considerable trade-off of recall.
Conference Paper
Full-text available
Entity resolution (ER) is the task of identifying different representations of the same real-world object across datasets. Designing and tuning ER algorithms is an error-prone, labor-intensive process, which can significantly benefit from data-driven, automated learning methods. Our focus is on "big data'' scenarios where the primary challenges include 1) identifying, out of a potentially massive set, a small subset of informative examples to be labeled by the user, 2) using the labeled examples to efficiently learn ER algorithms that achieve both high precision and high recall, and 3) executing the learned algorithm to determine duplicates at scale. Recent work on learning ER algorithms has employed active learning to partially address the above challenges by aiming to learn ER rules in the form of conjunctions of matching predicates, under precision guarantees. While successful in learning a single rule, prior work has been less successful in learning multiple rules that are sufficiently different from each other, thus missing opportunities for improving recall. In this paper, we introduce an active learning system that learns, at scale, multiple rules each having significant coverage of the space of duplicates, thus leading to high recall, in addition to high-precision. We show the superiority of our system on real-world ER scenarios of sizes up to tens of millions of records, over state-of-the-art active learning methods that learn either rules or committees of statistical classifiers for ER, and even over sophisticated methods based on first-order probabilistic models.
Article
Full-text available
The nearest neighbor problem is the following: Given a set of n points P in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to the query point q in X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = R-d under some l-p norm.
Conference Paper
Full-text available
While named entity recognition is a much addressed research topic, recognizing companies in text is of particular difficulty. Company names are extremely heterogeneous in structure, a given company can be referenced in many different ways, their names include person names, locations, acronyms, numbers, and other unusual tokens. Further, instead of using the official company name, quite different colloquial names are frequently used by the general public. We present a machine learning (CRF) system that reliably recognizes organizations in German texts. In particular, we construct and employ various dictionaries, regular expressions, text context, and other techniques to improve the results. In our experiments we achieved a precision of 91.11% and a recall of 78.82%, showing significant improvement over related work. Using our system we were able to extract 263,846 company mentions from a corpus of 141,970 newspaper articles.
Article
Full-text available
Entity resolution is an important application in field of data cleaning. Standard approaches like deterministic methods and probabilistic methods are generally used for this purpose. Many new approaches using single layer perceptron, crowdsourcing etc. are developed to improve the efficiency and also to reduce the time of entity resolution. The approaches used for this purpose also depend on the type of dataset, labeled or unlabeled. This paper presents a new method for labeled data which uses single layered convolutional neural network to perform entity resolution. It also describes how crowdsourcing can be used with the output of the convolutional neural network to further improve the accuracy of the approach while minimizing the cost of crowdsourcing. The paper also discusses the data pre-processing steps used for training the convolutional neural network. Finally it describes the airplane sensor dataset which is used for demonstration of this approach and then shows the experimental results achieved using convolutional neural network.
Book
Cambridge Core - Pattern Recognition and Machine Learning - Mining of Massive Datasets - by Jure Leskovec
Chapter
Today’s enterprise decision making relies heavily on insights derived from vast amounts of data from different sources. To acquire these insights, the available data must be cleaned, integrated and linked. In this work, we focus on the problem of linking records that contain textual descriptions of IT products.
Conference Paper
Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.
Article
Entity matching (EM) has been a long-standing challenge in data management. Most current EM works, however, focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then present Magellan, a new kind of EM systems that addresses these limitations. Magellan is novel in four important aspects. (1) It provides a how-to guide that tells users what to do in each EM scenario, step by step. (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do. (3) Tools are built on top of the data science stacks in Python, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provide a powerful scripting environment to facilitate interactive experimentation and allow users to quickly write code to "patch" the system. We have extensively evaluated Magellan with 44 students and users at various organizations. In this paper we propose demonstration scenarios that show the promise of the Magellan approach.