Content uploaded by Thomas Gschwind
Author content
All content in this area was uploaded by Thomas Gschwind on May 10, 2023
Content may be subject to copyright.
Fast Record Linkage for Company Entities
Thomas Gschwind, Christoph Miksovic, Julian Minder, Katsiaryna Mirylenka, Paolo Scotton
IBM Research — Zurich
R¨
uschlikon, Switzerland
{thg, cmi, jmd, kmi, psc}@zurich.ibm.com
Abstract—Record linkage is an essential part of nearly all
real-world systems that consume structured and unstructured
data coming from different sources. Typically no common key
is available for connecting records. Massive data integration
processes often have to be completed before any data analytics
and further processing can be performed. In this work we focus
on company entity matching, where company name, location
and industry are taken into account. Our contribution is a
highly scalable, enterprise-grade end-to-end system that uses
rule-based linkage algorithms in combination with a machine
learning approach to account for short company names. Linkage
time is greatly reduced by an efficient decomposition of the search
space using MinHash. Based on real-world ground truth datasets,
we show that our approach reaches a recall of 91% compared
to 73% for baseline approaches, while scaling linearly with the
number of nodes used in the system.
I. INTRODUCTION
Enterprise artificial intelligence applications require the
integration of many data sources. In such applications, one
of the most important entity attributes to be linked is often
the company name. It acts as a “primary key” across multiple
datasets such as company descriptions, marketing intelligence
databases, ledger databases, or stock market related data. The
technique used to perform such a linkage is commonly referred
to as record linkage or entity matching.
RL is in charge of joining various representations of the
same entity (e.g., a company, an organization, a product,
etc.) residing in structured records coming from different
datasets [23]. Record linkage (RL) has been extensively stud-
ied in recent decades. It was formalized by Fellegi and Sunter
in 1969 [8]. The tutorial by Lise Getoor [10] provides an
excellent overview of use cases and techniques. Essentially,
RL has been used to link entities from different sets or to
deduplicate/canonize entities within a given set. To this extent,
several approaches have been envisaged ranging from feature
matching or rule-based to machine learning approaches.
Typically, RL is performed in batch mode to link a large
number of entities between two or more datasets [10]. A
challenge of enterprise applications is the ever-increasing
amount of unstructured data such as news, blogs and social
media content to be integrated with enterprise data. As a
consequence, RL has to be performed between structured
records and unstructured documents. This large amount of data
may flow in streams for rapid consumption and analysis by
enterprise systems. Therefore, RL needs to be executed on the
fly and with stringent time constraints.
We use the Watson Natural Language Understanding ser-
vice [14] to identify mentions in unstructured text. These
entities are then passed to RL in a structured fashion in
the form of a record containing attributes that, for example,
represent company names, locations, industries and others. RL
is in charge of linking this record against one or multiple
reference datasets.
The main contributions of this work are
1) an end-to-end RL system that is highly scalable and
provides an enterprise-grade RL service,
2) scoring functions for various attribute types together with
a hierarchical scoring tree that allows the efficient and
flexible implementation of multi-criteria scoring, and
3) the automatic extraction of short company name, an
important feature of the company entity, based on condi-
tional random fields.
The paper is organized as follows. Section II presents related
work and discusses the general background of RL. Section III
describes the proposed system in detail. The performance of
the proposed system is discussed in Section IV and Section V
presents future research directions and concludes the paper.
II. BACKGROU ND
Various record linkage systems have been proposed in
recent decades [2], [12], [16], [17], [20]. As mentioned in
the introduction, they can usually be divided into rule-based
and machine learning-based systems. Konda et al. [16] have
proposed a system to perform RL on a variety of entity types,
providing a great flexibility in defining the linkage workflow.
This system allows the user to select the various algorithms
being used at various stages of the linkage process. Despite
its flexibility, this approach does not address the performance
problem at the center of the class of applications that we are
addressing. The Certus system proposed in [17] exploits graph
differential dependencies for the RL task. Even though there
is no need for an expert to create these graphs manually, still
an essential amount of training data is needed to leverage
the graphs automatically. However, we cannot apply such
techniques as we consider cases where the amount of training
data is very limited.
In the domain of RL, locality-sensitive hashing (LSH)
methods are generally used to provide entities with signatures
in such a way that similar entities have identical signatures
with high probability [29]. These signatures are commonly
referred to as blocking keys, which denote blocks. Blocks are
used to limit the number of comparisons needed during the
scoring phase.
Original
Reference
Database
Query
DBpedia
Cleaning for
Scoring
LSH Function
Cleaning for
Blocking
Reference
Database
Blocking
Keys
Database
Cleaning for
Blocking
Compile Cleaning for
Scoring
LSH Function Retrieve
Candidates IDs
Score
Candidates
Retrieve
Candidates
Compile Query Matches
Short Name
Model Training
Short Name
Service
Short Name
Extraction
Short Name
Extraction
Short Name
Extraction
Preprocessing Pipeline
Runtime Pipeline
Short Name Extraction
Fig. 1: Preprocessing and runtime pipeline.
Typical LSH algorithms are MinHash [3], [4], [19] and
SimHash [5], [6]. MinHash can be parametrized by decom-
posing the hashing functions in rows and bands [19]. The
row-band parameter settings for a desired minimal similarity
threshold can be determined by an “S-curve”. In our current
setup, we chose MinHash and tuned it for a high recall rate,
a key requirement for RL.
A first approach to use machine learning techniques for
record linkage was proposed in 2003 by Elfeky et al. [7]. A
trained classifier approach is compared to unsupervised clus-
tering and to probabilistic approaches. Although the trained
classifier outperforms the other approaches, the authors em-
phasize the difficulty of obtaining training data. More recent
studies [11], [26] assess the applicability of neural networks
to record linkage. In particular, Mudgal et al. [26] show
that, compared to “classical” approaches, deep learning brings
significant advantages for unstructured and noisy data.
The major limitation to using machine learning techniques
for record linkage is the difficulty of finding sufficient an-
notated training data. This is especially true with company
names. Moreover, for each new reference dataset introduced in
the system, a specific new training dataset must be developed.
To alleviate this problem, some promising approaches such as
the use of active learning [27] have been proposed. However,
the application of machine learning techniques to record
linkage remains limited at the moment. Nevertheless, machine
learning can be applied to sub-problems within record linkage.
In this work, we propose a novel machine learning-based
technique to extract a short name from a conventional com-
pany name. Full company names usually contain many ac-
companying words, e.g. “Systems, Inc.” in “Cisco Systems,
Inc.”, that contain additional information about a company’s
organizational entity type, its location, line of business, size
and share in the international market. The accompanying
words often vary greatly across datasets. For example, some
systems will have just “Cisco” instead of the conventional
name “Cisco Systems, Inc.”. Short company names (also re-
ferred colloquial or normalized company names) represent the
most discriminative substring in a company name string and
are particularly popular in unstructured data sources such as
media publications or financial reports, where many company
mentions are aggregated.
It has been shown by Loster et al. [20] that taking short
(colloquial) company names into account is greatly beneficial
for company record linkage. However, the company entity
matching system described in [20] used a manually created
short company name corpus, whereas in this work we focus
on automated short name extraction.
III. RECORD LIN KAG E SYS TE M
As mentioned above, we consider the problem of RL
performed on the fly, i.e. dynamically linking an incoming
record to records in one or more reference datasets. A record is
defined as a collection of attributes, each of which corresponds
to a column in the dataset. Attributes typically include the
company name, street address, city, postal code, country code,
industry, etc. Different reference datasets might not contain the
same attribute types, and/or attributes might be referenced by
different names.
The record linkage system essentially comprises three com-
ponents (Figure 1). Short name extraction is in charge of
training the service to extract short company names. The
preprocessing pipeline prepares the reference datasets. Finally,
the runtime pipeline is responsible for matching incoming
requests against candidate records and returning the best
matches.
A. Short Name Extraction
We use two data sources as the training corpus for the short
name extraction. DBpedia [1], the first source, contains some
65K company entities derived from the English version of
Wikipedia. The company entities contain a name, a label and
a homepage of a company. We use all these fields to derive
a company short name, which, in most cases, appears either
in the label or on the homepage of a company. For example,
the company “Aston Martin Lagonda Limited” has the label
support
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
prec ision reca ll f1-score
class IN class OUT mic ro av g macro avg
(a) DBpedia corpus.
support
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
prec ision reca ll f1-score
class IN class OUT mic ro av g macro avg
(b) Aggregated DBpedia and commercial corpus.
Fig. 2: CRF performance for company short name extraction.
“Aston Martin”. In this and similar cases based on the handful
of heuristically devised rules, we conclude that “Aston Martin”
is the short name of the company.
Another source of training data is a commercial company
database that contains company entities, such as branches,
subsidiaries and headquarters, all having individual local and
global identifiers. The set of all identifiers associated with a
company can be represented hierarchically. Based on these
hierarchies, we identify the families of companies which are
represented as a tree structure. For each family of companies,
we extracted the common tokens of the company names as
a short name for the entire family. After extracting common
tokens, additional checks were performed to exclude legal
entity types of companies from the token list. The remaining
tokens were combined and used as a short name for all the
company names in the family. For example, from a family of
companies that have two distinct names “Zumu Holdings Pty
Ltd” and “Zumu Foods Pty Ltd”, we extracted “Zumu” to be
the representative short name. Given this data source, we were
able to extract 950K of long-short name pairs for training.
In total, more than a million long name, short name pairs
were used as the corpus for the automatic extraction of short
names. The task of extracting short names in the case of
the commercial company data is more difficult because the
variability of names within the family of companies is greater,
and the short name is often the most discriminative part of
the name, whereas some other quite discriminative words
should be omitted. As can be seen from the support pie
chart in Figure 2b, indeed, for the overall corpus, where the
commercial company data portion is dominant, the number
of words that should be omitted is slightly greater than the
number of words that should be kept in a short name.
We treat the short name learning process as a sequence
labeling task, where for each word in a sequence, we need to
decide whether the word is kept or omitted from a company
name. Conditional Random Fields (CRF) [18] is one of the
best-performing models applied for sequence labeling [13]. In
our case, we have only two labels: “IN” and “OUT” to indicate
whether the word is included in or omitted from a short name,
respectively.
To evaluate CRF for the task of short name extraction, preci-
sion, recall and F1-score are computed separately for “IN” and
“OUT” classes. We also present micro and macro averages for
each performance measure. The plots for DBpedia corpus and
for the aggregated DBpedia and commercial company corpus
are shown in Figure 2.
The results demonstrate that CRF is able to distinguish
between discriminative and non-discriminative words in a
company name as all the performance measures are greater
than 0.76 for all the datasets under consideration. Indeed, the
task for DBpedia names is easier, and CRF achieves an overall
accuracy of approximately 0.9 for both classes. For the larger
corpus, the model struggled to reveal all words that should
have been included in the short name, providing the recall
for “IN” class, which is equal to 0.76. For other performance
measures, the values are close to 0.81.
The trained CRF model is applied to extract the short names
in the main record linkage system presented above, with the
results in Section IV.
B. Preprocessing Pipeline
The preprocessing pipeline reads records from a given
source format, converts the string into their decomposed UTF-
8 representation [32], collapses multiple consecutive spaces
into a single space, and generates a binary database that
supports the efficient retrieval of the records. Once the binary
database has been generated, a blocking key database is built
by computing for each record a set of blocking key values
corresponding to an LSH function. The blocking key database
stores the corresponding record indices for each blocking
key. As discussed in Section II, our implementation uses
MinHash [3], [4] as its LSH function.
The computation of the blocking key is based on a cleaned
version of the company name and the company’s short name.
The cleaning ensures that records with notational variations are
assigned the same blocking key (for instance, by consistently
omitting the legal entity type). This cleaning will generate a
number of incorrect matches that will have to be removed by
the scoring algorithm of the runtime pipeline. Other than for
the computation of the blocking keys, we do not perform any
additional cleaning as data cleaning destroys information [28].
C. Runtime Pipeline
The runtime pipeline links entity queries to the entities
stored in the entity database. It computes the blocking keys
and retrieves the corresponding candidate entities. It also
transforms the query into a more efficient representation in the
form of a scoring tree which is evaluated against the candidate
entities. Once the scores for the candidate records have been
computed, they are sorted.
The scoring tree uses different scoring algorithms, depend-
ing on the type of data to be processed. If the data describes an
address, we use a geographic scoring, whereas if it describes a
company name, we use a scoring algorithm tuned for company
names. If multiple types of data are present, the scoring
tree combines the scores into a single value. More formally,
the scoring tree represents a scoring function s(Rq, Rr)that
evaluates the similarity between a query record Rqand a
record in the reference dataset Rrsuch that s(Rq, Rr)∈[0,1]
and s(Rq, Rr) = 1 iff records Rqand Rrare identical.
Scoring Company Names. In order to score company
names, we started with different string similarity functions,
such as the Jaccard similarity j() on which MinHash is
based, or the Levenshtein distance l() which we convert into a
similarity function to obtain a score value ∈[0,1]. Hence, we
compute the Jaccard and Levenshtein scores as follows where
n1and n2represent company names of length |n1|and |n2|:
sJac(n1, n2) = j(n1, n2)and sLev (n1, n2)=1−l(n1, n2)
|n1|+|n2|
A first limitation we observed is that both Jaccard and
Levenshtein scores give too much weight to diacritics. A na¨
ıve
approach is simply to remove all the diacritics as part of the
cleaning step. However, there are company names that are only
differentiated by the presence of diacritics. To tackle this prob-
lem, we leverage a property of Unicode representation where
diacritics are represented as special combining characters. The
combining characters are given a lower weight in the scoring
process.
Another challenge is to deal with legal entity types of
companies such as “inc.” or “ltd.”, which may or may not
be included in the company name. In our initial attempt, we
simply removed these legal entity type identifiers. However,
we soon came across companies where the names differ only
in the legal entity type but are actually distinct companies. This
is one of several occurrences where cleaning had a negative
effect on scoring, which confirms the observations made by
Randall et al. [28]. Generally, one approach to alleviate the
problem related to special mentions (e.g. legal entity types)
is to assign them a lower weight in the scoring process.
Therefore we adopted the approach of assigning legal entity
types the same weight as a single character minus a small
value of = 1/256 (the smallest value available in our weight
representation). The fact of subtracting allows us to give
precendence to changes that are not in the legal entity type.
In some situations, the city name can be included in
the company name. For example, IBM Research Zurich is
sometimes indicated as IBM Research if it is clear from the
context that the geographic region is Switzerland. To handle
this, we detect city name mentions in a company name and
reduce its weight if the city is in the company’s vicinity. This
allows more flexibility with regard to names. To look up city
names, we use a fast trie implementation.
Additionally, as described previously, we derive for each
company name a short name. Words that are part of the short
name are weighted three times the normal weight. This ap-
proach allows us to place more emphasis on the characteristic
words of the company compared to other elements present in
the name.
In the following, we represent the Levenshtein and Jaccard
similarities that consider these modifications as sLev’ and sJac’.
The final company name score is computed as:
s(n1, n2)=0.9·max(sJac’(n1, n2), sLev’ (n1, n2)) +
0.1·min(sJac’(n1, n2), sLev’ (n1, n2))
The rationale behind this choice is that the Jaccard score
allows for word permutations, whereas the Levenshtein score
relies on the character sequence. We do not simply use the
maximum between the two similarities is that, because in
certain cases, the Jaccard similarity may return a similarity of
1 for names that are different. This is to ensure that a match
with a Jaccard similarity of 1 is not chosen coincidentally
over a Levenshtein similarity of 1, which is only possible if
the strings are equal. The values of 0.9 and 0.1 have been
chosen arbitrarily.
We dubbed this measure RLS. In Section IV we will
compare this approach against the
1) Jaccard score (sJacc),
2) Levenshtein score (sLev),
3) weighted average of the Jaccard and Levenhstein scores
(sweighted = 0.5·sJacc + 0.5·sLev), and the
4) RLS scoring function (smaxmin = 0.9·max(sJacc, sLev ) +
0.1·min(sJacc, sLev )), with optimizations for diacritic char-
acters, legal entity type, and city optimizations disabled
(max-min).
Scoring Other Attributes. Multiple attributes can be taken
into consideration for scoring; in this section we describe
geolocation and industry scoring. A geographical location is
represented by an address element. This element contains the
street address, postal code, city and country code attributes.
Each component is scored using a specific algorithm. The
street address is currently scored using a tokenized string
matching (e.g. Levenshtein tokenized distance [25]). This
provides a reasonable measure between street address strings,
especially if street number and street name appear in different
orders. Postal codes are evaluated according to the number
of matching digits or characters. The rationale behind this
approach is that, to the best of our knowledge, the vast majority
of postal code systems are organized in a hierarchical fashion.
However, this scoring can be improved by using a geographic
location lookup service. The city is scored using the Haversine
distance [15] if the GPS location is available in the reference
dataset. To retrieve the GPS location of the city mentioned in
the query record, we use a trie data structure, which contains
the names and GPS position of some 195,000 cities worldwide
obtained from geonames.org [9]. To evaluate the score, we
compute the Haversine distance between cities associated with
an exponential decay in [1,0). As a fallback, if the GPS
position is not available or the city in the query record cannot
be found in the trie, we use the Levenshtein score (described
previously) between city names. Finally, the country code
score simply returns 1 if the country matches and 0 otherwise.
Industries are typically represented by four-digit Standard
Industry Classification (SIC) codes [30]. Similar to postal
codes, SIC industry codes are also hierarchical: The first
two left-most digits represent a “Major Group” (e.g. Mining,
Manufacturing and others), the following digit is the “Indus-
trial Group” and the last digit is the specific industry within
the industrial group. When representing an industry, codes
of variable length can be used, depending on the level of
generality of the representation. To evaluate the industry score,
we use a measure similar to the one used for postal codes.
D. Implementation
The design of our RL system is driven by three main goals:
versatility, speed, and scalability.
Versatility is given by the generality of the approach. As we
have shown in the previous sections, the various components
have been designed to be able to accommodate virtually any
reference dataset and to perform RL on a large variety of
entities. The scoring function set can be extended to other
attribute types, e.g. product names, person names, and others.
Also, the scoring tree can be adapted to accommodate these
new attribute types with appropriate combining functions. The
central element of the system is a generic “linker”, which
can easily be configured to load a preprocessed dataset and
perform linkage. To maximize performance in terms of speed,
the linker has been written in C++ and loads the entity database
into memory. Therefore, once the linker is started and initial-
ized, all operations are performed in memory. Additionally, the
linker uses a multi-threaded approach such that RL requests
can be processed in a parallel fashion to exploit the cores
available on the physical system. Each reference dataset and,
therefore, the associated entity databases are loaded into a
specific linker.
To ensure scalability, we have adopted a containerized
approach; each linker runs in an individual container. In
conjunction with a container orchestration system, such as Ku-
bernetes, it is possible to run and dispatch linkers on multiple
physical machines. This approach allows linear scaling with
the number of nodes that are added to the cluster as well
as the ability to run linkages simultaneously against multiple
datasets. Moreover, the overall system is resilient to node
failures, which is an important characteristic for an enterprise-
grade application.
IV. EVALUATION
We evaluated our RL system across multiple dimensions.
First, we assessed the parameters under which it yields the best
performance. That is, identifying the right tradeoffs between
memory usage and performance as well as the different scoring
strategies available. Second, we compare our scoring algorithm
to using just standard scoring algorithms. Third, we compare
our RL system to two baseline systems: using a simple case
insensitive string lookup of the company name (to identify
the number of trivial matches) as well as using Apache Solr,
a state-of-the-art distributed indexing system that powers the
search and navigation features of many of the world’s largest
internet sites. Fourth, we evaluate the scalability in relation to
the number of parallel clients accessing the system [31].
To evaluate these dimensions, we rely on two commercial
but publicly available company databases. The first company
database comprises 150 million records and is used by any
company that engages in government contracts in the United
States. The second is used by financial analysts and comprises
approximately 15 million records. Finally, we also use internal
accounting data comprising 2 million records. We denote these
company databases as 150m, 15m, and 2m respectively.
Swiss dataset. We randomly selected 450 companies located
in Switzerland from the 15m database and manually matched it
against the 150m database. The use case behind this dataset is
to identify local companies with a random mix between small,
medium, and big companies as would be encountered by a user
with a strong local interest. Switzerland was chosen for two
reasons: (i) it has four different official languages allowing us
to assess the system in combination with different languages
and (ii) our familiarity with the region was instrumental in
correctly identifying records referring to the same company.
Company records were divided into the following types of
records:
Matched (296 records, 196 unique records) that show all cor-
rect matches (potentially multiple matches, for instance,
if one database was missing the address or listed many
subsidiaries).
Unmatched (114 records) where no corresponding company
was present in the other database.
Undecided (80 records) where we were unable to decide con-
clusively whether companies are the same or if one of the
companies had been renamed, e.g. following a merger.
Undecided records were counted neither as true positive nor
as negative but only as false positive if they were matched
against a different record.
Accounting dataset. We leveraged internally available fi-
nancial data that maps accounting company data from the
2m database to the 150m database. This dataset consists of
55k records. As this linkage was manually performed by
domain experts, we can assume that >99% of detected links
are accurate. The use case behind this dataset is to match
accounting data against a reference database as is it usually
performed in large companies.
News dataset. The dataset is based on a random selection
of 104 current news articles from different sources. It was
manually curated and lists for each article the companies that
should be found from the reference company database. The
0
0.2
0.4
0.6
0.8
1
0 0.25 0.5 0.75 1
r=4,b=10
r=5,b=18
r=6,b=30
0.6
0.66
0.72
0.78
0.84
0.9
0.55 0.6 0.65
Fig. 3: S-curves (left: full; right: zoomed; x-axis: Jaccard
similarity; y-axis: matching probability)
use case behind this dataset is to mine data about companies
from unstructured data sources.
A. Performance Tuning
In this section, we evaluate different tuning parameters and
their effects on the performance of our RL system. First,
we analyze different MinHash row-band configurations. From
experience, we know that correct matches typically have a
Jaccard similarity greater than 0.8. However, some correct
matches have a score as low as 0.6. Using these numbers, we
have chosen three row-band configurations such that records
with a Jaccard similarity of ≥0.8are matched with a
probability >99% and those with a similarity of <0.8with
a probability >75%. Considering that correct matches with a
score <0.8are rare, the 75% figure has been arbitrarily chosen
as a tradeoff between performance and matching accuracy. The
row-band configurations are shown in Figure 3 and Table I.
minhash 4/10 5/18 6/30
σ= 0.547.5% 43.5% 37.6%
σ= 0.675.0% 76.7% 76.1%
σ= 0.793.5% 96.3% 97.6%
σ= 0.899.4% 99.9% 99.9%
TABLE I: Minhash matching probabilities
A higher number of rows and bands yields a sharper S-
curve. Hence, entities with a low score are less probable to
be considered a match. However, this comes at the expense of
having to compute more MinHashes (rows ×bands) as well
as consuming more memory to store the additional bands.
Table II shows the results of linking the Swiss dataset
against 150M database. The recall figures for the different
configurations are between 86.6% and 87.2%. This is not
surprising, considering that the S-curve was configured to
capture all company names with a Jaccard similarity of ≥0.8
MinHash 4/10 5/18 6/30
recall 86.67% 87.18% 87.18%
database size 38.8GiB 57.6GiB 99.5GiB
comparisons 72.9k 55.0k 23.6k
TABLE II: Memory and performance comparison
83
84
85
86
87
88
89
90
91
92
52 54 56 58 60 62 64 66 68
Recall [%]
Precision [%]
Jaccard
Levenshtein
weighted
max-min
RLS
Fig. 4: Different scoring functions for different MinHash
configurations: r= 4,b= 10 (encircled data point), followed
by r= 5,b= 18 and r= 6,b= 30 (x-axis: precision [%];
y-axis: recall [%])
with a >99% probability. Memory consumption almost grows
linearly with the number of bands. The number of comparisons
necessary, however, looks surprising, because the S-curves
are relatively close to each other. After considering that the
scoring distribution among candidate entities has a heavy tail
distribution with considerably more candidate entities having
a low Jaccard similarity, this difference is easily explained.
Finally, the time spent computing the 6×30 MinHashes for
each record to be matched was negligible, which, looking at
the table, is due to the fact that each record must be compared
to 23 to 72 thousand candidate records.
We chose the configuration with 30 bands for our RL system
as it provides the best tradeoff between memory consumption
and comparison operations required.
B. Scoring Evaluation
In Section III-C, we have described our algorithm for scor-
ing company names. Figure 4 shows the precision and recall
numbers for the different strategies. The similarity functions
are shown for the row-band configurations discussed previ-
ously: r= 4, b = 10 (encircled), followed by r= 5, b = 18
and r= 6, b = 30. The results are very similar for the
different band configurations, a bit better for those with higher
band numbers, which would be supported by the fact that the
matching probability for similarities >0.6, to include most
outliers, is slightly higher for higher band numbers.
The Jaccard similarity has a lower recall than the Leven-
shtein similarity because the former is more sensitive to small
changes in the name such as diacritics. As a consequence, its
precision is higher. The weighted approach lies somewhere in
the middle.
The max-min strategy compared to the weighted strategy
yields similar results in terms of recall but with lower preci-
sion. This can be explained in the case where two company
names have a “high” Jaccard score and a “low” Levenshtein
score. For example, if the Jaccard score is 1.0 and the
Levenshtein score is 0.4, then the arithmetic mean is 0.7,
which is barely above our threshold, whereas max-min yields
a score of 0.94, which closely resembles the Jaccard similarity.
The comprehensive RLS approach shows significant im-
provements in terms of recall. The precision is similar to the
max-min strategy but below the weighted or Jaccard strategies.
This is because it finds matches for records in the ground truth
dataset that have no corresponding matches in the reference
database. In this case it is almost impossible to discern close
matches from non-matches. Considering that we favor recall
over precision, this is an acceptable tradeoff.
C. Matching Accuracy
In this section, we evaluate the matching accuracyof the RL
system in terms of precision and recall. We compare to a trivial
case insensitive string comparison approach, to identify the
number of trivial matches, as well as to a Solr based approach.
Solr was configured in a manner to allow for flexible search
operations that do not impose rigid restrictions on the type
and structure of the query terms. A default search field based
on the “solr.TextField” class including standard tokenization
and lowercasing was used to copy all the relevant company
attributes into one multivalued field. This setup allows to
submit compact query data, e.g. company name only, as well
as complex query phrases that contain a company name and
arbitrary additional attributes like address, city, and country.
The performance results for all our datasets (Swiss, Ac-
counting, News) are summarized in Table III.
Swiss dataset. For the first use case, we see that the trivial
matching algorithm is already able to correctly match 57% of
the records of the Swiss dataset. We assume that this is because
many companies ensure that their information is correctly
stored in the 150m and 15m databases. Precision is at 100%
because all matched names have been identified correctly.
Compared to the trivial approach, Solr is able to match
another 15% of the records, i.e., 35% of those records not
matched by the trivial approach. Typically, when the name
is similar or has been slightly shortened or extended. RLS
is able to match yet another 19% of the records that have
not been matched by Solr, or in other terms, 79% of the
records not matched by the trivial approach. The reason is
that RLS is aware of matching semantics of different artifacts
that compose a company record. The relatively low precision
is explained by the fact that both, Solr and RLS, try to find a
match for every record. The precision for Solr is lower as as
it is less specialized for the task of company matching.
Accounting dataset. The recall values of the accounting
dataset mostly mirrors the Swiss dataset despite storing mostly
bigger and international companies that frequently use English
words in their names. Interestingly the trivial matching per-
forms much worse. This seems to be because the accounting
database with 2m records is only internally available and hence
at times uses unofficial name variations of the company. These
variations are mostly trivial and hence both, Solr and RLS,
perform similarly to the Swiss dataset.
The precision and recall values are identical for Solr and
RLS because this dataset only contains records that are present
Trivial Solr RLS
Swiss R: 57% R: 72% R: 91%
P: 100% P: 41% P: 65%
Accounting R: 28% R: 73% R: 89%
P: 100% P: 73% P: 89%
News R: 33% R: 42% R: 64%
P: 37% P: 42% P: 58%
TABLE III: Matching Performance: Trivial vs. Solr vs. RLS
in both databases and both systems have returned a match for
every record.
News dataset. For this dataset, as mentioned previously,
we use Watson NLU to identify company names in news
articles. Subsequently, these names are matched to the 150m
database. This task is much harder as very limited context
is available and company names may vary substantially. This
leads to lower accuracy results. Due to its specialization and
extra-processing, RLS again outperforms Solr.
It has to be noted that we only used the company name
as other attributes are often not reliable or not present in the
unstructured case. For example, an article may contain several
company names, cities and countries and therefore it can be
ambiguous to an automated entity resolution system which
city/country refers to a given company.
D. Scalability
Each RLS instance accepts up to 8 parallel client requests.
A request can contain multiple individual queries that are
distributed over 4 threads. This gives a theoretical maximum
of 32 queries being processed in parallel. Each instance runs
on a server with two Intel®Xeon®CPU E5-2630 2.2GHz
(total 40 threads) and 400GB of memory.
0
5
10
15
20
25
0
2
4
6
8
10
12
0
10
20
30
40
50
60
70
80
90
100
CPU
Load
t
[ms]
Number
of
Clients
Load
Time/Req
Fig. 5: Scalability Analysis of the RL System
We have deployed this service on a node which is part of
our Kubernetes cluster. Requests are issued by a range of one
to twelve parallel clients. Each client sends 10’000, requests
each containing 80 queries. The scalability results are shown
in Figure 5. Up to eight clients, the CPU load increases almost
linearly. We notice that the average processing time decreases
as a benefit of parallel processing. It converges to a value of
17ms per request. With more than eight clients, the CPU load
and performance gain levels off as requests start to be queued.
Scaling requests over multiple nodes is performed by the
load balancer of Kubernetes. Since each instance keeps its
own copy of the reference company database and hence
runs independently, no performance penalties are incurred by
Kubernetes scaling the service over multiple nodes.
V. CONCLUSIONS AND FUTURE WORK
The proposed RL system is able to accurately match about
30% more records compared to the baselines. This improve-
ment is due to two contributions: (i) the introduction of
short company name extractions and their use both in the
preprocessing phase as well as in the scoring phase and (ii)
specific improvements of the scoring function, namely taking
into account diacritic characters, legal entity type, and the
ability to identify geographic locations in a company name.
Additionally, despite being deployed on a single node
only in our three node cluster, the system is capable of an
aggregated processing time of 17ms per record, which means
that we are able to match approximately 5M records per day.
These performance figures scale linearly with the number of
nodes, making the system perfectly suited for analyzing high-
volume streamed contents.
Our future work will: (i) explore the use of other LSH
functions such as SimHash [5] to assess whether our recall
values can be improved further, (ii) maintain automatic pa-
rameter learning and automatic training dataset augmentation,
(iii) consider the historical evolution of company names and
additional company modeling [21], [22], [24].
REFERENCES
[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives.
DBpedia: A nucleus for a web of open data. In Proceedings of the 6th
International The Semantic Web and 2nd Asian Conference on Asian
Semantic Web Conference, pages 722–735. Springer-Verlag, 2007.
[2] L. Barbosa, V. Crescenzi, X. L. Dong, P. Merialdo, F. Piai, D. Qiu,
Y. Shen, and D. Srivastava. Big data integration for product specifica-
tions. IEEE Data Engineering Bulletin, 41(2):71–81, June 2018.
[3] A. Z. Broder. On the resemblance and containment of documents. In
Proceedings. Compression and Complexity of SEQUENCES 1997, pages
21–29, June 1997.
[4] A. Z. Broder, M. S. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-
Wise independent permutations (extended abstract). In Proceedings of
the 30th ACM Symposium on the Theory of Computing, 1998.
[5] M. S. Charikar. Methods and apparatus for estimating similarity. US
Patent No. 7158961, 2001.
[6] M. S. Charikar. Similarity estimation techniques from rounding algo-
rithms. In Proceedings of the 34th ACM Symposium on the Theory of
Computing, 2002.
[7] M. Elfeky, V. Verykios, A. Elmagarmid, T. Ghanem, and A. Huwait.
Record linkage: A machine learning approach, a toolbox, and a digital
government web service. Technical Report 1573, Purdue University,
2003. Computer Science Technical Reports.
[8] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of
the American Statistical Association, 64(328):1183–1210, 1969.
[9] GeoNames. https://www.geonames.org/.
[10] L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice
& open challenges. Proceedings of the VLDB Endowment, 5(12):2018–
2019, 2012.
[11] R. D. Gottapu, C. H. Dagli, and A. Bahrami. Entity resolution using
convolutional neural network. Procedia Computer Science, 95:153–158,
Nov. 2016.
[12] L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current
practice and future directions. CSIRO Mathematical and Information
Sciences Technical Report, 3, June 2003.
[13] Z. Huang, W. Xu, and K. Yu. Bidirectional LSTM-CRF models for
sequence tagging. CoRR, abs/1508.01991, 2015.
[14] IBM. Watson Natural Language Understanding. https://www.ibm.com/
watson/services/natural-language- understanding/.
[15] J. Inman. Navigation and Nautical Astronomy: For the Use of British
Seamen (3rd ed.). London, UK: W. Woodward, C. & J. Rivington, 1835.
[16] P. Konda, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra,
S. Das, P. Suganthan G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li,
F. Panahi, and H. Zhang. Magellan: Toward building entity matching
management systems over data science stacks. Proceedings of the VLDB
Endowment, 9(12):1581–1584, Aug. 2016.
[17] S. Kwashie, J. Liu, J. Li, L. Liu, M. Stumptner, and L. Yang. Certus: An
effective entity resolution approach with graph differential dependencies
(GDDs). Proceedings of the VLDB Endowment, 12(6):653–666, Feb.
2019.
[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional ran-
dom fields: Probabilistic models for segmenting and labeling sequence
data. In Proceedings of the 18th International Conference on Machine
Learning (ICML), pages 282–289, 2001.
[19] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive
Datasets. Cambridge University Press, 2nd edition, 2014.
[20] M. Loster, Z. Zuo, F. Naumann, O. Maspfuhl, and D. Thomas. Improv-
ing company recognition from unstructured text by using dictionaries.
In Proceedings of the 20th International Conference on Extending
Database Technology (EDBT), pages 610–619, 2017.
[21] K. Mirylenka, C. Miksovic, and P. Scotton. Applicability of latent
dirichlet allocation for company modeling. In Industrial Conference
on Data Mining (ICDM’2016), 2016.
[22] K. Mirylenka, C. Miksovic, and P. Scotton. Recurrent neural net-
works for modeling company-product time series. Proceedings of
2nd ECML/PKDD Workshop on Advanced Analytics and Learning on
Temporal Data (AALTD), pages 29–36, 2016.
[23] K. Mirylenka, P. Scotton, C. Miksovic, and S.-E. B. Alaoui. Linking IT
product records. In Proceedings of the Data Integration and Applications
Workshop (DINA), 2019.
[24] K. Mirylenka, P. Scotton, C. Miksovic, and J. Dillon. Hidden layer
models for company representations and product recommendations. In
Advances in Database Technology - 22nd International Conference on
Extending Database Technology (EDBT), pages 468–476, 2019.
[25] K. Mirylenka, P. Scotton, C. Miksovic, and A. Schade. Similarity match-
ing system for record linkage. US Patent Application P201704804US01,
2018.
[26] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep,
E. Arcaute, and V. Raghavendra. Deep learning for entity matching:
A design space exploration. In Proceedings of the 2018 International
Conference on Management of Data (SIGMOD), pages 19–34, 2018.
[27] K. Qian, L. Popa, and P. Sen. Active learning for large-scale entity
resolution. In Proceedings of the 2017 ACM Conference on Information
and Knowledge Management (CIKM), pages 1379–1388, Nov. 2017.
[28] S. M. Randall, A. M. Ferrante, J. H. Boyd, and J. B. Semmens.
The effect of data cleaning on record linkage quality. BMC Medical
Informatics and Decision Making, 13(1), June 2013.
[29] R. M. Sariel Har-Peled, Piotr Indyk. Approximate nearest neighbor:
Towards removing the curse of dimensionality. THEORY OF COM-
PUTING, 8:321–350, 2012.
[30] What is a SIC code? https://siccode.com/page/what-is-a- sic-code.
[31] Solr. https://lucene.apache.org/solr/.
[32] The Unicode Consortium. The Unicode®Standard, 2019. https://www.
unicode.org/standard/standard.html.