PresentationPDF Available

String matching algorithms in OpenRefine clustering and reconciliation functions - a case study of person name matching

Authors:

Abstract

Person entities are important linking nodes both within and between Linked Open Data resources across different domains and use cases. Therefore, efficient identity management is a crucial part of resource development and maintenance. This case study is concerned with the task of semi-automatic population of a newly developed domain knowledge graph, LexBib Wikibase [https://lexbib.elex.is/wiki/Main_Page] with high-quality person data. We aim to transform person name literals taken from publication metadata into Semantic Web entities, to enable improved retrieval and entity enrichment for the domain-specific discovery portal ElexiFinder [http://finder.elex.is/intelligence?type=articles]. In a prototype workflow to achieve this transformation, the open source tool OpenRefine is used as a one-tool solution to perform deduplication (synonym problem), disambiguation (homonym problem) and reconciliation of person names with reference datasets, using a sample of 3.104 name literals taken from LexBib domain bibliography. We closely examine OpenRefine’s clustering functions with its underlying string matching algorithms, to gain a better understanding of their ability to account for different error types that frequently occur in person name matching, such as spelling errors, phonetic variations, initials, or double names. Following the same approach, string matching processes implemented in two widely used reconciliation services for Wikidata [https://github.com/wetneb/openrefine-wikibase] and VIAF [https://github.com/codeforkjeff/conciliator] are examined. OpenRefine offers various features to support further processing of algorithmic output. Therefore, we also analyse the usefulness of these features within the range of the presented use case. The results of this case study may contribute to a better understanding and subsequent further development of interlinking features in OpenRefine and adjoining reconciliation services. By offering empiric data on OpenRefine’s underlying string matching algorithms, the study’s results supplement existing guides and tutorials on clustering and reconciliation, especially for person name matching projects.
A case study of person name matching
String matching algorithms in
OpenRefine clustering and reconciliation functions
Christiane Klaes
Hildesheim University / University Library Braunschweig
c.klaes@tu-braunschweig.de
01.12.2021
Christiane Klaes2
Agenda
1. Use case: domain knowledge base „LexBib“
2. String matching measures for person names
3. Clustering algorithms in OpenRefine
4. Matching algorithms in reconciliation services
01.12.2021
Christiane Klaes3
1 Domain knowledge base „LexBib“
Data flow
OpenRefine
ReconciliationData cleaning
Error handling
Deduplication
Assign preferred labels
Reconciliation services
Disambiguation
Enrichment
(Lindemann et al., 2019; Klaes, 2021)
01.12.2021
Christiane Klaes4
1 Domain knowledge base „LexBib“
Harmonizing name literals
Initials
Double names
Nicknames
Order of name
components
Spelling errors
01.12.2021
Christiane Klaes5
2 String matching measures for person names
Levenshtein
N-grams, Skip-grams
Phonetic measures:
Soundex, Metaphone (for English)
Cologne (for German)
Jaro, JaroWinkler
(Christen 2006; Recchia/Louwerse 2013; Pilania/Kumaran 2019)
Minerich, Richard. 2012. "Levenshtein
Distance and the Triangle Inequality."
Inviting Epiphany, September 04.
https://devopedia.org/levenshtein
-distance#Minerich-2012
01.12.2021
Christiane Klaes6
3 Clustering algorithms in OpenRefine
Mixed methods approach
conservative liberal
Name strings Person entities
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
01.12.2021
Christiane Klaes7
3 Clustering algorithms in OpenRefine
Manual validation and post-processing
01.12.2021
Christiane Klaes8
3 Clustering algorithms in OpenRefine
Results for clustering algorithms
Clustering
algorithm
Number of
clusters
Precision
of
clusters
Typical deviations
Fingerprint 60 1,0 Agnès Tutin / Agnes Tutin
Bi-gram fingerprint 9 1,0 Sene-Mongaba / Sene Mongaba
Metaphone 3 66 0,65 Hannu Tommola / Hannu Tammala
Cologne 42 0,19 Hannu Tommola / Hannu Tuomola
Levenshtein 5 1,0 Franck Sajous / Frank Sajous
PPM
(setting: 1.0 / 6)
15 1,0 Bolette S. Pedersen /
Bolette Sandford Pedersen
PPM
(setting: 2.0 / 4)
103 0,43 B. T. Sue Atkins / Sue Atkins /
Beryl T. Sue Atkins
Sample: 3.104 person names from LexBib
01.12.2021
Christiane Klaes9
3 Clustering algorithms in OpenRefine
Automatic vs. manual clustering
01.12.2021
Christiane Klaes10
3 Clustering algorithms in OpenRefine
Lessons learned
Dierent name forms as input for clustering algorithms
01.12.2021
Christiane Klaes11
4 Matching algorithms in reconciliation services
(Lindemann et al., 2019; Klaes, 2021)
OpenRefine
ReconciliationData cleaning
Error handling
Deduplication
Assign preferred labels
Reconciliation services
Levenshtein distance
Disambiguation
Enrichment
01.12.2021
Christiane Klaes12
4 Matching algorithms in reconciliation services
Reconciliation results for Wikidata and VIAF
autoMatch
No match
autoMatch
Number
of names
Candidate score
01.12.2021
Christiane Klaes13
4 Matching algorithms in reconciliation services
Validation
Judgement
Validation: precision of matches
Wikidata VIAF
autoMatches 0,9 0,91
Linking candidates,
Score 100 - 95 0,29 1,00
All linking candidates 0,18 0,67
01.12.2021
Christiane Klaes14
4 Matching algorithms in reconciliation services
Misleading deviations in VIAF name strings
Birth year as part of
name strings
Punctuation as part of
name strings
Similarity score (Levenshtein distance)
01.12.2021
Christiane Klaes15
Thank you!
Questions & comments welcome
Acknowledgements
This presentation is based on the author’s Master thesis
titled „Linked Open Data-Strategien zum Identity
Management in einer Fachontologie – prototypische
Entwicklung eines Workflows zur Aufbereitung und zum
Interlinking von Personennamen“, University of
Hildesheim, August 2021.
Many thanks to PD Dr. Laura Giacomini and Prof. Dr.
Ulrich Heid, and to Dr. David Lindemann.
01.12.2021
Christiane Klaes16
References
Christen, Peter (2006): A Comparison of Personal Name Matching: Techniques and Practical Issues.
Sixth IEEE International Conference on Data Mining - Workshops (ICDMW’06), 2006, Hong Kong,
China: IEEE, 290–294. http://doi.org/10.1109/ICDMW.2006.2
Christen, Peter (2012): Data matching: concepts and techniques for record linkage, entity resolution,
and duplicate detection. Berlin ; New York: Springer.
Delpeuch, Antonin (2019): A Survey of OpenRefine Reconciliation Services.
http://arxiv.org/abs/1906.08092
Färber, Michael/Bartscherer, Frederic/Menne, Carsten/Rettinger, Achim (2017): Linked Data Quality
of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. In: Zaveri, Amrapali et al. (Hrsg.), Semantic
Web, 9 (1), 77–129.
Heath, Tom/Bizer, Christian (2011): Linked Data: Evolving the Web into a Global Data Space, Bd. 1. 1.
Aufl. Morgan & Claypool. http://linkeddatabook.com/book
01.12.2021
Christiane Klaes17
References
Klaes, Christiane (2021): Linked OpenData-Strategien zum Identity Management in einer Fachontologie -
Prototypische Entwicklung eines Workflows zur Aufbereitung und zum Interlinking von Personennamen. Hildesheim:
Universität Hildesheim.
Lindemann, David/Klaes, Christiane/Zumstein, Philipp (2019): Metalexicography as Knowledge Graph. In: Eskevich,
Maria/De Melo, Gerard/Fäth, Christian/McCrae, John P./Buitelaar, Paul/Chiarcos, Christian/Klimek,
Bettina/Dojchinovski, Milan (Hrsg.), OASICS, 70. https://doi.org/10.4230/OASIcs.LDK.2019.19
Pilania, Ankita/Kumaran, Gnanamani Mayyil Muthuil Muthu (2019): Comparative Study of Name Matching Algorithms.
Proceedings of the 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom),
Bharati Vidyapeeth, New Delhi: IEEE Computer Society, 1174–1178. https://ieeexplore.ieee.org/document/8991380
Pratter, Yves (2020): Clustering in Depth: Methods and Theory Behind the Clustering Functionality in OpenRefine.
GitHub. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
Recchia, Gabriel/Louwerse, Max (2013): A Comparison of String Similarity Measures for Toponym Matching. COMP
2013 - ACM SIGSPATIAL International Workshop on Computational Models of Place, 5. November 2013, Orlando,
Florida, USA, 54–61. https://doi.org/10.1145/2534848.2534850
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Finding and matching personal names is at the core of an increasing number of applications: from text andWeb min- ing, information retrieval and extraction, search engines, to deduplication and data linkage systems. Variations and errors in names make exact string matching problematic, and approximate matching techniques based on phonetic encoding or pattern matching have to be applied. When compared to general text, however, personal names have different characteristics that need to be considered. In this paper we discuss the characteristics of personal names and present potential sources of variations and er- rors. We overview a comprehensive number of commonly used, as well as some recently developed name matching techniques. Experimental comparisons on four large name data sets indicate that there is no clear best technique. We provide a series of recommendations that will help re- searchers and practitioners to select a name matching tech- nique suitable for a given data set.
Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection
  • Peter Christen
Christen, Peter (2012): Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Berlin ; New York: Springer.