Julian Minder’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (1)


Fig. 1: Preprocessing and runtime pipeline.
Fig. 3: S-curves (left: full; right: zoomed; x-axis: Jaccard similarity; y-axis: matching probability)
Fig. 4: Different scoring functions for different MinHash configurations: r = 4, b = 10 (encircled data point), followed by r = 5, b = 18 and r = 6, b = 30 (x-axis: precision [%]; y-axis: recall [%])
Fig. 5: Scalability Analysis of the RL System
Fast Record Linkage for Company Entities
  • Conference Paper
  • Full-text available

December 2019

·

123 Reads

·

18 Citations

·

Christoph Miksovic

·

Julian Minder

·

[...]

·

Record linkage is an essential part of nearly all real-world systems that consume structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data integration processes often have to be completed before any data analytics and further processing can be performed. In this work we focus on company entity matching, where company name, location and industry are taken into account. Our contribution is a highly scalable, enterprise-grade end-to-end system that uses rule-based linkage algorithms in combination with a machine learning approach to account for short company names. Linkage time is greatly reduced by an efficient decomposition of the search space using MinHash. Based on real-world ground truth datasets, we show that our approach reaches a recall of 91% compared to 73% for baseline approaches, while scaling linearly with the number of nodes used in the system.

Download

Citations (1)


... MinHash algorithm, when used with the LSH forest data structure, represents a text similarity method that approximates the Jaccard set similarity score [32] MinHash was used to replace the large sets of string data with smaller "signatures" that still preserve the underlying similarity metric, hence producing a signature matrix, but a pair-wise signature comparison was still needed. Here the LSH Forest comes into play. ...

Reference:

Managing Personal Identifiable Information in Data Lakes
Fast Record Linkage for Company Entities