Erhard Rahm

Erhard Rahm
University of Leipzig · Institute of Computer Science

Dr. Ing.

About

384
Publications
98,595
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
18,044
Citations
Additional affiliations
April 1994 - present
University of Leipzig
Position
  • Professor (Full)

Publications

Publications (384)
Article
Full-text available
Temporal property graphs are graphs whose structure and properties change over time. Temporal graph datasets tend to be large due to stored historical information, asking for scalable analysis capabilities. We give a complete overview of Gradoop, a graph dataflow system for scalable, distributed analytics of temporal property graphs which has been...
Article
Full-text available
Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matching of data properties, which attempts to try to fin...
Preprint
Full-text available
This work is a summarized view on the results of a one-year cooperation between Oracle Corp. and the University of Leipzig. The goal was to research the organization of relationships within multi-dimensional time-series data, such as sensor data from the IoT area. We showed in this project that temporal property graphs with some extensions are a pr...
Article
Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple sim...
Article
Full-text available
Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exc...
Preprint
Full-text available
Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations...
Preprint
Entity resolution targets at identifying records that represent the same real-world entity from one or more datasets. A major challenge in learning-based entity resolution is how to reduce the label cost for training. Due to the quadratic nature of record pair comparison, labeling is a costly task that often requires a significant effort from human...
Preprint
Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple sim...
Conference Paper
Full-text available
We present and evaluate new methods for incremental entity resolution as needed for the completion of knowledge graphs integrating data from multiple sources. Compared to previous approaches we aim at reducing the dependency on the order in which new sources and entities are added. For this purpose, we consider sets of new entities for an optimized...
Chapter
We present and evaluate new methods for incremental entity resolution as needed for the completion of knowledge graphs integrating data from multiple sources. Compared to previous approaches we aim at reducing the dependency on the order in which new sources and entities are added. For this purpose, we consider sets of new entities for an optimized...
Chapter
Entity Resolution is a crucial task to integrate data from different sources to identify records that represent the same entity. Entity resolution commonly employs supervised learning techniques based on training data of matching and non-matching pairs of records and their attribute similarities as represented by similarity vectors. To reduce the a...
Chapter
The temporal analysis of evolving graphs is an important requirement in many domains. We are therefore extending the distributed graph analysis framework Gradoop and its graph data model to support temporal graph analysis. This paper contains an overview of our work in progress and an example use case from the financial domain demonstrating the fle...
Poster
Ein Produkt der Community: - Förderiertes Record Linkage (Bloomfilter) - Föderiertes Record Linkage (SMPC) - Blocking und Locality-Sensitive Hashing - Audit Trail für klinische Studien - Handling von eGK-Nummern - MainzelLibrary - Continous Integration - Docker-Image - GCP-Validierung Gemeinschaftliche Entwicklung in Zahlen: - Open Source seit 201...
Article
Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem...
Preprint
Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem...
Preprint
Full-text available
Given a large graph, a graph sample determines a subgraph with similar characteristics for certain metrics of the original graph. The samples are much smaller thereby accelerating and simplifying the analysis and visualization of large graphs. We focus on the implementation of distributed graph sampling for Big Data frameworks and in-memory dataflo...
Article
The temporal analysis of evolving graphs is an important requirement in many domains but hardly supported in current graph database and graph processing systems. We therefore have started with extending the distributed graph analysis framework Gradoop for temporal graph analysis by adding time properties to vertices, edges and graphs and using them...
Article
Privacy-preserving record linkage (PPRL) is increasingly demanded in real-world applications, e.g., in the health-care domain, to combine person-related data for data analysis while preserving the privacy of individuals. However, the adoption of PPRL is hampered by the absence of easy-to-use and powerful PPRL tools covering the entire PPRL process....
Article
Analyzing large amounts of graph data, e.g., from social networks or bioinformatics, has recently gained much attention. Unfortunately, tool support for handling and analyzing such graph data is still weak and scalability to large data volumes is often limited. We introduce the BIGGR approach providing a novel tool for the user-friendly and efficie...
Article
Privacy-preserving record linkage (PPRL) supports the matching and integration of person-related data, e.g., on patients or customers without compromising privacy. It is based on the encoding of sensitive attribute values needed for matching and often involves trusted parties for linkage. We report on recent research results from the Big Data cente...
Chapter
The use of similarity measures in various domains is cornerstone for different tasks ranging from ontology alignment to information retrieval. To this end, existing metrics can be classified into several categories among which lexical and semantic families of similarity measures predominate but have rarely been combined to complete the aforemention...
Chapter
Full-text available
There exist many tools to annotate mentions of medical entities in documents with concepts from biomedical ontologies. To improve the overall quality of the annotation process, we propose the use of machine learning to combine the results of different annotation tools. We comparatively evaluate the results of the machine-learning based approach wit...
Article
Since its launch in October 2014, the Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig carries out collaborative research on Big Data methods and their use in challenging data science applications of different domains, leading to both general, and application-specific solutions and services. In this article, we giv...
Conference Paper
The use of similarity measures in various domains is cornerstone for different tasks ranging from ontology alignment to information retrieval. To this end, existing metrics can be classified into several categories among which lexical and semantic families of similarity measures predominate but have rarely been combined to complete the aforemen-tio...
Article
The efficient and intelligent handling of large, often distributed and heterogeneous data sets increasingly determines the scientific and economic competitiveness in most application areas. Mobile applications, social networks, multimedia collections, sensor networks, data intense scientific experiments, and complex simulations nowadays generate a...
Article
The extensive use of semantic annotations in the medical domain to enhance information retrieval or encode clinical notes to improving information reuse and sharing demands for high quality annotations generation and services for guaranteeing their validity over time. In this paper we present the extension of an existing framework supporting the (s...
Chapter
Privacy-preserving record linkage (PPRL) supports the integration of person-related data from different sources while protecting the privacy of individuals by encoding sensitive information needed for linkage. The use of encoded data makes it challenging to achieve high linkage quality in particular for dirty data containing errors or inconsistenci...
Article
Full-text available
The maintenance and use of metadata such as provenance and time-related information is of increasing importance in the Semantic Web, especially for Big Data applications that work on heterogeneous data from multiple sources and which require high data quality. In an RDF dataset, it is possible to store metadata alongside the actual RDF data and se...
Article
Full-text available
We demonstrate Gradoop, an open source framework that combines and extends features of graph database systems with the benefits of distributed graph processing. Using a rich graph data model and powerful graph operators, users can declaratively express graph analytical programs for distributed execution without needing advanced programming experien...
Article
Full-text available
Introduction: This article is part of the Focus Theme of Methods of Information in Medicine on the German Medical Informatics Initiative. "Smart Medical Information Technology for Healthcare (SMITH)" is one of four consortia funded by the German Medical Informatics Initiative (MI-I) to create an alliance of universities, university hospitals, rese...
Chapter
Knowledge graphs holistically integrate information about entities from multiple sources. A key step in the construction and maintenance of knowledge graphs is the clustering of equivalent entities from different sources. Previous approaches for such an entity clustering suffer from several problems, e.g., the creation of overlapping clusters or th...
Conference Paper
Full-text available
Cyber attacks such as ransomware can do great damage. Intrusion detection systems can help to detect those attacks. Especially with anomaly detection methods, it is possible to detect previous unknown attacks. In this paper, we present a graph-based approach in combination with existing methods trying to increase recognition rates and reduce false...
Conference Paper
Transactional frequent subgraph mining identifies frequent structural patterns in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that util...
Conference Paper
Full-text available
The annotation of entities with concepts from standardized terminologies and ontologies is of high importance in the life sciences to enhance semantic interoperability, information retrieval and meta-analysis. Unfortunately, medical documents such as clinical forms or electronic health records are still rarely annotated despite the availability of...
Conference Paper
Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We propose a distributed holis...
Conference Paper
Full-text available
Frequent pattern mining is an important research field and can be applied to different labeled data structures ranging from itemsets to graphs. There are scenarios where a label can be assigned to a taxonomy and generalized patterns can be mined by replacing labels by their ancestors. In this work, we propose a novel approach to generalized frequen...
Conference Paper
Full-text available
Entity resolution identifies semantically equivalent entities, e.g., describing the same product or customer. It is especially challenging for big data applications where large volumes of data from many sources have to be matched and integrated. Entity resolution for multiple data sources is best addressed by clustering schemes that group all match...
Article
Full-text available
Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed...
Conference Paper
Graph pattern matching is an important and challenging operation on graph data. Typical use cases are related to graph analytics. Since analysts are often non-programmers, a graph system will only gain acceptance, if there is a comprehensible way to declare pattern matching queries. However, respective query languages are currently only supported b...
Conference Paper
Full-text available
Property graphs are an intuitive way to model, analyze and visualize complex relationships among heterogeneous data objects, for example, as they occur in social, biological and information networks. These graphs typically contain thousands or millions of vertices and edges and their entire representation can easily overwhelm an analyst. One way to...
Article
Full-text available
Transactional frequent subgraph mining identifies frequent subgraphs in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the f...
Chapter
Full-text available
The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (P...
Chapter
Full-text available
Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining c...
Conference Paper
Full-text available
Semantic annotations are often used to enrich documents as clinical trials and electronic health records. However, the usability of these annotations tends to decrease over time due to the evolution of the domain ontologies. The maintenance of these annotations is critical for tools that exploit them (e.g., search engines and decision support syste...
Article
Full-text available
Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well...
Conference Paper
Full-text available
Graph grouping supports data analysts in decision making based on the characteristics of large-scale, heterogeneous networks containing millions or even billions of vertices and edges. We demonstrate graph grouping with Gradoop, a scalable system supporting declarative programs composed from multiple graph operations. Using social network data, we...
Article
Full-text available
Links build the backbone of the Linked Data Cloud. With the steady growth in size of datasets comes an increased need for end users to know which frameworks to use for deriving links between datasets. In this survey, we comparatively evaluate current Link Discovery tools and frameworks. For this purpose, we outline general requirements and derive a...
Conference Paper
Full-text available
Pairwise link discovery approaches for the Web of Data do not scale to many sources thereby limiting the potential for data integration. We thus propose a holistic approach for linking many data sources based on a clustering of entities representing the same real-world object. Our clustering approach utilizes existing links and can deal with entiti...
Conference Paper
Full-text available
The integration, mining, and analysis of person-specific data can provide enormous opportunities for organizations , governments, and researchers to leverage today's massive data collections. However, the use of personal or otherwise sensitive data also raises concerns about the privacy, confidentiality , and potential discrimination of people. Pri...
Conference Paper
Full-text available
This paper deals with the problem of maintenance of semantic annotations produced based on domain ontologies. Many annotated texts have been produced and made available to end-users. If not reviewed regularly, the quality of these annotations tends to decrease over time due to the evolution of the domain ontologies. The quality of these annotations...
Conference Paper
Full-text available
Annotations are useful to semantically enrich documents and other datasets with concepts of standardized vocabularies and ontologies. In the medical domain, many documents are not annotated at all and manual annotation is a difficult process making automatic annotation methods highly desirable to support human annotators. We propose a reuse-based a...
Conference Paper
Full-text available
Current data integration approaches are mostly limited to few data sources, partly due to the use of binary match approaches between pairs of sources. We thus advocate for the development of more holistic, clustering-based data integration approaches that scale to many data sources. We outline different use cases and provide an overview of initial...
Article
Full-text available
Biomedical ontologies are heavily used to annotate data, and different ontologies are often interlinked by ontology mappings. These ontology-based mappings and annotations are used in many applications and analysis tasks. Since biomedical ontologies are continuously updated dependent artifacts can become outdated and need to undergo evolution as we...
Conference Paper
Full-text available
Graphs are an intuitive way to model complex relationships between real-world data objects. Thus, graph ana-lytics plays an important role in research and industry. As graphs often reflect heterogeneous domain data, their representation requires an expressive data model including the abstraction of graph collections, for example, to analyze communi...
Article
Full-text available
The analysis of person-related data in Big Data applications faces the tradeoff of finding useful results while preserving a high degree of privacy. This is especially challenging when person-related data from multiple sources need to be integrated and analyzed. Privacy-preserving record linkage (PPRL) addresses this problem by encoding sensitive a...
Article
Full-text available