Christian Bizer

Christian Bizer
University of Mannheim ·  School of Business Informatics and Mathematics

Professor

About

183
Publications
75,756
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
32,150
Citations
Introduction
Christian Bizer explores technical and economic questions concerning the development of global, decentralized information environments. His current research focus is the evolution of the World Wide Web from a medium for the publication of documents into a global dataspace.
Additional affiliations
April 2008 - July 2012
Freie Universität Berlin
Position
  • Juniorprofessor

Publications

Publications (183)
Preprint
Structured product data, in the form of attribute-value pairs, is essential for e-commerce platforms to support features such as faceted product search and attribute-based product comparison. However, vendors often provide unstructured product descriptions, making attribute value extraction necessary to ensure data consistency and usability. Large...
Preprint
Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and their ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the pote...
Chapter
Entity Matching is the task of deciding if two entity descriptions refer to the same real-world entity. State-of-the-art entity matching methods often rely on fine-tuning Transformer models such as BERT or RoBERTa. Two major drawbacks of using these models for entity matching are that (i) the models require significant amounts of fine-tuning data f...
Preprint
Structured product data in the form of attribute/value pairs is the foundation of many e-commerce applications such as faceted product search, product comparison, and product recommendation. Product offers often only contain textual descriptions of the product attributes in the form of titles or free text. Hence, extracting attribute/value pairs fr...
Preprint
Entity Matching is the task of deciding if two entity descriptions refer to the same real-world entity. State-of-the-art entity matching methods often rely on fine-tuning Transformer models such as BERT or RoBERTa. Two major drawbacks of using these models for entity matching are that (i) the models require significant amounts of fine-tuning data f...
Preprint
The goal of entity resolution is to identify records in multiple datasets that represent the same real-world entity. However, comparing all records across datasets can be computationally intensive, leading to long runtimes. To reduce these runtimes, entity resolution pipelines are constructed of two parts: a blocker that applies a computationally c...
Preprint
The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions o...
Chapter
Entity matching aims at identifying records in different data sources that describe the same real-world entity. Entity matching is the foundational technique for setting RDF links in the context of the Web of Data. By applying active learning methods for training entity matchers, it is possible to reduce the human labeling effort by selecting infor...
Preprint
Contrastive learning has seen increasing success in the fields of computer vision and information retrieval in recent years. This poster is the first work that applies contrastive learning to the task of product matching in e-commerce using product offers from different e-shops. More specifically, we employ a supervised contrastive learning techniq...
Preprint
Transformer-based matching methods have significantly moved the state-of-the-art for less-structured matching tasks involving textual entity descriptions. In order to excel on these tasks, Transformer-based matching methods require a decent amount of training pairs. Providing enough training data can be challenging, especially if a matcher for non-...
Chapter
Supervised entity resolution methods rely on labeled record pairs for learning matching patterns between two or more data sources. Active learning minimizes the labeling effort by selecting informative pairs for labeling. The existing active learning methods for entity resolution all target two-source matching scenarios and ignore signals that only...
Article
An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for o...
Chapter
Entity resolution is one of the central challenges when integrating data from large numbers of data sources. Active learning for entity resolution aims to learn high-quality matching models while minimizing the human labeling effort by selecting only the most informative record pairs for labeling. Most active learning methods proposed so far, start...
Chapter
Full-text available
Data from relational web tables can be used to augment cross-domain knowledge bases like DBpedia, Wikidata, or the Google Knowledge Graph with descriptions of entities that are not yet part of the knowledge base. Such long-tail entities can include for instance small villages, niche songs, or athletes that play in lower-level leagues. In previous w...
Conference Paper
The Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with additional attributes or add missing facts to knowledge bases. Nearly all existing approaches for these...
Conference Paper
The goal of entity resolution, also known as duplicate detection and record linkage, is to identify all records in one or more data sets that refer to the same real-world entity. To achieve this goal, matching rules, encoding the matching patterns in the data, can be learned with the help of manually annotated record pairs. Active learning for enti...
Conference Paper
The Web contains a large number of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with additional attributes or add missing facts to knowledge bases. Nearly all existing approaches for...
Article
Deep neural networks are increasingly used for tasks such as entity resolution, sentiment analysis, and information extraction. As the methods are rather training data hungry, it is necessary to use large training sets in order to enable the methods to play their strengths. Millions of websites have started to annotate structured data within HTML p...
Conference Paper
A current research question in the area of entity resolution (also called link discovery or duplicate detection) is whether and in which cases embeddings and deep neural network based matching methods outperform traditional symbolic matching methods. The problem with answering this question is that deep learning based matchers need large amounts of...
Chapter
Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical...
Conference Paper
Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the specific vendor. In addition, product offers might contain str...
Article
HTML tables on web pages ("web tables") cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the...
Conference Paper
The 10th Linked Data on the Web workshop (LDOW2017) was held in Perth, Western Australia on April 3, 2017, co-located with the 26th International World Wide Web Conference (WWW2017). In its 10th anniversary edition, the LDOW workshop aims to stimulate discussion and further research into the challenges of publishing, consuming, and integrating stru...
Conference Paper
Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching...
Article
The increasing variety of Linked Data on the Web makes it challenging to determine the quality of this data and, subsequently, to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data Quality, the output of such tools is not suitable for machine consumption, and thus con...
Conference Paper
Relational tables collected from HTML pages ("web tables") are used for a variety of tasks including table extension, knowledge base completion, and data transformation. Most of the existing algorithms for these tasks assume that the data in the tables has the form of binary relations, i.e., relates a single entity to a value or to another entity....
Conference Paper
A subset of the HTML tables on the Web contains relational data. The data in these tables covers a multitude of topics and is thus very useful for complementing or validating cross-domain knowledge bases, such as DBpedia, YAGO, or the Google Knowledge Graph. A large fraction of the data in these knowledge bases is time-dependent, meaning that the c...
Conference Paper
Analysts are increasingly confronted with the situation that data which they need for a data mining project exists somewhere on the Web or in an organization’s intranet but they are unable to find it. The data mining tools that are currently available on the market offer a wide range of powerful data mining methods but hardly support analysts in se...
Conference Paper
The ninth workshop on Linked Data (LDOW2016) on the Web is held in Montreal, Quebec, Canada on April 12, 2016 and co-located with the 25rd International World Wide Web Conference (WWW2016). The Web is developing from a medium for publishing textual documents into a medium for sharing structured data. This trend is fueled on the one hand by the adop...
Conference Paper
Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers...
Conference Paper
Semantically annotated data, using markup languages like RDFa and Microdata, has become more and more publicly available in the Web, especially in the area of e-commerce. Thus, a large amount of structured product descriptions are freely available and can be used for various applications, such as product search or recommendation. However, little ef...
Conference Paper
Millions of HTML tables containing structured data can be found on the Web. With their wide coverage, these tables are potentially very useful for filling missing values and extending cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph. As a prerequisite for being able to use table data for knowledge base extension, th...
Conference Paper
Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of large-scale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from different points in time for both the schema and the websites deploy...
Article
Lots of data from different domains are published as Linked Open Data (LOD). While there are quite a few browsers for such data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this system paper, we introduce the RapidMiner Linked Open Dat...
Article
A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to ext...
Conference Paper
This paper presents a brief summary of the eight workshop on Linked Data on the Web. The LDOW 2013 workshop is held in conjunction with the World Wide Web conference 2013. The focus is on data publishing, integration and consumption using RDF and other semantic representation formalisms and technologies.
Article
Full-text available
The datasets that are part of the Linking Open Data cloud diagramm (LOD cloud) are classified into the following topical categories: media, government, publications, life sciences, geographic, social networking, user-generated content, and cross-domain. The topical categories were manually assigned to the datasets. In this paper, we investigate to...
Conference Paper
The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topic...
Conference Paper
Outlier detection used for identifying wrong values in data is typically applied to single datasets to search them for values of unexpected behavior. In this work, we instead propose an approach which combines the outcomes of two independent outlier detection runs to get a more reliable result and to also prevent problems arising from natural outli...
Conference Paper
In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity des...
Conference Paper
Full-text available
Many data mining problems can be solved better if more background knowledge is added: predictive models can become more accurate, and descriptive models can reveal more interesting �ndings. However, col- lecting and integrating background knowledge is tedious manual work. In this paper, we introduce the RapidMiner Linked Open Data Exten- sion, whic...
Chapter
Full-text available
The central assumption of Linked Data is that data providers ease the integration of Web data by setting RDF links between data sources. In addition to linking entities, Web data integration also requires the alignment of the different vocabularies that are used to describe entities as well as the resolution of data conflicts between data sources....
Article
Full-text available
Previous research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional insights about the macroscopic structure of the World Web Web by analyzing an aggregated version of a r...
Conference Paper
Knowledge about the general graph structure of the World Wide Web is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we describe and analyse a large, publicly accessible crawl of the web...
Conference Paper
Large numbers of websites have started to markup their content using standards such as Microdata, Microformats, and RDFa. The marked-up content elements comprise descriptions of people, organizations, places, events, products, ratings, and reviews. This development has accelerated in last years as major search engines such as Google, Bing and Yahoo...
Conference Paper
Full-text available
In this paper, we present a workflow model together with an implementation following the Linked Data principles and the principles for RESTful web services. By means of RDF-based specifications of web services, workflows, and runtime information, we establish a full provenance chain for all resources created within these workflows.
Conference Paper
In order to efficiently use the ever growing amounts of structured data on the web, methods and tools for quality-aware data integration should be devised. In this paper we propose an approach to automatically learn the conflict resolution strategies, which is a crucial step in large-scale data integration. The approach is implemented as an extensi...
Article
Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical...
Article
Das World Wide Web wandelt sich von einem Medium zur Veröffentlichung von Texten zu einem Medium zur Veröffentlichung von strukturierten Daten. Neben Web-2.0-APIs spielen bei dieser Entwicklung zunehmend Linked-Data-Technologien eine zentrale Rolle. Linked-Data-Technologien ermöglichen die Vernetzung von Datenbanken mittels Datenlinks auf Basis der...
Article
Full-text available
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikiped...
Article
A large number of e-commerce websites have started to markup their products using standards such as Microdata, Microformats, and RDFa. However, the markup is mostly not as fine-grained as desirable for applications and mostly consists of free text properties. This paper discusses the challenges that arise in the task of matching descriptions of ele...
Conference Paper
Type information is very valuable in knowledge bases. However, most large open knowledge bases are incomplete with respect to type information, and, at the same time, contain noisy and incorrect data. That makes classic type inference by reasoning difficult. In this paper, we propose the heuristic link-based type inference mechanism SDType, which c...
Conference Paper
More and more websites embed structured data describing for instance products, reviews, blog posts, people, organizations, events, and cooking recipes into their HTML pages using markup standards such as Microformats, Microdata and RDFa. This development has accelerated in the last two years as major Web companies, such as Google, Facebook, Yahoo!,...
Article
Full-text available
The High-Level Expert Group on Scientific Data (HLEG, 2010) was given the responsibility by the European Commission's Directorate-General for Information Society and Media to prepare a 'vision 2030' for the evolution of e-infrastructure for scientific data. The HLEG, 2010 was given the responsibility by the European Commission's Directorate-General...
Article
We present ONTOCOM, a method to estimate the costs of ontology engineering, as well as project management tools that support the application of the method. ONTOCOM is part of a broader framework we have developed over the five years, whose aim is to ...
Article
A central problem in data integration and data cleansing is to find entities in different data sources that describe the same real-world object. Many existing methods for identifying such entities rely on explicit linkage rules which specify the conditions that entities must fulfill in order to be considered to describe the same real-world object....
Conference Paper
The amount of data that is available as Linked Data on the Web has grown rapidly over the last years. However, the linkage between data sources remains sparse as setting RDF links means effort for the data publishers. Many existing methods for generating these links rely on explicit linkage rules which specify the conditions which must hold true fo...
Article
Full-text available
The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object. In order for Linked Data applications to consum...
Chapter
Over the last years, an increasing number of web sites have started to embed structured data into HTML documents as well as to publish structured data in addition to HTML documents directly on the Web. This trend has led to the extension of the Web with a global data space—the Web of Data. As the classic document Web, the Web of Data covers a wide...
Conference Paper
We have developed DBpedia Spotlight, a flexible concept tagging system that is able to tag – i.e. annotate – entities, topics and other terms in natural language text. The system starts by recognizing phrases to annotate in the input text, and subsequently disambiguates them to a reference knowledge base extracted from Wikipedia. In this paper we e...
Article
Full-text available
Linked Data sources on the Web use a wide range of differ-ent vocabularies to represent data describing the same type of entity. For some types of entities, like people or biblio-graphic record, common vocabularies have emerged that are used by multiple data sources. But even for representing data of these common types, different user communities u...
Article
Full-text available
The DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this information into a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base contains textual descriptions (titles and abstracts) of concepts in up to 97 languages. It also c...
Article
Full-text available
Twenty-five Semantic Web and Database researchers met at the 2011 STI Semantic Summit in Riga, Latvia July 6-8, 2011[1] to discuss the opportunities and challenges posed by Big Data for the Semantic Web, Semantic Technologies, and Database communities. The unanimous conclusion was that the greatest shared challenge was not only engineering Big Data...
Conference Paper
Full-text available
Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with...
Article
This paper studies concept drift over time. We first define the meaning of a concept in terms of intension, extension and label. Then we study concept drift over time using two theories: one based on concept identity and one based on concept morphing. ...
Conference Paper
Linked Data technologies provide for setting links between records in distinct databases and thus to connect these databases into a global data space. Over the last years, Linked Data technologies have been adopted by an increasing number of data providers, including the US and UK governments as well as mayor players in the media and pharmaceutical...
Conference Paper
Full-text available
Enriching knowledge bases with multimedia information makes it possible to complement textual descriptions with visual and audio information. Such complementary information can help users to understand the meaning of assertions, and in general improve the user experience with the knowledge base. In this paper we address the problem of how to enrich...
Conference Paper
Full-text available
The Web has developed into a global information space consisting not just of linked documents, but also of Linked Data. In 2010, we have seen significant growth in the size of the Web of Data, as well as in the number of communities contributing to its creation. In addition to publishing and interlinking datasets, there is intensive work on develop...
Book
The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only...
Conference Paper
Full-text available
Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets....
Chapter
The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open Web settings. As SPARQL is taken up by the community, there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protoc...
Chapter
This book has introduced the concept and basic principles of Linked Data, along with overviews of supporting technologies such as URIs, HTTP, and RDF. Collectively, these technologies and principles enable a style of publishing that weaves data into the very fabric of the Web - a unique characteristic that exposes Linked Data to the rigorous and bo...
Chapter
A significant number of individuals and organisations have adopted Linked Data as a way to publish their data, not just placing it on the Web but using Linked Data to ground it in the Web [80]. The result is a global data space we call the Web of Data [30]. The Web of Data forms a giant global graph [17] consisting of billions of RDF statements fro...
Chapter
All data that is published on the Web, according to the Linked Data principles becomes part of a single, global data space. This chapter will discuss how applications use this Web of Data. In general, applications are built to exploit the following properties of the Linked Data architecture: 1. Standardized Data Representation and Access. Integrati...
Chapter
So far this book has introduced the basic principles of Linked Data (Chapter 2) and given an overview of how these principles are being applied to the publication of data from a wide variety of domains (Chapter 3). This chapter will discuss the primary design considerations that must be taken into account when preparing data to be published as Link...
Chapter
The term Linked Data refers to a set of best practices for publishing and interlinking structured data on the Web. These best practices were introduced by Tim Berners-Lee in his Web architecture note Linked Data [16] and have become known as the Linked Data principles. These principles are the following: 1. Use URIs as names for things. 2. Use HTTP...
Chapter
This chapter will examine various common patterns for publishing Linked Data, which demonstrate how Linked Data complements rather than replaces existing data management infrastructures. Following this conceptual overview, the chapter introduces a series of recipes for publishing Linked Data on the Web that build on the design considerations outlin...

Network

Cited By