Paolo Papotti

Paolo Papotti
EURECOM · Data Science Department

PhD

About

125
Publications
18,616
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,662
Citations
Introduction
Paolo Papotti is an Assistant Professor in the data science department at EURECOM (France). His research is focused on data management. He is a regular program committee member for data management venues, such as ACM SIGMOD, VLDB, IEEE ICDE, and is associate editor for the ACM Journal of Data and Information Quality (JDIQ).

Publications

Publications (125)
Preprint
Full-text available
Fact-checking is one of the effective solutions in fighting online misinformation. However, traditional fact-checking is a process requiring scarce expert human resources, and thus does not scale well on social media because of the continuous flow of new content to be checked. Methods based on crowdsourcing have been proposed to tackle this challen...
Preprint
Full-text available
Entity resolution is a widely studied problem with several proposals to match records across relations. Matching textual content is a widespread task in many applications, such as question answering and search. While recent methods achieve promising results for these two tasks, there is no clear solution for the more general problem of matching tex...
Preprint
While pre-trained language models (PLMs) are the go-to solution to tackle many natural language processing problems, they are still very limited in their ability to capture and to use common-sense knowledge. In fact, even if information is available in the form of approximate (soft) logical rules, it is not clear how to transfer it to a PLM in orde...
Conference Paper
The reporting and the analysis of current events around the globe has expanded from professional, editor-lead journalism all the way to citizen journalism. Nowadays, politicians and other key players enjoy direct access to their audiences through social media, bypassing the filters of official cables or traditional media. However, the multiple adva...
Preprint
Full-text available
The reporting and analysis of current events around the globe has expanded from professional, editor-lead journalism all the way to citizen journalism. Politicians and other key players enjoy direct access to their audiences through social media, bypassing the filters of official cables or traditional media. However, the multiple advantages of free...
Article
The Future of Work (FoW) is witnessing an evolution where AI systems (broadly machines or businesses) are used to the benefit of humans. Work here refers to all forms of paid and unpaid labor in both physical and virtual workplaces and that is enabled by AI systems. This covers crowdsourcing platforms such as Amazon Mechanical Turk, online labor ma...
Research Proposal
Full-text available
* Context: Deep learning (DL) has been recently used successfully for monitoring and improving data quality (DQ). Examples include data integration tasks such as entity resolution and schema matching, data cleaning tasks such as error detection and repair, and data curation in general. The data curation community has successfully leveraged deep...
Article
Entity-centric knowledge graphs (KGs) are now popular to collect facts about entities. KGs have rich schemas with a large number of different types and predicates to describe the entities and their relationships. On these rich schemas, logical rules are used to represent dependencies between the data elements. While rules are useful in query answer...
Article
Full-text available
Data cleaning (or data repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicti...
Article
We describe RuDiK, an algorithm and a system for mining declarative rules over RDF knowledge graphs (KGs). RuDiK can discover rules expressing both positive relationships between KG elements, e.g., “if two persons share at least one parent, they are likely to be siblings,” and negative patterns identifying data contradictions, e.g., “if two persons...
Research
Full-text available
A content-based system to verify (textual) statistical claims about the coronavirus spread and effects on data from the official sources. Available for multiple languages (EN, IT, FR, GE) at https://coronacheck.eurecom.fr
Preprint
Organizations such as the International Energy Agency (IEA) spend significant amounts of time and money to manually fact check text documents summarizing data. The goal of the Scrutinizer system is to reduce verification overheads by supporting human fact checkers in translating text claims into SQL queries on an associated database. Scrutinizer co...
Preprint
Full-text available
We present a novel method - LIBRE - to learn an interpretable classifier, which materializes as a set of Boolean rules. LIBRE uses an ensemble of bottom-up weak learners operating on a random subset of features, which allows for the learning of rules that generalize well on unseen data even in imbalanced settings. Weak learners are combined with a...
Conference Paper
Fact checking is the task of determining if a given claim holds. Several algorithms have been developed to check claims with reference information in the form of facts in a knowledge base. While individual algorithms have been experimentally evaluated in the past, we provide the first comprehensive and publicly available benchmark infrastructure fo...
Preprint
Full-text available
Integrating information from heterogeneous data sources is one of the fundamental problems facing any enterprise. Recently, it has been shown that deep learning based techniques such as embeddings are a promising approach for data integration problems. Prior efforts directly use pre-trained embeddings or simplistically adapt techniques from natural...
Article
Fact checking is the task of determining if a given claim holds. Several algorithms have been developed to check facts with reference information in the form of knowledge bases. We demonstrate BUCKLE, an open-source benchmark for comparing and evaluating fact checking algorithms in a level playing field across a range of scenarios. The demo is cent...
Preprint
Full-text available
One challenge in fact checking is the ability to improve the transparency of the decision. We present a fact checking method that uses reference information in knowledge graphs (KGs) to assess claims and explain its decisions. KGs contain a formal representation of knowledge with semantic descriptions of entities and their relationships. We exploit...
Article
The definition of mappings between heterogeneous schemas is a critical activity of any database application. Existing tools provide high level interfaces for the discovery of correspondences between elements of schemas, but schema mappings need to be manually specified every time from scratch, even if the scenario at hand is similar to one that has...
Article
Full-text available
RuDiK is a system for the discovery of declarative rules over knowledge-bases (KBs). RuDiK discovers both positive rules, which identify relationships between entities, e.g., "if two persons have the same parent, they are siblings", and negative rules, which identify data contradictions, e.g., "if two persons are married, one cannot be the child of...
Article
Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data...
Conference Paper
Full-text available
We present RUDIK, a system for the discovery of declarative rules over knowledge-bases (KBs). RUDIK discovers rules that express positive relationships between entities, such as "if two persons have the same parent, they are siblings", and negative rules, i.e., patterns that identify contradictions in the data, such as "if two persons are married,...
Conference Paper
Full-text available
Fact checking is the task of determining if a given claim holds. Several algorithms have been developed to check facts with reference information in the form of knowledge bases. While individual algorithms have been experimentally evaluated, we provide a first publicly available benchmark evaluating fact checking implementations across a range of a...
Article
Full-text available
We study black-box attacks on machine learning classifiers where each query to the model incurs some cost or risk of detection to the adversary. We focus explicitly on minimizing the number of queries as a major objective. Specifically, we consider the problem of attacking machine learning classifiers subject to a budget of feature modification cos...
Conference Paper
Full-text available
A volunteered geographic information system, e.g., OpenStreetMap (OSM), collects data from volunteers to generate geospatial maps. To keep the map consistent, volunteers are expected to perform the tedious task of updating the underlying geospatial data at regular intervals. Such map curation step takes time and considerable human effort. In this p...
Chapter
Full-text available
Rule discovery is a challenging but inevitable process in several data centric applications. The main challenges arise from the huge search space that needs to be explored, and from the noise in the data, which makes the mining results hardly useful. While existing state-of-the-art systems pose the users at the beginning and the end of the mining p...
Article
Full-text available
Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given high-level specification, via a predefined grammar. This grammar des...
Chapter
Full-text available
Schema mapping management is an important research area in data transformation, integration, and cleaning systems. The reasons for its success can be found in the declarative nature of its building block (thus enabling clean semantics and easy to use design tools) paired with the efficiency and modularity in the deployment step. In this chapter we...
Conference Paper
The chase is a family of algorithms used in a number of data management tasks, such as data exchange, answering queries under dependencies, query reformulation with constraints, and data cleaning. It is well established as a theoretical tool for understanding these tasks, and in addition a number of prototype systems have been developed. While indi...
Conference Paper
Full-text available
Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combined by con...
Article
2017 ACM. Entity matching (EM) is a critical part of data integration and cleaning. In many applications, the users need to understand why two entities are considered a match, which reveals the need for interpretable and concise EM rules. We model EM rules in the form of General Boolean Formulas (GBFs) that allows arbitrary attribute matching combi...
Article
Full-text available
This is in response to recent feedback from some readers, which requires some clarifications regarding our IEJoin algorithm published in [1]. The feedback revolves around four points: (1) a typo in our illustrating example of the join process; (2) a naming error for the index used by our algorithm to improve the bit array scan; (3) the sort order u...
Article
Full-text available
Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as $$B^+$$ B + -tree, $$R^*$$ R ∗ -tree and Bitmap. However, inequality joins have...
Article
Full-text available
Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integri...
Conference Paper
We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bo...
Conference Paper
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for mul...
Conference Paper
Full-text available
Repairing erroneous or conflicting data that violate a set of constraints is an important problem in data management. Many automatic or semi-automatic data-repairing algorithms have been proposed in the last few years, each with its own strengths and weaknesses. Bart is an open-source error-generation system conceived to support thorough experiment...
Conference Paper
Full-text available
Conference Paper
Full-text available
Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R∗-tree and Bitmap, inequality joins have received little atte...
Chapter
Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques h...
Article
Full-text available
Data curation includes the many tasks needed to ensure data maintains its value over time. Given the maturity of many data curation tasks, including data transformation and data cleaning, it is surprising that rigorous empirical evaluations of research ideas are so scarce. In this work, we argue that thorough evaluation of data curation systems imp...
Chapter
Full-text available
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation prob...
Conference Paper
Data analytics is at the core of any organization that wants to obtain measurable value from its growing data assets. Data analytic tasks may range from simple to extremely complex pipelines, such as data extraction, transformation and loading, online analytical processing, graph processing, and machine learning (ML). Following the dictum “one size...
Article
Full-text available
Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts, in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques...
Article
Full-text available
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation prob...
Article
Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases KBs both general-purpose and within enter...
Conference Paper
Full-text available
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present...
Conference Paper
Full-text available
While syntactic transformations require the application of a formula on the input values, such as unit conversion or date format conversions, semantic transformations, such as zip code to city, require a look-up in some reference data. We recently presented DataXFormer, a system that leverages Web tables, Web forms, and expert sourcing to cover a w...
Conference Paper
Full-text available
Data transformation is a crucial step in data integration. While some transformations, such as liters to gallons, can be easily performed by applying a formula or a program on the input values, others, such as zip code to city, require sifting through a repository containing explicit value map-pings. There are already powerful systems that provide...
Chapter
Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data...
Article
Full-text available
It is widely recognized that whenever different data sources need to be integrated into a single target database errors and inconsistencies may arise, so that there is a strong need to apply data-cleaning techniques to repair the data. Despite this need, database research has so far investigated mappings and data repairing essentially in isolation....
Article
Full-text available
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the vi...
Conference Paper
Full-text available
The ability to predict future movements for moving objects enables better decisions in terms of time, cost, and impact on the environment. Unfortunately, future location prediction is a challenging task. Existing works exploit techniques to predict a trip destination, but they are effective only when location data are precise (e.g., GPS data) and m...