Benny Kimelfeld

Benny Kimelfeld

About

164
Publications
14,540
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,931
Citations

Publications

Publications (164)
Article
Datasets may include errors, and specifically violations of integrity constraints, for various reasons. Standard techniques for "minimalcost" database repairing resolve these violations by aiming for a minimum change in the data, and in the process, may sway representations of different sub-populations. For instance, the repair may end up deleting...
Article
Recent studies investigated the challenge of assessing the strength of a given claim extracted from a dataset, particularly the claim's potential of being misleading and cherry-picked. We focus on claims that compare answers to an aggregate query posed on a view that selects tuples. The strength of a claim amounts to the question of how likely it i...
Preprint
Full-text available
Assessing data quality is crucial to knowing whether and how to use the data for different purposes. Specifically, given a collection of integrity constraints, various ways have been proposed to quantify the inconsistency of a database. Inconsistency measures are particularly important when we wish to assess the quality of private data without reve...
Preprint
In Approval-Based Committee (ABC) voting, each voter lists the candidates they approve and then a voting rule aggregates the individual approvals into a committee that represents the collective choice of the voters. An extensively studied class of such rules is the class of ABC scoring rules, where each voter contributes to each possible committee...
Article
Document spanners have been proposed as a formal framework for declarative Information Extraction (IE) from text, following IE products from the industry and academia. Over the past decade, the framework has been studied thoroughly in terms of expressive power, complexity, and the ability to naturally combine text analysis with relational querying....
Article
This editorial accompanies the second issue dedicated to the PODS research track of the Proceedings of the ACM on Management of Data (PACMMOD) journal. The journal hosts a SIGMOD and a PODS research track. The PODS research track aims to provide a solid scientific basis for methods, techniques, and solutions for the data management challenges that...
Preprint
Full-text available
Datasets may include errors, and specifically violations of integrity constraints, for various reasons. Standard techniques for ``minimal-cost'' database repairing resolve these violations by aiming for minimum change in the data, and in the process, may sway representations of different sub-populations. For instance, the repair may end up deleting...
Preprint
Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a Euclidean space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The...
Preprint
Full-text available
Document spanners have been proposed as a formal framework for declarative Information Extraction (IE) from text, following IE products from the industry and academia. Over the past decade, the framework has been studied thoroughly in terms of expressive power, complexity, and the ability to naturally combine text analysis with relational querying....
Article
Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classifier to changes in the token. Aggregating local explanations over a dataset provides a global explanation of the mod...
Article
We consider spatial voting where candidates are located in the Euclidean d-dimensional space, and each voter ranks candidates based on their distance from the voter's ideal point. We explore the case where information about the location of voters' ideal points is incomplete: for each dimension, we are given an interval of possible values. We study...
Article
A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the...
Article
Deploying possible world semantics and the challenge of computing the certain answers to queries.
Article
Attribution scores can be applied in data management to quantify the contribution of individual items to conclusions from the data, as part of the explanation of what led to these conclusions. In Artificial Intelligence, Machine Learning, and Data Management, some of the common scores are deployments of the Shapley value, a formula for profit shari...
Article
Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggre...
Preprint
Full-text available
We study the fine-grained complexity of conjunctive queries with grouping and aggregation. For some common aggregate functions (e.g., min, max, count, sum), such a query can be phrased as an ordinary conjunctive query over a database annotated with a suitable commutative semiring. Specifically, we investigate the ability to evaluate such queries by...
Preprint
Full-text available
We consider spatial voting where candidates are located in the Euclidean $d$-dimensional space and each voter ranks candidates based on their distance from the voter's ideal point. We explore the case where information about the location of voters' ideal points is incomplete: for each dimension, we are given an interval of possible values. We study...
Article
We study the question of when we can provide direct access to the k -th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of...
Preprint
Full-text available
A probabilistic database with attribute-level uncertainty consists of relations where cells of some attributes may hold probability distributions rather than deterministic content. Such databases arise, implicitly or explicitly, in the context of noisy operations such as missing data imputation, where we automatically fill in missing values, column...
Preprint
Full-text available
A path query extracts vertex tuples from a labeled graph, based on the words that are formed by the paths connecting the vertices. We study the computational complexity of measuring the contribution of edges and vertices to an answer to a path query, focusing on the class of conjunctive regular path queries. To measure this contribution, we adopt t...
Article
We investigate approval-based committee voting with incomplete information about the approval preferences of voters. We consider several models of incompleteness where each voter partitions the set of candidates into approved, disapproved, and unknown candidates, possibly with ordinal preference constraints among candidates in the latter category....
Article
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the pr...
Article
Full-text available
Quantifying the inconsistency of a database is motivated by various goals including reliability estimation for new datasets and progress indication in data cleaning. Another goal is to attribute to individual tuples a level of responsibility to the overall inconsistency, and thereby prioritize tuples in the explanation or inspection of dirt. Theref...
Article
Full-text available
The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions w...
Preprint
The Shapley value is a game-theoretic notion for wealth distribution that is nowadays extensively used to explain complex data-intensive computation, for instance, in network analysis or machine learning. Recent theoretical works show that query evaluation over relational databases fits well in this explanation paradigm. Yet, these works fall short...
Article
Full-text available
We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and i...
Article
A common conceptual view of text analysis is that of a two-step process, where we first extract relations from text documents and then apply a relational query over the result. Hence, text analysis shares technical challenges with, and can draw ideas from, relational databases. A framework that formally instantiates this connection is that of the d...
Article
We investigate the practical aspects of computing the necessary and possible winners in elections over incomplete voter preferences. In the case of the necessary winners, we show how to implement and accelerate the polynomial-time algorithm of Xia and Conitzer. In the case of the possible winners, where the problem is NP-hard, we give a natural red...
Article
Database tuples can be seen as players in the game of jointly realizing the answer to a query. Some tuples may contribute more than others to the outcome, which can be a binary value in the case of a Boolean query, a number for a numerical aggregate query, and so on. To quantify the contributions of tuples, we use the Shapley value that was introdu...
Article
We study the problem of counting the repairs of an inconsistent database in the case where constraints are Functional Dependencies (FDs). A repair is then a maximal independent set of the conflict graph, wherein nodes represent facts and edges represent violations. We establish a dichotomy in data complexity for the complete space of FDs: when the...
Preprint
We investigate approval-based committee voting with incomplete information about voters' approval preferences. We consider several models of incompleteness where each voter partitions the set of candidates into approved, disapproved, and unknown candidates, possibly with ordinal preference constraints among candidates in the latter category. For a...
Preprint
Full-text available
We study the problem of computing an embedding of the tuples of a relational database in a manner that is extensible to dynamic changes of the database. Importantly, the embedding of existing tuples should not change due to the embedding of newly inserted tuples (as database applications might rely on existing embeddings), while the embedding of al...
Article
We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and we want to find feature queries that linearly separate the two sets of examples. We focus on conjunctive feature queries, and explore two problems: (a) deciding if separating feature queries exist (separability), and...
Preprint
We investigate the problem of computing the probability of winning in an election where voter attendance is uncertain. More precisely, we study the setting where, in addition to a total ordering of the candidates, each voter is associated with a probability of attending the poll, and the attendances of different voters are probabilistically indepen...
Preprint
We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering...
Preprint
Full-text available
Quantifying the inconsistency of a database is motivated by various goals including reliability estimation for new datasets and progress indication in data cleaning. Another goal is to attribute to individual tuples a level of responsibility to the overall inconsistency, and thereby prioritize tuples in the explanation or inspection of dirt. Theref...
Preprint
A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the...
Preprint
Full-text available
We present a web-based system called ViS-\'A-ViS aiming to assist literary scholars in detecting repetitive patterns in an annotated textual corpus. Pattern detection is made possible using distant reading visualizations that highlight potentially interesting patterns. In addition, the system uses time-series alignment algorithms, and in particular...
Article
We present an algorithm that enumerates all the minimal triangulations of a graph in incremental polynomial time. Consequently, we get an algorithm for enumerating all the proper tree decompositions, in incremental polynomial time, where “proper” means that the tree decomposition cannot be improved by removing or splitting a bag. The algorithm can...
Article
The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints, including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining from data approximate DCs, that is, DCs that are "...
Preprint
Full-text available
The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining approximate DCs (i.e., DCs that are "almost" satisf...
Preprint
In an election via a positional scoring rule, each candidate receives from each voter a score that is determined only by the position of the candidate in the voter's total ordering of the candidates. A winner (respectively, unique winner) is a candidate who receives a score not smaller than (respectively, strictly greater than) the remaining candid...
Preprint
Full-text available
We investigate the practical aspects of computing the necessary and possible winners in elections over incomplete voter preferences. In the case of the necessary winners, we show how to implement and accelerate the polynomial-time algorithm of Xia and Conitzer. In the case of the possible winners, where the problem is NP-hard, we give a natural red...
Article
In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a “minimal way.” Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an...
Preprint
Full-text available
Preference analysis is widely applied in various domains such as social choice and e-commerce. A recently proposed framework augments the relational database with a preference relation that represents uncertain preferences in the form of statistical ranking models, and provides methods to evaluate Conjunctive Queries (CQs) that express preferences...
Article
Preference analysis is widely applied in various domains such as social choice and e-commerce. A recently proposed frame- work augments the relational database with a preference re- lation that represents uncertain preferences in the form of statistical ranking models, and provides methods to evaluate Conjunctive Queries (CQs) that express preferen...
Preprint
Full-text available
Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (intervals identified by their start and end indices) from text. Based on these Fagin et al. introduced regular document spanners which are the closure of regex formulas under Relational Algebra. In this work, we study the computational complexity...
Article
We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair), which is obtained by a minimum number of tuple deletions, and an optimal update repair (optimal U-repair),...
Preprint
Full-text available
We introduce the problem of Automatic Location Type Classification from social media posts. Our goal is to correctly associate a set of messages posted in a small radius around a given location with their corresponding location type, e.g., school, church, restaurant or museum. We provide a dataset of locations associated with tweets posted in close...
Preprint
The Shapley value is a conventional and well-studied function for determining the contribution of a player to the coalition in a cooperative game. Among its applications in a plethora of domains, it has recently been proposed to use the Shapley value for quantifying the contribution of a tuple to the result of a database query. In particular, we ha...
Preprint
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the pr...
Article
The challenge of entity matching is that of identifying when different data items (often referred to as records or mentions) refer to the same real-life entity. Popular instantiations of this problem include deduplication, where the items are database records that include duplicate representations of the same entity (e.g., duplicate profiles in a s...
Conference Paper
Personalization plays a key role in electronic commerce, adjusting the products presented to users through search and recommendations according to their personality and tastes. Current personalization efforts focus on the adaptation of product selections, while the description of a given product remains the same regardless of the user who views it....
Preprint
Full-text available
We introduce annotated document spanners, which are document spanners that can annotate their output tuples with elements from a semiring. Such spanners are useful for modeling soft constraints, which are popular in practical information extraction tools. We introduce a finite automaton model for such spanners, which generalizes vset-automata and w...
Preprint
Full-text available
We study the problem of model counting for Boolean Conjunctive Queries (CQs): given a database, how many of its subsets satisfy the CQ? This problem is computationally equivalent to the evaluation of a CQ over a tuple-independent probabilistic database (i.e., determining the probability of satisfying the CQ) when the probability of every tuple is 1...
Article
Full-text available
We study the complexity of estimating the probability of an outcome in an election over probabilistic votes. The focus is on voting rules expressed as positional scoring rules, and two models of probabilistic voters: the uniform distribution over the completions of a partial voting profile (consisting of a partial ordering of the candidates by each...
Conference Paper
A tree decomposition of a graph facilitates computations by grouping vertices into bags that are interconnected in an acyclic structure; hence their importance in a plethora of problems such as query evaluation over databases and inference over probabilistic graphical models. The relative benefit from different tree decompositions is measured by di...
Conference Paper
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allo...
Conference Paper
We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and the goal is to find feature queries that allow for a linear separation between the two sets of examples. We focus on conjunctive feature queries, and explore two fundamental problems: (a) deciding whether separating f...
Conference Paper
We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e....
Conference Paper
Election databases are the main elements of a recently introduced framework that aims to create bridges between the computational social choice and the data management communities. An election database consists of incomplete information about the preferences of voters, in the form of partial orders, alongside with standard database relations that p...
Preprint
Graph pattern matching (e.g., finding all cycles and cliques) has become an important component in many critical domains such as social networks, biology, and cyber-security. This development motivated research to develop faster algorithms that target graph pattern matching. In recent years, the database community has shown that mapping graph patte...
Preprint
Full-text available
We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and i...
Preprint
How should a cleaning system measure the amount of inconsistency in the database? Proper measures are important for quantifying the progress made in the cleaning process relative to the remaining effort and resources required. Similarly to effective progress bars in interactive systems, inconsistency should ideally diminish steadily and continuousl...
Preprint
We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e....
Preprint
Full-text available
We propose a visual query language for interactively exploring large-scale knowledge graphs. Starting from an overview, the user explores bar charts through three interactions: class expansion, property expansion, and subject/object expansion. A major challenge faced is performance: a state-of-the-art SPARQL engine may require tens of minutes to co...
Article
In the design of analytical procedures and machine learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this article, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classificat...
Preprint
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allo...
Article
In the design of analytical procedures and machine-learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. We embark on the establishment of database foundations for feature engineering. Specifically, we propose a formal framework for classification...
Conference Paper
We develop a novel framework that aims to create bridges between the computational social choice and the database management communities. This framework enriches the tasks currently supported in computational social choice with relational database context, thus making it possible to formulate sophisticated queries about voting rules, candidates, vo...
Conference Paper
We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtained by a minimum number of tuple deletions, and an optimal update repair (optimal U-repair) th...
Conference Paper
Regular expressions with capture variables, also known as "regex formulas,'' extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of "document spanners," Fagin et al.'s formal framework for information extraction. We investigate the complexity of q...
Conference Paper
Full-text available
Models of uncertain preferences, such as Mallows, have been extensively studied due to their plethora of application domains. In a recent work, a conceptual and theoretical framework has been proposed for supporting uncertain preferences as first-class citizens in a relational database. The resulting database is probabilistic, and, consequently, qu...
Preprint
We develop a novel framework that aims to create bridges between the computational social choice and the database management communities. This framework enriches the tasks currently supported in computational social choice with relational database context, thus making it possible to formulate sophisticated queries about voting rules, candidates, vo...
Article
Distributions over rankings are used to model user preferences in various settings including political elections and electronic commerce. The Repeated Insertion Model (RIM) gives rise to various known probability distributions over rankings, in particular to the popular Mallows model. However, probabilistic inference on RIM is computationally chall...
Article
Traditional modeling of inconsistency in database theory casts all possible "repairs" equally likely. Yet, effective data cleaning needs to incorporate statistical reasoning. For example, yearly salary of \$100k and age of 22 are more likely than \$100k and 122 and two people with same address are likely to share their last name (i.e., a functional...
Article
A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well studied language for expressing spanners is that of the regular spanners: relational algebra over regex fo...
Article
We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtained by a minimum number of tuple deletions, and an optimal update repair (optimal U-repair) th...
Article
Probabilistic programming languages are used for developing statistical models. They typically consist of two components: a specification of a stochastic process (the prior) and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algor...
Article
Tree decompositions facilitate computations on complex graphs by grouping vertices into bags interconnected in an acyclic structure; hence their importance in a plethora of problems such as query evaluation over databases and inference over probabilistic graphical models. Different applications take varying benefits from different tree decompositio...
Article
For a relation that violates a set of functional dependencies, we consider the task of finding a maximum number of pairwise-consistent tuples, or what is known as a "cardinality repair." We present a polynomial-time algorithm that, for certain fixed relation schemas (with functional dependencies), computes a cardinality repair. Moreover, we prove t...
Conference Paper
In the design of analytical procedures and machine-learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this framework paper, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for cla...
Conference Paper
We present an algorithm that enumerates all the minimal triangulations of a graph in incremental polynomial time. Consequently, we get an algorithm for enumerating all the proper tree decompositions, in incremental polynomial time, where ``proper'' means that the tree decomposition cannot be improved by removing or splitting a bag. The algorithm ca...
Conference Paper
In the traditional sense, a subset repair of an inconsistent database refers to a consistent subset of facts (tuples) that is maximal under set containment. Preferences between pairs of facts allow to distinguish a set of preferred repairs based on relative reliability (source credibility, extraction quality, recency, etc.) of data items. Previous...
Conference Paper
We propose a novel framework wherein probabilistic preferences can be naturally represented and analyzed in a probabilistic relational database. The framework augments the relational schema with a special type of a relation symbol---a preference symbol. A deterministic instance of this symbol holds a collection of binary relations. Abstractly, the...
Article
Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying...

Network

Cited By