Hector Garcia-Molina's research while affiliated with Stanford University and other places

Publications (537)

Conference Paper
In this paper, we present CrowdDQS, a system that uses the most recent set of crowdsourced voting evidence to dynamically issue questions to workers on Amazon Mechanical Turk (AMT). CrowdDQS posts all questions to AMT in a single batch, but delays the decision of the exact question to issue a worker until the last moment, concentrating votes on unc...
Conference Paper
In Entity Resolution, the objective is to find which records of a dataset refer to the same real-world entity. Crowd Entity Resolution uses humans, in addition to machine algorithms, to improve the quality of the outcome. We study a hybrid approach that combines two common interfaces for human tasks in Crowd Entity Resolution, taking into account k...
Conference Paper
We focus on data fusion, i.e., the problem of unifying conflicting data from data sources into a single representation by estimating the source accuracies. We propose SLiMFast, a framework that expresses data fusion as a statistical learning problem over discriminative probabilistic models, which in many cases correspond to logistic regression. In...
Article
We present smart drill-down , an operator for interactively exploring a relational table to discover and summarize “interesting” groups of tuples. Each group of tuples is described by a rule . For instance, the rule $(a, b, \star, 1000)$ tells us that there are 1,000 tuples with value $a$ in the first column and $b$ in the second column (and a...
Conference Paper
We study the problem of graph tracking with limited information. In this paper, we focus on updating a social graph snapshot. Say we have an existing partial snapshot, G1, of the social graph stored at some system. Over time G1 becomes out of date. We want to update G1 through a public API to the actual graph, restricted by the number of API calls...
Conference Paper
We study the problem of using the crowd to perform entity resolution (ER) on a set of records. For many types of records, especially those involving images, such a task can be difficult for machines, but relatively easy for humans. Typical crowd-based ER approaches ask workers for pairwise judgments between records, which quickly becomes prohibitiv...
Conference Paper
We present smart drill-down, an operator for interactively exploring a relational table to discover and summarize "interesting" groups of tuples. Each group of tuples is described by a rule. For instance, the rule (a, b, ⋆, 1000) tells us that there are a thousand tuples with value a in the first column and b in the second column (and any value in...
Article
Crowdsourcing refers to solving large problems by involving human workers that solve component sub-problems or tasks. In data crowdsourcing, the problem involves data acquisition, management, and analysis. In this paper, we provide an overview of data crowdsourcing, giving examples of problems that the authors have tackled, and presenting the key d...
Article
We study the problem of estimating the quality of data sources in data fusion settings. In contrast to existing models that rely only on conflicting observations across sources to infer quality (internal signals), we propose a data fusion model, called FUSE, that combines internal signals with external data-source features. We show both theoretical...
Conference Paper
An important problem that online work marketplaces face is grouping clients into clusters, so that in each cluster clients are similar with respect to their hiring criteria. Such a separation allows the marketplace to "learn" more accurately the hiring criteria in each cluster and recommend the right contractor to each client, for a successful coll...
Chapter
We present a data exploration system equipped with smart drilldown, a novel operator for interactively exploring a relational table to discover and summarize "interesting" groups of tuples. Each such group of tuples is represented by a rule. For instance, the rule (a; b; *; 1000) tells us that there are a thousand tuples with value a in the first c...
Article
Full-text available
We present {\em smart drill-down}, an operator for interactively exploring a relational table to discover and summarize "interesting" groups of tuples. Each group of tuples is described by a {\em rule}. For instance, the rule $(a, b, \star, 1000)$ tells us that there are a thousand tuples with value $a$ in the first column and $b$ in the second col...
Conference Paper
Latency is a critical factor when using a crowdsourcing platform to solve a problem like entity resolution or sorting. In practice, most frameworks attempt to reduce latency by heuristically splitting a budget of questions into rounds, so that after each round the answers are analyzed and new questions are selected. We focus on one of the most exte...
Article
We study the problem of Entity Resolution (ER) with limited information. ER is the problem of identifying and merging records that represent the same real-world entity. In this paper, we focus on the resolution of a single node g from one social graph (Google+ in our case) against a second social graph (Twitter in our case). We want to find the bes...
Article
Given a set of records, an Entity Resolution (ER) algorithm finds records that refer to the same real-world entity. Humans can often determine if two records refer to the same entity, and hence we study the problem of selecting questions to ask error-prone humans. We give a Maximum Likelihood formulation for the problem of finding the 'most benefic...
Article
Full-text available
Evaluating workers is a critical aspect of any crowdsourcing system. In this paper, we devise techniques for evaluating workers by finding confidence intervals on their error rates. Unlike prior work, we focus on "conciseness"---that is, giving as tight a confidence interval as possible. Conciseness is of utmost importance because it allows us to b...
Article
Full-text available
User Defined Function(UDFs) are used increasingly to augment query languages with extra, application dependent functionality. Selection queries involving UDF predicates tend to be expensive, either in terms of monetary cost or latency. In this paper, we study ways to efficiently evaluate selection queries with UDF predicates. We provide a family of...
Article
Traditional search engines are unable to support a large number of potential queries issued by users, for instance, queries containing non-textual fragments such as images or videos, queries that are very long, ambiguous, or those that require subjective judgment, or semantically-rich queries over non-textual corpora. We demonstrate DataSift, a cro...
Article
We focus on crowd-powered fltering, i.e., fltering a large set of items using humans. Filtering is one of the most commonly used building blocks in crowdsourcing applications and systems. While solutions for crowd-powered fltering exist, theymake a range of implicit assumptions and restrictions, ultimately rendering them not powerful enough for rea...
Conference Paper
We consider the problem of using humans to find a bounded number of items satisfying certain properties, from a data set. For instance, we may want humans to identify a select number of travel photos from a data set of photos to display on a travel website, or a candidate set of resumes that meet certain requirements from a large pool of applicants...
Article
Entity resolution (ER) identifies database records that refer to the same real-world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We first address the problem of keeping the ER result up-to-date when the ER logic or data “evolve” frequently. A naïve approach...
Article
Data scientists rely on visualizations to interpret the data returned by queries, butnding the right visualization remains amanual task that is oen laborious. We propose aDBMS that partially automates the task of finding the right visualizations for a query. In a nutshell, given an input query Q, the new DBMS optimizer will explore not only the spa...
Article
Traditional information retrieval systems have limited functionality. For instance, they are not able to adequately support queries containing non-textual fragments such as images or videos, queries that are very long or ambiguous, or semantically-rich queries over non-textual corpora. In this paper, we present DataSift, an expressive and accurate...
Conference Paper
We study the problem of disinformation. We assume that an ``agent'' has some sensitive information that the ``adversary'' is trying to obtain. For example, a camera company (the agent) may secretly be developing its new camera model, and a user (the adversary) may want to know in advance the detailed specs of the model. The agent's goal is to disse...
Conference Paper
form only given. It may sound contradictory to use humans to analyze big data, since humans cannot process huge amounts of data, may be error prone and are relatively slow. However, humans can do certain tasks much better than machines, e.g., tasks that involve image analysis or natural language. In this talk I will discuss how humans can be judici...
Conference Paper
Full-text available
Worker quality control is a crucial aspect of crowdsourcing systems; typically occupying a large fraction of the time and money invested on crowdsourcing. In this work, we devise techniques to generate confidence intervals for worker error rate estimates, thereby enabling a better evaluation of worker quality. We show that our techniques generate c...
Conference Paper
Full-text available
We propose an algorithm that obtains the top-k list of items out of a larger itemset, using human workers (e.g., through crowdsourcing) to perform comparisons among items. An example application is finding the best photographs in a large collection by asking humans to evaluate different photos. Our algorithm has to address several challenges: obtai...
Article
Entity resolution (ER) is the problem of identifying which records in a database refer to the same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data from the web may simply be too large to completely resolve with a reasonable amount of work. As...
Article
We study the problem of enhancing Entity Resolution (ER) with the help of crowdsourcing. ER is the problem of clustering records that refer to the same real-world entity and can be an extremely difficult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an ea...
Article
Deco is a comprehensive system for answering declarative queries posed over stored relational data together with data obtained on-demand from the crowd. In this overview paper, we describe Deco's data model, query language, and system prototype, summarizing material from earlier papers. Deco's data model was designed to be general, flexible, and pr...
Conference Paper
Crowdsourcing enables programmers to incorporate "human computation" as a building block in algorithms that cannot be fully automated, such as text analysis and image recognition. Similarly, humans can be used as a building block in data-intensive applications--providing, comparing, and verifying data used by applications. Building upon the decades...
Conference Paper
We study data privacy in the context of information leak-age. As more of our sensitive data gets exposed to merchants, health care providers, employers, social sites and so on, there is a higher chance that an adversary can "connect the dots" and piece together a lot of our information. The more complete the integrated information, the more our pri...
Conference Paper
We study quality control mechanisms for a crowdsourcing system where workers perform object comparison tasks. We study error masking techniques (e.g., voting) and detection of bad workers. For the latter, we consider using gold-standard questions, as well as disagreement with the plurality answer. We perform experiments on Mechanical Turk that yiel...
Article
Deco is a system that enables declarative crowdsourcing: answering SQL queries posed over data gathered from the crowd as well as existing relational data. Deco implements a novel push-pull hybrid execution model in order to support a flexible data model and a precise query semantics, while coping with the combination of latency, monetary cost, and...
Article
Full-text available
Use of links to enhance page ranking has been widely studied. The underlying assumption is that links convey recommendations. Although this technique has been used successfully in global web search, it produces poor results for website search, because the majority of the links in a website are used to organize information and convey no recommendati...
Conference Paper
Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact the resolution of other types of records. In this paper we propose a flexible, modular resolution...
Conference Paper
Full-text available
In sponsored search auctions advertisers compete for ad slots in the search engine results page, by bidding on keywords of interest. To improve advertiser expressiveness, we augment the bidding process with conflict constraints. With such constraints, advertisers can condition their bids on the non-appearance of certain undesired ads on the results...
Article
Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact the resolution of other types of records. In this paper we propose a flexible, modular resolution...
Article
This paper presents a new randomized algorithm for quickly finding approximate nearest neighbor matches between image patches. Our algorithm offers substantial performance improvements over the previous state of the art (20--100×), enabling its use in new interactive image editing tools, computer vision, and video applications. Previously, the cost...
Article
How to address user information needs amidst a preponderance of data.
Conference Paper
Full-text available
We study the impact of display advertising on user search behavior using a field experiment. In such an experiment, the treatment group users are exposed to some display advertising campaign, while the control group users are not. During the campaign and the post-campaign period we monitor the user search queries and we label them as relevant or ir...
Article
In this paper, we consider the problem of constructing wrappers for web information extraction that are robust to changes in websites. We consider two models to study robustness formally: the adversarial model, where we look at the worst-case robustness of wrappers, and probabilistic model, where we look at the expected robustness of wrappers, as w...
Article
Users of websites such as Facebook, Ebay and Yahoo! demand fast response times, and these sites replicate data across globally distributed datacenters to achieve this. However, it is not necessary to replicate all data to all locations: if a European user's record is never accessed in Asia, it does not make sense to pay the bandwidth and disk costs...
Article
We study the problem of making recommendations when the objects to be recommended must also satisfy constraints or requirements. In particular, we focus on course recommendations: the courses taken by a student must satisfy requirements (e.g., take two out of a set of five math courses) in order for the student to graduate. Our work is done in the...
Article
We consider the problem of human-assisted graph search: given a directed acyclic graph with some (unknown) target node(s), we consider the problem of finding the target node(s) by asking an omniscient human questions of the form "Is there a target node that is reachable from the current node?". This general problem has applications in many domains...
Conference Paper
Transmission of infectious diseases, propagation of information, and spread of ideas and influence through social networks are all examples of diffusion. In such cases we say that a contagion spreads through the network, a process that can be modeled by a cascade graph. Studying cascades and network diffusion is challenging due to missing data. Eve...
Article
Full-text available
We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebody's laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having...
Article
We consider the problem of human-assisted graph search: given a directed acyclic graph with some (unknown) target node(s), we consider the problem of finding the target node(s) by asking an omniscient human questions of the form "Is there a target node that is reachable from the current node?". This general problem has applications in many domains...
Article
We consider a crowdsourcing database system that may cleanse, populate, or filter its data by using human workers. Just like a conventional DB system, such a crowdsourcing DB system requires data manipulation functions such as select, aggregate, maximum, average, and so on, except that now it must rely on human operators (that for example compare t...
Conference Paper
We present "Turkalytics," a novel analytics tool for human computation systems. Turkalytics processes and reports logging events from workers in real-time and has been shown to scale to over one hundred thousand logging events per day. We present a state model for worker interaction that covers the Mechanical Turk (the SCRAP model) and a data model...
Article
Our work investigates the problem of retrieving the maximum item from a set in crowdsourcing environments. We first develop parameterized families of max algorithms, that take as input a set of items and output an item from the set that is believed to be the maximum. Such max algorithms could, for instance, select the best Facebook profile that mat...
Conference Paper
Full-text available
We examine the creation of a tag cloud for exploring and understanding a set of objects (e.g., web pages, documents). In the first part of our work, we present a formal system model for reasoning about tag clouds. We then present metrics that capture the structural properties of a tag cloud, and we briefly present a set of tag selection algorithms...
Conference Paper
Full-text available
The advent of database services has resulted in privacy concerns on the part of the client storing data with third party database service providers. Previous approaches to enabling such a service have been based on data encryption, causing a large overhead in query processing. A distributed architecture for secure database services is proposed as a...
Article
Given a large set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We forma...
Article
Full-text available
Output URL bidding is a new bidding mechanism for sponsored search, where advertisers bid on search result URLs, as opposed to keywords in the input query. For example, an advertiser may want his ad to appear whenever the search result includes the sites www.imdb.com and en.wikipedia.org, instead of bidding on keywords that lead to these sites, e.g...
Article
Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, we develop a novel technique to extract concepts from large datasets. We approach the problem of concept extracti...
Article
The high rate of change and the unprecedented scale of the Web pose enormous challenges to search engines who wish to provide the most up-to-date and highly relevant information to its users. The VLDB 2000 paper "The Evolution of the Web and Implications for an Incremental Crawler" tried to address part of this challenge by collecting and analyzing...
Article
Entity resolution (ER) identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We address the problem of keeping the ER re- sult up-to-date when the ER logic "evolves" frequently. A na¨ ive approach that re-...
Article
Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, we develop a novel technique to extract concepts from large datasets. We approach the problem of concept extracti...
Article
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application...
Article
The panelists will discuss what characterizes data management in the cloud, and how this differs from the broad range of applications that conventional database management systems have supported over the past few decades. They will examine whether we need to develop new technologies to address demonstrably new challenges, or whether we can largely...
Conference Paper
Full-text available
We study recommendations in applications where there are temporal patterns in the way items are consumed or watched. For example, a student who has taken the Advanced Algorithms course is more likely to be interested in Convex Optimization, but a student who has taken Convex Optimization need not be interested in Advanced Algorithms in the future....
Article
Full-text available
Web graphs are approximate snapshots of the web, created by search engines. They are essential to monitor the evolution of the web and to compute global properties like PageRank values of web pages. Their continuous monitoring requires a notion of graph similarity to help measure the amount and significance of changes in the evolving web. As a resu...
Conference Paper
We consider the problem of recommending the best set of k items when there is an inherent ordering between items, expressed as a set of prerequisites (e.g., the movie 'Godfather I' is a prerequisite of 'Godfather II'). Since this general problem is computationally intractable, we develop 3 approximation algorithms to solve this problem for various...
Conference Paper
Given a database instance and a corresponding view instance, we address the view definitions problem (VDP): Find the most succinct and accurate view definition, when the view query is restricted to a specific family of queries. We study the tradeoffs among succintness, level of approximation, and the family of queries through algorithms and complex...
Conference Paper
A fundamental premise of tagging systems is that regular users can organize large collections for browsing and other tasks using uncontrolled vocabularies. Until now, that premise has remained relatively unexamined. Using library data, we test the tagging approach to organizing a collection. We find that tagging systems have three major large scale...
Article
The high rate of change and the unprecedented scale of the Web pose enormous challenges to search engines who wish to provide the most up-to-date and highly relevant information to its users. The VLDB 2000 paper "The Evolution of the Web and Implications for an Incremental Crawler" tried to address part of this challenge by collecting and analyzing...
Article
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F 1 , cluster F 1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific applicati...
Article
Full-text available
Entity resolution (ER) (also known as deduplication or merge-purge) is a process of identifying records that refer to the same real-world entity and merging them together. In practice, ER results may contain “inconsistencies,” either due to mistakes by the match and merge function writers or changes in the application semantics. To remove the incon...
Article
Full-text available
Social sites such as FaceBook, Orkut, Flickr, MySpace and many others have become immensely popular. At these sites, users share their resources (e.g., photos, profiles, blogs) and learn from each other. On the other hand, higher education applications help students and administrators track and manage academic information such as grades, course eva...
Article
Full-text available
Social sites have become extremely popular among users but have they attracted equal attention from the research community? Are they good only for simple tasks, such as tagging and poking friends? Do they present any new or interesting research challenges? In this paper, we describe the insights we have obtained implementing CourseRank, a course ev...
Article
The Stanford Archival Repository Project aims to build a robust archiving system that can protect digital objects from failures over very long time spans. Objects are replicated among cooperating digital archives, so that if any archive fails its objects survive. We have designed an architecture for digital archives, and developed techniques for ef...
Article
We consider the problem of efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive Normal Form (CNF) Boolean expressions over a high-dimensional multi-valued attribute space. The goal is to rapidly find the set of Boolean expressions that evaluate to true for a given assignment of values to attributes. A solution to this problem has appl...
Conference Paper
We show how an online advertising network can use filtering, predictive pricing and revenue sharing together to manage the quality of cost-per-click (CPC) traffic. Our results suggest that predictive pricing alone can and should be used instead of filtering to manage organic traffic quality, whereas either method can be used to deter click inflatio...
Conference Paper
Full-text available
Recommendation systems have become very popular but most rec- ommendation methods are 'hard-wired' into the system making ex- perimentation with and implementation of new recommendation paradigms cumbersome. In this paper, we propose FlexRecs, a framework that decouples the definition of a recommendation pro- cess from its execution and supports fl...
Conference Paper
Full-text available
Special-purpose social sites can oer valuable services to well-dened, closed, communities, e.g., in a university or in a corporation. The purpose of this demo is to show the challenges, special features and potential of a focused social system in action through CourseRank, a course evaluation and planning social system.