April 2019
·
40 Reads
·
11 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
April 2019
·
40 Reads
·
11 Citations
August 2017
·
19 Reads
May 2017
·
95 Reads
·
37 Citations
In this paper, we present CrowdDQS, a system that uses the most recent set of crowdsourced voting evidence to dynamically issue questions to workers on Amazon Mechanical Turk (AMT). CrowdDQS posts all questions to AMT in a single batch, but delays the decision of the exact question to issue a worker until the last moment, concentrating votes on uncertain questions to maximize accuracy. Unlike previous works, CrowdDQS also (1) optionally can decide when it is more beneficial to issue gold standard questions with known answers than to solicit new votes (both can help us estimate worker accuracy, but gold standard questions provide a less noisy estimate of worker accuracy at the expense of not obtaining new votes), (2) estimates worker accuracies in real-time even with limited evidence (with or without gold standard questions), and (3) infers the distribution of worker skill levels to actively block poor workers. We deploy our system live on AMT to over 1000 crowdworkers, and find that CrowdDQS can accurately answer questions using up to 6x fewer votes than standard approaches. We also find there are many non-obvious practical challenges involved in deploying such a system seamlessly to crowdworkers, and discuss techniques to overcome these challenges.
May 2017
·
62 Reads
·
53 Citations
In Entity Resolution, the objective is to find which records of a dataset refer to the same real-world entity. Crowd Entity Resolution uses humans, in addition to machine algorithms, to improve the quality of the outcome. We study a hybrid approach that combines two common interfaces for human tasks in Crowd Entity Resolution, taking into account key observations about the advantages and disadvantages of the two interfaces. We give a formal definition to the problem of human task selection and we derive algorithms with strong optimality guarantees. Our experiments with four real-world datasets show that our hybrid approach gives an improvement of 50% to 300% in the crowd cost to resolve a dataset, compared to using a single interface.
May 2017
·
98 Reads
·
79 Citations
We focus on data fusion, i.e., the problem of unifying conflicting data from data sources into a single representation by estimating the source accuracies. We propose SLiMFast, a framework that expresses data fusion as a statistical learning problem over discriminative probabilistic models, which in many cases correspond to logistic regression. In contrast to previous approaches that use complex generative models, discriminative models make fewer distributional assumptions over data sources and allow us to obtain rigorous theoretical guarantees. Furthermore, we show how SLiMFast enables incorporating domain knowledge into data fusion, yielding accuracy improvements of up to 50% over state-of-the-art baselines. Building upon our theoretical results, we design an optimizer that obviates the need for users to manually select an algorithm for learning SLiMFast's parameters. We validate our optimizer on multiple real-world datasets and show that it can accurately predict the learning algorithm that yields the best data fusion results.
March 2017
·
46 Reads
·
41 Citations
IEEE Transactions on Knowledge and Data Engineering
We present smart drill-down , an operator for interactively exploring a relational table to discover and summarize “interesting” groups of tuples. Each group of tuples is described by a rule . For instance, the rule tells us that there are 1,000 tuples with value a in the first column and b in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are NP-Hard , and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms.
October 2016
·
29 Reads
·
11 Citations
We study the problem of graph tracking with limited information. In this paper, we focus on updating a social graph snapshot. Say we have an existing partial snapshot, G1, of the social graph stored at some system. Over time G1 becomes out of date. We want to update G1 through a public API to the actual graph, restricted by the number of API calls allowed. Periodically recrawling every node in the snapshot is prohibitively expensive. We propose a scheme where we exploit indegrees and outdegrees to discover changes to the actual graph. When there is ambiguity, we probe the graph and verify edges. We propose a novel strategy designed for limited information that can be adapted to different levels of staleness. We evaluate our strategy against recrawling on real datasets and show that it saves an order of magnitude of API calls while introducing minimal errors.
October 2016
·
34 Reads
·
21 Citations
We study the problem of using the crowd to perform entity resolution (ER) on a set of records. For many types of records, especially those involving images, such a task can be difficult for machines, but relatively easy for humans. Typical crowd-based ER approaches ask workers for pairwise judgments between records, which quickly becomes prohibitively expensive even for moderate numbers of records. In this paper, we reduce the cost of pairwise crowd ER approaches by soliciting the crowd for attribute labels on records, and then asking for pairwise judgments only between records with similar sets of attribute labels. However, due to errors induced by crowd-based attribute labeling, a naive attribute-based approach becomes extremely inaccurate even with few attributes. To combat these errors, we use error mitigation strategies which allow us to control the accuracy of our results while maintaining significant cost reductions. We develop a probabilistic model which allows us to determine the optimal, lowest-cost combination of error mitigation strategies needed to achieve a minimum desired accuracy. We test our approach with actual crowdworkers on a dataset of celebrity images, and find that our results yield crowd ER strategies which achieve high accuracy yet are significantly lower cost than pairwise-only approaches.
May 2016
·
101 Reads
·
33 Citations
We present smart drill-down, an operator for interactively exploring a relational table to discover and summarize "interesting" groups of tuples. Each group of tuples is described by a rule. For instance, the rule (a, b, ⋆, 1000) tells us that there are a thousand tuples with value a in the first column and b in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are NP-Hard, and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms.
April 2016
·
752 Reads
·
139 Citations
IEEE Transactions on Knowledge and Data Engineering
Crowdsourcing refers to solving large problems by involving human workers that solve component sub-problems or tasks. In data crowdsourcing, the problem involves data acquisition, management, and analysis. In this paper, we provide an overview of data crowdsourcing, giving examples of problems that the authors have tackled, and presenting the key design steps involved in implementing a crowdsourced solution. We also discuss some of the open challenges that remain to be solved.
... • Explicit-Online workers have the perception that they are collaborating with other people when carrying out tasks and their individual outcomes can be influenced by other responses (Huang and Sundar 2020;Koutrika et al. 2009). • Implicit-The result of the task is the joint effort of multiple workers without having a clear perception of collaboration with other users. ...
March 2009
Proceedings of the International AAAI Conference on Web and Social Media
... These algorithms summarize either datasets or blocks to speed up online EM tasks. Verroios et al. defines Top-k EM [24], which only detects the k most popular entities (corresponding to the k largest record clusters) in a dataset. They adaptively use locality-sensitive hashing (LSH) to rapidly estimate amounts of records corresponding to a specific entity. ...
April 2019
... Many methods relying on tournament sorting have been presented in the literature to implement crowd-based topk algorithms. Polychronopoulos et al. (2013) presented a solution based on a human-powered top-k method in which the crowd is asked to rank tuples directly and to aggregate them using a median rank aggregation algorithm. This leads to the final top-k result being identified based on the judgments of human workers. ...
June 2013
... [31] modeled the complex collaborative crowdsourcing task assignment problem as a combinatorial optimization problem based on the maximum flow and computed the optimal solution to task assignment with a Slide-Container Queue (SCP). Beyond that, similar to Boolean crowdsourcing, the utilization of historical worker data [225], answer distribution [4,226], gold questions test [220] and behavioral data [19] are also generally adopted in open-ended crowdsourcing for better worker estimation and task assignment. ...
May 2017
... -Major-A baseline that resolves conflicts by selecting the most frequent value from the candidate set using kNN. -SlimFast [70]-A state-of-the-art truth discovery model for conflict resolution for single fact scenarios using weighted kNN. Bold values indicate the best-performing results among the compared methods for each metric To ensure a fair comparison, we balance the labeled data available for SlimFast and ICLCR. ...
May 2017
... Steven Euijong Whang et al. proposed a probabilistic model for question selection based on similarity to make a probabilistic estimate of the likelihood of the outcome of a manually answered question [15]. Crowdsourcing-based entity resolution received a lot of research subsequently, which can be roughly divided into three categories: probabilitybased crowdsourcing entity resolution [19,[21][22][23], clustering-based crowdsourcing entity resolution [16,24], and partial order-based entity resolution [25][26][27]. Falcon and Corleone crowdsourced a whole workflow [28,29]. ...
May 2017
... Multi dimensional data aggregation. Previous work on multidimensional data aggregation developed methods that extend the traditional drill-down and roll-up operators to find the most interesting data parts for exploration [4,7,40,85,105]. Other works have focused on assessing the similarity between data cubes [10]), or discovering intriguing data visualizations [99,115]. ...
March 2017
IEEE Transactions on Knowledge and Data Engineering
... This is the only required local information, not exact knowledge (identities) of the nodes whose outgoing edges point to each node. Such in-degree can be obtained without much difficulty in online social networks having follower-followee relationship such as Google++, Twitter, and Instagram, since the in-degree is usually available as part of a user profile, which is simply the number of followers [73]. It is worth noting that the social networks are often much more stringent on retrieving the IDs of inlinks and outlinks of each user than looking up user profiles [73]. ...
October 2016
... The existing methods usually need to obtain the matching relationships between pairs of records [12][13][14][15] and use these for truth inference to achieve the purpose of parsing the whole dataset [12,[16][17][18][19]. In a database, large amounts of structured data are stored, which consist of different attributes, and obtaining the matching relationships between attributes can also achieve the purpose of entity resolution. ...
October 2016
... Similar crowdsourcing platforms including ( CrowdFlower 2 , samasource3 3 , etc.). Designing an efficient crowdsourcing model usually needs to consider the following three points: label quality control [7], cost control [8], and time control [9].In this article, we mainly study the quality control of labels. As the workers are not professional for the crowdsourcing task, their understanding of the labeling task is uneven, which often leads to poor label quality. ...
May 2015