[Show abstract][Hide abstract] ABSTRACT: Many web databases are "hidden" behind proprietary search interfaces that
enforce the top-$k$ output constraint, i.e., each query returns at most $k$ of
all matching tuples, preferentially selected and returned according to a
proprietary ranking function. In this paper, we initiate research into the
novel problem of skyline discovery over top-$k$ hidden web databases. Since
skyline tuples provide critical insights into the database and include the
top-ranked tuple for every possible ranking function following the monotonic
order of attribute values, skyline discovery from a hidden web database can
enable a wide variety of innovative third-party applications over one or
multiple web databases. Our research in the paper shows that the critical
factor affecting the cost of skyline discovery is the type of search interface
controls provided by the website. As such, we develop efficient algorithms for
three most popular types, i.e., one-ended range, free range and point
predicates, and then combine them to support web databases that feature a
mixture of these types. Rigorous theoretical analysis and extensive real-world
online and offline experiments demonstrate the effectiveness of our proposed
techniques and their superiority over baseline solutions.
[Show abstract][Hide abstract] ABSTRACT: Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the \(t\) most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.
No preview · Article · Sep 2015 · Data Mining and Knowledge Discovery
[Show abstract][Hide abstract] ABSTRACT: The rise of social media has turned the Web into an online community where people connect, communicate, and collaborate with each other. Structured analytics in social media is the process of discovering the structure of the relationships emerging from this social media use. It focuses on identifying the users involved, the activities they undertake, the actions they perform, and the items (e.g., movies, restaurants, blogs, etc.) they create and interact with. There are two key challenges facing these tasks: how to organize and model social media content, which is often unstructured in its raw form, in order to employ structured analytics on it; and how to employ analytics algorithms to capture both explicit link-based relationships and implicit behavior-based relationships. In this tutorial, we systemize and summarize the research so far in analyzing social interactions between users and items in the Web from data mining and database perspectives. We start with a general overview of the topic, including discourse to various exciting and practical applications. Then, we discuss the state-of-art for modeling the data, formalizing the mining task, developing the algorithmic solutions, and evaluating on real datasets. We also emphasize open problems and challenges for future research in the area of structured analytics and social media.
No preview · Article · Aug 2015 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract] ABSTRACT: Many emerging applications such as collaborative editing, multi-player games, or fan-subbing require to form a team of experts to accomplish a task together. Existing research has investigated how to assign workers to such team-based tasks to ensure the best outcome assuming the skills of individual workers to be known. In this work, we investigate how to estimate individual worker's skill based on the outcome of the team-based tasks they have undertaken. We consider two popular skill aggregation functions and estimate the skill of the workers, where skill is either a deterministic value or a probability distribution. We propose effcient solutions for worker skill estimation using continuous and discrete optimization techniques. We present comprehensive experiments and validate the scalability and effectiveness of our proposed solutions using multiple real-world datasets.
No preview · Article · Jul 2015 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract] ABSTRACT: In recent years, there has been much research in the adoption of Ranked Retrieval model (in addition to the Boolean retrieval model) in structured databases, especially those in a client-server environment (e.g., web databases). With this model, a search query returns top-k tuples according to not just exact matches of selection conditions, but a suitable ranking function. While much research has gone into the design of ranking functions and the efficient processing of top-k queries, this paper studies a novel problem on the privacy implications of database ranking. The motivation is a novel yet serious privacy leakage we found on real-world web databases which is caused by the ranking function design. Many such databases feature private attributes - e.g., a social network allows users to specify certain attributes as only visible to him/herself, but not to others. While these websites generally respect the privacy settings by not directly displaying private attribute values in search query answers, many of them nevertheless take into account such private attributes in the ranking function design. The conventional belief might be that tuple ranks alone are not enough to reveal the private attribute values. Our investigation, however, shows that this is not the case in reality. To address the problem, we introduce a taxonomy of the problem space with two dimensions, (1) the type of query interface and (2) the capability of adversaries. For each subspace, we develop a novel technique which either guarantees the successful inference of private attributes, or does so for a significant portion of realworld tuples. We demonstrate the effectiveness and efficiency of our techniques through theoretical analysis, extensive experiments over real-world datasets, as well as successful online attacks over websites with tens to hundreds of millions of users - e.g., Amazon Goodreads and Renren.com.
No preview · Article · Jun 2015 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract] ABSTRACT: Location based services (LBS) have become very popular in recent years. They
range from map services (e.g., Google Maps) that store geographic locations of
points of interests, to online social networks (e.g., WeChat, Sina Weibo,
FourSquare) that leverage user geographic locations to enable various
recommendation functions. The public query interfaces of these services may be
abstractly modeled as a kNN interface over a database of two dimensional points
on a plane: given an arbitrary query point, the system returns the k points in
the database that are nearest to the query point. In this paper we consider the
problem of obtaining approximate estimates of SUM and COUNT aggregates by only
querying such databases via their restrictive public interfaces. We distinguish
between interfaces that return location information of the returned tuples
(e.g., Google Maps), and interfaces that do not return location information
(e.g., Sina Weibo). For both types of interfaces, we develop aggregate
estimation algorithms that are based on novel techniques for precisely
computing or approximately estimating the Voronoi cell of tuples. We discuss a
comprehensive set of real-world experiments for testing our algorithms,
including experiments on Google Maps, WeChat, and Sina Weibo.
Full-text · Article · May 2015 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract] ABSTRACT: How to enable efficient analytics over such data has been an increasingly
important research problem. Given the sheer size of such social networks, many
existing studies resort to sampling techniques that draw random nodes from an
online social network through its restrictive web/API interface. Almost all of
them use the exact same underlying technique of random walk - a Markov Chain
Monte Carlo based method which iteratively transits from one node to its random
Random walk fits naturally with this problem because, for most online social
networks, the only query we can issue through the interface is to retrieve the
neighbors of a given node (i.e., no access to the full graph topology). A
problem with random walks, however, is the "burn-in" period which requires a
large number of transitions/queries before the sampling distribution converges
to a stationary value that enables the drawing of samples in a statistically
In this paper, we consider a novel problem of speeding up the fundamental
design of random walks (i.e., reducing the number of queries it requires)
without changing the stationary distribution it achieves - thereby enabling a
more efficient "drop-in" replacement for existing sampling-based analytics
techniques over online social networks. Our main idea is to leverage the
history of random walks to construct a higher-ordered Markov chain. We develop
two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW
and GNRW) and prove that, no matter what the social network topology is, CNRW
and GNRW offer better efficiency than baseline random walks while achieving the
same stationary distribution. We demonstrate through extensive experiments on
real-world social networks and synthetic graphs the superiority of our
techniques over the existing ones.
[Show abstract][Hide abstract] ABSTRACT: We present SmartCrowd, a framework for optimizing task assignment in knowledge-intensive crowdsourcing (KI-C). SmartCrowd distinguishes itself by formulating, for the first time, the problem of worker-to-task assignment in KI-C as an optimization problem, by proposing efficient adaptive algorithms to solve it and by accounting for human factors, such as worker expertise, wage requirements, and availability inside the optimization process. We present rigorous theoretical analyses of the task assignment optimization problem and propose optimal and approximation algorithms with guarantees, which rely on index pre-computation and adaptive maintenance. We perform extensive performance and quality experiments using real and synthetic data to demonstrate that the SmartCrowd approach is necessary to achieve efficient task assignments of high-quality under guaranteed cost budget.
[Show abstract][Hide abstract] ABSTRACT: In this work, we initiate the investigation of optimization opportunities in
collaborative crowdsourcing. Many popular applications, such as collaborative
document editing, sentence translation, or citizen science resort to this
special form of human-based computing, where, crowd workers with appropriate
skills and expertise are required to form groups to solve complex tasks.
Central to any collaborative crowdsourcing process is the aspect of successful
collaboration among the workers, which, for the first time, is formalized and
then optimized in this work. Our formalism considers two main
collaboration-related human factors, affinity and upper critical mass,
appropriately adapted from organizational science and social theories. Our
contributions are (a) proposing a comprehensive model for collaborative
crowdsourcing optimization, (b) rigorous theoretical analyses to understand the
hardness of the proposed problems, (c) an array of efficient exact and
approximation algorithms with provable theoretical guarantees. Finally, we
present a detailed set of experimental results stemming from two real-world
collaborative crowdsourcing application us- ing Amazon Mechanical Turk, as well
as conduct synthetic data analyses on scalability and qualitative aspects of
our proposed algorithms. Our experimental results successfully demonstrate the
efficacy of our proposed solutions.
[Show abstract][Hide abstract] ABSTRACT: Micro blogs and collaborative content sites such as Twitter and Amazon are popular among millions of users who generate huge numbers of tweets, posts, and reviews every day. Despite their popularity, these sites only provide rudimentary mechanisms to navigate their sites, programmatically or through a browser, like a keyword search interface or a get-neighbors (e.g., Friends) interface. Many interesting queries cannot be directly answered by any of these interfaces, e.g., Find Twitter users in Los Angeles that have tweeted the word 'diabetes' in the last year. Note that the Twitter programming interface does not allow conditions on the user's home location. In this paper, we introduce the novel problem of querying hidden attributes in micro blogs and collaborative content sites by leveraging the existing search mechanisms offered by those sites. We model these data sources as heterogeneous graphs and their two key access interfaces, Local Search and Content Search, which search through keywords and neighbors respectively. We show which of these two approaches is better for which types of hidden attribute searches. We conduct experiments on Twitter, Amazon, and Rate MDs to evaluate the performance of the search approaches.
[Show abstract][Hide abstract] ABSTRACT: The kNN query interface is a popular search interface for many real-world web
databases. Given a user-specified query, the top-k nearest neighboring tuples
(ranked by a predetermined ranking function) are returned. For example, many
websites now provide social network features that recommend to a user others
who share similar properties, interests, etc. Our studies of real-world
websites unveil a novel yet serious privacy leakage caused by the design of
such interfaces and ranking functions. Specifically, we find that many of such
websites feature private attributes that are only visible to a user
him/herself, but not to other users (and therefore will not be visible in the
query answer). Nonetheless, these websites also take into account such private
attributes in the design of the ranking function. While the conventional belief
might be that tuple ranks alone are not enough to reveal the private attribute
values, our investigation shows that this is not the case in reality.
Specifically, we define a novel problem of rank based inference, and
introduce a taxonomy of the problem space according to two dimensions, (1) the
type of query interfaces widely used in practice and (2) the capability of
adversaries. For each subspace of the prolem, we develop a novel technique
which either guarantees the successful inference of private attributes, or
(when such an inference is provably infeasible in the worst-case scenario)
accomplishes such an inference attack for a significant portion of real-world
tuples. We demonstrate the effectiveness and efficiency of our techniques
through theoretical analysis and extensive experiments over real-world
datasets, including successful online attacks over popular services such as
Amazon Goodreads and Catch22dating.
[Show abstract][Hide abstract] ABSTRACT: CrewScout is an expert-team finding system based on the concept of skyline teams and efficient algorithms for finding such teams. Given a set of experts, CrewScout finds all k-expert skyline teams, which are not dominated by any other k-expert teams. The dominance between teams is governed by comparing their aggregated expertise vectors. The need for finding expert teams prevails in applications such as question answering, crowdsourcing, panel selection, and project team formation. The new contributions of this paper include an end-to-end system with an interactive user interface that assists users in choosing teams and an demonstration of its application domains.
[Show abstract][Hide abstract] ABSTRACT: We assume a dataset of transactions generated by a set of users over structured items where each item could be described through a set of features. In this paper, we are interested in identifying the frequent featuresets (set of features) by mining item transactions. For example, in a news website, items correspond to news articles, the features are the named-entities/topics in the articles and an item transaction would be the set of news articles read by a user within the same session. We show that mining frequent featuresets over structured item transactions is a novel problem and show that straightforward extensions of existing frequent itemset mining techniques provide unsatisfactory results. This is due to the fact that while users are drawn to each item in the transaction due to a subset of its features, the transaction by itself does not provide any information about such underlying preferred features of users. In order to overcome this hurdle, we propose a featureset uncertainty model where each item transaction could have been generated by various featuresets with different probabilities. We describe a novel approach to transform item transactions into uncertain transaction over featuresets and estimate their probabilities using constrained least squares based approach. We propose diverse algorithms to mine frequent featuresets. Our experimental evaluation provides a comparative analysis of the different approaches proposed.
No preview · Article · Nov 2014 · Proceedings of the VLDB Endowment
[Show abstract][Hide abstract] ABSTRACT: In this paper, we introduce a novel, general purpose, technique for faster
sampling of nodes over an online social network. Specifically, unlike
traditional random walk which wait for the convergence of sampling distribution
to a predetermined target distribution - a waiting process that incurs a high
query cost - we develop WALK-ESTIMATE, which starts with a much shorter random
walk, and then proactively estimate the sampling probability for the node taken
before using acceptance-rejection sampling to adjust the sampling probability
to the predetermined target distribution. We present a novel backward random
walk technique which provides provably unbiased estimations for the sampling
probability, and demonstrate the superiority of WALK-ESTIMATE over traditional
random walks through theoretical analysis and extensive experiments over real
world online social networks.
[Show abstract][Hide abstract] ABSTRACT: The prevalence of social media has sparked novel advertising models, vastly different from the traditional keyword based bidding model adopted by search engines. One such model is topic based advertising, popular with micro-blogging sites. Instead of bidding on keywords, the approach is based on bidding on topics, with the winning bid allowed to disseminate messages to users interested in the specific topic. Naturally topics have varying costs depending on multiple factors (e.g., how popular or prevalent they are). Similarly users in a micro-blogging site have diverse interests. Assuming one wishes to disseminate a message to a set V of users interested in a specific topic, a question arises whether it is possible to disseminate the same message by bidding on a set of topics that collectively reach the same users in V albeit at a cheaper cost. In this paper, we show how an alternative set of topics R with a lower cost can be identified to target (most) users in V. Two approximation algorithms are presented to address the problem with strong bounds. Theoretical analysis and extensive quantitative and qualitative experiments over real-world data sets at realistic scale containing millions of users and topics demonstrate the effectiveness of our approach.