Gautam Das

University of Texas at Arlington, Arlington, Texas, United States

Are you Gautam Das?

Claim your profile

Publications (170)37.27 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many web databases are "hidden" behind proprietary search interfaces that enforce the top-$k$ output constraint, i.e., each query returns at most $k$ of all matching tuples, preferentially selected and returned according to a proprietary ranking function. In this paper, we initiate research into the novel problem of skyline discovery over top-$k$ hidden web databases. Since skyline tuples provide critical insights into the database and include the top-ranked tuple for every possible ranking function following the monotonic order of attribute values, skyline discovery from a hidden web database can enable a wide variety of innovative third-party applications over one or multiple web databases. Our research in the paper shows that the critical factor affecting the cost of skyline discovery is the type of search interface controls provided by the website. As such, we develop efficient algorithms for three most popular types, i.e., one-ended range, free range and point predicates, and then combine them to support web databases that feature a mixture of these types. Rigorous theoretical analysis and extensive real-world online and offline experiments demonstrate the effectiveness of our proposed techniques and their superiority over baseline solutions.
    Full-text · Article · Mar 2016
  • Source
    Abolfazl Asudeh · Nan Zhang · Gautam Das

    Full-text · Article · Jan 2016
  • [Show abstract] [Hide abstract]
    ABSTRACT: Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the \(t\) most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.
    No preview · Article · Sep 2015 · Data Mining and Knowledge Discovery
  • Mahashweta Das · Gautam Das
    [Show abstract] [Hide abstract]
    ABSTRACT: The rise of social media has turned the Web into an online community where people connect, communicate, and collaborate with each other. Structured analytics in social media is the process of discovering the structure of the relationships emerging from this social media use. It focuses on identifying the users involved, the activities they undertake, the actions they perform, and the items (e.g., movies, restaurants, blogs, etc.) they create and interact with. There are two key challenges facing these tasks: how to organize and model social media content, which is often unstructured in its raw form, in order to employ structured analytics on it; and how to employ analytics algorithms to capture both explicit link-based relationships and implicit behavior-based relationships. In this tutorial, we systemize and summarize the research so far in analyzing social interactions between users and items in the Web from data mining and database perspectives. We start with a general overview of the topic, including discourse to various exciting and practical applications. Then, we discuss the state-of-art for modeling the data, formalizing the mining task, developing the algorithmic solutions, and evaluating on real datasets. We also emphasize open problems and challenges for future research in the area of structured analytics and social media.
    No preview · Article · Aug 2015 · Proceedings of the VLDB Endowment
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many emerging applications such as collaborative editing, multi-player games, or fan-subbing require to form a team of experts to accomplish a task together. Existing research has investigated how to assign workers to such team-based tasks to ensure the best outcome assuming the skills of individual workers to be known. In this work, we investigate how to estimate individual worker's skill based on the outcome of the team-based tasks they have undertaken. We consider two popular skill aggregation functions and estimate the skill of the workers, where skill is either a deterministic value or a probability distribution. We propose effcient solutions for worker skill estimation using continuous and discrete optimization techniques. We present comprehensive experiments and validate the scalability and effectiveness of our proposed solutions using multiple real-world datasets.
    No preview · Article · Jul 2015 · Proceedings of the VLDB Endowment

  • No preview · Conference Paper · Jun 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, there has been much research in the adoption of Ranked Retrieval model (in addition to the Boolean retrieval model) in structured databases, especially those in a client-server environment (e.g., web databases). With this model, a search query returns top-k tuples according to not just exact matches of selection conditions, but a suitable ranking function. While much research has gone into the design of ranking functions and the efficient processing of top-k queries, this paper studies a novel problem on the privacy implications of database ranking. The motivation is a novel yet serious privacy leakage we found on real-world web databases which is caused by the ranking function design. Many such databases feature private attributes - e.g., a social network allows users to specify certain attributes as only visible to him/herself, but not to others. While these websites generally respect the privacy settings by not directly displaying private attribute values in search query answers, many of them nevertheless take into account such private attributes in the ranking function design. The conventional belief might be that tuple ranks alone are not enough to reveal the private attribute values. Our investigation, however, shows that this is not the case in reality. To address the problem, we introduce a taxonomy of the problem space with two dimensions, (1) the type of query interface and (2) the capability of adversaries. For each subspace, we develop a novel technique which either guarantees the successful inference of private attributes, or does so for a significant portion of realworld tuples. We demonstrate the effectiveness and efficiency of our techniques through theoretical analysis, extensive experiments over real-world datasets, as well as successful online attacks over websites with tens to hundreds of millions of users - e.g., Amazon Goodreads and
    No preview · Article · Jun 2015 · Proceedings of the VLDB Endowment
  • Source
    Azade Nazi · Mahashweta Das · Gautam Das

    Full-text · Conference Paper · May 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Location based services (LBS) have become very popular in recent years. They range from map services (e.g., Google Maps) that store geographic locations of points of interests, to online social networks (e.g., WeChat, Sina Weibo, FourSquare) that leverage user geographic locations to enable various recommendation functions. The public query interfaces of these services may be abstractly modeled as a kNN interface over a database of two dimensional points on a plane: given an arbitrary query point, the system returns the k points in the database that are nearest to the query point. In this paper we consider the problem of obtaining approximate estimates of SUM and COUNT aggregates by only querying such databases via their restrictive public interfaces. We distinguish between interfaces that return location information of the returned tuples (e.g., Google Maps), and interfaces that do not return location information (e.g., Sina Weibo). For both types of interfaces, we develop aggregate estimation algorithms that are based on novel techniques for precisely computing or approximately estimating the Voronoi cell of tuples. We discuss a comprehensive set of real-world experiments for testing our algorithms, including experiments on Google Maps, WeChat, and Sina Weibo.
    Full-text · Article · May 2015 · Proceedings of the VLDB Endowment
  • Source
    Zhuojie Zhou · Nan Zhang · Gautam Das
    [Show abstract] [Hide abstract]
    ABSTRACT: How to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. Almost all of them use the exact same underlying technique of random walk - a Markov Chain Monte Carlo based method which iteratively transits from one node to its random neighbor. Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the "burn-in" period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner. In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves - thereby enabling a more efficient "drop-in" replacement for existing sampling-based analytics techniques over online social networks. Our main idea is to leverage the history of random walks to construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on real-world social networks and synthetic graphs the superiority of our techniques over the existing ones.
    Full-text · Article · Apr 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present SmartCrowd, a framework for optimizing task assignment in knowledge-intensive crowdsourcing (KI-C). SmartCrowd distinguishes itself by formulating, for the first time, the problem of worker-to-task assignment in KI-C as an optimization problem, by proposing efficient adaptive algorithms to solve it and by accounting for human factors, such as worker expertise, wage requirements, and availability inside the optimization process. We present rigorous theoretical analyses of the task assignment optimization problem and propose optimal and approximation algorithms with guarantees, which rely on index pre-computation and adaptive maintenance. We perform extensive performance and quality experiments using real and synthetic data to demonstrate that the SmartCrowd approach is necessary to achieve efficient task assignments of high-quality under guaranteed cost budget.
    Full-text · Article · Mar 2015 · The VLDB Journal
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we initiate the investigation of optimization opportunities in collaborative crowdsourcing. Many popular applications, such as collaborative document editing, sentence translation, or citizen science resort to this special form of human-based computing, where, crowd workers with appropriate skills and expertise are required to form groups to solve complex tasks. Central to any collaborative crowdsourcing process is the aspect of successful collaboration among the workers, which, for the first time, is formalized and then optimized in this work. Our formalism considers two main collaboration-related human factors, affinity and upper critical mass, appropriately adapted from organizational science and social theories. Our contributions are (a) proposing a comprehensive model for collaborative crowdsourcing optimization, (b) rigorous theoretical analyses to understand the hardness of the proposed problems, (c) an array of efficient exact and approximation algorithms with provable theoretical guarantees. Finally, we present a detailed set of experimental results stemming from two real-world collaborative crowdsourcing application us- ing Amazon Mechanical Turk, as well as conduct synthetic data analyses on scalability and qualitative aspects of our proposed algorithms. Our experimental results successfully demonstrate the efficacy of our proposed solutions.
    Full-text · Article · Feb 2015
  • Source

    Full-text · Article · Feb 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Micro blogs and collaborative content sites such as Twitter and Amazon are popular among millions of users who generate huge numbers of tweets, posts, and reviews every day. Despite their popularity, these sites only provide rudimentary mechanisms to navigate their sites, programmatically or through a browser, like a keyword search interface or a get-neighbors (e.g., Friends) interface. Many interesting queries cannot be directly answered by any of these interfaces, e.g., Find Twitter users in Los Angeles that have tweeted the word 'diabetes' in the last year. Note that the Twitter programming interface does not allow conditions on the user's home location. In this paper, we introduce the novel problem of querying hidden attributes in micro blogs and collaborative content sites by leveraging the existing search mechanisms offered by those sites. We model these data sources as heterogeneous graphs and their two key access interfaces, Local Search and Content Search, which search through keywords and neighbors respectively. We show which of these two approaches is better for which types of hidden attribute searches. We conduct experiments on Twitter, Amazon, and Rate MDs to evaluate the performance of the search approaches.
    No preview · Article · Jan 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The kNN query interface is a popular search interface for many real-world web databases. Given a user-specified query, the top-k nearest neighboring tuples (ranked by a predetermined ranking function) are returned. For example, many websites now provide social network features that recommend to a user others who share similar properties, interests, etc. Our studies of real-world websites unveil a novel yet serious privacy leakage caused by the design of such interfaces and ranking functions. Specifically, we find that many of such websites feature private attributes that are only visible to a user him/herself, but not to other users (and therefore will not be visible in the query answer). Nonetheless, these websites also take into account such private attributes in the design of the ranking function. While the conventional belief might be that tuple ranks alone are not enough to reveal the private attribute values, our investigation shows that this is not the case in reality. Specifically, we define a novel problem of rank based inference, and introduce a taxonomy of the problem space according to two dimensions, (1) the type of query interfaces widely used in practice and (2) the capability of adversaries. For each subspace of the prolem, we develop a novel technique which either guarantees the successful inference of private attributes, or (when such an inference is provably infeasible in the worst-case scenario) accomplishes such an inference attack for a significant portion of real-world tuples. We demonstrate the effectiveness and efficiency of our techniques through theoretical analysis and extensive experiments over real-world datasets, including successful online attacks over popular services such as Amazon Goodreads and Catch22dating.
    Full-text · Article · Nov 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: CrewScout is an expert-team finding system based on the concept of skyline teams and efficient algorithms for finding such teams. Given a set of experts, CrewScout finds all k-expert skyline teams, which are not dominated by any other k-expert teams. The dominance between teams is governed by comparing their aggregated expertise vectors. The need for finding expert teams prevails in applications such as question answering, crowdsourcing, panel selection, and project team formation. The new contributions of this paper include an end-to-end system with an interactive user interface that assists users in choosing teams and an demonstration of its application domains.
    No preview · Conference Paper · Nov 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: We assume a dataset of transactions generated by a set of users over structured items where each item could be described through a set of features. In this paper, we are interested in identifying the frequent featuresets (set of features) by mining item transactions. For example, in a news website, items correspond to news articles, the features are the named-entities/topics in the articles and an item transaction would be the set of news articles read by a user within the same session. We show that mining frequent featuresets over structured item transactions is a novel problem and show that straightforward extensions of existing frequent itemset mining techniques provide unsatisfactory results. This is due to the fact that while users are drawn to each item in the transaction due to a subset of its features, the transaction by itself does not provide any information about such underlying preferred features of users. In order to overcome this hurdle, we propose a featureset uncertainty model where each item transaction could have been generated by various featuresets with different probabilities. We describe a novel approach to transform item transactions into uncertain transaction over featuresets and estimate their probabilities using constrained least squares based approach. We propose diverse algorithms to mine frequent featuresets. Our experimental evaluation provides a comparative analysis of the different approaches proposed.
    No preview · Article · Nov 2014 · Proceedings of the VLDB Endowment

  • No preview · Article · Nov 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce a novel, general purpose, technique for faster sampling of nodes over an online social network. Specifically, unlike traditional random walk which wait for the convergence of sampling distribution to a predetermined target distribution - a waiting process that incurs a high query cost - we develop WALK-ESTIMATE, which starts with a much shorter random walk, and then proactively estimate the sampling probability for the node taken before using acceptance-rejection sampling to adjust the sampling probability to the predetermined target distribution. We present a novel backward random walk technique which provides provably unbiased estimations for the sampling probability, and demonstrate the superiority of WALK-ESTIMATE over traditional random walks through theoretical analysis and extensive experiments over real world online social networks.
    Full-text · Article · Oct 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The prevalence of social media has sparked novel advertising models, vastly different from the traditional keyword based bidding model adopted by search engines. One such model is topic based advertising, popular with micro-blogging sites. Instead of bidding on keywords, the approach is based on bidding on topics, with the winning bid allowed to disseminate messages to users interested in the specific topic. Naturally topics have varying costs depending on multiple factors (e.g., how popular or prevalent they are). Similarly users in a micro-blogging site have diverse interests. Assuming one wishes to disseminate a message to a set V of users interested in a specific topic, a question arises whether it is possible to disseminate the same message by bidding on a set of topics that collectively reach the same users in V albeit at a cheaper cost. In this paper, we show how an alternative set of topics R with a lower cost can be identified to target (most) users in V. Two approximation algorithms are presented to address the problem with strong bounds. Theoretical analysis and extensive quantitative and qualitative experiments over real-world data sets at realistic scale containing millions of users and topics demonstrate the effectiveness of our approach.
    No preview · Conference Paper · Oct 2014

Publication Stats

5k Citations
37.27 Total Impact Points


  • 2-2015
    • University of Texas at Arlington
      • Department of Computer Sciences & Engineering
      Arlington, Texas, United States
  • 2014
    • University of Washington Tacoma
      • Institute of Technology
      Tacoma, Washington, United States
  • 2012-2014
    • George Washington University
      • Department of Computer Science
      Washington, Washington, D.C., United States
    • Qatar Computing Research Institute
      Ad Dawḩah, Baladīyat ad Dawḩah, Qatar
  • 2006
    • Banner College - Arlington
      Arlington, Texas, United States
  • 1993-2006
    • The University of Memphis
      • Department of Mathematical Sciences
      Memphis, TN, United States
  • 2000-2004
    • Microsoft
      Washington, West Virginia, United States
  • 1995
    • University of São Paulo
      San Paulo, São Paulo, Brazil