Gautam Das

George Washington University, Washington, Washington, D.C., United States

Are you Gautam Das?

Claim your profile

Publications (163)31.21 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the \(t\) most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.
    Data Mining and Knowledge Discovery 09/2015; 29(5). DOI:10.1007/s10618-015-0413-2 · 1.99 Impact Factor
  • Mahashweta Das · Gautam Das ·

    Proceedings of the VLDB Endowment 08/2015; 8(12):2046-2047. DOI:10.14778/2824032.2824135

  • Proceedings of the VLDB Endowment 07/2015; 8(11):1142-1153. DOI:10.14778/2809974.2809977

  • ICWSM 2015; 06/2015

  • Proceedings of the VLDB Endowment 06/2015; 8(10):1106-1117. DOI:10.14778/2794367.2794379
  • Source
    Azade Nazi · Mahashweta Das · Gautam Das ·

    SIGMOD 2015; 05/2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Location based services (LBS) have become very popular in recent years. They range from map services (e.g., Google Maps) that store geographic locations of points of interests, to online social networks (e.g., WeChat, Sina Weibo, FourSquare) that leverage user geographic locations to enable various recommendation functions. The public query interfaces of these services may be abstractly modeled as a kNN interface over a database of two dimensional points on a plane: given an arbitrary query point, the system returns the k points in the database that are nearest to the query point. In this paper we consider the problem of obtaining approximate estimates of SUM and COUNT aggregates by only querying such databases via their restrictive public interfaces. We distinguish between interfaces that return location information of the returned tuples (e.g., Google Maps), and interfaces that do not return location information (e.g., Sina Weibo). For both types of interfaces, we develop aggregate estimation algorithms that are based on novel techniques for precisely computing or approximately estimating the Voronoi cell of tuples. We discuss a comprehensive set of real-world experiments for testing our algorithms, including experiments on Google Maps, WeChat, and Sina Weibo.
    Proceedings of the VLDB Endowment 05/2015; 8(12). DOI:10.14778/2824032.2824034
  • Source
    Zhuojie Zhou · Nan Zhang · Gautam Das ·
    [Show abstract] [Hide abstract]
    ABSTRACT: How to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. Almost all of them use the exact same underlying technique of random walk - a Markov Chain Monte Carlo based method which iteratively transits from one node to its random neighbor. Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the "burn-in" period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner. In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves - thereby enabling a more efficient "drop-in" replacement for existing sampling-based analytics techniques over online social networks. Our main idea is to leverage the history of random walks to construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on real-world social networks and synthetic graphs the superiority of our techniques over the existing ones.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we initiate the investigation of optimization opportunities in collaborative crowdsourcing. Many popular applications, such as collaborative document editing, sentence translation, or citizen science resort to this special form of human-based computing, where, crowd workers with appropriate skills and expertise are required to form groups to solve complex tasks. Central to any collaborative crowdsourcing process is the aspect of successful collaboration among the workers, which, for the first time, is formalized and then optimized in this work. Our formalism considers two main collaboration-related human factors, affinity and upper critical mass, appropriately adapted from organizational science and social theories. Our contributions are (a) proposing a comprehensive model for collaborative crowdsourcing optimization, (b) rigorous theoretical analyses to understand the hardness of the proposed problems, (c) an array of efficient exact and approximation algorithms with provable theoretical guarantees. Finally, we present a detailed set of experimental results stemming from two real-world collaborative crowdsourcing application us- ing Amazon Mechanical Turk, as well as conduct synthetic data analyses on scalability and qualitative aspects of our proposed algorithms. Our experimental results successfully demonstrate the efficacy of our proposed solutions.
  • Source

  • [Show abstract] [Hide abstract]
    ABSTRACT: Micro blogs and collaborative content sites such as Twitter and Amazon are popular among millions of users who generate huge numbers of tweets, posts, and reviews every day. Despite their popularity, these sites only provide rudimentary mechanisms to navigate their sites, programmatically or through a browser, like a keyword search interface or a get-neighbors (e.g., Friends) interface. Many interesting queries cannot be directly answered by any of these interfaces, e.g., Find Twitter users in Los Angeles that have tweeted the word 'diabetes' in the last year. Note that the Twitter programming interface does not allow conditions on the user's home location. In this paper, we introduce the novel problem of querying hidden attributes in micro blogs and collaborative content sites by leveraging the existing search mechanisms offered by those sites. We model these data sources as heterogeneous graphs and their two key access interfaces, Local Search and Content Search, which search through keywords and neighbors respectively. We show which of these two approaches is better for which types of hidden attribute searches. We conduct experiments on Twitter, Amazon, and Rate MDs to evaluate the performance of the search approaches.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The kNN query interface is a popular search interface for many real-world web databases. Given a user-specified query, the top-k nearest neighboring tuples (ranked by a predetermined ranking function) are returned. For example, many websites now provide social network features that recommend to a user others who share similar properties, interests, etc. Our studies of real-world websites unveil a novel yet serious privacy leakage caused by the design of such interfaces and ranking functions. Specifically, we find that many of such websites feature private attributes that are only visible to a user him/herself, but not to other users (and therefore will not be visible in the query answer). Nonetheless, these websites also take into account such private attributes in the design of the ranking function. While the conventional belief might be that tuple ranks alone are not enough to reveal the private attribute values, our investigation shows that this is not the case in reality. Specifically, we define a novel problem of rank based inference, and introduce a taxonomy of the problem space according to two dimensions, (1) the type of query interfaces widely used in practice and (2) the capability of adversaries. For each subspace of the prolem, we develop a novel technique which either guarantees the successful inference of private attributes, or (when such an inference is provably infeasible in the worst-case scenario) accomplishes such an inference attack for a significant portion of real-world tuples. We demonstrate the effectiveness and efficiency of our techniques through theoretical analysis and extensive experiments over real-world datasets, including successful online attacks over popular services such as Amazon Goodreads and Catch22dating.
  • [Show abstract] [Hide abstract]
    ABSTRACT: CrewScout is an expert-team finding system based on the concept of skyline teams and efficient algorithms for finding such teams. Given a set of experts, CrewScout finds all k-expert skyline teams, which are not dominated by any other k-expert teams. The dominance between teams is governed by comparing their aggregated expertise vectors. The need for finding expert teams prevails in applications such as question answering, crowdsourcing, panel selection, and project team formation. The new contributions of this paper include an end-to-end system with an interactive user interface that assists users in choosing teams and an demonstration of its application domains.
    CIKM; 11/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: We assume a dataset of transactions generated by a set of users over structured items where each item could be described through a set of features. In this paper, we are interested in identifying the frequent featuresets (set of features) by mining item transactions. For example, in a news website, items correspond to news articles, the features are the named-entities/topics in the articles and an item transaction would be the set of news articles read by a user within the same session. We show that mining frequent featuresets over structured item transactions is a novel problem and show that straightforward extensions of existing frequent itemset mining techniques provide unsatisfactory results. This is due to the fact that while users are drawn to each item in the transaction due to a subset of its features, the transaction by itself does not provide any information about such underlying preferred features of users. In order to overcome this hurdle, we propose a featureset uncertainty model where each item transaction could have been generated by various featuresets with different probabilities. We describe a novel approach to transform item transactions into uncertain transaction over featuresets and estimate their probabilities using constrained least squares based approach. We propose diverse algorithms to mine frequent featuresets. Our experimental evaluation provides a comparative analysis of the different approaches proposed.
    Proceedings of the VLDB Endowment 11/2014; 8(3):257-268.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce a novel, general purpose, technique for faster sampling of nodes over an online social network. Specifically, unlike traditional random walk which wait for the convergence of sampling distribution to a predetermined target distribution - a waiting process that incurs a high query cost - we develop WALK-ESTIMATE, which starts with a much shorter random walk, and then proactively estimate the sampling probability for the node taken before using acceptance-rejection sampling to adjust the sampling probability to the predetermined target distribution. We present a novel backward random walk technique which provides provably unbiased estimations for the sampling probability, and demonstrate the superiority of WALK-ESTIMATE over traditional random walks through theoretical analysis and extensive experiments over real world online social networks.
  • Source
    Article: HDBTracker

  • [Show abstract] [Hide abstract]
    ABSTRACT: We present IQR, a system that demonstrates optimization based interactive relaxations for queries that return an empty answer. Given an empty answer, IQR dynamically suggests one relaxation of the original query conditions at a time to the user, based on certain optimization objectives, and the user responds by either accepting or declining the relaxation, until the user arrives at a non-empty answer, or a non-empty answer is impossible to achieve with any further relaxations. The relaxation suggestions hinge on a proba- bilistic framework that takes into account the probability of the user accepting a suggested relaxation, as well as how much that relaxation serves towards the optimization objec- tive. IQR accepts a wide variety of optimization objectives - user centric objectives, such as, minimizing the number of user interactions (i.e., effort) or returning relevant results, as well as seller centric objectives, such as, maximizing profit. IQR offers principled exact and approximate solutions for gen- erating relaxations that are demonstrated using multiple, large real datasets.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Microblogging platforms such as Twitter have experienced a phenomenal growth of popularity in recent years, making them attractive platforms for research in diverse fields from computer science to sociology. However, most microblogging platforms impose strict access restrictions (e.g., API rate limits) that prevent scientists with limited resources - e.g., who cannot afford microblog-data-access subscriptions offered by GNIP et al. - to leverage the wealth of microblogs for analytics. For example, Twitter allows only 180 queries per 15 minutes, and its search API only returns tweets posted within the last week. In this paper, we consider a novel problem of estimating aggregate queries over microblogs, e.g., "how many users mentioned the word 'privacy' in 2013?". We propose novel solutions exploiting the user-timeline information that is publicly available in most microblogging platforms. Theoretical analysis and extensive real-world experiments over Twitter, Google+ and Tumblr confirm the effectiveness of our proposed techniques.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The rise of Web 2.0 is signaled by sites such as Flickr,, and YouTube, and social tagging is essential to their success. A typical tagging action involves three components, user, item (e.g., photos in Flickr), and tags (i.e., words or phrases). Analyzing how tags are assigned by certain users to certain items has important implications in helping users search for desired information. In this paper, we develop a dual mining framework to explore tagging behavior. This framework is centered around two opposing measures, similarity and diversity, applied to one or more tagging components, and therefore enables a wide range of analysis scenarios such as characterizing similar users tagging diverse items with similar tags or diverse users tagging similar items with diverse tags. By adopting different concrete measures for similarity and diversity in the framework, we show that a wide range of concrete analysis problems can be defined and they are NP-Complete in general. We design four sets of efficient algorithms for solving many of those problems and demonstrate, through comprehensive experiments over real data, that our algorithms significantly out-perform the exact brute-force approach without compromising analysis result quality.
    The VLDB Journal 04/2014; 23(2). DOI:10.1007/s00778-013-0341-y · 1.57 Impact Factor

Publication Stats

4k Citations
31.21 Total Impact Points


  • 2012-2014
    • George Washington University
      • Department of Computer Science
      Washington, Washington, D.C., United States
    • Qatar Computing Research Institute
      Ad Dawḩah, Baladīyat ad Dawḩah, Qatar
  • 2-2014
    • University of Texas at Arlington
      • Department of Computer Sciences & Engineering
      Arlington, Texas, United States
  • 2006
    • Banner College - Arlington
      Arlington, Texas, United States
  • 1993-2006
    • The University of Memphis
      • Department of Mathematical Sciences
      Memphis, TN, United States
  • 2000-2004
    • Microsoft
      Washington, West Virginia, United States