Rajeev Rastogi

Google Inc., New York, New York, United States

Are you Rajeev Rastogi?

Claim your profile

Publications (207)64.02 Total impact

  • Minos Garofalakis, Rajeev Rastogi, Kyuseok Shim
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequential pattern mining under various constraints is a challenging data mining task. The paper provides a generic framework based on constraint programming to discover sequence patterns defined by constraints on local patterns (e.g., gap, regular expressions) or constraints on patterns involving combination of local patterns such as relevant subgroups and top-k patterns. This framework enables the user to mine in a declarative way both kinds of patterns. The solving step is done by exploiting the machinery of Constraint Programming. For complex patterns involving combination of local patterns, we improve the mining step by using dynamic CSP. Finally, we present two case studies in biomedical information extraction and stylistic analysis in linguistics.
    2014 IEEE 26th International Conference on Tools with Artificial Intelligence (ICTAI 2014); 11/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The highly dynamic nature of online commenting environments makes accurate ratings prediction for new comments challenging. In such a setting, in addition to exploiting comments with high predicted ratings, it is also critical to explore comments with high uncertainty in the predictions. In this paper, we propose a novel upper confidence bound (UCB) algorithm called LOGUCB that balances exploration with exploitation when the average rating of a comment is modeled using logistic regression on its features. At the core of our LOGUCB algorithm lies a novel variance approximation technique for the Bayesian logistic regression model that is used to compute the UCB value for each comment. In experiments with a real-life comments dataset from Yahoo! News, we show that LOGUCB with bag-of-words and topic features outperforms state-of-the-art explore-exploit algorithms.
    Proceedings of the 21st ACM international conference on Information and knowledge management; 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.
    Proceedings of the 21st ACM international conference on Information and knowledge management; 10/2012
  • Source
    Abhinav Mishra, Rajeev Rastogi
    [Show abstract] [Hide abstract]
    ABSTRACT: In many instances, offensive comments on the internet attract a disproportionate number of positive ratings from highly biased users. This results in an undesirable scenario where these offensive comments are the top rated ones. In this paper, we develop semi-supervised learning techniques to correct the bias in user ratings of comments. Our scheme uses a small number of comment labels in conjunction with user rating information to iteratively compute user bias and unbiased ratings for unlabeled comments. We show that the running time of each iteration is linear in the number of ratings, and the system converges to a unique fixed point. To select the comments to label, we devise an active learning algorithm based on empirical risk minimization. Our active learning method incrementally updates the risk for neighboring comments each time a comment is labeled, and thus can easily scale to large comment datasets. On real-life comments from Yahoo! News, our semi-supervised and active learning algorithms achieve higher accuracy than simple baselines, with few labeled examples.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we study the problem of efficiently computing multiple aggregation queries over a data stream. In order to share computation, prior proposals have suggested instantiating certain intermediate aggregates which are then used to generate the final answers for input queries. In this work, we make a number of important contributions aimed at improving the execution and generation of query plans containing intermediate aggregates. These include: (1) a different hashing model, which has low eviction rates, and also allows us to accurately estimate the number of evictions, (2) a comprehensive query execution cost model based on these estimates, (3) an efficient greedy heuristic for constructing good low-cost query plans, (4) provably near-optimal and optimal algorithms for allocating the available memory to aggregates in the query plan when the input data distribution is Zipf-like and Uniform, respectively, and (5) a detailed performance study with real-life IP flow data sets, which show that our multiple aggregates computation techniques consistently outperform the best-known approach.
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on; 05/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia's category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia's crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy.
    Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011; 01/2011
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we consider the problem of extracting structured data from web pages taking into account both the content of individual attributes as well as the structure of pages and sites. We use Markov Logic Networks (MLNs) to capture both content and structural features in a single unified framework, and this enables us to perform more accurate inference. We show that inference in our information extraction scenario reduces to solving an instance of the maximum weight subgraph problem. We develop specialized procedures for solving the maximum subgraph variants that are far more efficient than previously proposed inference methods for MLNs that solve variants of MAX-SAT. Experiments with real-life datasets demonstrate the effectiveness of our approach.
    Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
    Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.
    09/2010; 3:578-587. DOI:10.1145/1772690.1772826
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Long-distance multi-hop wireless networks have been used in recent years to provide connectivity to rural areas. The salient features of such networks include TDMA channel access, nodes with multiple radios, and point-to-point long-distance wireless links established using high-gain directional antennas mounted on high towers. It has been demonstrated previously that in such network architectures, nodes can transmit concurrently on multiple radios, as well as receive concurrently on multiple radios. However, concurrent transmission on one radio, and reception on another radio causes interference. Under this scheduling constraint, given a set of source-destination demand rates, we consider the problem of satisfying the maximum fraction of each demand (also called the maximum concurrent flow problem). We give a novel joint routing and scheduling scheme for this problem, based on linear programming and graph coloring. We analyze our algorithm theoretically and prove that at least 50% of a satisfiable set of demands is satisfied by our algorithm for most practical networks (with maximum node degree at most 5).
    INFOCOM, 2010 Proceedings IEEE; 04/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we develop a framework for achieving scalable and communication-efficient dissemination of content in pub/sub systems. To maximize communication sharing across subscriptions, our routing framework groups subscriptions based on similarity, and transmits content matching one or more subscriptions in a group over a single dissemination tree for the group. We develop a cost model that uses published content samples in conjunction with the knowledge of consumer subscriptions to estimate the communication cost of a set of routing trees for subscription groups. The problem of computing a communication-optimal set of routing trees is then formulated as an optimization problem that seeks to find trees with the minimum cost. It turns out that the problem of computing a minimum-cost tree for a subscription group is a new generalization of the well-known Steiner tree problem, and an interesting problem in its own right. We develop an approximation algorithm that uses low-stretch spanning trees to compute a tree whose communication cost is within a polylogarithmic factor of the optimum. We use this to compute trees for various subscription- grouping configurations generated using a greedy clustering strategy, and select the one with the lowest cost. Our experimental study demonstrates the effectiveness of our content-aware routing approach compared to traditional routing based on content oblivious spanning trees.
    INFOCOM 2009, IEEE; 05/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Randomized techniques, based on computing small “sketch” synopses for each stream, have recently been shown to be a very effective tool for approximating the result of a single SQL query over streaming data tuples. In this paper, we investigate the problems arising when data-stream sketches are used to process multiple such queries concurrently. We demonstrate that, in the presence of multiple query expressions, intelligently sharing sketches among concurrent query evaluations can result in substantial improvements in the utilization of the available sketching space and the quality of the resulting approximation error guarantees. We provide necessary and sufficient conditions for multi-query sketch sharing that guarantee the correctness of the result-estimation process. We also investigate the difficult optimization problem of determining sketch-sharing configurations that are optimal (e.g., under a certain error metric for a given amount of space). We prove that optimal sketch sharing typically gives rise to NP-hard questions, and we propose novel heuristic algorithms for finding good sketch-sharing configurations in practice. Results from our experimental study with queries from the TPC-H benchmark verify the effectiveness of our approach, clearly demonstrating the benefits of our sketch-sharing methodology.
    Information Systems 04/2009; 34:209-230. DOI:10.1016/j.is.2008.06.002 · 1.24 Impact Factor
  • Source
    Yigal Bejerano, Rajeev Rastogi
    [Show abstract] [Hide abstract]
    ABSTRACT: This chapter introduces a greedy approach for delay monitoring of IP networks. It proposed two-phased monitoring scheme that ensures complete coverage of the network from both links and paths point of views, and it minimizes the monitoring overhead on the underlying production network. In the first phase it computes the locations of monitoring stations such that all network links or paths are covered by the minimal number of stations. Subsequently, in the second phase, it computes the minimal set of probe messages to be sent by each station such that the latency of every routing path can be measured. Unfortunately, both the station selection and the probe assignment problems are NP-hard. However, by using greedy approximation algorithms the scheme finds solutions close to the best possible approximations to both the station selection and the probe assignment problems. Further, the experimental results demonstrate the effectiveness of the presented algorithms for accurately monitoring large networks with very few monitoring stations and probe messages close to the number of network links.
    Greedy Algorithms, 11/2008; , ISBN: 978-953-7619-27-5
  • Source
    K.V.M. Naidu, Debmalya Panigrahi, Rajeev Rastogi
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose new "low-overhead" network monitoring techniques to detect violations of path-level QoS guarantees like end-to-end delay, loss, etc. Unlike existing path monitoring schemes, our approach does not calculate QoS parameters for all paths. Instead, it monitors QoS values for only a few paths, and exploits the fact that path anomalies are rare and anomalous states are well separated from normal operation, to rule out path QoS violations in most situations. We propose a heuristic to select a small subset of network paths to monitor while ensuring that no QoS violations are missed. Experiments with an ISP topology from the Rocketfuel data set show that our heuristic can deliver almost a 50% decrease in monitoring overhead compared to previous schemes.
    INFOCOM 2008. The 27th Conference on Computer Communications. IEEE; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present a new channel allocation scheme for IEEE 802.11 based mesh networks with point-to- point links, designed for rural areas. Our channel allocation scheme allows continuous full-duplex data transfer on every link in the network. Moreover, we do not require any synchronization across the links as the channel assignment prevents cross link interference. Our approach is simple. We consider any link in the network as made up of two directed edges. To each directed edge at a node, we assign a non-interfering IEEE 802.11 channel so that the set of channels assigned to the outgoing edges is disjoint from channels assigned to the incoming edges. Evaluation of this scheme in a testbed demonstrate throughput gains of between 50 - 100%, and significantly less end-to-end delays, over existing link scheduling/channel allocation protocols (such as 2P [11]) designed for point-to-point mesh networks. Formally speaking, this channel allocation scheme is equivalent to an edge-coloring problem, that we call the directed edge coloring (DEC) problem. We establish a relationship between this coloring problem and the classical vertex coloring problem, and thus, show that this problem is NP-hard. More precisely, we give an algorithm that, given k vertex coloring of a graph can directed edge color it using xi(k) colors, where xi(k) is the smallest integer n such that (lfloorn/2rfloor/n ) ges k.
    INFOCOM 2008. The 27th Conference on Computer Communications. IEEE; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: IEEE 802.11 WiFi equipment based wireless mesh networks have recently been proposed as an inexpensive approach to connect far-flung rural areas. Such networks are built using high-gain directional antennas that can establish long-distance wireless point-to-point links. Some nodes in the network (called gateway nodes) are directly connected to the wired internet, and the remaining nodes connect to the gateway(s) using one or more hops. The dominant cost of constructing such a mesh network is the cost of constructing antenna towers at nodes. The cost of a tower depends on its height, which in turn depends on the length of its links and the physical obstructions along those links. We investigate the problem of selecting which links should be established such that all nodes are connected, while the cost of constructing the antenna towers required to establish the selected links is minimized. We show that this problem is NP-hard and that a better than O(log n) approximation cannot be expected, where n is the number of vertices in the graph. We then present the first algorithm in the literature, for this problem, with provable performance bounds. More precisely, we present a greedy algorithm that is an O(log n) approximation algorithm for this problem. Finally, through simulations, we compare our approximation algorithm with both the optimal solution, and a naive heuristic.
    INFOCOM 2008. The 27th Conference on Computer Communications. IEEE; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cisco's NetFlow collector (NFC) is a powerful example of a real-world product that supports multiple aggregate queries over a continuous stream of IP flow records. NFC enables a plethora of network management tasks like traffic demands estimation, application traffic profiling, etc. In this paper, we investigate two computation sharing techniques for enabling streaming applications such as NFC to scale to hundreds of queries. Our first technique instantiates certain intermediate aggregates which are then used to generate the final answers for input queries. Our second technique coalesces the filter conditions of similar queries and uses the coalesced filter to pre-filter stream data input to these queries. Using these techniques, we propose a heuristic to compute a good query plan and perform extensive simulations to show that our heuristic delivers a factor of over 3 performance improvement compared to a naive approach.
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern communication networks are vulnerable to attackers who send unsolicited messages to innocent users, wasting network resources and user time. Some examples of such attacks are spam emails, annoying tele-marketing phone calls, viral marketing in social networks, etc. Existing techniques to identify these attacks are tailored to certain specific domains (like email spam filtering), but are not applicable to a majority of other networks. We provide a generic abstraction of such attacks, called the Random Link Attack (RLA), that can be used to describe a large class of attacks in communication networks. In an RLA, the malicious user creates a set of false identities and uses them to communicate with a large, random set of innocent users. We mine the social networking graph extracted from user interactions in the communication network to find RLAs. To the best of our knowledge, this is the first attempt to conceptualize the attack definition, applicable to a variety of communication networks. In this paper, we formally define RLA and show that the problem of finding an RLA is NP-complete. We also provide two efficient heuristics to mine subgraphs satisfying the RLA property; the first (GREEDY) is based on greedy set-expansion, and the second (TRWALK) on randomized graph traversal. Our experiments with a real-life data set demonstrate the effectiveness of these algorithms.
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on; 05/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Detecting constraint violations in large-scale distributed systems has recently attracted plenty of attention from the research community due to its varied applications (security, network monitoring, etc.). Communication efficiency of these systems is a critical concern and determines their practicality. In this paper, we introduce a new set of methods called non-zero slack schemes to implement distributed SUM queries efficiently. We show, both analytically and empirically, that these methods can lead to a considerable reduction in the amount of communication. We propose three adaptive non-zero slack schemes that adapt to changing data distributions; our best scheme is a lightweight reactive scheme that probabilistically adjusts local constraints based on the occurrence of certain events (using only a periodic probability estimation). We conduct an extensive experimental study using real-life and synthetic data sets, and show that our non-zero slack schemes incur significantly less communication overhead compared to the state of the art zero slack scheme (over a 60% savings).
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on; 05/2008

Publication Stats

10k Citations
64.02 Total Impact Points

Institutions

  • 2011
    • Google Inc.
      New York, New York, United States
  • 2008–2011
    • Yahoo! Labs
      Sunnyvale, California, United States
  • 2002–2008
    • Alcatel Lucent
      Lutetia Parisorum, Île-de-France, France
  • 2007
    • Kent State University
      • Department of Computer Science
      Кент, Ohio, United States
  • 2006
    • Stanford University
      Palo Alto, California, United States
  • 2004
    • Cornell University
      Итак, New York, United States
  • 2003
    • Oregon Health and Science University
      Portland, Oregon, United States
    • Carnegie Mellon University
      • Computer Science Department
      Pittsburgh, PA, United States
  • 1999
    • Research Center on Scientific and Technical Information
      Alger, Alger, Algeria
  • 1970–1999
    • University of Texas at Austin
      • Department of Computer Science
      Austin, TX, United States
  • 1998
    • AT&T Labs
      Austin, Texas, United States