G. Cormode

AT&T Labs, Austin, TX, USA

Are you G. Cormode?

Claim your profile

Publications (9)5.35 Total impact

  • Source
    Conference Proceeding: On Signatures for Communication Graphs
    [show abstract] [hide abstract]
    ABSTRACT: Communications between individuals can be represented by (weighted, multi-) graphs. Many applications operate on communication graphs associated with telephone calls, emails, instant messages (IM), blogs, web forums, e-business relationships and so on. These applications include identifying repetitive fraudsters, message board aliases, multiusage of IP addresses, etc. Tracking electronic identities in communication networks can be achieved if we have a reliable "signature" for nodes and activities. While many examples of ad hoc signatures can be proposed for particular tasks, what is needed is a systematic study of the principles behind the usage of signatures for any task. We develop a formal framework for the use of signatures in communication graphs and identify three fundamental properties that are natural to signature schemes: persistence, uniqueness and robustness. We argue for the importance of these properties by showing how they impact a set of applications. We then explore several signature schemes - previously defined and new - in our framework and evaluate them on real data in terms of these properties. This provides insights into suitable signature schemes for desired applications. Finally, as case studies, we focus on two concrete applications in enterprise network traffic. We apply signature schemes to these problems and demonstrate their effectiveness.
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on; 05/2008
  • Source
    Conference Proceeding: Conquering the Divide: Continuous Clustering of Distributed Data Streams
    [show abstract] [hide abstract]
    ABSTRACT: Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the k-center clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We show that these algorithms can be designed to give accuracy guarantees that are close to the best possible even in the centralized case. In our experiments, we see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location.
    Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on; 05/2007
  • Source
    Conference Proceeding: What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams
    [show abstract] [hide abstract]
    ABSTRACT: Emerging applications in sensor systems and network-wide IP traffic analysis present many technical challenges. They need distributed monitoring and continuous tracking of events. They have severe resource constraints not only at each site in terms of per-update processing time and archival space for highspeed streams of observations, but also crucially, communication constraints for collaborating on the monitoring task. These elements have been addressed in a series of recent works. A fundamental issue that arises is that one cannot make the "uniqueness" assumption on observed events which is present in previous works, since widescale monitoring invariably encounters the same events at different points. For example, within the network of an Internet Service Provider packets of the same flow will be observed in different routers; similarly, the same individual will be observed by multiple mobile sensors in monitoring wild animals. Aggregates of interest on such distributed environments must be resilient to duplicate observations. We study such duplicate-resilient aggregates that measure the extent of the duplication―how many unique observations are there, how many observations are unique―as well as standard holistic aggregates such as quantiles and heavy hitters over the unique items. We present accuracy guaranteed, highly communication-efficient algorithms for these aggregates that work within the time and space constraints of high speed streams. We also present results of a detailed experimental study on both real-life and synthetic data.
    Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on; 05/2006
  • Source
    Article: What's new: finding significant differences in network data streams
    G. Cormode, S. Muthukrishnan
    [show abstract] [hide abstract]
    ABSTRACT: Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about "what's new". Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational. We present novel algorithms for finding the most significant deltoids in high-speed traffic data, and prove that they use small space, very small time per update, and are guaranteed to find significant deltoids with pre-specified accuracy. In experimental evaluation with real network traffic, our algorithms perform well and recover almost all deltoids. This is the first work to provide solutions capable of working over the data with one pass, at network traffic speeds.
    IEEE/ACM Transactions on Networking 01/2006; 13(6):1219- 1232. · 2.03 Impact Factor
  • Source
    Conference Proceeding: Effective computation of biased quantiles over data streams
    [show abstract] [hide abstract]
    ABSTRACT: Skew is prevalent in many data sources such as IP traffic streams. To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the "high-biased" and the "targeted" quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures. Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over highspeed data streams.
    Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on; 05/2005
  • Article: On automated lesson construction from electronic textbooks
    [show abstract] [hide abstract]
    ABSTRACT: An electronic book may be viewed as an application with a multimedia database. We define an electronic textbook as an electronic book that is used in conjunction with instructional resources such as lectures. We propose an electronic textbook data model with topics, topic sources, metalinks (relationships among topics), and instructional modules, which are multimedia presentations possibly capturing real-life lectures of instructors. Using the data model, the system provides users a topic-guided multimedia lesson construction. We concentrate, in detail, on the use of one metalink type in lesson construction, namely, prerequisite dependencies, and provide a sound and complete axiomatization of prerequisite dependencies. We present a simple automated way of constructing lessons for users where the user lists a set of topic names (s)he is interested in, and the system automatically constructs and delivers the "best" user-tailored lesson as a multimedia presentation, where "best" is characterized in terms of both topic closures with respect to prerequisite dependencies and what the user knows about topics. We model and present sample lesson construction requests for users, discuss their complexity, and give algorithms that evaluate such requests. For expensive lesson construction requests, we list heuristics and empirically evaluate their performance. We also discuss the worst-case performance guarantees of lesson request algorithms.
    IEEE Transactions on Knowledge and Data Engineering 04/2004; · 1.66 Impact Factor
  • Article: Comparing data streams using Hamming norms (how to zero in)
    [show abstract] [hide abstract]
    ABSTRACT: Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed "on the fly" as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the "l<sub>0</sub> sketch" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
    IEEE Transactions on Knowledge and Data Engineering 06/2003; 15(3):529- 540. · 1.66 Impact Factor
  • Conference Proceeding: Electronic books in digital libraries
    [show abstract] [hide abstract]
    ABSTRACT: An electronic book is an application with a multimedia database of instructional resources, which include hyperlinked text, instructor's audio/video clips, slides, animation, still images, etc. As well as content-based information about these data, and metadata such as annotations, tags, and cross-referencing information. Electronic books in the Internet or on CDs today are not easy to learn from. We propose the use of a multimedia database of instructional resources in constructing and delivering multimedia lessons about topics in an electronic book. We introduce an electronic book data model containing (a) topic objects and (b) instructional resources, called instruction module objects, which are multimedia presentations possibly capturing real-life lectures of instructors. We use the notion of topic prerequisites for topics at different detail levels, to allow electronic book users to request/compose multimedia lessons about topics in the electronic book. We present automated construction of the “best” user-tailored lesson (as a multimedia presentation)
    Advances in Digital Libraries, 2000. ADL 2000. Proceedings. IEEE; 02/2000
  • Source
    Article: Small synopses for group-by query verification on outsourced data streams
    [show abstract] [hide abstract]
    ABSTRACT: 0??. 2 · Due to the overwhelming flow of information in many data stream applications, data outsourc-ing is a natural and effective paradigm for individual businesses to address the issue of scale. In the standard data outsourcing model, the data owner outsources streaming data to one or more third-party servers, which answer queries posed by a potentially large number of clients on the data owner's behalf. Data outsourcing intrinsically raises issues of trust, making outsourced query assurance on data streams a problem with important practical implications. Existing so-lutions proposed in this model all build upon cryptographic primitives such as signatures and collision-resistant hash functions, which only work for certain types of queries, e.g., simple selec-tion/aggregation queries. In this paper, we consider another common type of queries, namely, "GROUP BY, SUM" queries, which previous techniques fail to support. Our new solutions are not based on cryptographic primitives, but instead use algebraic and probabilistic techniques to compute a small synopsis on the true query result, which is then communicated to the client so as to verify the correctness of the query result returned by the server. The synopsis uses a constant amount of space irrespective of the result size, has an extremely small probability of failure, and can be maintained using no extra space when the query result changes as elements stream by. We then generalize our synopsis to allow some tolerance on the number of erroneous groups, in order to support semantic load shedding on the server. When the number of erroneous groups is indeed tolerable, the synopsis can be strengthened so that we can locate and even correct these errors. Finally, we implement our techniques and perform an empirical evaluation using live network traffic.
    ACM Transactions on Database Systems Month 20YY.

Institutions

  • 2007–2008
    • AT&T Labs
      Austin, TX, USA
  • 2003–2006
    • Rutgers, The State University of New Jersey
      • Department of Mathematics & Computer Science
      New Brunswick, NJ, USA
  • 2004
    • Case Western Reserve University
      • Department of Electrical Engineering and Computer Science
      Cleveland, OH, USA