Chris Jermaine

Rice University, Houston, Texas, United States

Are you Chris Jermaine?

Claim your profile

Publications (25)4.35 Total impact

  • Supriya Nirkhiwale, Alin Dobra, Chris Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: As of 2005, sampling has been incorporated in all major database systems. While efficient sampling techniques are realizable, determining the accuracy of an estimate obtained from the sample is still an unresolved problem. In this paper, we present a theoretical framework that allows an elegant treatment of the problem. We base our work on generalized uniform sampling (GUS), a class of sampling methods that subsumes a wide variety of sampling techniques. We introduce a key notion of equivalence that allows GUS sampling operators to commute with selection and join, and derivation of confidence intervals. We illustrate the theory through extensive examples and give indications on how to use it to provide meaningful estimations in database systems.
    06/2013; DOI:10.14778/2556549.2556563
  • [Show abstract] [Hide abstract]
    ABSTRACT: Methods for Approximate Query Processing (AQP) are essential for dealing with massive data. They are often the only means of providing interactive response times when exploring massive datasets, and are also needed to handle high speed data streams. These methods proceed by computing a lossy, compact synopsis of the data, and then executing the query of interest against the synopsis rather than the entire dataset. We describe basic principles and recent developments in AQP. We focus on four key synopses: random samples, histograms, wavelets, and sketches. We consider issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. We also discuss the trade-offs between the different synopsis types.
    Foundations and Trends in Databases 01/2012; 4(1–3):1-294. DOI:10.1561/1900000004
  • [Show abstract] [Hide abstract]
    ABSTRACT: Virtually all work on topic modeling has assumed that the topics are to be learned over a text-based document corpus. However, there exist important applications where topic models must be learned over an audio corpus of spoken language. Unfortunately, speech-to-text programs can have very low accuracy. We therefore propose a novel topic model for spoken language that incorporates a statistical model of speech-to-text software behavior. Crucially, our model exploits the uncertainty numbers returned by the software. Our ideas apply to any domain in which it would be useful to build a topic model over data in which uncertainties are explicitly represented.
    Data Mining (ICDM), 2012 IEEE 12th International Conference on; 01/2012
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses. In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Given a spatial dataset placed on an n ×n grid, our goal is to find the rectangular regions within which subsets of the dataset exhibit anomalous behavior. We develop algorithms that, given any user-supplied arbitrary likelihood function, conduct a likelihood ratio hypothesis test (LRT) over each rectangular region in the grid, rank all of the rectangles based on the computed LRT statistics, and return the top few most interesting rectangles. To speed this process, we develop methods to prune rectangles without computing their associated LRT statistics.
    ACM Transactions on Knowledge Discovery from Data 10/2010; 4:20. DOI:10.1145/1857947.1857952 · 1.15 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of similarity search in applications where the cost of computing the similarity between two records is very expensive, and the similarity measure is not a metric. In such applications, comparing even a tiny fraction of the database records to a single query record can be orders of magnitude slower than reading the entire database from disk, and indexing is often not possible. We develop a general-purpose, statistical framework for answering top-k queries in such databases, when the database administrator is able to supply an inexpensive surrogate ranking function that substitutes for the actual similarity measure. We develop a robust method that learns the relationship between the surrogate function and the similarity measure. Given a query, we use Bayesian statistics to update the model by taking into account the observed partial results. Using the updated model, we construct bounds on the accuracy of the result set obtained via the surrogate ranking. Our experiments show that our models can produce useful bounds for several real-life applications.
    Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA; 01/2010
  • Source
    Alin Dobra, Chris Jermaine, Florin Rusu, Fei Xu
    [Show abstract] [Hide abstract]
    ABSTRACT: DBO is a database system that utilizes randomized algorithms to give statistically meaningful estimates for the final answer to a multi-table, disk-based query from start to finish during query ex- ecution. However, DBO's "time 'til utility" (or "TTU"; that is, the time until DBO can give a useful estimate) can be overly large, par- ticularly in the case that many database tables are joined in a query, or in the case that a join query includes a very selective predicate on one or more of the tables, or when the data are skewed. In this paper, we describe Turbo DBO, which is a prototype database sys- tem that can answer multi-table join queries in a scalable fashion, just like DBO. However, Turbo DBO often has a much lower TTU than DBO. The key innovation of Turbo DBO is that it makes use of novel algorithms that look for and remember "partial match" tu- ples in a randomized fashion. These are tuples that satisfy some of the boolean predicates associated with the query, and can possibly be grown into tuples that actually contribute to the final query result at a later time.
    08/2009; 2:419-430. DOI:10.14778/1687627.1687675
  • Source
    Mingxi Wu, Chris Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.
    The VLDB Journal 04/2009; 18(2):571-597. DOI:10.1007/s00778-009-0133-6 · 1.70 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Given a spatial data set placed on an n × n grid, our goal is to find the rectangular regions within which subsets of the data set exhibit anomalous behavior. We develop algorithms that, given any user- supplied arbitrary likelihood function, conduct a likelih ood ratio hypothesis test (LRT) over each rectangular region in the grid, rank all of the rectangles based on the computed LRT statistics, and re- turn the top few most interesting rectangles. To speed this process, we develop methods to prune rectangles without computing their associated LRT statistics.
    Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes query processing in the DBO database sys- tem. Like other database systems designed for ad-hoc, analytic processing, DBO is able to compute the exact answer to queries over a large relational database in a scalable fashion. Unlike any other system designed for analytic processing, DBO can constantly maintain a guess as to the final answer to an aggregate query throughout execution, along with statistically meaningful bounds for the guess's accuracy. As DBO gathers more and more informa- tion, the guess gets more and more accurate, until it is 100% accu- rate as the query is completed. This allows users to stop the execution at any time that they are happy with the query accuracy, and encourages exploratory data analysis.
    ACM Transactions on Database Systems 11/2008; 33(4). DOI:10.1145/1247480.1247560 · 0.75 Impact Factor
  • Source
    Conference Paper: The DBO database system.
    [Show abstract] [Hide abstract]
    ABSTRACT: We demonstrate our prototype of the DBO database system. DBO is designed to facilitate scalable analytic processing over large data archives. DBO's analytic processing perfor- mance is competitive with other database systems; however, unlike any other existing research or industrial system, DBO maintains a statistically meaningful guess to the nal answer to a query from start to nish during query processing. This guess may be quite accurate after only a few seconds or min- utes, while answering a query exactly may take hours. This can result in signicant savings in both user and computer time, since a user can abort a query as soon as he or she is happy with the guess' accuracy.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008; 01/2008
  • Source
    Xiuyao Song, Chris Jermaine, Sanjay Ranka, John Gums
    [Show abstract] [Hide abstract]
    ABSTRACT: Classic mixture models assume that the prevalence of the various mixture components is fixed and does not vary over time. This presents problems for applications where the goal is to learn how complex data distributions evolve. We develop models and Bayesian learning algorithms for inferring the temporal trends of th e compo- nents in a mixture model as a function of time. We show the utility of our models by applying them to the real-life problem of track- ing changes in the rates of antibiotic resistance in Escheri chia coli and Staphylococcus aureus. The results show that our methods can derive meaningful temporal antibiotic resistance pattern s.
    Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008; 01/2008
  • Source
    Fei Xu, Chris Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: Many aspects of the data integration problem have been considered in the literature: how to match schemas across dierent data sources, how to decide when dierent records refer to the same entity, how to eciently perform the re- quired entity resolution in a batch fashion, and so on. How- ever, what has largely been ignored is a way to eciently deploy these existing methods in a realistic, distributed en- terprise integration environment. The straightforward use of existing methods often requires that all data be shipped to a coordinator for cleaning, which is often unacceptable. We develop a set of randomized algorithms that allow e- cient application of existing entity resolution methods to the answering of aggregate queries over data that have been dis- tributed across multiple sites. Using our methods, it is pos- sible to eciently generate aggregate query results that ac- count for duplicate and inconsistent values scattered across a federated system.
    Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007; 09/2007
  • Source
    Mingxi Wu, Chris Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to two specific problems that arise in the context of data management.
    Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007; 09/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm called the Sort-Merge-Shrink (SMS) Join for computing the answer to such a query over large, disk-based input tables. The key innovation of the SMS join is that if the input data are clustered in a statistically random fashion on disk, then at all times, the join provides an online, statistical estimator for the eventual answer to the query as well as probabilistic confidence bounds. Thus, a user can monitor the progress of the join throughout its execution and stop the join when satisfied with the estimate's accuracy or run the algorithm to completion with a total time requirement that is not much longer than that of other common join algorithms. This contrasts with other online join algorithms, which either do not offer such statistical guarantees or can only offer guarantees so long as the input data can fit into main memory.
    ACM Transactions on Database Systems 12/2006; 31:1382-1416. DOI:10.1145/1189769.1189775 · 0.75 Impact Factor
  • Source
    Mingxi Wu, Chris Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: An eective approach to detecting anomalous points in a data set is distance-based outlier detection. This paper describes a simple sampling algorithm to eciently detect distance-based outliers in domains where each and every dis- tance computation is very expensive. Unlike any existing algorithms, the sampling algorithm requires a fixed num- ber of distance computations and can return good results with accuracy guarantees. The most computationally ex- pensive aspect of estimating the accuracy of the result is sorting all of the distances computed by the sampling algo- rithm. The experimental study on two expensive domains as well as ten additional real-life data sets demonstrates both the eciency and eectiveness of the sampling algorithm in comparison with the state-of-the-art algorithm and the reliability of the accuracy guarantees.
    Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006; 08/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One important way in which sampling for approximate query processing in a database environment differs from traditional applications of sampling is that in a database, it is feasible to collect accurate summary statistics from the data in addition to the sample. This paper describes a set of sampling-based estimators for approximate query processing that make use of simple summary statistics to to greatly increase the accuracy of sampling-based estimators. Our estimators are able to give tight probabilistic guarantees on estimation accuracy. They are suitable for low or high dimensional data, and work with categorical or numerical attributes. Furthermore, the information used by our estimators can easily be gathered in a single pass, making them suitable for use in a streaming environment.
    Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on; 05/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The largest databases in use today are so large that answering a query exactly can take minutes, hours, or even days. One way to address this problem is to make use of approximation algorithms. Previous work on online aggregation has considered how to give online estimates with ever-increasing accuracy for aggregate functions over relational join and selection queries. However, no existing work is applicable to online estimation over subset-based SQL queries-those queries with a correlated subquery linked to an outer query via a NOT EXISTS, NOT IN, EXISTS, or IN clause (other queries such as EXCEPT and INTERSECT can also be seen as subset-based queries). In this paper we develop algorithms for online estimation over such queries, and consider the difficult problem of providing probabilistic accuracy guarantees at all times during query execution.
    Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005; 01/2005
  • Source
    Abhijit Pol, Chris Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: Statistical estimation and approximate query processing have become increasingly prevalent applications for database systems. However, approximation is usually of little use without some sort of guarantee on estimation accuracy, or "confidence bound." Analytically deriving probabilistic guarantees for database queries over sampled data is a daunting task, not suitable for the faint of heart, and certainly beyond the expertise of the typical database system end-user. This paper considers the problem of incorporating into a database system a powerful "plug-in" method for computing confidence bounds on the answer to relational database queries over sampled or incomplete data. This statistical tool, called the bootstrap, is simple enough that it can be used by a data-base programmer with a rudimentary mathematical background, but general enough that it can be applied to almost any statistical inference problem. Given the power and ease-of-use of the bootstrap, we argue that the algorithms presented for supporting the bootstrap should be incorporated into any database system which is intended to support analytic processing.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005; 01/2005