Luis Leopoldo Perez

Rice University, Houston, Texas, United States

Are you Luis Leopoldo Perez?

Claim your profile

Publications (8)0 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the SimSQL system, which allows for SQLbased specification, simulation, and querying of database-valued Markov chains, i.e., chains whose value at any time step comprises the contents of an entire database. SimSQL extends the earlier Monte Carlo database system (MCDB), which permitted Monte Carlo simulation of static database-valued random variables. Like MCDB, SimSQL uses user-specified "VG functions" to generate the simulated data values that are the building blocks of a simulated database. The enhanced functionality of SimSQL is enabled by the ability to parametrize VG functions using stochastic tables, so that one stochastic database can be used to parametrize the generation of another stochastic database, which can parametrize another, and so on. Other key extensions include the ability to explicitly define recursive versions of a stochastic table and the ability to execute the simulation in a MapReduce environment. We focus on applying SimSQL to Bayesian machine learning.
    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses. In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.
    ACM Trans. Database Syst. 01/2011; 36:18.
  • Luis Leopoldo Perez, Subi Arumugam, Christopher M. Jermaine
    [Show abstract] [Hide abstract]
    ABSTRACT: MCDB is a prototype database system for managing stochastic models for uncertain data. In this paper, we study the problem of how to use MCDB to answer statistical queries that search for database objects which satisfy some filter condition with greater (or less than) a user-specified probability. For example: "Which packages will arrive late with > 5% probability?" "Which regions will see more than a 2% decline in sales with > 50% probability?" "What items will be out of stock by Friday with > 20% probability?" We consider both the systems aspects and the statistical aspects of the problem.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Enterprises often need to assess and manage the risk arising from uncertainty in their data. Such uncertainty is typically modeled as a probability distribution over the uncertain data values, specified by means of a complex (often predictive) stochastic model. The probability distribution over data values leads to a probability dis- tribution over database query results, and risk assessment amounts to exploration of the upper or lower tail of a query-result distribu- tion. In this paper, we extend the Monte Carlo Database System to efficiently obtain a set of samples from the tail of a query-result distribution by adapting recent "Gibbs cloning" ideas from the sim- ulation literature to a database setting.
    PVLDB. 01/2010; 3:782-793.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Since the 1970's, database systems have been "compute-centric". When a computation needs the data, it requests the data, and the data are pulled through the system. We believe that this is problematic for two reasons. First, requests for data naturally incur high latency as the data are pulled through the memory hierarchy, and second, it makes it difficult or impossible for multiple queries or operations that are interested in the same data to amortize the bandwidth and latency costs associated with their data access. In this paper, we describe a purely-push based, research prototype database system called DataPath. DataPath is "data-centric". In DataPath, queries do not request data. Instead, data are automatically pushed onto processors, where they are then processed by any interested computation. We show experimentally on a multi-terabyte benchmark that this basic design principle makes for a very lean and fast database system.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system's ability to gracefully handle complex or unforeseen types of uncertainty, and does not permit the uncertainty model to be dynamically parameterized according to the current state of the database. We introduce MCDB, a system for managing uncertain data that is based on a Monte Carlo approach. MCDB represents uncertainty via "VG functions," which are used to pseudorandomly generate realized values for uncertain attributes. VG functions can be parameterized on the results of SQL queries over "parameter tables" that are stored in the database, facilitating what-if analyses. By storing parameters, and not probabilities, and by estimating, rather than exactly computing, the probability distribution over possible query answers, MCDB avoids many of the limitations of prior systems. For example, MCDB can easily handle arbitrary joint probability distributions over discrete or continuous attributes, arbitrarily complex SQL queries, and arbitrary functionals of the query-result distribution such as means, variances, and quantiles. To achieve good performance, MCDB uses novel query processing techniques, executing a query plan exactly once, but over "tuple bundles" instead of ordinary tuples. Experiments indicate that our enhanced functionality can be obtained with acceptable overheads relative to traditional systems.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008; 01/2008
  • Source
    Conference Paper: The DBO database system.
    [Show abstract] [Hide abstract]
    ABSTRACT: We demonstrate our prototype of the DBO database system. DBO is designed to facilitate scalable analytic processing over large data archives. DBO's analytic processing perfor- mance is competitive with other database systems; however, unlike any other existing research or industrial system, DBO maintains a statistically meaningful guess to the nal answer to a query from start to nish during query processing. This guess may be quite accurate after only a few seconds or min- utes, while answering a query exactly may take hours. This can result in signicant savings in both user and computer time, since a user can abort a query as soon as he or she is happy with the guess' accuracy.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We demonstrate our prototype of the DBO database system. DBO is designed to facilitate scalable analytic processing over large data archives. DBO's analytic processing perfor- mance is competitive with other database systems; however, unlike any other existing research or industrial system, DBO maintains a statistically meaningful guess to the final answer to a query from start to finish during query processing. This guess may be quite accurate after only a few seconds or min- utes, while answering a query exactly may take hours. This can result in significant savings in both user and computer time, since a user can abort a query as soon as he or she is happy with the guess' accuracy.

Publication Stats

121 Citations

Top Journals

Institutions

  • 2010–2011
    • Rice University
      Houston, Texas, United States
  • 2008
    • University of Florida
      • Department of Computer and Information Science and Engineering
      Gainesville, Florida, United States