Michael J. Franklin’s research while affiliated with University of Chicago and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (349)


VUS: Effective and Efficient Accuracy Measures for Time-Series Anomaly Detection
  • Preprint
  • File available

February 2025

·

13 Reads

·

Ashwin K. Krishna

·

Marine Bruel

·

[...]

·

Anomaly detection (AD) is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), AD for time series is also concerned with range-based anomalies (i.e., outliers spanning multiple observations). Nevertheless, it is common to use traditional point-based information retrieval measures, such as Precision, Recall, and F-score, to assess the quality of methods by thresholding the anomaly score to mark each point as an anomaly or not. However, mapping discrete labels into continuous data introduces unavoidable shortcomings, complicating the evaluation of range-based anomalies. Notably, the choice of evaluation measure may significantly bias the experimental outcome. Despite over six decades of attention, there has never been a large-scale systematic quantitative and qualitative analysis of time-series AD evaluation measures. This paper extensively evaluates quality measures for time-series AD to assess their robustness under noise, misalignments, and different anomaly cardinality ratios. Our results indicate that measures producing quality values independently of a threshold (i.e., AUC-ROC and AUC-PR) are more suitable for time-series AD. Motivated by this observation, we first extend the AUC-based measures to account for range-based anomalies. Then, we introduce a new family of parameter-free and threshold-independent measures, Volume Under the Surface (VUS), to evaluate methods while varying parameters. We also introduce two optimized implementations for VUS that reduce significantly the execution time of the initial implementation. Our findings demonstrate that our four measures are significantly more robust in assessing the quality of time-series AD methods.

Download

Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing

January 2025

·

10 Reads

A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large collections of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy. However , a programmer who wants to answer a substantive AI-powered query must orchestrate large numbers of models, prompts, and data operations. In this paper, we present PALIMPZEST, a system that enables programmers to pose AI-powered analytical queries over arbitrary collections of unstructured data in a simple declarative language. The system uses a cost optimization framework-which explores the search space of AI models, prompting techniques, and related foundation model optimizations. PALIMPZEST implements the query while navigating the trade-offs between runtime, financial cost, and output data quality. We introduce a novel language for AI-powered analytics tasks, the optimization methods that PALIMPZEST uses, and the prototype system itself. We evaluate PALIMPZEST on a real-world workload. Our system produces plans that are up to 3.3x faster and 2.9x cheaper than a baseline method when using a single-thread setup, while also achieving superior F1-scores. PALIMPZEST applies its optimizations automatically, requiring no additional work from the user.


Databases Unbound: Querying All of the World's Bytes with AI

November 2024

·

56 Reads

·

5 Citations

Proceedings of the VLDB Endowment

Over the past five decades, the relational database model has proven to be a scaleable and adaptable model for querying a variety of structured data, with use cases in analytics, transactions, graphs, streaming and more. However, most of the world's data is unstructured. Thus, despite their success, the reality is that the vast majority of the world's data has remained beyond the reach of relational systems. The rise of deep learning and generative AI offers an opportunity to change this. These models provide a stunning capability to extract semantic understanding from almost any type of document, including text, images, and video, which can extend the reach of databases to all the world's data. In this paper we explore how these new technologies will transform the way we build database management software, creating new that systems that can ingest, store, process, and query all data. Building such systems presents many opportunities and challenges. In this paper we focus on three: scalability, correctness, and reliability, and argue that the declarative programming paradigm that has served relational systems so well offers a path forward in the new world of AI data systems as well. To illustrate this, we describe several examples of such declarative AI systems we have built in document and video processing, and provide a set of research challenges and opportunities to guide research in this exciting area going forward. And lovely apparitions, -dim at first , Then radiant, as the mind arising bright From the embrace of beauty (whence the forms Of which these are the phantoms) casts on them The gathered rays which are reality- Shall visit us the progeny immortal Of Painting, Sculpture, and rapt Poesy , And arts, though unimagined, yet to be ; Prometheus Unbound, Percy Bysshe Shelley


A Declarative System for Optimizing AI Workloads

May 2024

·

16 Reads

Modern AI models provide the key to a long-standing dream: processing analytical queries about almost any kind of data. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or insights from image and video corpora. Today's models can accomplish these tasks with high accuracy. However, a programmer who wants to answer a substantive AI-powered query must orchestrate large numbers of models, prompts, and data operations. For even a single query, the programmer has to make a vast number of decisions such as the choice of model, the right inference method, the most cost-effective inference hardware, the ideal prompt design, and so on. The optimal set of decisions can change as the query changes and as the rapidly-evolving technical landscape shifts. In this paper we present Palimpzest, a system that enables anyone to process AI-powered analytical queries simply by defining them in a declarative language. The system uses its cost optimization framework -- which explores the search space of AI models, prompting techniques, and related foundation model optimizations -- to implement the query with the best trade-offs between runtime, financial cost, and output data quality. We describe the workload of AI-powered analytics tasks, the optimization methods that Palimpzest uses, and the prototype system itself. We evaluate Palimpzest on tasks in Legal Discovery, Real Estate Search, and Medical Schema Matching. We show that even our simple prototype offers a range of appealing plans, including one that is 3.3x faster, 2.9x cheaper, and offers better data quality than the baseline method. With parallelism enabled, Palimpzest can produce plans with up to a 90.3x speedup at 9.1x lower cost relative to a single-threaded GPT-4 baseline, while obtaining an F1-score within 83.5% of the baseline. These require no additional work by the user.



Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel)

September 2023

·

35 Reads

·

2 Citations

Proceedings of the VLDB Endowment

Large language models (LLMs) have recently taken the world by storm, promising potentially game changing opportunities in multiple fields. Naturally, there is significant promise in applying LLMs to the management of structured data, or more generally, to the processes involved in data science. At the very least, LLMs have the potential to provide substantial advancements in long-standing challenges that our community has been tackling for decades. On the other hand, they may introduce completely new capabilities that we have only dreamed of thus far. This panel will bring together a few leading experts who have been thinking about these opportunities from various perspectives and fielding them in research prototypes and even in commercial applications.


How Large Language Models Will Disrupt Data Management

August 2023

·

86 Reads

·

49 Citations

Proceedings of the VLDB Endowment

Large language models (LLMs), such as GPT-4, are revolutionizing software's ability to understand, process, and synthesize language. The authors of this paper believe that this advance in technology is significant enough to prompt introspection in the data management community, similar to previous technological disruptions such as the advents of the world wide web, cloud computing, and statistical machine learning. We argue that the disruptive influence that LLMs will have on data management will come from two angles. (1) A number of hard database problems, namely, entity resolution, schema matching, data discovery, and query synthesis, hit a ceiling of automation because the system does not fully understand the semantics of the underlying data. Based on large training corpora of natural language, structured data, and code, LLMs have an unprecedented ability to ground database tuples, schemas, and queries in real-world concepts. We will provide examples of how LLMs may completely change our approaches to these problems. (2) LLMs blur the line between predictive models and information retrieval systems with their ability to answer questions. We will present examples showing how large databases and information retrieval systems have complementary functionality.


Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding Distances

June 2023

·

14 Reads

·

10 Citations

Proceedings of the VLDB Endowment

Similarity search is a core analytical task, and its performance critically depends on the choice of distance measure. For time-series querying, elastic measures achieve state-of-the-art accuracy but are computationally expensive. Thus, fast lower bounding (LB) measures prune unnecessary comparisons with elastic distances to accelerate similarity search. Despite decades of attention, there has never been a study to assess the progress in this area. In addition, the research has disproportionately focused on one popular elastic measure, while other accurate measures have received little or no attention. Therefore, there is merit in developing a framework to accumulate knowledge from previously developed LBs and eliminate the notoriously challenging task of designing separate LBs for each elastic measure. In this paper, we perform the first comprehensive study of 11 LBs spanning 5 elastic measures using 128 datasets. We identify four properties that constitute the effectiveness of LBs and propose the Generalized Lower Bounding (GLB) framework to satisfy all desirable properties. GLB creates cache-friendly summaries, adaptively exploits summaries of both query and target time series, and captures boundary distances in an unsupervised manner. GLB outperforms all LBs in speedup (e.g., up to 13.5× faster against the strongest LB in terms of pruning power), establishes new state-of-the-art results for the 5 elastic measures, and provides the first LBs for 2 elastic measures with no known LBs. Overall, GLB enables the effective development of LBs to facilitate fast similarity search.


Data Station: Delegated, Trustworthy, and Auditable Computation to Enable Data-Sharing Consortia with a Data Escrow

May 2023

·

52 Reads

Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long and tedious one-off negotiations. We introduce Data Station, a data escrow designed to enable the formation of data-sharing consortia. Data owners share data with the escrow knowing it will not be released without their consent. Data users delegate their computation to the escrow. The data escrow relies on delegated computation to execute queries without releasing the data first. Data Station leverages hardware enclaves to generate trust among participants, and exploits the centralization of data and computation to generate an audit log. We evaluate Data Station on machine learning and data-sharing applications while running on an untrusted intermediary. In addition to important qualitative advantages, we show that Data Station: i) outperforms federated learning baselines in accuracy and runtime for the machine learning application; ii) is orders of magnitude faster than alternative secure data-sharing frameworks; and iii) introduces small overhead on the critical path.



Citations (76)


... In addition, SICs are expressed in the same language that defines data processing and integrate seamlessly into the relational model. These design decisions are in line with previous efforts that argue for statement and solution specifications on LLM-based system components [44] as well as an (extended) relational model in AI-augmented DPSs [26,29,31,37]. ...

Reference:

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems
Databases Unbound: Querying All of the World's Bytes with AI
  • Citing Article
  • November 2024

Proceedings of the VLDB Endowment

... New research fields are opening strong opportunities for the definition of conceptual models [13,14]. Simultaneously, research has also been devoted to addressing data integration with novel approaches [15][16][17]. As a result, it is imperative to investigate advancements in both fields: modeling and integration. ...

Will LLMs Reshape, Supercharge, or Kill Data Science? (VLDB 2023 Panel)
  • Citing Article
  • September 2023

Proceedings of the VLDB Endowment

... New AI tools leveraging these models, such as Elicit and NotebookLM, demonstrate potential in revolutionizing research practices, including discovering information, conducting literature surveys, and performing meta-analyses. Specialized applications of LLMs in data management, such as entity resolution, schema matching, and query synthesis, as discussed by Fernandez et al. (2023), further highlight their utility. LLMs are also transforming human-computer interaction in scientific analysis by lowering the barrier to using complex scientific data sets. ...

How Large Language Models Will Disrupt Data Management
  • Citing Article
  • August 2023

Proceedings of the VLDB Endowment

... The latter results in ordered sequences of real-valued data points commonly referred to as time series. Analytical tasks over time series data are becoming increasingly important in virtually every domain [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], including astronomy [20], energy [21], environmental [22], and social [23] sciences. ...

Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding Distances
  • Citing Article
  • June 2023

Proceedings of the VLDB Endowment

... [48] discusses techniques to protect sensitive data when these data depend on other disclosable attributes. Data Station [66] is a data escrow that provides a policy framework to coordinate and execute datasharing tasks in a trustworthy environment. IoT Notary [46] is a privacy-preserving system that enforces data-capturing policies in IoT devices. ...

Data station: delegated, trustworthy, and auditable computation to enable data-sharing consortia with a data escrow
  • Citing Article
  • July 2022

Proceedings of the VLDB Endowment

... Therefore, researchers have recognized that the classical PW metrics are inadequate for TAD evaluation, as they treat each time point as an individual unit, failing to account for the temporal continuity of anomaly events. In recent years, improved TAD evaluation methods [17]- [20] that take the continuity of timeseries into consideration have been proposed, which can be broadly classified into two categories: point-based and eventbased methods. Point-based methods treat each anomaly point as equal, with specific predicted points adjusted to account for the temporal context. ...

Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection

Proceedings of the VLDB Endowment

... The third category, Collective anomalies, corresponds to sequences of points that do not repeat a typical (previously observed) pattern. Moreover, their (a) GraphAn system (b) Theseus system Figure 3: Example of interactive systems that (a) allow the user to dive into computational steps [6], and (b) depicts experimental results [7]. ...

Theseus: Navigating the Labyrinth of Time-Series Anomaly Detection

Proceedings of the VLDB Endowment

... Massive collections of time-varying measurements, commonly referred to as time series, have become a reality in virtually every scientific and industrial domain [5,39,44,43,47,19,6,42,40]. Notably, there is an increasingly pressing need for developing techniques for efficient and effective analysis of zettabytes of time series produced by millions of Internet-of-Things (IoT) devices [22,24,46,26,27,32]. ...

Fast Adaptive Similarity Search through Variance-Aware Quantization

... The TSB-UAD anomaly detection benchmark [30] consists of 1980 univariate time series collected from a variety of sources, including human body signals, spacecraft data, environmental data, and network services. These time series are annotated with anomalies based on 18 publicly available anomaly detection datasets from the past decade. ...

TSB-UAD: an end-to-end benchmark suite for univariate time-series anomaly detection

Proceedings of the VLDB Endowment