Michael J. Franklin's research while affiliated with University of Chicago and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (339)
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built...
The detection of anomalies in time series has gained ample academic and industrial attention, yet, no comprehensive benchmark exists to evaluate time-series anomaly detection methods. Therefore, there is no final verdict on which method performs the best (and under what conditions). Consequently, we often observe methods performing exceptionally we...
Anomaly detection (AD)is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), AD for time series is also concerned with range-based anomalies (i.e., outli...
Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built...
The detection of anomalies in time series has gained ample academic and industrial attention. However, no comprehensive benchmark exists to evaluate time-series anomaly detection methods. It is common to use (i) proprietary or synthetic data, often biased to support particular claims; or (ii) a limited collection of publicly available datasets. Con...
Subsequence anomaly detection in long data series is a significant problem. While the demand for real-time analytics and decision-making increases, anomaly detection methods have to operate over streams and handle drifts in data distribution. Nevertheless, existing approaches either require prior domain knowledge or become cumbersome and expensive...
With the increasing demand for real-time analytics and decision making, anomaly detection methods need to operate over streams of values and handle drifts in data distribution. Unfortunately, existing approaches have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent...
This paper introduces Data Stations, a new data architecture that we are designing to tackle some of the most challenging data problems that we face today: access to sensitive data; data discovery and integration; and governance and compliance. Data Stations depart from modern data lakes in that both data and derived data products, such as machine...
Existing stream processing and continuous query processing systems eagerly maintain standing queries by consuming all available resources to finish the jobs at hand, which can be a major source of wasting CPU cycles and memory resources. However, users sometimes do not need to see the up-to-date query result right after the data is ready, and thus...
Machine Learning (ML) has become a critical tool enabling new methods of analysis and driving deeper understanding of phenomena across scientific disciplines. There is a growing need for “learning systems” to support various phases in the ML lifecycle. While others have focused on supporting model development, training, and inference, few have focu...
Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by oth...
Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing...
Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.
As neural networks are increasingly employed in machine learning practice, organizations will have to determine how to share limited training resources among a diverse set of model training tasks. This paper studies jointly training multiple neural network models on a single GPU. We presents an empirical study of this operation, called pack, and en...
Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by oth...
The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both...
The computational demands of modern AI techniques are immense, and as the number of practical applications grows, there will be an increasing burden on shared computing infrastructure. We envision a forthcoming era of "AI Systems" research where reducing resource consumption, reasoning about transient resource availability, trading off resource con...
The analysis of time series is becoming increasingly prevalent across scientific disciplines and industrial applications. The effectiveness and the scalability of time-series mining techniques critically depend on design choices for three components responsible for (i) representing; (ii) comparing; and (iii) indexing time series. Unfortunately, the...
Many applications ingest data in an intermittent, yet largely predictable, pattern. Existing systems tend to ignore how data arrives when making decisions about how to update (or refresh) an ongoing query. To address this shortcoming we propose a new query processing paradigm, Intermittent Query Processing (IQP), that bridges query execution and po...
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant...
The results collected from crowd workers may not be reliable because (1) there are some malicious workers that randomly return the answers and (2) some tasks are hard and workers may not be good at these tasks. Thus it is important to exploit the different characteristics of workers and tasks and control the quality in crowdsourcing. Existing studi...
Data completeness is becoming a significant roadblock in data quality. Existing research in this area currently handles the certainty of a query by ignoring the incomplete part and approximating missing attributes on partially complete tuples, but leaves open the question of how the missing data affect the quality of the results. This is particular...
The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids, and clouds. Yet it remains a challenge to harness the available power and move toward gracefully searching and retrieving...
While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called "data lakes" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this m...
Data science promises new insights, helping transform information into knowledge that can drive science and industry.
It is rather inconvenient to interact with crowdsourcing platforms, as the platforms require one to set various parameters and even write code. Thus, inspired by traditional DBMS, crowdsourcing database systems have been designed and built.
This chapter introduces the background of crowdsourcing. Section 2.1 gives an overview of crowdsourcing, and Sect. 2.2 introduces the crowdsourcing workflow. Next, Sect. 2.3 introduces some widely used crowdsourcing platforms, and Sect. 2.4 discusses existing tutorials, surveys, and books on crowdsourcing. Finally, Sect. 2.5 presents the optimizati...
Despite the availability of crowdsourcing platforms, which provide a much cheaper way to ask humans to do some work, it is still quite expensive when there is a lot of work to do. Therefore, a big challenge in crowdsourced data management is cost control, i.e., how to reduce human cost while still keeping good result quality.
To obtain high-quality results, different applications require the use of different crowdsourced operators, which have operator-specific optimization goals over three factors: cost, quality, and latency. This chapter reviews how crowdsourced operators (i.e., crowdsourced selection, crowdsourced collection, crowdsourced join, crowdsourced sort, crow...
Latency refers to the total time of completing a job. Since humans are slower than machines, sometimes even a simple job (e.g., labeling one thousand images) may take hours or even days to complete. Thus, another big challenge in crowdsourced data management is latency control, that is, how to reduce job completion time while still keeping good res...
Crowdsourcing has become more and more prevalent in data management and analytics. This book provides a comprehensive review of crowdsourced data management, including motivation, applications, techniques, and existing systems. This chapter first summarizes this book in Sect. 8.2 and then provides several research challenges and opportunities in Se...
This book provides an overview of crowdsourced data management. Covering all aspects including the workflow, algorithms and research potential, it particularly focuses on the latest techniques and recent advances. The authors identify three key aspects in determining the performance of crowdsourced data management: quality control, cost control and...
Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined with a variety of different sources, each susceptible to different types of inconsistencies, and new data streams during prediction time, the model may encounter previously unseen inconsistencies. An important class of such inconsiste...
Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choos...
We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell...
Computing power has now become abundant with multi-core machines, grids and clouds, but it remains a challenge to harness the available power and move towards gracefully handling web-scale datasets. Several researchers have used automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly...
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this paper, we introduce Clipper, the first general-purpose low-latency prediction serving sy...
As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve the opportunities for data skipping in row-oriented systems. Modern analytics and big data system...
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distribute...
Any important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring platforms are an effective way to harness the capabilities of people (i.e., the crowd) to...
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, HPC tools are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark—a modern platform for data intensive comp...
Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative...
For distributed environments including Web Services, data and application integration, and personalized content delivery, XML is becoming the common wire format for data. In this emerging distributed infrastructure, XML message brokers will play a key role as central exchange points for messages sent between applications and/or users. Users (equiva...
R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a fron...
Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateC...
Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean,...
Data cleaning is frequently an iterative process tailored to the requirements of a specific analysis task. The design and implementation of iterative data cleaning tools presents novel challenges, both technical and organizational, to the community. In this paper, we present results from a user survey (N = 29) of data analysts and infrastructure en...
Linear causal analysis is central to a wide range of important application spanning finance, the physical sciences, and engineering. Much of the existing literature in linear causal analysis operates in the time domain. Unfortunately, the direct application of time domain linear causal analysis to many real-world time series presents three critical...
Data cleaning is often an important step to ensure that predictive models,
such as regression and classification, are not affected by systematic errors
such as inconsistent, out-of-date, or outlier data. Identifying dirty data is
often a manual and iterative process, and can be challenging on large datasets.
However, many data cleaning workflows ca...
Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd. Such systems raise many implementation questions. Perhaps the most fundamental issue is that the closed-world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of...
Data labeling is a necessary but often slow process that impedes the development of interactive systems for modern data analysis. Despite rising demand for manual data labeling, there is a surprising lack of work addressing its high and unpredictable latency. In this paper, we introduce CLAMShell, a system that speeds up crowds in order to achieve...
Second order stationary models in time series analysis are based on the
analysis of essential statistics whose computations follow a common pattern. In
particular, with a map-reduce nomenclature, most of these operations can be
modeled as mapping a kernel that only depends on short windows of consecutive
data and reducing the results produced by ea...
Scalable distributed dataflow systems have recently experienced widespread
adoption, with commodity dataflow engines such as Hadoop and Spark, and even
commodity SQL engines routinely supporting increasingly sophisticated analytics
tasks (e.g., support vector machines, logistic regression, collaborative
filtering). However, these systems' synchrono...
Materialized views (MVs), stored pre-computed results, are widely used to
facilitate fast queries on large datasets. When new records arrive at a high
rate, it is infeasible to continuously update (maintain) MVs and a common
solution is to defer maintenance by batching updates together. Between batches
the MVs become increasingly stale with incorre...
Data labeling is a necessary but often slow process that impedes the
development of interactive systems for modern data analysis. Despite rising
demand for manual data labeling, there is a surprising lack of work addressing
its high and unpredictable latency. In this paper, we introduce CLAMShell, a
system that speeds up crowds in order to achieve...
The proliferation of massive datasets combined with the development of sophisticated analytical techniques has enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. A major obstacle to supporting these predictive applications is the challenging and expe...
Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and f...
Scientific analyses commonly compose multiple single-process programs into a
dataflow. An end-to-end dataflow of single-process programs is known as a
many-task application. Typically, tools from the HPC software stack are used to
parallelize these analyses. In this work, we investigate an alternate approach
that uses Apache Spark -- a modern big d...
"Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current...
The rise of data-intensive "Web 2.0" Internet services has led to a range of popular new programming frameworks that collectively embody the latest incarnation of the vision of Object-Relational Mapping (ORM) systems, albeit at unprecedented scale. In this work, we empirically investigate modern ORM-backed applications' use and disuse of database c...
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e...
Apache Spark is a popular open-source platform for large-scale data
processing that is well-suited for iterative machine learning tasks. In this
paper we present MLlib, Spark's open-source distributed machine learning
library. MLlib provides efficient functionality for a wide range of learning
settings and includes several underlying statistical, o...
The proliferation of massive datasets combined with the development of
sophisticated analytical techniques have enabled a wide variety of novel
applications such as improved product recommendations, automatic image tagging,
and improved speech driven interfaces. These and many other applications can be
supported by Predictive Analytic Queries (PAQs...
Abstract
Every few years a group of database researchers meets to discuss the state of database research,
its impact on practice, and important new directions. This report summarizes the discussion and
conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that
Big Data has now become a de ning challenge...
Minimizing coordination, or blocking communication between concurrently executing operations, is key to maximizing scalability, availability, and high performance in database systems. However, uninhibited coordination-free execution can compromise application correctness, or consistency. When is coordination necessary for correctness? The classic u...
The seminal 2003 paper by Cosley, Lab, Albert, Konstan, and Reidl, demonstrated the susceptibility of recommender systems to rating biases. To facilitate browsing and selec- tion, almost all recommender systems display average rat- ings before accepting ratings from users which has been shown to bias ratings. This effiect is called Social In uence...
Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand i...
In the SIGMOD 2013 conference, we published a paper extending our earlier
work on crowdsourced entity resolution to improve crowdsourced join processing
by exploiting transitive relationships [Wang et al. 2013]. The VLDB 2014
conference has a paper that follows up on our previous work [Vesdapunt et al.,
2014], which points out and corrects a mistak...
To support complex data-intensive applications such as personalized
recommendations, targeted advertising, and intelligent services, the data
management community has focused heavily on the design of systems to support
training complex models on large datasets. Unfortunately, the design of these
systems largely ignores a critical component of the o...