Conference PaperPDF Available

Starfish: A Self-tuning System for Big Data Analytics


Abstract and Figures

Timely and cost-effective analytics over “Big Data ” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces—is a popular choice for big data analytics. Most practitioners of big data analytics—like computational scientists, systems researchers, and business analysts—lack the expertise to tune the system to get good performance. Unfortunately, Hadoop’s performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in payas-you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish’s system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices over big data pose new challenges; leading us to different design choices in Starfish. 1.
Components in the Starfish architecture hour), or data-driven. Yahoo! uses data-driven workflows to generate a reconfigured preference model and an updated home-page for any user within seven minutes of a home-page click by the user. Figure 2 is a visual representation of an example workload that a data analyst may want to run on demand or periodically using Amazon Elastic MapReduce. The input data processed by this workload resides as files on Amazon S3. The final results produced by the workload are also output to S3. The input data consists of files that are collected by a personalized Web-site like . The example workload in Figure 2 consists of workflows that load the files from S3 as three datasets: Users, GeoInfo, and Clicks. The workflows process these datasets in order to generate six different results I-VI of interest to the analyst. For example, Result I in Figure 2 is a count of all users with age less than 20. For all users with age greater than 25, Result II counts the number of users per geographic region. For each workflow, one or more MapReduce jobs are generated in order to run the workflow on Amazon Elastic MapReduce or on a local Hadoop cluster. For example, no- tice from Figure 2 that a join of the Users and GeoInfo datasets is needed in order to generate Result II. This logical join operation can be processed using a single MapReduce job. The tuning challenges present at each level of workload processing led us to the Starfish architecture shown in Figure 3. Broadly, the functionality of the components in this architecture can be cate-
Content may be subject to copyright.
Starfish: A Self-tuning System for Big Data Analytics
Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
Fatma Bilgen Cetin, Shivnath Babu
Department of Computer Science
Duke University
Timely and cost-effective analytics over “Big Data” is now a key
ingredient for success in many businesses, scientific and engineer-
ing disciplines, and government endeavors. The Hadoop software
stack—which consists of an extensible MapReduce execution en-
gine, pluggable distributed storage engines, and a range of proce-
dural to declarative interfaces—is a popular choice for big data ana-
lytics. Most practitioners of big data analytics—like computational
scientists, systems researchers, and business analysts—lack the ex-
pertise to tune the system to get good performance. Unfortunately,
Hadoop’s performance out of the box leaves much to be desired,
leading to suboptimal use of resources, time, and money (in pay-
as-you-go clouds). We introduce Starfish, a self-tuning system for
big data analytics. Starfish builds on Hadoop while adapting to user
needs and system workloads to provide good performance automat-
ically, withoutany need for users to understand and manipulate the
many tuning knobs in Hadoop. While Starfish’s system architecture
is guided by work on self-tuning database systems, we discuss how
new analysis practices over big data pose new challenges; leading
us to different design choices in Starfish.
Timely and cost-effective analytics over “Big Data” has emerged
as a key ingredient for success in many businesses, scientific and
engineering disciplines, and government endeavors [6]. Web search
engines and social networks capture and analyze every user action
on their sites to improve site design, spam and fraud detection,
and advertising opportunities. Powerful telescopes in astronomy,
genome sequencers in biology, and particle accelerators in physics
are putting massive amounts of data into the hands of scientists.
Key scientific breakthroughs are expected to come from compu-
tational analysis of such data. Many basic and applied science
disciplines now have computational subareas, e.g., computational
biology, computational economics, and computational journalism.
Cohen et al. recently coined the acronym MAD—for Magnetism,
Agility, and Depth—to express the features that users expect from
a system for big data analytics [6].
Magnetism: A magnetic system attracts all sources of data ir-
This article is published under a Creative Commons Attribution License
(, which permits distribution
and reproduction in any medium as well allowing derivative works, pro-
vided that you attribute the original work to the author(s) and CIDR 2011.
5th Biennial Conference on Innovative Data Systems Research (CIDR ’11)
January 9-12, 2011, Asilomar, California, USA.
respective of issues like possible presence of outliers, unknown
schema or lack of structure, and missing values that keep many
useful data sources out of conventional data warehouses.
Agility: An agile system adapts in sync with rapid data evolution.
Depth: A deep system supports analytics needs that go far be-
yond conventional rollups and drilldowns tocomplex statistical and
machine-learning analysis.
Hadoop is a MAD system that is becoming popular for big data
analytics. An entire ecosystem of tools is being developed around
Hadoop. Figure 1 shows a summary of the Hadoop software stack
in wide use today. Hadoop itself has two primary components: a
MapReduce execution engine and a distributed filesystem. While
the Hadoop Distributed FileSystem (HDFS)isused predominantly
as the distributed filesystem in Hadoop, other filesystems like Ama-
zon S3 are also supported. Analytics with Hadoop involves loading
data as files into the distributed filesystem, and then running paral-
lel MapReduce computations on the data.
A combination of factors contributes to Hadoop’s MADness.
First, copying files into the distributed filesystem is all it takes
to get data into Hadoop. Second, the MapReduce methodology
is to interpret data (lazily) at processing time, and not (eagerly)
at loading time. These two factors contribute to Hadoop’s mag-
netism and agility. Third, MapReduce computations in Hadoop can
be expressed directly in general-purpose programming languages
like Java or Python, domain-specific languages like R, or generated
automatically from SQL-like declarative languages like HiveQL
and Pig Latin. This coverage of the language spectrum makes
Hadoop well suited for deep analytics. Finally, an unheralded as-
pect of Hadoop is its extensibility, i.e., the ease with which many of
Hadoop’s core components like the scheduler, storage subsystem,
input/output data formats, data partitioner, compression algorithms,
caching layer, and monitoring can be customized or replaced.
Getting desired performance from a MAD system can be a non-
trivial exercise. The practitioners of big data analytics like data
analysts, computational scientists, and systems researchers usually
lack the expertise to tune system internals. Such users would rather
use a system that can tune itself and provide good performance au-
tomatically. Unfortunately, the same properties that make Hadoop
MAD pose new challenges in the path to self-tuning:
Data opacity until processing: The magnetism and agility
that comes with interpreting data only at processing time
poses the difficulty that even the schema may be unknown
until the point when an analysis job has to be run on the data.
File-based processing: Input data for a MapReduce job may
be stored as few large files, millions of small files, or any-
thing in between. Such uncontrolled data layouts are a marked
contrast to the carefully-planned layouts in database systems.
Figure 1: Starfish in the Hadoop ecosystem
Heavy use of programming languages: A sizable fraction
of MapReduce programs will continue to be written in pro-
gramming languages like Java for performance reasons, or
in languages like Python or Rthat a user is most comfortable
with while prototyping new analysis tasks.
Traditional data warehouses are kept nonMAD by its administra-
tors because it is easierto meet performance requirements in tightly
controlled environments; a luxury we cannot afford any more [6].
To further complicate matters, three more features in addition to
MAD are becoming important in analytics systems: Data-lifecycle-
awareness,Elasticity, and Robustness. A system with all six fea-
tures would be MADDER than current analytics systems.
Data-lifecycle-awareness: A data-lifecycle-aware system goes be-
yond query execution to optimize the movement, storage, and pro-
cessing of big data during its entire lifecycle. The intelligence em-
bedded in many Web sites like LinkedIn and Yahoo!—e.g., recom-
mendation of new friends or news articles of potential interest, se-
lection and placement of advertisements—is driven by computation-
intensive analytics. A number of companies today use Hadoop for
such analytics [12]. The input data for the analytics comes from
dozens of different sources on user-facing systems like key-value
stores, databases, and logging services (Figure 1). The data has to
be moved for processing to the analytics system. After processing,
the results are loaded back in near real-time to user-facing systems.
Terabytes of data may go through this cycle per day [12]. In such
settings, data-lifecycle-awareness is needed to: (i) eliminate indis-
criminate data copying that causes bloated storage needs (as high as
20x if multiple departments in the company make their own copy
of the data for analysis [17]); and (ii) reduce resource overheads
and realize performance gains due to reuse of intermediate data or
learned metadata in workflows that are part of the cycle [8].
Elasticity: An elastic system adjusts its resource usage and oper-
ational costs to the workload and user requirements. Services like
Amazon Elastic MapReduce have created a market for pay-as-you-
go analytics hosted on the cloud. Elastic MapReduce provisions
and releases Hadoop clusters on demand, sparing users the hassle
of cluster setup and maintenance.
Robustness: A robust system continues to provide service, possi-
bly with graceful degradation, in the face of undesired events like
hardware failures, software bugs [12], and data corruption.
1.1 Starfish: MADDERandSelf-TuningHadoop
Hadoop has the core mechanisms to be MADDER than exist-
ing analytics systems. However, the use of most of these mecha-
nisms has to be managed manually. Take elasticity as an example.
Hadoop supports dynamic node addition as well as decommission-
ing of failed or surplus nodes. However, these mechanisms do not
magically make Hadoop elastic because of the lack of control mod-
ules to decide (a) when to add new nodes or to drop surplus nodes,
and (b) when and howto rebalance the data layout in this process.
Starfish is a MADDER and self-tuning system for analytics on
big data. An important design decision we made is to build Starfish
on the Hadoop stack as shown in Figure 1. (That is not to say that
Starfish uses Hadoop as is.) Hadoop, as observed earlier, has useful
primitives to help meet thenew requirements of big data analytics.
In addition, Hadoop’s adoption by academic, government, and in-
dustrial organizations is growing at a fast pace.
A number of ongoing projects aim to improve Hadoop’s peak
performance, especially to match the query performance of parallel
database systems [1, 7, 10]. Starfish has a different goal. The peak
performance a manually-tuned system can achieve is not our pri-
mary concern, especially if this performance is for one of the many
phases in the data lifecycle. Regular users may rarely see perfor-
mance close to this peak. Starfish’sgoal is to enable Hadoop users
and applications to get good performance automatically through-
out the data lifecycle in analytics; without any need on their part to
understand and manipulate the many tuning knobs available.
Section 2 gives an overview of Starfish while Sections 3–5 de-
scribe its components. The primary focus of this paper is on using
experimental results to illustrate the challenges in each component
and to motivate Starfish’s solution approach.
The workload that a Hadoop deployment runs can be considered
at different levels. At the lowest level, Hadoop runs MapReduce
jobs. A job can be generated directly from a program written in
a programming language like Java or Python, or generated from a
query in a higher-level language like HiveQL or Pig Latin [15], or
submitted as part of a MapReduce job workflow by systems like
Azkaban, Cascading, Elastic MapReduce, and Oozie. The execu-
tion plan generated for a HiveQL or Pig Latin query is usually a
workflow. Workflows may be ad-hoc, time-driven (e.g., run every
Figure 2: Example analytics workload to be run on Amazon Elastic MapReduce
Figure 3: Components in the Starfish architecture
hour), or data-driven. Yahoo! uses data-driven workflows to gener-
ate a reconfigured preference model and anupdated home-page for
any user within seven minutes of a home-page click by the user.
Figure 2 is a visual representation of an example workload that a
data analyst may want to run on demand or periodically using Ama-
zon Elastic MapReduce. The input data processed by this workload
resides as files on Amazon S3. The final results produced by the
workload are also output to S3. The input data consists offiles that
are collected by a personalized Web-site like
The example workload in Figure 2 consists of workflows that
load the files from S3 as three datasets: Users, GeoInfo, and Clicks.
The workflows process these datasets in order to generate six dif-
ferent results I-VI of interest to the analyst. For example, Result
I in Figure 2 is a count of all users with age less than 20. For all
users with age greater than 25, Result II counts the number of users
per geographic region. For each workow, one or more MapRe-
duce jobs are generated in order to run the workflow on Amazon
Elastic MapReduce or on a local Hadoop cluster. For example, no-
tice from Figure 2 that a join of the Users and GeoInfo datasets is
needed in order to generate Result II. This logical join operation
can be processed using a single MapReduce job.
The tuning challenges present at each level of workload process-
ing led us to the Starfish architecture shown in Figure 3. Broadly,
the functionality of the components in this architecture can be cate-
gorized into job-level tuning, workflow-level tuning, and workload-
level tuning. These components interact to provide Starfish’s self-
tuning capabilities.
2.1 Job-level Tuning
The behavior of a MapReduce job in Hadoop is controlled by
the settings of more than 190 configuration parameters. If the user
does not specify parameter settings during job submission, then de-
fault values—shipped with the system or specified by the system
administrator—are used. Good settings for these parameters de-
pend on job, data, and cluster characteristics. While only a frac-
tion of the parameters can have significant performance impact,
browsing through the Hadoop, Hive, and Pig mailing lists reveals
that users often run into performance problems caused by lack of
knowledge of these parameters.
Consider a user who wants to perform a join of data in the files
users.txt and geoinfo.txt, and writes the Pig Latin script:
Users = Load ‘users.txt’ as (username: chararray,
age: int, ipaddr: chararray)
GeoInfo = Load ‘geoinfo.txt’ as (ipaddr: chararray,
region: chararray)
Result = Join Users by ipaddr, GeoInfo by ipaddr
The schema as well as properties of the data in the files could have
been unknown so far. The system now has to quickly choose the
join execution technique—given the limited information available
so far, and from among 10+ ways to execute joins in Starfish—as
well as the corresponding settings of job configuration parameters.
Starfish’s Just-in-Time Optimizer addresses unique optimization
problems like those above to automatically select efficient execu-
tion techniques for MapReduce jobs. “Just-in-time” captures the
online nature of decisions forced on the optimizer by Hadoop’s
MADDER features. The optimizer takes the help of the Profiler
and the Sampler. The Profiler uses a technique called dynamic in-
strumentation to learn performance models, called job profiles, for
unmodified MapReduce programs written in languages like Java
and Python. The Sampler collects statistics efficiently about the
input, intermediate, and output key-value spaces of a MapReduce
job. A unique feature of the Sampler is that it can sample the exe-
cution of a MapReduce job in order to enable the Profiler to collect
approximate job profiles at a fraction of the full job execution cost.
2.2 Workflow-level Tuning
Workflow execution brings out some critical and unanticipated
interactions between the MapReduce task scheduler and the un-
derlying distributed filesystem. Significant performance gains are
realized in parallel task scheduling by moving the computation to
the data. By implication, the data layout across nodes in the clus-
ter constrains how tasks can be scheduled in a “data-local” fash-
ion. Distributed filesystems have their own policies on how data
written to them is laid out. HDFS, for example, always writes the
first replica of any block on the same node where the writer (in
this case, a map or reduce task) runs. This interaction between
data-local scheduling and the distributed filesystem’s block place-
ment policies can lead to an unbalanced data layout across nodes in
the cluster during workflow execution; causing severe performance
degradation as we will show in Section 4.
Efficient scheduling of a Hadoop workflow is further compli-
cated by concerns like (a) avoiding cascading reexecution under
node failure or data corruption [11], (b) ensuring power propor-
tional computing, and (c) adapting to imbalance in load or cost of
energy across geographic regions and time at the datacenter level
[16]. Starfish’s Workflow-aware Scheduler addresses such concerns
in conjunction with the What-if Engine and the Data Manager.
This scheduler communicates with, but operates outside, Hadoop’s
internal task scheduler.
2.3 Workload-level Tuning
Enterprises struggle with higher-level optimization and provi-
sioning questions for Hadoop workloads. Given a workload con-
sisting of a collection of workflows (like Figure 2), Starfish’s Work-
load Optimizer generates an equivalent, but optimized, collection
of workflows that are handed off to the Workflow-aware Scheduler
for execution. Three important categories of optimization opportu-
nities exist at the workload level:
A. Data-flow sharing, where a single MapReduce job performs
computations for multiple and potentially different logical
nodes belonging to the same or different workflows.
B. Materialization, where intermediate data in a workflow is
stored for later reuse in the same or different workflows.
Effective use of materialization has to consider the cost of
materialization (both in terms of I/O overhead and storage
consumption [8]) and its potential to avoid cascading reexe-
cution of tasks under node failure or data corruption [11].
C. Reorganization, where new data layouts (e.g., with partition-
ing) and storage engines (e.g., key-value stores like HBase
and databases like column-stores [1]) are chosen automat-
ically and transparently to store intermediate data so that
downstream jobs in the same or different workflows can be
executed very efficiently.
While categories A, B, and C are well understood in isolation, ap-
plying them in an integrated manner to optimize MapReduce work-
loads poses new challenges. First, the data output from map tasks
and input to reduce tasks in a job is always materialized in Hadoop
in order to enable robustness to failures. This data—which today is
simply deleted after the job completes—is key-value-based, sorted
on the key, and partitioned using externally-specified partitioning
functions. This unique form of intermediate data is available almost
for free, bringing new dimensions to questions on materialization
and reorganization. Second, choices for A, B, and C potentially in-
teract among each other and with scheduling, data layout policies,
as well as job configuration parameter settings. The optimizer has
to be aware of such interactions.
Hadoop provisioning deals with choices like the number of nodes,
node configuration, and network configuration to meet given work-
load requirements. Historically, such choices arose infrequently
and were dealt with by system administrators. Today, users who
provision Hadoop clusters on demand using services like Ama-
zon Elastic MapReduce and Hadoop On Demand are required to
make provisioning decisions on their own. Starfish’s Elastisizer
automates such decisions. The intelligence in the Elastisizer comes
from a search strategy in combination with the What-if Engine that
uses a mix of simulation and model-based estimation to answer
what-if questions regarding workload performance on a specified
cluster configuration. In the longer term, we aim to automate provi-
sioning decisions at the level of multiple virtual and elastic Hadoop
clusters hosted on a single shared Hadoop cluster to enable Hadoop
Analytics as a Service.
2.4 Lastword: Starfish’s Language for Work-
loads and Data
As described in Section 1.1 and illustrated in Figure 1, Starfish
is built on the Hadoop stack. Starfish interposes itself between
Hadoop and its clients like Pig, Hive, Oozie, and command-line
interfaces to submit MapReduce jobs. These Hadoop clients will
now submit workloads—which can vary from a single MapReduce
job, to a workflow of MapReduce jobs, and to a collection of mul-
tiple workflows—expressed in Lastword1to Starfish. Lastword is
Starfish’s language to accept as well as to reason about analytics
Unlike languages like HiveQL, Pig Latin, or Java, Lastword is
not a language that humans will have to interface with directly.
Higher-level languages like HiveQL and Pig Latin were developed
to support a diverse user community—ranging from marketing an-
alysts and sales managers to scientists, statisticians, and systems
researchers—depending on their unique analytical needs and pref-
erences. Starfish provides language translators to automatically
convert workloads specified in these higher-level languages to Last-
word. A common language like Lastword allows Starfish to exploit
optimization opportunities among the different workloads that run
on the same Hadoop cluster.
A Starfish client submits a workload as a collection of work-
flows expressed in Lastword. Three types of workflows can be rep-
resented in Lastword: (a) physical workflows, which are directed
graphs2where each node is a MapReduce job representation; (b)
logical workflows, which are directed graphs where each node is a
logical specification such as a select-project-join-aggregate (SPJA)
or a user-defined function forperforming operations like partition-
ing, filtering, aggregation, and transformations; and (c) hybrid work-
flows, where a node can be of either type.
An important feature of Lastword is its support for expressing
metadata along with the tasks for execution. Workflows specified
in Lastword can be annotated with metadata at the workflow level
or at the node level. Such metadata is either extracted from in-
puts provided by users or applications, or learned automatically by
Starfish. Examples of metadata include scheduling directives (e.g.,
whether the workflow is ad-hoc, time-driven, or data-driven), data
properties (e.g., full or partial schema, samples, and histograms),
data layouts (e.g., partitioning, ordering, and collocation), and run-
time monitoring information (e.g., execution profiles of map and
reduce tasks in a job).
The Lastword language gives Starfish another unique advantage.
Note that Starfish is primarily a system for running analytics work-
1Language for Starfish Workloads and Data.
2Cycles may be needed to support loops or iterative computations.
WordCount TeraSort
Rules of Based on Rules of Based on
Thumb Job Profile Thumb Job Profile
io.sort.spill.percent 0.80 0.80 0.80 0.80
io.sort.record.percent 0.50 0.05 0.15 0.15
io.sort.mb 200 50 200 200
io.sort.factor 10 10 10 100
mapred.reduce.tasks 27 2 27 400
Running Time (sec) 785 407 891 606
Table 1: Parameter settings from rules of thumb and recom-
mendations from job profiles for WordCount and TeraSort
loads on big data. At the same time, we want Starfish to be usable in
environments where workloads are run directlyon Hadoop without
going through Starfish. Lastword enables Starfish to be used as a
recommendation engine in these environments. The full or partial
Hadoop workload from such an environment can be expressed in
Lastword—we will provide tools to automate this step—and then
input to Starfish which is run in a special recommendation mode.
In this mode, Starfish uses its tuning features to recommend good
configurations at the job, workflow, and workload levels; instead of
running the workload with these congurations as Starfish would
do in its normal usage mode.
The response surfaces in Figure 4 show the impact of various
job configuration parameter settings on the running time of two
MapReduce programs in Hadoop. We use WordCount and TeraSort
which are simple, yet very representative, MapReduce programs.
The default experimental setup used in this paper is a single-rack
Hadoop cluster running on 16 Amazon EC2 nodes of the c1.medium
type. Each node runs at most 3 map tasks and 2 reduce tasks con-
currently. WordCount processes 30GB of data generated using the
RandomTextWriter program in Hadoop. TeraSort processes 50GB
of data generated using Hadoop’s TeraGen program.
Rules of Thumb for Parameter Tuning: The job configuration
parameters varied in Figure 4 are io.sort.mb,io.sort.record.percent,
and mapred.reduce.tasks. All other parameters are kept constant.
Table 1 shows the settings of various parameters for the two jobs
based on popular rules of thumb used today [5, 13]. For exam-
ple, the rules of thumb recommend setting mapred.reduce.tasks
(the number of reduce tasks in the job) to roughly 0.9times the
total number of reduce slots in the cluster. The rationale is to en-
sure that all reduce tasks run in one wave while leaving some slots
free for reexecuting failed or slow tasks. A more complex rule
of thumb sets io.sort.record.percent to 16
16+avg r ecord size based on
the average size of map output records. The rationale here involves
source-code details of Hadoop.
Figure 4 shows that the rule-of-thumb settings gave poor perfor-
mance. In fact, the rule-of-thumb settings for WordCount gave one
of its worst execution times: io.sort.mb and io.sort.record.percent
were set too high. The interaction between these two parameters
was very different and more complex forTeraSort as shown inFig-
ure 4(b). A higher setting for io.sort.mb leads to better performance
for certain settings of the io.sort.record.percent parameter, but hurts
performance for other settings. The complexity of the surfaces and
the failure of rules of thumb highlight the challenges a user faces
if asked to tune the parameters herself. Starfish’s job-level tuning
components—Profiler, Sampler, What-if Engine, and Just-in-Time
Optimizer—help automate this process.
Profiling Using Dynamic Instrumentation: The Profiler uses dy-
namic instrumentation to collect run-time monitoring information
from unmodified MapReduce programs running on Hadoop. Dy-
namic instrumentation has become hugely popular over the last few
years to understand, debug, and optimize complex systems [4]. The
dynamic nature means that there is zero overhead when instrumen-
tation is turned off; an appealing property in production deploy-
ments. The current implementation of the Profiler uses BTrace [2],
a safe and dynamic tracing tool for the Java platform.
When Hadoop runs a MapReduce job, the Starfish Profiler dy-
namically instruments selected Java classes inHadoop to construct
ajob profile. A profile is a concise representation of the job exe-
cution that captures information both at the task and subtask lev-
els. The execution of a MapReduce job is broken down into the
Map Phase and the Reduce Phase. Subsequently, the Map Phase is
divided into the Reading,Map Processing,Spilling, and Merging
subphases. The Reduce Phase is divided into the Shuffling,Sorting,
Reduce Processing, and Writing subphases. Each subphase repre-
sents an important part of the job’s overall execution in Hadoop.
The job profile exposes three views that capture various aspects
of the job’s execution:
1. Timings view: This view gives the breakdown of how wall-
clock time was spent in the various subphases. For exam-
ple, a map task spends time reading input data, running the
user-defined map function, and sorting, spilling, and merging
map-output data.
2. Data-flow view: This view gives the amount of data pro-
cessed in terms of bytes and number of records during the
various subphases.
3. Resource-level view: This view captures the usage trends of
CPU, memory, I/O, and network resources during the vari-
ous subphases of the job’s execution. Usage of CPU, I/O,
and network resources are captured respectively in terms of
the time spent using these resources per byte and per record
processed. Memory usage is captured in terms of the mem-
ory used by tasks as they run in Hadoop.
Wewill illustrate the benefits of job profiles and the insights gained
from them through a real example. Figure 5 shows the Timings
view from the profiles collected for the two configuration parame-
ter settings for WordCount shown in Table 1. We will denote the
execution of WordCount using the “Rules of Thumb” settings from
Table 1 as Job A; and theexecution of WordCount using the “Based
on Job Profile” settings as Job B. Note that the same WordCount
MapReduce program processing the same input dataset is being run
in either case. The WordCount program uses a Combiner to per-
form reduce-style aggregation on the map task side for each spill
of the map task’s output. Table 1 shows that Job Bruns 2x faster
than Job A.
Our first observation from Figure 5 is that the map tasks in Job B
completed on average much faster compared to the map tasks in Job
A; yet the reverse happened to the reduce tasks. Further exploration
of the Data-flow and Resource views showed that the Combiner
in Job Awas processing an extremely large number of records,
causing high CPU contention. Hence, all the CPU-intensive op-
erations in Job A’s map tasks (executing the user-provided map
function, serializing and sorting the map output) were negatively
affected. Compared to Job A, the lower settings for io.sort.mb
and io.sort.record.percent in Job Bled to more, but individually
smaller, map-side spills. Because the Combiner is invoked on these
individually smaller map-side spills in Job B, the Combiner caused
far less CPU contention in Job Bcompared to Job A.
On the other hand, the Combiner drastically decreases the amount
of intermediate data that is spilled to disk as well as transferred over
the network (shuffled) from map to reduce tasks. Since the map
50 100 150 200
WordCount in Hadoop
Running Time (sec)
Rules of
based on
Job Profile
50 100 150 200
TeraSort in Hadoop
Running Time (sec)
400 00.1 0.2 0.3 0.4 0.5
TeraSort in Hadoop
Running Time (sec)
Rules of
based on
Job Profile
Figure 4: Response surfaces of MapReduce programs in Hadoop: (a) WordCount, with io.sort.mb [50,200] and
io.sort.record.percent [0.05,0.5] (b) TeraSort, with io.sort.mb [50,200] and io.sort.record.percent [0.05,0.5] (c) TeraSort,
with io.sort.record.percent [0.05,0.5] and mapred.reduce.tasks [27,400]
Figure 5: Map and reduce time breakdown for two WordCount
jobs run with different settings of job configuration parameters
tasks in Job Bprocessed smaller spills, the data reduction gains
from the Combiner were also smaller; leading to larger amounts
of data being shuffled and processed by the reducers. However,
the additional local I/O and network transfer costs in Job Bwere
dwarfed by the reduction in CPU costs.
Effectively, the more balanced usage of CPU, I/O, and network
resources in the map tasks of Job Bimproved the overall perfor-
mance of the map tasks significantly compared to Job A. Overall,
the benefit gained by the map tasks in Job Boutweighed by far the
loss incurred by the reduce tasks; leading to the 2x better perfor-
mance of Job Bcompared to the performance of Job A.
Predicting Job Performance in Hadoop: The job profile helps in
understanding the job behavior as well as in diagnosing bottlenecks
during job execution for the parameter settings used. More impor-
tantly, given a new setting Sof the configuration parameters, the
What-if Engine can use the job profile and a set of models that we
developed to estimate the new profile if the job were to be run using
S. This what-if capability is utilized by the Just-in-Time Optimizer
in order to recommend good parameter settings.
The What-if Engine is given four inputs when asked to predict
the performance of a MapReduce job J:
1. The job profile generated for Jby the Profiler. The profile
may be available from a previous execution of J. Otherwise,
the Profiler can work in conjunction with Starfish’s Sampler
to generate an approximate job profile efficiently. Figure 6
considers approximate job profiles later in this section.
2. The new setting Sof the job configuration parameters using
which Job Jwill be run.
3. The size, layout, and compression information of the input
dataset on which Job Jwill be run. Note that this input
dataset can be different from the dataset used while gener-
ating the job profile.
4. The cluster setup and resource allocation that will be used to
run Job J. This information includes the number of nodes
and network topology of the cluster, the number of map and
reduce task slots per node, and the memory available for each
task execution.
The What-if Engine uses a set of performance models for predict-
ing (a) the flow of data going through each subphase in the job’s
execution, and (b) the time spent in each subphase. The What-if En-
gine then produces a virtual job profile by combining the predicted
information in accordance with the cluster setup and resource allo-
cation that will be used to run the job. The virtual job profile con-
tains the predicted Timings and Data-flow views of the job when
run with the new parameter settings. The purpose of the virtual
profile is to provide the user with more insights on how the job will
behave when using the new parameter settings, as well as to ex-
pand the use of the What-if Engine towards answering hypothetical
questions at the workflowand workload levels.
Towards Cost-Based Optimization: Table 1 shows the parameter
settings for WordCount and TeraSort recommended by an initial
implementation of the Just-in-Time Optimizer. The What-if En-
gine used the respective job profiles collected from running the jobs
using the rules-of-thumb settings. WordCount runs almost twice
as fast at the recommended setting. As we saw earlier, while the
Combiner reduced the amount of intermediate data drastically, it
was making the map execution heavily CPU-bound and slow. The
configuration setting recommended by the optimizer—with lower
io.sort.mb and io.sort.record.percent—made the map tasks signif-
icantly faster. This speedup outweighed the lowered effectiveness
of the Combiner that caused more intermediate data to be shuffled
and processed by the reduce tasks.
These experiments illustrate the usefulness of the Just-in-Time
Optimizer. One of the main challenges that we are addressing
is in developing an efficient strategy to search through the high-
dimensional space of parameter settings. A related challenge is in
generating job profiles with minimal overhead. Figure 6 shows the
tradeoff between the profiling overhead (in terms of job slowdown)
and the average relative error in the job profile views when profil-
ing is limited to a fraction of the tasks in WordCount. The results
are promising but show room for improvement.
Cause and Effect of Unbalanced Data Layouts: Section 2.2 men-
tioned how interactions between the taskscheduler and the policies
employed by the distributed filesystem can lead to unbalanced data
layouts. Figure 7 shows how even the execution of a single large
Figure 6: (a) Relative job slowdown, and (b) relative error in
the approximate views generated as the percentage of profiled
tasks in a job is varied
Figure 7: Unbalanced data layout
job can cause an unbalanced layout in Hadoop. We ran a partition-
ing MapReduce job (similar to “Partition by age” shown in Figure
2) that partitions a 100GB TPC-H Lineitem table into four parti-
tions relevant to downstream workflow nodes. The data properties
are such that one partition is much larger than the others. All the
partitions are replicated once as done by default for intermediate
workflow data in systems like Pig [11]. HDFS ends up placing all
blocks for the large partition on the node (Datanode 14) where the
reduce task generating this partition runs.
A number of other causes can lead to unbalanced data layouts
rapidly or over time: (a) skewed data, (b) scheduling of tasks in
a data-layout-unaware manner as done by the Hadoop schedulers
available today, and (c) addition or dropping of nodes without run-
ning costly data rebalancing operations. (HDFS does not automat-
ically move existing data when new nodes are added.) Unbalanced
data layouts are a serious problem in big data analytics because
they are prominent causes of task failure (due to insufficient free
disk space for intermediate map outputs or reduce inputs) and per-
formance degradation. We observed a more than 2x slowdown for
a sort job running on the unbalanced layout in Figure 7 compared
to a balanced layout.
Unbalanced data layouts cause a dilemma for data-locality-aware
schedulers (i.e., schedulers that aim to move computation to the
data). Exploiting data locality can have two undesirable conse-
quences in this context: performance degradation due to reduced
parallelism, and worse, making the data layout further unbalanced
because new outputs will go to the over-utilized nodes. Figure 7
also shows how running a map-only aggregation on the large par-
tition leads to the aggregation output being written to the over-
utilized Datanode 14. The aggregation output was small. A larger
output could have made the imbalance much worse. On the other
hand, non-data-local scheduling (i.e., moving data to the computa-
tion) incurs the overhead of data movement. A useful new feature
in Hadoop will be to piggyback on such data movements to rebal-
ance the data layout.
Figure 8: Sort running time on the partitions
Figure 9: Partition creation time
Weran the same partitioning job with a replication factor of two
for the partitions. For our single-rack cluster, HDFS places the
second replica of each block of the partitions on a randomly-chosen
node. The overall layout is still unbalanced, but the time to sort
the partitions improved significantly because the second copy of
the data is spread out over the cluster (Figure 8). Interestingly, as
shown in Figure 9, theoverhead of creating a second replica is very
small on our cluster (which will change if the network becomes the
bottleneck [11]).
Aside from ensuring that the data layout is balanced, other choices
are available such as collocating two or more datasets. Consider a
workflow consisting of three jobs. The first two jobs partition two
separate datasets Rand S(e.g., Users and GeoInfo from Figure
2) using the same partitioning function into npartitions each. The
third job, whose input consists of the outputs of the first two jobs,
performs an equi-join of the respective partitions from Rand S.
HDFS does not provide the ability to collocate the joining parti-
tions from Rand S; so a join job run in Hadoop will have to do
non-data-local reads for one of its inputs.
We implemented a new block placement policy in HDFS that
enables collocation of two or more datasets. (As an excellent ex-
ample of Hadoop’s extensibility, HDFS provides a pluggable inter-
face that simplifies the task of implementing new block placement
policies [9].) Figure 10 shows how the new policy gives a 22%
improvement in the running time of a partition-wise join job by
collocating the joining partitions.
Experimental results like those above motivate the need for a
Workflow-aware Scheduler that can run jobs in a workflow such
that the overall performance of the workow is optimized. Work-
flow performance can be measured in terms of running time, re-
source utilization in the Hadoop cluster, and robustness to failures
(e.g., minimizing the need for cascading reexecution of tasks due to
node failure or data corruption) and transient issues (e.g., reacting
to the slowdown of a node due to temporary resource contention).
As illustrated by Figures 7–10, good layouts of the initial (base),
intermediate (temporary), and final (results) data in a workflow are
vital to ensure good workflow performance.
Workflow-aware Scheduling: A Workflow-aware Scheduler can
ensure that job-level optimization and scheduling policies are co-
Figure 10: Respective execution times of a partition-wise join
job with noncollocated and collocated input partitions
ordinated tightly with the policies for data placement employed
by the underlying distributed filesystem. Rather than making de-
cisions that are locally optimal for individual MapReduce jobs,
Starfish’s Workflow-aware Scheduler makes decisions by consider-
ing producer-consumer relationships among jobs in the workflow.
Figure 11 gives an example of producer-consumer relationships
among three Jobs P,C1, and C2in a workflow. Analyzing these
relationships gives important information such as:
What parts of the data output by a job are used by down-
stream jobs in the workflow? Notice from Figure 11 that the
three writer tasks of Job Pgenerate files File1,File2, and
File3 respectively. (In a MapReduce job, the writer tasks are
map tasks in a map-only job, and reduce tasks otherwise.)
Each file is stored as blocks in the distributed filesystem.
(HDFS blocks are 64MB in size by default.) File1 forms the
input to Job C1, while File1 and File2 form the input to Job
C2. Since File3 is not used by any of the downstream jobs,
a Workflow-aware Scheduler can configure Job Pto avoid
generating File3.
What is the unit of data-level parallelism in each job that
reads the data output by a job? Notice from Figure 11 that
the data-parallel reader tasks of Job C1read and process one
data block each. However, the data-parallel reader tasks of
Job C2read one file each. (In a MapReduce job in a work-
flow, the data-parallel map tasks of the job read the output
of upstream jobs in the workflow.) While not shown in Fig-
ure 11, jobs like the join in Figure 10 consist of data-parallel
tasks that each read a group of files output by upstream jobs
in the workflow. Information about the data-parallel access
patterns of jobs is vital to guarantee good data layouts that,
in turn, will guarantee an efficient mix of parallel and data-
local computation. For File2 in Figure 11, all blocks in the
file should be placed on the same node to ensure data-local
computation (i.e., to avoid having to move data to the compu-
tation). The choice for File1, which is read by both Jobs C1
and C2, is not so easy to make. The data-level parallelism is
at the block-level in Job C1, but at the file-level in Job C2.
Thus, the optimal layout of File1 from Job C1’s perspective
is to spread File1s blocks across the nodes so that C1s map
tasks can run in parallel across the cluster. However, the op-
timal layout of File1 from Job C2s perspective is to place
all blocks on the same node.
Starfish’s Workflow-aware Scheduler works in conjunction with the
What-if Engine and the Just-in-Time Optimizer in order to pick the
job execution schedule as well as the data layouts for a workflow.
The space of choices for data layout includes:
1. What block placement policy to use in the distributed filesys-
tem for the output file of a job? HDFS uses the Local Write
Figure 11: Part of an example workflow showing producer-
consumer relationships among jobs
block placement policy which works as follows: the first
replica of any block is stored on the same node where the
block’s writer (a map or reduce task) runs. We have imple-
mented a new Round Robin block placement policy in HDFS
where the blocks written are stored on the nodes of the dis-
tributed filesystem in a round robin fashion.
2. How many replicas to store—called the replication factor
for the blocks of a file? Replication helps improve perfor-
mance for heavily-accessed files. Replication also improves
robustness by reducing performance variability in case of
node failures.
3. What size to use for blocks of a file? For a very big file, a
block size larger than the default of 64MB can improve per-
formance significantly by reducing the number of map tasks
needed to process the file. The caveat is that the choice of the
block size interacts with the choice of job-level configuration
parameters like io.sort.mb (recall Section 3).
4. Should a job’s output files be compressed for storage? Like
the use of Combiners (recall Section 3), the use of compres-
sion enables the cost of local I/O and network transfers to be
traded for additional CPU cost. Compression is not always
beneficial. Furthermore, like the choice of the block size, the
usefulness of compression depends onthe choice of job-level
The Workflow-aware Scheduler performs a cost-based search for a
good layout for the output data of each job in a given workflow.
The technique we employ here asks a number of questions to the
What-if Engine; and uses the answers to infer the costs and benefits
of various choices. The what-if questions asked for a workow
consisting of the producer-consumer relationships among Jobs P,
C1, and C2shown in Figure 11 include:
(a) What is the expected running time of Job Pif the Round
Robin block placement policy is used for P’s output files?
(b) What will the new data layout in the cluster be if the Round
Robin block placement policy is used for P’s output files?
(c) What is the expected running time of Job C1(C2) if its input
data layout is the one in the answer to Question (b)?
Figure 12: Respective running times of (a) a partition job and
(b) a two-job workflow with the (default) Local Write and the
Round Robin block placement policies used in HDFS
(d) What are the expected running times of Jobs C1and C2if
they are scheduled concurrently when JobPcompletes?
(e) Given the Local Write block placement policy and a repli-
cation factor of 1 for Job P’s output, what is the expected
increase in the running time of Job C1if one node in the
cluster were to fail during C1’s execution?
These questions are answered by the What-if Engine based on a
simulation of the main aspects of workflow execution. This step in-
volves simulating MapReduce job execution, task scheduling, and
HDFS block placement policies. The job-level and cluster-level
information described in Section 3 is needed as input for the simu-
lation of workflow execution.
Figure 12 shows results from an experiment where the Workflow-
aware Scheduler was asked to pick the data layout for a two-job
workflow consisting of a partition job followed by a sort job. The
choice for the data layout involved selecting which block place-
ment policy to use between the (default) Local Write policy and the
Round Robin policy. The remaining choices were kept constant:
replication factor is 1, the block size is 128MB, and compression is
not used. The choice of collocation was not considered since it is
not beneficial to collocate any group of datasets in this case.
The Workflow-aware Scheduler first asks what-if questions re-
garding the partition job. The What-if Engine predicted correctly
that the Round Robin policy will perform better than the Local
Write policy for the output data of the partition job. In our cluster
setting on Amazon EC2, the local I/O within a node becomes the
bottleneck before the parallel writes of data blocks to other storage
nodes over the network. Figure 12(a) shows the actual performance
of the partition job for the two block placement policies.
The next set of what-if questions have todo with the performance
of the sort job for different layouts of the output of the partition job.
Here, using the Round Robin policy for the partition job’s output
emerges a clear winner. The reason is that the Round Robin policy
spreads the blocks over the cluster so that maximum data-level par-
allelism of sort processing can be achieved while performing data-
local computation. Overall, the Workflow-aware Scheduler picks
the Round Robin block placement policy for the entire workflow.
As seen in Figure 12(b), this choice leads to the minimum total run-
ning time of the two-job workflow. Use of the Round Robin policy
gives around 30% reduction in total running time compared to the
default Local Write policy.
Workload Optimizer: Starfish’s Workload Optimizer represents
the workload as a directed graph and applies the optimizations listed
in Section 2.3 as graph-to-graph transformations. The optimizer
Figure 13: Processing multiple SPA workflow nodes on the
same input dataset
uses the What-if Engine to do a cost-based estimation of whether
the transformation will improve performance.
Consider the workflows that produce the results IV, V, and VI
in Figure 2. These workflows have a join of Users and Clicks in
common. The results IV, V, and VI can each be represented as a
Select-Project-Aggregate (SPA) expression over the join. Starfish
has an operator, called the Jumbo operator, that can process any
number of logical SPA workflow nodes over the same table in a
single MapReduce job. (MRShare [14] and Pig [15] also support
similar operators.) Without the Jumbo operator, each SPA node
will have to be processed as a separate job. The Jumbo operator en-
ables sharing of all or some of the map-side scan and computation,
sorting and shuffling, as well as the reduce-side scan, computation,
and output generation. At the same time, the Jumbo operator can
help the scheduler to better utilize the bounded number of map and
reduce task slots in a Hadoop cluster.
Figure 13(a) shows an experimental result where three logical
SPA workflow nodes are processed on a 24GB dataset as: (a) Se-
rial, which runs three separate MapReduce jobs in sequence; (b)
Concurrent, which runs three separate MapReduce jobs concur-
rently; (c) using the Jumbo operator to share the map-side scans
in the SPA nodes; and (d) using the Jumbo operator to share the
map-side scans as well as the intermediate data produced by the
SPA nodes. Figure 13(a) shows that sharing the sorting and shuf-
fling of intermediate data, in addition to sharing scans, provides
additional performance benefits.
Now consider the workflows that produce results I, II, IV, and
V in Figure 2. These four workflows have filter conditions on
the age attribute in the Users dataset. Running a MapReduce job
to partition Users based on ranges of age values will enable the
four workflows to prune out irrelevant partitions efficiently. Figure
13(b) shows the results from applying partition pruning tothe same
three SPA nodes from Figure 13(a). Generating the partitions has
significant overhead—as seen in Figure 13(b)—but possibilities ex-
ist to hide or reduce this overhead by combining partitioning with a
previous job like data copying. Partition pruning improves the per-
formance of all MapReduce jobs in our experiment. At the same
time, partition pruning decreases the performance benefits provided
by the Jumbo operator. These simple experiments illustrate the in-
teractions among different optimization opportunities that exist for
Hadoop workloads.
Elastisizer: Users can now leverage pay-as-you-go resources on
the cloud to meet their analytics needs. Amazon Elastic MapRe-
duce allows users to instantiate a Hadoop cluster on EC2 nodes,
and run workflows. The typical workflow on Elastic MapReduce
accesses data initially from S3, does in-cluster analytics, and writes
final output back to S3 (Figure 2). The cluster can be released when
the workflow completes, and the user pays for the resources used.
While Elastic MapReduce frees users from setting up and main-
taining Hadoop clusters, the burden of cluster provisioning is still
Figure 14: Workload performance under various cluster and Hadoop configurations on Amazon Elastic MapReduce
Figure 15: Performance Vs. pay-as-you-go costs for a workload on Amazon Elastic MapReduce
on the user. Specifically, users have to specify the number and type
of EC2 nodes (from among 10+ types) as well as whether to copy
data from S3 into the in-cluster HDFS. The space of provisioning
choices is further complicated by Amazon Spot Instances which
provide a market-based option for leasing EC2 nodes. In addition,
the user has to specify the Hadoop-level as well as job-level con-
figuration parameters for the provisioned cluster.
One of the goals of Starfish’s Elastisizer is to automatically deter-
mine the best cluster and Hadoop configurations to process a given
workload subject to user-specified goals (e.g., on completion time
and monetary costs incurred). To illustrate this problem, Figure 14
shows how the performance of a workload Wconsisting of a sin-
gle workflow varies across different cluster configurations (number
and type of EC2 nodes) and corresponding Hadoop configurations
(number of concurrent map and reduce slots per node).
The user could have multiple preferences and constraints for the
workload, which poses a multi-objective optimization problem. For
example, the goal may be to minimize the monetary cost incurred to
run the workload, subject to a maximum tolerable workload com-
pletion time. Figures 15(a) and 15(b) show the running time aswell
as cost incurred on Elastic MapReduce for the workload Wfor dif-
ferent cluster configurations. Some observations from the figures:
If the user wants to minimize costs subject to a completion time
of 30 minutes, then the Elastisizer should recommend a cluster
of four m1.large EC2 nodes.
If the user wants to minimize costs, then two m1.small nodes
are best. However, the Elastisizer can suggest that by paying
just 20% more, the completion time can be reduced by 2.6x.
To estimate workload performance for various cluster configura-
tions, the Elastisizer invokes the What-if Engine which, in turn,
uses a mix of simulation and model-based estimation. As dis-
cussed in Section 4, the What-if Engine simulates the task schedul-
ing and block-placement policies over a hypothetical cluster, and
uses performance models to predict the data flow and performance
of the MapReduce jobs in the workload. The latest Hadoop release
includes a Hadoop simulator, called Mumak, that we initially at-
tempted to use in the What-if Engine. However, Mumak needs a
workload execution trace for a specific cluster size as input, and
cannot simulate workload execution for a different cluster size.
Hadoop is now a viable competitor to existing systems for big
data analytics. While Hadoop currently trails existing systems in
peak query performance, a number of research efforts are address-
ing this issue [1, 7, 10]. Starfish fills a different void by enabling
Hadoop users and applications to get good performance automat-
ically throughout the data lifecycle in analytics; without any need
on their part to understand and manipulate the many tuning knobs
available. A system like Starfish is essential as Hadoop usage con-
tinues to grow beyond companies like Facebook and Yahoo! that
have considerable expertise in Hadoop. New practitioners of big
data analytics like computational scientists and systems researchers
lack the expertise to tune Hadoop to get good performance.
Starfish’s tuning goals and solutions are related to projects like
Hive, Manimal, MRShare, Nectar, Pig, Quincy, andScope [3, 8, 14,
15, 18]. The novelty in Starfish’s approach comes from how it fo-
cuses simultaneously on different workload granularities—overall
workload, workflows, and jobs (procedural and declarative)—as
well as across various decision points—provisioning, optimization,
scheduling, and data layout. This approach enables Starfish to han-
dle the significant interactions arising among choices made at dif-
ferent levels.
Weare extremely grateful to Amazon Web Services for awarding
us a grant for the use of resources on the Amazon Elastic Compute
Cloud (EC2) and Elastic MapReduce.
[1] A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and
A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce
and DBMS Technologies for Analytical Workloads. PVLDB, 2(1),
[2] BTrace: A Dynamic Instrumentation Tool for Java.
[3] M. J. Cafarella and C. R´e. Manimal: Relational Optimization for
Data-Intensive Programs. In WebDB, 2010.
[4] B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic
Instrumentation of Production Systems. In USENIX Annual
Technical Conference, 2004.
[5] Cloudera: 7 tips for Improving MapReduce Performance.
[6] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton.
MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2),
[7] J. Dittrich, J.-A. Quian´e-Ruiz, A. Jindal, Y. Kargin, V. Setty, and
J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah.
PVLDB, 3(1), 2010.
[8] P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang.
Nectar: Automatic Management of Data and Computation in
Datacenters. In OSDI, 2010.
[9] Pluggable Block Placement Policies in HDFS.
[10] D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of
MapReduce: An In-depth Study. PVLDB, 3(1), 2010.
[11] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. Making Cloud Intermediate
Data Fault-tolerant. In SoCC, 2010.
[12] TeraByte-scale Data Cycle at LinkedIn.
[13] Hadoop MapReduce Tutorial.
[14] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas.
MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB,
3(1), 2010.
[15] C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic
Optimization of Parallel Dataflow Programs. In USENIX Annual
Technical Conference, 2008.
[16] A. Qureshi, R. Weber, H. Balakrishnan, J. V. Guttag, and B. V.
Maggs. Cutting the Electric Bill for Internet-scale Systems. In
SIGCOMM, 2009.
[17] Agile Enterprise Analytics. Keynote by Oliver Ratzesberger at the
Self-Managing Database Systems Workshop 2010.
[18] J. Zhou, P.-˚
A. Larson, and R. Chaiken. Incorporating Partitioning and
Parallel Plans into the SCOPE Optimizer. In ICDE, 2010.
When a MapReduce job executes in a Hadoop cluster, a lot of in-
formation is generated including logs, counters, resource utilization
metrics, and profiling data. This information is organized, stored,
and managed by Starfish’s Metadata Manager in a catalog that can
be viewed using Starfish’s Visualizer. A user can employ the Vi-
sualizer to get a deep understanding of a job’s behavior during ex-
ecution, and to ultimately tune the job. Broadly, the functionality
of the Visualizer can be categorized into Timeline views,Data-flow
views, and Profile views.
A.1 Timeline Views
Timeline views are used to visualize the progress of a job exe-
cution at the task level. Figure 16 shows the execution timeline of
map and reduce tasks that ran during a MapReduce job execution.
The user can observe information like how many tasks were run-
ning at any point in time on each node, when each task started and
ended, or how many map or reduce waves occurred. The user is
able to quickly spot any variance in the task execution times and
discover potential load balancing issues.
Moreover, Timeline views can be used to compare different ex-
ecutions of the same job run at different times or with different
parameter settings. Comparison of timelines will show whether
the job behavior changed over time as well as help understand the
impact that changing parameter settings has on job execution. In
addition, the Timeline views support a What-if mode using which
the user can visualize what the execution of a job will be when run
using different parameter settings. For example, the user can deter-
mine the impact of decreasing the value of io.sort.mb on map task
execution. Under the hood, the Visualizer invokes the What-if En-
gine to generate a virtual job profile for the job in the hypothetical
setting (recall Section 3).
A.2 Data-flow Views
The Data-flow views enable visualization of the flow of data
among the nodes and racks of a Hadoop cluster, and between the
map and reduce tasks of a job. They form an excellent way of iden-
tifying data skew issues and realizing the need for a better parti-
tioner in a MapReduce job. Figure 17 presents the data flow among
the nodes during the execution of a MapReduce job. The thickness
of each line is proportional to the amount of data that was shuffled
between the corresponding nodes. The user also has the ability to
specify a set of filter conditions (see the left side of Figure 17) that
allows her to zoom in on a subset of nodes or on the large data
transfers. An important feature of the Visualizer is the Video mode
that allows users to play back a job execution from the past. Us-
ing the Video mode (Figure 17), the user can inspect how data was
processed and transfered between the map and reduce tasks of the
job, and among nodes and racks of the cluster, as time went by.
A.3 Profile Views
In Section 3, we saw how a job profile contains a lot of useful
information like the breakdown of task execution timings, resource
usage, and data flow per subphase. The Profile views help visualize
the job profiles, namely, the information exposed by the Timings,
Data-flow, and Resource-level views in a profile; allowing an in-
depth analysis of the task behavior during execution. For example,
Figure 5 shows parts of two Profile views that display the break-
down of time spent on average in each map and reduce task for two
WordCount job executions. Job Awas run using the parameter set-
tings as specified by rules of thumb, whereas Job Bwas run using
the settings recommended by the Just-in-time Optimizer (Table 1
in Section 3). The main difference caused by the two settings was
more, but smaller, map-side spills for Job Bcompared to Job A.
We can observe that the map tasks in Job Bcompleted on av-
erage much faster compared to the map tasks in Job A; yet the
reverse happened to the reduce tasks. The Profile views allow us to
see exactly which subphases benefit the most from the parameter
settings. It is obvious from Figure 5 that the time spent perform-
ing the map processing and the spilling in Job Bwas significantly
lower compared to Job A.
On the other hand, the Combiner drastically decreases the amount
of intermediate data spilled to disk (which can be observed in the
Data-flow views not shown here). Since the map tasks in Job B
processed smaller spills, the reduction gains from the Combiner
were also smaller; leading to larger amounts of data being shuffled
and processed by the reducers. The Profile views show exactly how
much more time was spent in Job Bfor shuffling and sorting the
intermediate data, as well as performing the reduce computation.
Overall, the benefit gained by the map tasks in Job Boutweighed
by far the loss incurred by the reduce tasks, leading to a 2x better
performance than Job A.
Figure 16: Execution timeline of the map and reduce tasks of a MapReduce job
Figure 17: Visual representation of the data-flow among the Hadoop nodes during a MapReduce job execution
... In-distribution Data slicing [53], algorithmic recourse [120] Out-of-distribution Generating adversarial samples [165], generating samples with distribution shift [125] Prompt engineering Manual prompt engineering [206], automated prompt engineering [238] Data maintenance Visual summarization [37], clustering for visualization [72], Understanding visualization recommendation [253], valuation [83] Quality assurance Quality assessment [195], quality improvement [19] Storage & retrieval Resource allocation [100], query index selection [223], query rewriting [12] • Inference data development: The objective is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs. There are three sub-goals in this effort: 1) in-distribution evaluation and 2) out-of-distribution evaluation aim to generate samples that adhere to or differ from the training data distribution, respectively, while 3) prompt enginerring tunes the prompt in language models to get the desired predictions. ...
... Various learning-based strategies have been developed for resource allocation in data processing systems. For instance, Starfish [100] proposes a profile-predict-optimize approach that generates job profiles with dataflow and cost statistics, which are then used to predict virtual job profiles for task scheduling. More recently, machine learning approaches such as OtterTune [232] have been developed to automatically select the most important parameters, map workloads, and recommend parameters to improve latency and throughput. ...
Full-text available
Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on
... Moreover, as the number of data sources increases rapidly (every year in our case) and involves sophisticated data analysis as well, so does the amount of data-related storage required to facilitate data analysts' adapting and yielding data swiftly. To ensure this, an agile database is needed that can acclimate both its physical and logical contents in synchronization with the fast evolution of newly growing data [31]. ...
The advancement of any modern nation depends largely on the standard of higher education. The fourth industrial revolution is all about the growth of information and frontier technologies. Consequently, it is impossible to maintain the standard of higher education without the help of emerging technologies. The higher education admission process has access to enormous heterogeneous data. This data can be used as an essential factor in making the appropriate decisions for educational institutions. With this point of view, we are proposing a Data-Driven Decision-Making System that will analyze available data from the previously proposed system Centralized Admission Hub (CAH) and will generate consummate reports based on that data. Decision-makers from universities can take strategic decisions about the standards of admission tests, seat distribution, and admission criteria based on analysis like department-wise student interest, overall interest in being admitted into certain universities, etc. Students can make decisions about choosing a university or major and based on information like the acceptance rate of the university, student-to-faculty member ratio, etc. The uni-versity's managing authorities can take the required decisions as well, based on relevant reports generated by this system. In this research work, we discussed different aspects of the proposed system, explored possible features, and aligned those to solve higher education admission-related problems.
... However, once a configuration space becomes large, users can easily get confused by possible interactions among configuration options with distinct impacts on diverse objectives. For example, many software systems have poorly chosen defaults (Van Aken et al. 2017;Herodotou et al. 2011). Hence, it is useful to seek better configurations. ...
Full-text available
Context Many software systems can be tuned for multiple objectives (e.g., faster runtime, less required memory, less network traffic or energy consumption, etc.). Such systems can suffer from “disagreement” where different models have different (or even opposite) insights and tactics on how to optimize a system. For configuration problems, we show that (a) model disagreement is rampant; yet (b) prior to this paper, it has barely been explored. Objective We aim at helping practitioners and researchers better solve multi-objective configuration optimization problems, by resolving model disagreement. Method We propose a dimension reduction method called VEER that builds a useful one-dimensional approximation to the original N-objective space. Traditional model-based optimizers use Pareto search to locate Pareto-optimal solutions to a multi-objective problem, which is computationally heavy on large-scale systems. VEER builds a surrogate that can replace the Pareto sorting step after deployment. Results Compared to the prior state-of-the-art, for 11 configurable systems, VEER significantly reduces disagreement and execution time, without compromising the optimization performance in most cases. For our largest problem (with tens of thousands of possible configurations), optimizing with VEER finds as good or better optimizations with zero model disagreements, three orders of magnitude faster. Conclusion When employing model-based optimizers for multi-objective optimization, we recommend to apply VEER, which not only improves the execution time, but also resolves the potential model disagreement problem.
... In this paper, we focus on tuning online Spark SQL tasks. Unlike offline tuning where it is tolerable to simulate various configurations in a non-production cluster [4,15,20,40], each configuration can only be evaluated in real production due to high overhead of offline evaluations or security concerns. Thus, the configuration should be carefully selected to achieve high performance quickly (efficient) and low risks of negative results (safe). ...
Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.
... Similarly, many solutions have been developed to address MapReduce's gaps, such as delayed data loading and data reuse. Among those tools are Starfish, which is a Hadoop-based framework aiming to improve the performance of MapReduce jobs through the use of data lifecycle analytics, as well as being a self-tuning system that adapts to users' needs and systems' workloads without requiring users to configure or change the underlying settings or parameters [54]. Spark is an alternative to MapReduce that aims to overcome disk I/O limitations and improve the performance of prior solutions. ...
Full-text available
Big Data is a phenomenon that affects today’s world, with new data being generated every second. Today’s enterprises face major challenges from the increasingly diverse data, as well as from indexing, searching, and analyzing such enormous amounts of data. In this context, several frameworks and libraries for processing and analyzing Big Data exist. Among those frameworks Hadoop MapReduce, Mahout, Spark, and MLlib appear to be the most popular, although it is unclear which of them best suits and performs in various data processing and analysis scenarios. This paper proposes EverAnalyzer, a self-adjustable Big Data management platform built to fill this gap by exploiting all of these frameworks. The platform is able to collect data both in a streaming and in a batch manner, utilizing the metadata obtained from its users’ processing and analytical processes applied to the collected data. Based on this metadata, the platform recommends the optimum framework for the data processing/analytical activities that the users aim to execute. To verify the platform’s efficiency, numerous experiments were carried out using 30 diverse datasets related to various diseases. The results revealed that EverAnalyzer correctly suggested the optimum framework in 80% of the cases, indicating that the platform made the best selections in the majority of the experiments.
... Como trabalhos futuros, seria interessante explorar outros algoritmos paralelos (p.ex outros da MLlib), novas consultas interativas e fluxos completos de processamento de dados, compostos por vários algoritmos ou consultas, assim como cenários com jobs paralelos e aplicações concorrentes no mesmo cluster. Também, cabe avaliar o modelo para adicionar ou remover recursos de forma proativa ao cluster respeitando a qualidade de serviço e provendo maior eficiência.É também interessante avaliar o comportamento do modelo para os algoritmos selecionados com diferentes entradas, bem como comparar os resultados com outros modelos como, por exemplo, o Starfish [Herodotou et al. 2011]. ...
Conference Paper
Aplicações de big data têm tipicamente propriedades bem específicas, tais como heterogeneidade e irregularidade nos padrões de acesso aos dados, que tornam a alocação de recursos de hardware e software muito desafiadora. Por outro lado, a flexibilidade e a elasticidade provida por plataformas de computação na nuvem facilitam esta alocaçãoá medida em que recursos são alocados sob demanda. Entretanto, estas características também tornam a previsão de desempenho (p.ex: tempo de resposta das aplicações) mais complexa. Este trabalho explora um modelo analítico para a previsão de tempo de resposta de aplicações executando na plataforma Spark, muito popular para processamento de dados em larga escala, que é parametrizado a partir de logs de execuções prévias. O modelo é avaliado em diversos cenários e aplicações, obtendo um erro relativo no tempo de resposta previsto inferior a 8%, em média.
... Additionally, there are prevailing works that proposed autonomic computing concept with selfconfigure feature in big data processing framework. Starfish [7] introduced self-tuning system on Hadoop based on user needs and system workloads for better performance using cost-based modeling and simulation. [8] used cron with python automated script to provide self-configure feature in Hadoop. ...
Conference Paper
Link prediction analysis becomes vital to acquire a deeper understanding of events underlying social networks interactions and connections especially in current evolving and large-scale social networks. Traditional link prediction approaches underperformed for most large-scale social networks in terms of its scalability and efficiency. Spark is a distributed open-source framework that facilitate scalable link prediction efficiency in large-scale social networks. The framework provides numerous tunable properties for users to manually configure the parameters for the applications. However, manual configurations open to performance issue when the applications start scaling tremendously, which is hard to set up and expose to human errors. This paper introduced a novel Self-Configured Framework (SCF) to provide an autonomous feature in Spark that predicts and sets the best configuration instantly before the application execution using XGBoost classifier. SCF is evaluated on the Twitter social network using three link prediction applications: Graph Clustering (GC), Overlapping Community Detection (OCD), and Redundant Graph Clustering (RGD) to assess the impact of shifting data sizes on different applications in Twitter. The result demonstrates a 40% reduction in prediction time as well as a balanced resource consumption that makes full use of resources, especially for limited number and size of clusters.
... Moreover, these configuration options also affect performance, i.e., non-functional properties such as energy consumption, response time, and throughput. The task of (optimally) configuring a software system is of paramount importance because (1) many systems are shipped with a sub-optimal default configuration (Aken et al. 2017;Herodotou et al. 2011), (2) manually exploring configurations does not scale (Xu et al. 2015), and (3) fine-grained tuning can improve performance up to several orders of magnitude (Jamshidi and Casale 2016;Zhu et al. 2017). This is why certain disciplines have whole branches of parameter optimization (Bergstra et al. 2011) or algorithm selection (Rice 1976), but with substantially smaller and usually unconstrained configuration spaces. ...
Full-text available
Understanding the influence of configuration options on the performance of a software system is key for finding optimal system configurations, system understanding, and performance debugging. In the literature, a number of performance-influence modeling approaches have been proposed, which model a configuration option’s influence and a configuration’s performance as a scalar value. However, these point estimates falsely imply a certainty regarding an option’s influence that neglects several sources of uncertainty within the assessment process, such as (1) measurement bias, choices of model representation and learning process, and incomplete data. This leads to the situation that different approaches and even different learning runs assign different scalar performance values to options and interactions among them. The true influence is uncertain, though. There is no way to quantify this uncertainty with state-of-the-art performance modeling approaches. We propose a novel approach, P4, which is based on probabilistic programming, that explicitly models uncertainty for option influences and consequently provides a confidence interval for each prediction alongside a scalar. This way, we can explain, for the first time, why predictions may be erroneous and which option’s influence may be unreliable. An evaluation on 13 real-world subject systems shows that P4’s accuracy is in line with the state of the art while providing reliable confidence intervals, in addition to scalar predictions. We qualitatively explain how uncertain influences of individual options and interactions cause inaccurate predictions.
This paper addresses the issue of model establishment and performance control for nonlinear computing systems based on Hadoop-Mapreduce platform. The parameter is selected as the control input data and the CPU usage is selected as the output data. By using least-squares principle, a Hammerstein model for this relationship is identified and validated. The control law consists of a nonlinear compensator and a PID feedback. By testing in Simulink, the Hammerstein model is verified appropriate and the designed nonlinear control drives the CPU usage to achieve the target value asymptotically.
Full-text available
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop — an open-source implementation of MapReduce — often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.
Conference Paper
Full-text available
Massive data analysis on large clusters presents new opportunities and challenges for query optimization. Data partitioning is crucial to performance in this environment. However, data repartitioning is a very expensive operation so minimizing the number of such operations can yield very significant performance improvements. A query optimizer for this environment must therefore be able to reason about data partitioning including its interaction with sorting and grouping. SCOPE is a SQL-like scripting language used at Microsoft for massive data analysis. A transformation-based optimizer is responsible for converting scripts into efficient execution plans for the Cosmos distributed computing platform. In this paper, we describe how reasoning about data partitioning is incorporated into the SCOPE optimizer. We show how relational operators affect partitioning, sorting and grouping properties and describe how the optimizer reasons about and exploits such properties to avoid unnecessary operations. In most optimizers, consideration of parallel plans is an afterthought done in a postprocessing step. Reasoning about partitioning enables the SCOPE optimizer to fully integrate consideration of parallel, serial and mixed plans into the cost-based optimization. The benefits are illustrated by showing the variety of plans enabled by our approach.
Conference Paper
Full-text available
The MapReduce distributed programming framework is very popular, but currently lacks the optimization techniques that have been standard with relational database systems for many years. This paper proposes Manimal, which uses static code analysis to detect MapReduce program semantics and thereby enable wholly-automatic optimization of MapReduce programs. For example, a programmer's map function that emits data only when an if... statement holds true is essentially encoding a selection condition; code analysis can detect and characterize these conditions. If Manimal has an appropriate index available, it can then alter MapReduce execution to use it. Manimal can address many different optimization opportunities, including projections, structure-aware data compression, and others. However, this paper illustrates the system by focusing on one: efficient selection. We give a static analysis algorithm that can detect selections in user programs, and cover how Manimal can employ a B+Tree to execute these selections efficiently at runtime. Testing Manimal on several standard MapReduce programs, we show that selection alone can automatically reduce a standard program's runtime to 63% of conventional MapReduce execution time on identical hardware. We also give an in-depth discussion of other optimization targets and detection techniques.
Conference Paper
Full-text available
Managing data and computation is at the heart of datacenter computing. Manual management of data can lead to data loss, wasteful consumption of storage, and laborious bookkeeping. Lack of proper management of computation can result in lost opportunities to share common computations across multiple jobs or to compute results incrementally. Nectar is a system designed to address the aforementioned problems. It automates and unifies the management of data and computation within a datacenter. In Nectar, data and computation are treated interchangeably by associating data with its computation. Derived datasets, which are the results of computations, are uniquely identified by the programs that produce them, and together with their programs, are automatically managed by a datacenter wide caching service. Any derived dataset can be transparently regenerated by re-executing its program, and any computation can be transparently avoided by using previously cached results. This enables us to greatly improve datacenter management and resource utilization: obsolete or infrequently used derived datasets are automatically garbage collected, and shared common computations are computed only once and reused by others. This paper describes the design and implementation of Nectar, and reports on our evaluation of the system using analytic studies of logs from several production clusters and an actual deployment on a 240-node cluster.
Conference Paper
This paper presents DTrace, a new facility for dynamic instrumentation of production systems. DTrace features the ability to dynamically instrument both user-level and kernel-level software in a unified and absolutely safe fashion. When not explicitly enabled, DTrace has zero probe effect — the system operates exactly as if DTrace were not present at all. DTrace allows for many tens of thousands of instrumentation points, with even the smallest of systems offering on the order of 30,000 such points in the kernel alone. We have developed a C-like high-level control language to describe the predicates and actions at a given point of instrumentation. The lan- guage features user-defined variables, including thread- local variables and associative arrays. To eliminate the need for most postprocessing, the facility features a scal- able mechanism for aggregating data and a mechanism for speculative tracing. DTrace has been integrated into the Solaris operating system and has been used to find serious systemic performance problems on production systems — problems that could not be found using pre- existing facilities.
Conference Paper
Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention re- cently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more main- tainable code. These languages engender greater trans- parency in program structure, and open up opportunities for automatic optimization. This paper proposes a set of optimization strategies for this context, drawing on and extending techniques from the database community.
Conference Paper
Parallel dataflow programs generate enormous amounts of distributed data that are short-lived, yet are critical for com- pletion of the job and for good run-time performance. We call this class of data as intermediate data. This paper is the first to address intermediate data as a first-class citizen, specifically targeting and minimizing the effect of run-time server failures on the availability of intermediate data, and thus on performance metrics such as job completion time. We propose new design techniques for a new storage sys- tem called ISS (Intermediate Storage System), implement these techniques within Hadoop, and experimentally evalu- ate the resulting system. Under no failure, the performance of Hadoop augmented with ISS (i.e., job completion time) turns out to be comparable to base Hadoop. Under a fail- ure, Hadoop with ISS outperforms base Hadoop and incurs up to 18% overhead compared to base no-failure Hadoop, depending on the testbed setup.
Large-scale data analysis lies in the core of modern enter- prises and scientific research. With the emergence of cloud computing, the use of an analytical query processing in- frastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular frame- work in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batch mode. Taking into account that dierent jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges incurred while utilizing the processing infrastructure. In this paper we propose a sharing framework tailored to MapReduce. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more eciently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Experiments in our prototype, built on top of Hadoop, demonstrate the overall eectiveness of our approach and substantial savings.
As massive data acquisition and storage becomes increas- ingly aordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Ag- ile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intel- ligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Interactive Media, us- ing the Greenplum parallel database system. We describe database design methodologies that support the agile work- ing style of analysts in these settings. We present data- parallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reect on database system features that enable agile design and exi- ble algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.