ArticlePDF Available

A View of the Parallel Computing Landscape


Abstract and Figures

Industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Failure could jeopardize both the IT industry and the portions of the economy that depend on rapidly improving information technology. Jeopardy for the IT industry means opportunity for the research community. If researchers meet the parallel challenge, the future of IT is rosy. If they don't, it's not. Hence, there are few restrictions on potential solutions. Given an excuse to reinvent the whole software/hardware stack, this opportunity is also a once-in-a-career chance to fix other weaknesses in computing that have accumulated over the decades like barnacles on the hull of an old ship. In this article, the authors lay out one view of the opportunities, then, as an example, describe in more depth the approach of the Berkeley Parallel Computing Lab, or Par Lab, updating two long technical reports that include more detail. The goal is to recruit more parallel revolutionaries.
Content may be subject to copyright.
56 COMMUNICATIONS OF THE ACM | oCtobER 2009 | vol. 52 | No. 10
contributed articles
Writing programs that scale with increasing
numbers of cores should be as easy as writing
programs for sequential computers.
A View of
the Parallel
technology advances to double per-
formance every 18 months. The im-
plicit hardware/software contract was
that increased transistor count and
power dissipation were OK as long
as architects maintained the existing
sequential programming model. This
contract led to innovations that were
inefficient in terms of transistors and
power (such as multiple instruction
issue, deep pipelines, out-of-order
execution, speculative execution,
and prefetching) but that increased
performance while preserving the se-
quential programming model.
The contract worked fine until we
hit the power limit a chip is able to
dissipate. Figure 1 reflects this abrupt
change, plotting the projected micro-
processor clock rates of the Interna-
tional Technology Roadmap for Semi-
conductors in 2005 and then again just
two years later.16 The 2005 prediction
was that clock rates should have ex-
ceeded 10GHz in 2008, topping 15GHz
in 2010. Note that Intel products are
today far below even the conservative
2007 prediction.
After crashing into the power wall,
architects were forced to find a new par-
adigm to sustain ever-increasing perfor-
mance. The industry decided the only
viable option was to replace the single
power-inefficient processor with many
efficient processors on the same chip.
The whole microprocessor industry
thus declared that its future was in par-
allel computing, with increasing num-
bers of processors, or cores, each tech-
nology generation every two years. This
style of chip was labeled a multicore mi-
croprocessor. Hence, the leap to mul-
ticore is not based on a breakthrough
in programming or architecture and
is actually a retreat from the more dif-
ficult task of building power-efficient,
high-clock-rate, single-core chips.5
Many startups have sold parallel
computers over the years, but all failed,
as programmers accustomed to con-
tinuous improvement in sequential
performance saw little need to explore
parallelism. Convex, Encore, Floating
Point Systems, Inmos, Kendall Square
INdUSTry NEEdS HELP from the research community
to succeed in its recent dramatic shift to parallel
computing. Failure could jeopardize both the
IT industry and the portions of the economy
that depend on rapidly improving information
technology. Here, we review the issues and, as an
example, describe an integrated approach we’re
developing at the Parallel Computing Laboratory, or
Par Lab, to tackle the parallel challenge.
Over the past 60 years, the IT industry has improved
the cost-performance of sequential computing by
about 100 billion times overall.20 For most of the past
20 years, architects have used the rapidly increasing
transistor speed and budget made possible by silicon
oCtobER 2009 | vol. 52 | No. 10 | COMMUNICATIONS OF THE ACM 57
58 COMMUNICATIONS OF THE ACM | oCtobER 2009 | vol. 52 | No. 10
contributed articles
Research, MasPar, nCUBE, Sequent,
Silicon Graphics, and Thinking Ma-
chines are just the best-known mem-
bers of the Dead Parallel Computer So-
ciety. Given this sad history, multicore
pessimism abounds. Quoting comput-
ing pioneer John Hennessy, President
of Stanford University:
“…when we start talking about par-
allelism and ease of use of truly parallel
computers, we’re talking about a problem
that’s as hard as any that computer sci-
ence has faced. …I would be panicked if I
were in industry.”19
Jeopardy for the IT industry means
opportunity for the research commu-
nity. If researchers meet the paral-
lel challenge, the future of IT is rosy.
If they don’t, it’s not. Hence, there
are few restrictions on potential so-
lutions. Given an excuse to reinvent
the whole software/hardware stack,
this opportunity is also a once-in-a-
career chance to x other weaknesses
in computing that have accumulated
over the decades like barnacles on the
hull of an old ship.
Here, we lay out one view of the op-
portunities, then, as an example, de-
scribe in more depth the approach of
the Berkeley Parallel Computing Lab,
or Par Lab, updating two long techni-
cal reports4,5 that include more detail.
Our goal is to recruit more parallel
Parallel Bridge
The bridge in Figure 2 represents an
analogy connecting computer users
on the right to the IT industry on the
left. The left tower is hardware, the
right tower is applications, and the
long span in between is software. We
use the bridge analogy throughout
this article. The aggressive goal of the
parallel revolution is to make it as easy
to write programs that are as effi cient,
portable, and correct (and that scale as
the number of cores per microproces-
sor increases biennially) as it has been
to write programs for sequential com-
puters. Moreover, we can fail overall
if we fail to deliver even one of these
“parallel virtues.” For example, if par-
allel programming is unproductive,
this weakness will delay and reduce
the number of programs that are able
to exploit new multicore architectures.
Hardware tower. The power wall
forces the change in the traditional
programming model, but the question
for parallel researchers is what kind of
computing architecture should take
its place. There is a technology sweet
spot around a pipelined processor of
ve-to-eight stages that is most effi -
cient in terms of performance per joule
and silicon area.5 Using simple cores
means there is room for hundreds of
them on the same chip. Moreover, hav-
ing many such simple cores on a chip
simplifi es hardware design and verifi -
cation, since each core is simple, and
replication of cores is nearly trivial.
Just as it’s easy to add spares to mask
manufacturing defects, “manycore”
computers can also have higher yield.
One example of a manycore comput-
er is from the world of network proces-
sors, which has seen a great deal of inno-
vation recently due to the growth of the
networking market. The best-designed
network processor is arguably the Cisco
Silicon Packet Processor, also known as
Metro, which has 188 fi ve-stage RISC
cores, plus four spares to help yield and
dissipate just 35 watts.
It may be reasonable to assume
that manycore computers will be ho-
mogeneous, like the Metro, but there
is an argument for heterogeneous
manycores as well. For example, sup-
pose 10% of the time a program gets no
speedup on a 100-core computer. To
run this sequential piece twice as fast,
assume a single fat core would need 10
times as many resources as a thin core
due to larger caches, a vector unit, and
other features. Applying Amdahl’s Law,
here are the speedups (relative to one
thin core) of 100 thin cores and 90 thin
cores for the parallel code plus one fat
core for the sequential code:
Speedup100 = 1 / (0.1 + 0.9/100) = 9.2
times faster
Speedup91 = 1 / (0.1/2 + 0.9/90) = 16.7
times faster
In this example of manycore proces-
sor speedup, a fat core needing 10
times as many resources would be
more effective than the 10 thin cores
it replaces.5,15
One notable challenge for the hard-
ware tower is that it takes four to fi ve
years to design and build chips and port
software to evaluate them. Given this
lengthy cycle, how could researchers in-
novate more quickly?
Software span. Software is the main
problem in bridging the gap between
users and the parallel IT industry.
Hence, the long distance of the span in
Figure 2 refl ects the daunting magni-
tude of the software challenge.
One especially vexing challenge
for the parallel software span is that
sequential programming accommo-
dates the wide range of skills of today’s
programmers. Our experience teach-
ing parallelism suggests that not every
programmer is able to understand the
nitty gritty of concurrent software and
parallel hardware; diffi cult steps in-
clude locks, barriers, deadlocks, load
balancing, scheduling, and memory
consistency. How can researchers de-
velop technology so all programmers
benefi t from the parallel revolution?
A second challenge is that two criti-
cal pieces of system software—com-
pilers and operating systems—have
grown large and unwieldy and hence
Figure 1. Microprocessor clock rates of Intel products vs. projects from
the International Roadmap for Semiconductors in 2005 and 2007.16
Clock Rate (GHz)
Intel single core
Intel multicore
2007 Roadmap
2005 Roadmap
2001 2003 2005 2007 2009 2011 2013
oCtobER 2009 | vol. 52 | No. 10 | COMMUNICATIONS OF THE ACM 59
resistant to change. One estimate is
that it takes a decade for a new compil-
er optimization to become part of pro-
duction compilers. How can research-
ers innovate rapidly if compilers and
operating systems evolve so glacially?
A fi nal challenge is how to measure
improvement in parallel program-
ming languages. The history of these
languages largely refl ects researchers
deciding what they think would be
better and then building it for oth-
ers to try. As humans write programs,
we wonder whether human psychol-
ogy and human-subject experiments
shouldn’t be allowed to play a larger
role in this revolution.17
Applications tower. The goal of re-
search into parallel computing should
be to fi nd compelling applications that
thirst for more computing than is cur-
rently available and absorb biennially
increasing number of cores for the next
decade or two. Success does not require
improvement in the performance of
all legacy software. Rather, we need to
create compelling applications that ef-
fectively utilize the growing number of
cores while providing software environ-
ments that ensure that legacy code still
works with acceptable performance.
Note that the notion of “better”
ings with 50,000 or more servers to run
SaaS, inspiring the new catchphrase
“cloud computing.”b They have also be-
gun renting thousands of machines by
the hour to enable smaller companies
to benefi t from cloud computing. We
expect these trends to accelerate; and
The mobile device (laptops and hand-
helds) is the client. In 2007, Hewlett-
Packard, the largest maker of PCs,
shipped more laptops than desktops.
Millions of cellphones are shipped each
day with ever-increasing functionality, a
trend we expect to accelerate as well.
Surprisingly, these extremes in
computing share many characteris-
tics. Both concern power and energy—
the datacenter due to the cost of power
and cooling and the mobile client due
to battery life. Both concern cost—
the datacenter because server cost is
replicated 50,000 times and mobile
clients because of a lower unit-price
target. Finally, the software stacks are
becoming similar, with more layers for
mobile clients and increasing concern
about protection and security.
b See Ambrust, M. et al. Above the Clouds:
A Berkeley View of Cloud Computing. Univer-
sity of California, Berkeley, Technical Report
is not defi ned by only average per-
formance; advances could be in, say,
worst-case response time, battery life,
reliability, or security. To save the IT
industry, researchers must demon-
strate greater end-user value from an
increasing number of cores.
Par Lab
As a concrete example of the parallel
landscape, we describe Berkeley’s Par
Lab project,a exploring one of many
potential approaches, though we won’t
know for years which of our ideas will
bear fruit. We hope it inspires more
researchers to participate, increasing
the chance of nding a solution before
it’s too late for the IT industry.
Given a fi ve-year project, we project
the state of the fi eld in fi ve to 10 years,
anticipating that IT will be driven to
extremes in size due to the increasing
popularity of software as a service, or
The datacenter is the server. Amazon,
Google, Microsoft, and other major IT
vendors are racing to construct build-
a In March 2007, Intel and Microsoft invited 25
universities to propose fi ve-year centers for
parallel computing research; the Berkeley and
Illinois efforts were ranked fi rst and second.
contributed articles
Figure 2. Bridge analogy connecting users to a parallel IT industry, inspired by the view of the Golden Gate Bridge from Berkeley, CA.
IT Industry
60 COMMUNICATIONS OF THE ACM | oCtobER 2009 | vol. 52 | No. 10
contributed articles
Many datacenter applications have
ample parallelism across independent
users, so the Par Lab focuses on paral-
lelizing applications for clients. The
multicore and manycore chips in the
datacenter stand to benefit from the
same tools and techniques developed
for similar chips in mobile clients.
Given this projection, we decided to
take a fresh approach: the Par Lab will
be driven top-down, applications first,
then software, and finally hardware.
Par Lab application tower. An unfor-
tunate computer science tradition is we
build research prototypes, then wonder
why applications people don’t use them.
In the Par Lab, we instead selected ap-
plications up-front to drive research and
provide concrete goals and metrics to
evaluate progress. We selected each ap-
plication based on five criteria: compel-
ling in terms of likely market or social
impact, with short-term feasibility and
longer-term potential; requiring signifi-
cant speedup or smaller, more efficient
platform to work as intended; covering
the possible platforms and markets
likely to dominate usage; enabling tech-
nology for other applications; and in-
volvement of a local committed expert
application partner to help design, use,
and evaluate our technology.
Here are the five initial applications
we’re developing:
Music/hearing. High-performance
signal processing will permit: concert-
quality sound-delivery systems for
home sound systems and conference
calls; composition and gesture-driven
live-performance systems; and much
improved hearing aids;
Speech understanding. Dramatically
improved automatic speech recogni-
tion in moderately noisy and rever-
berant environments would greatly
improve existing applications and en-
able new ones, like, say, a real-time
meeting transcriber with rewind and
search. Depending on acoustic condi-
tions, current transcribers can gener-
ate many errors;
Content-based image retrieval. Con-
sumer-image databases are growing so
dramatically they require automated
search instead of manual labeling. Low
error rates require processing very high
dimensional feature spaces. Current
image classifiers are too slow to deliver
adequate response times;
Intraoperative risk assessment for
stroke patients. Advanced physiological
blood-flow modeling based on com-
putational analysis of 3D medical im-
ages of a patient’s cerebral vasculature
enables “virtual stress testing” to risk-
stratify stroke victims intraoperatively.
Patients thus identified at low risk of
complications can then be treated to
mitigate the effects of the stroke. This
technology will ultimately lower compli-
cation rates in treating stroke victims,
improve quality of life, reduce medical
care expenditures, and save lives; and
Parallel browser. The browser will
be the largest and most important ap-
plication on many mobile devices. We
will first parallelize sequential browser
bottlenecks. Rather than parallelizing
JavaScript programs, we are pursuing
an actor language with implicit paral-
lelism. Such a language may be acces-
sible to Web programmers while al-
lowing them to extract the parallelism
in the browser’s JIT compiler, thereby
turning all Web-site developers un-
knowingly into parallel programmers.
Application-domain experts are
first-class members of the Par Lab proj-
ect. Rather than try to answer design
questions abstractly, we ask our experts
what they prefer in each case. Project
success is judged by the user experience
with the collective applications on our
hardware-software prototypes. If suc-
cessful, we imagine building on these
five applications to create other appli-
cations that are even more compelling,
as in the following two examples:
Name Whisperer. Imagine that your
mobile client peeking out of your shirt
pocket is able to recognize the per-
son walking toward you to shake your
hand. It would search a personal im-
age database, then whisper in your
ear, “This man is John Smith. He got
an A– from you in CS152 in 1993”; and
Health Coach. As your mobile client
is always with you, you could take pic-
tures and weigh your dishes (assum-
ing it has a built-in scale) before and
after each meal. It would also record
how much you exercise. Given calories
consumed and burned and an image of
your body, it could visualize what you’re
likely to look like in six months at this
rate and what you’d look like if you ate
less or exercised more.
Par Lab software span. Software
is the major effort of the project, and
we’re taking a different path from pre-
vious parallel projects, emphasizing
software architecture, autotuning, and
separate support for productivity vs.
performance programming.
Architecting parallel software with
design patterns, not just parallel pro-
gramming languages. Our situation is
similar to that found in other engineer-
ing disciplines where a new challenge
emerges that requires a top-to-bottom
rethinking of the entire engineering
process; for example, in civil architec-
ture, Filippo Brunelleschi’s solution
in 1418 for how to construct the dome
for the Cathedral of Florence required
innovations in tools and building tech-
niques, as well as rethinking the whole
process of developing an architecture.
All computer science faces a similar
challenge; parallel programming is
overdue for a fundamental rethinking
of the process of designing software.
Programmers have been trying to
craft parallel code for decades and
learned a great deal about what works
and what doesn’t work. Automatic par-
allelism doesn’t work. Compilers are
great at low-level scheduling decisions
but can’t discover new algorithms to
exploit concurrency. Programmers in
high-performance computing have
shown that explicit technologies (such
as MPI and OpenMP) can be made to
work but too often require heroic ef-
fort untenable for most commercial
software vendors.
To engineer high-quality parallel
software, we plan to rearchitect the
software through a “design pattern lan-
guage.” As explored in his 1977 book,
civil architect Christopher Alexander
wrote that “design patterns” describe
time-tested solutions to recurring prob-
lems within a well-defined context.3
An example is Alexander’s “family of
entrances” pattern, addressing how to
simplify comprehension of multiple
entrances for a first-time visitor to a
site. He defined a “pattern language”
as a collection of related and interlock-
ing patterns, constructed such that the
patterns flow into each other as the de-
signer solves a design problem.
Computer scientists are trained to
think in well-defined formalisms. Pat-
tern languages encourage a less for-
mal, more associative way of thinking
about a problem. A pattern language
does not impose a rigid methodol-
ogy; rather, it fosters creative problem
contributed articles
oCtobER 2009 | vol. 52 | No. 10 | COMMUNICATIONS OF THE ACM 61
solving by providing a common vo-
cabulary to capture the problems en-
countered during design and identify
potential solutions from among fami-
lies of proven designs.
The observation that design patterns
and pattern languages might be useful
for software design is not new. An exam-
ple is Gamma et al.’s 1994 book Design
Patterns, which outlined patterns use-
ful for object-oriented programming.12
In building our own pattern language,
we found Shaw’s and Garlan’s report,23
which described a variety of architec-
tural styles useful for organizing soft-
ware, to be very effective. That these
architectural styles may also be viewed
as design patterns was noted earlier
by Buschmann in his 1996 book Pat-
tern-Oriented Software Architecture.7 In
particular, we adopted Pipe-and-Filter,
Agent-and-Repository, Process Control,
and Event-Based architectural styles as
structural patterns within our pattern
language. To this list, we add MapReduce
and Iterator as structural design patterns.
These patterns define the structure
of a program but do not indicate what
is actually computed. To address this
blind spot, another key part of our pat-
tern language is the set of “dwarfs” of
the Berkeley View reports4,5 (see Fig-
ure 3). Dwarfs are best understood as
computational patterns providing the
computational interior of the structural
patterns discussed earlier. By analogy,
the structural patterns describe a fac-
tory’s physical structure and general
workflow. The computational patterns
describe the factory’s machinery, flow
of resources, and work products. Struc-
tural and computational patterns can
be combined to architect arbitrarily
complex parallel software systems.
Convention holds that truly useful
patterns are not invented but mined
from successful software applications.
To arrive at our list of useful compu-
tational patterns we began with those
compiled by Phillip Collela of Law-
rence Berkeley National Laboratory of
the “seven dwarfs of high-performance
computing.” Then, in 2006 and 2007
we worked with domain experts to
broadly survey other application ar-
eas, including embedded systems,
general-purpose computing (SPEC
benchmarks), databases, games, arti-
ficial intelligence/machine learning,
computer-aided design of integrated
circuits, and high-performance com-
puting. We then focused in depth on
the patterns in the applications we de-
scribed earlier. Figure 3 shows the re-
sults of our pattern mining.
Computational and structural pat-
terns can be hierarchically composed
to define an application’s high-level
software architecture, but a complete
pattern language for application de-
sign must at least span the full range,
from high-level architecture to detailed
software implementation and tuning.
Mattson et al’s 2004 book Patterns for
Parallel Programming18 was the first
such attempt to systematize parallel
programming using a complete pattern
language. We combine the structural
and computational patterns mentioned
earlier in our pattern language to liter-
ally sit on top of the algorithmic struc-
tures and implementation structures
in the pattern language in Mattson’s
book. The resulting pattern language is
still under development but is already
employed by the Par Lab to develop the
software architectures and parallel im-
plementations of such diverse applica-
tions as content-based image retrieval,
large-vocabulary continuous speech
recognition, and timing analysis for in-
tegrated circuit design.
Patterns are conceptual tools that
help a programmer reason about a
software project and develop an ar-
chitecture but are not themselves
implementation mechanisms for
producing code.
Split productivity and efficiency lay-
ers, not just a single general-purpose
layer. A key Par Lab research objective
is to enable programmers to easily
write programs that run as efficiently
on manycore systems as on sequential
ones. Productivity, efficiency, and cor-
rectness are inextricably linked and
must be addressed together. These ob-
jectives cannot be accomplished with
a single-point solution (such as a uni-
versal language). In our approach, pro-
ductivity is addressed in a productivity
layer that uses a common composition
and coordination language to glue to-
gether the libraries and programming
frameworks produced by the efficien-
cy-layer programmer. Efficiency is prin-
cipally handled through an efficiency
layer that is targeted for use by expert
parallel programmers.
The key to generating a successful
If researchers
meet the
parallel challenge,
the future of IT
is rosy. If they
don’t, it’s not.
62 COMMUNICATIONS OF THE ACM | oCtobER 2009 | vol. 52 | No. 10
contributed articles
multicore software developer commu-
nity is to maximally leverage the efforts
of parallel programming experts by en-
capsulating their software for use by the
programming masses. We use the term
“programming framework” to mean
a software environment that supports
implementation of the solution pro-
posed by the associated design pattern.
The difference between a programming
framework and a general programming
model or language is that in a program-
ming framework the customization is
performed only at specified points that
are harmonious with the style embod-
ied in the original design pattern. An
example of a successful sequential
programming framework is the Ruby
on Rails framework, which is based
on the Model-View-Controller pat-
tern.26 Users have ample opportunity
to customize the framework but only
in harmony with the core Model-View-
Controller pattern.
Frameworks include libraries, code
generators, and runtime systems that
assist programmers with implementa-
tion by abstracting difficult portions
of the computation and incorporating
them into the framework itself. Histor-
ically successful parallel frameworks
encode the collective experience of the
programming community’s solutions
to recurring problems. Basing frame-
works on pervasive design patterns will
help make parallel frameworks broad-
ly applicable.
Productivity-layer programmers will
compose libraries and programming
frameworks into applications with the
help of a composition and coordina-
tion language.13 The language will be
implicitly parallel; that is, its composi-
tion will have serial semantics, mean-
ing the composed programs will be
safe (such as race-free) and virtualized
with respect to processor resources. It
will document and check interface re-
strictions to avoid concurrency bugs
resulting from incorrect composition,
as in, say, instantiating a framework
with a stateful function when a state-
less one is required. Finally, it will
support definition of domain-specific
abstractions for constructing frame-
works for specific applications, offer-
ing a programming experience similar
to MATLAB and SQL.
Parallel programs in the efficiency
layer are written very close to the ma-
chine, with the goal of allowing the best
possible algorithm to be written in the
primitives of the layer. Unfortunately,
existing multicore systems do not of-
fer a common low-level programming
model for parallel code. We are thus
defining a thin portability layer that
runs efficiently across single-socket
platforms and includes features for
parallel job creation, synchronization,
memory allocation, and bulk-memory
access. To provide a common model of
memory across machines with coher-
ent caches, local stores, and relatively
slow off-chip memory, we are defining
an API based on the idea of logically
partitioned shared memory, inspired
by our experience with Unified Parallel
C,27 which partitions memory among
processors but not (currently) between
on- and off-chip.
We may implement this efficiency
language either as a set of runtime
primitives or as a language extension of
C. It will be extensible with libraries to
experiment with various architectural
features (such as transactions, dynamic
multithreading, active messages, and
collective communication). The API will
be implemented on some existing mul-
ticore and manycore platforms and on
our own emulated manycore design.
To engineer parallel software, pro-
grammers must be able to start with
effective software architectures, and
the software engineer would describe
the solution to a problem in terms of a
design pattern language. Based on this
language, the Par Lab is creating a fam-
Figure 3. The color of a cell (for 12 computational patterns in seven general application areas and five Par Lab applications)
indicates the presence of that computational pattern in that application; red/high; orange/moderate; green/low; blue/rare.
Health Image Speech Music Browser
1. Finite State Mach.
2. Circuits
3. Graph Algorithms
4. Structured Grid
5. Dense Matrix
6. Sparse Matrix
7. Spectral (FFT)
8. Dynamic Prog
9. Particle Methods
10. Backtrack/B&B
11. Graphical Models
12. Unstructured Grid
contributed articles
oCtobER 2009 | vol. 52 | No. 10 | COMMUNICATIONS OF THE ACM 63
ily of frameworks to help turn a design
into working code. The general-pur-
pose programmer will work largely with
the frameworks and stay within what
we call the productivity layer. Specialist
programmers trained in the details of
parallel programming technology will
work within the efficiency layer to im-
plement the frameworks and map them
onto specific hardware platforms. This
approach will help general-purpose
programmers create parallel software
without having to master the low-level
details of parallel programming.
Generating code with search-based au-
totuners, not compilers. Compilers that
automatically parallelize sequential
code may have great commercial value
as computers go from one to two to four
cores, though as described earlier, his-
tory suggests they will be unable to scale
from 32 to 64 to 128 cores. Compiling
will be even more difficult, as the switch
to multicore means microprocessors
are becoming more diverse, since con-
ventional wisdom is not yet established
for multicore architectures. For exam-
ple, the table here shows the diversity
in designs of x86 and SPARC multicore
computers. In addition, as the num-
ber of cores increase, manufacturers
will likely offer products with differing
numbers of cores per chip to cover mul-
tiple price-performance points. They
will also allow each core to vary its clock
frequency to save power. Such diversity
will make the goals of efficiency, scal-
ing, and portability even more difficult
for conventional compilers, at a time
when innovation is desperately needed.
In recent years, autotuners have
become popular for producing high-
quality, portable scientific code for se-
rial microprocessors,10 optimizing a set
of library kernels by generating many
variants of a kernel and measuring each
variant by running on the target plat-
form. The search process effectively
tries many or all optimization switches;
hence, searching may take hours to
complete on the target platform. How-
ever, search is performed only once,
when the library is installed. The result-
ing code is often several times faster
than naive implementations. A single
autotuner can be used to generate high-
quality code for a variety of machines.
In many cases, the autotuned code is
faster than vendor libraries that were
specifically hand-tuned for the target
The synthesized mechanics could be
barrier synchronization expressions or
tricky loop bounds in stencil loops. Our
sketching-based synthesis is to tradi-
tional, deductive synthesis what model
checking is to theorem proving; rather
than interactively deriving a program,
our system searches a space of candi-
date programs with constraint solving.
Efficiency is achieved by reducing the
problem to one solved with two com-
municating SAT solvers. In future work,
we hope to synthesize parallel sparse
matrix codes and data-parallel algo-
rithms for additional problems (such as
Verification and testing, not one or the
other. Correctness is addressed differ-
ently at the two layers. The productiv-
ity layer is free from concurrency prob-
lems because the parallelism models
are restricted, and the restrictions are
enforced. The efficiency-layer code is
checked automatically for subtle con-
currency errors.
A key challenge in verification is
obtaining specifications for programs
to verify. Modular verification and au-
tomated unit-test generation require
the specification of high-level serial se-
mantic constraints on the behavior of
the individual modules (such as paral-
lel frameworks and parallel libraries).
To simplify specification, we use ex-
ecutable sequential programs with the
same behavior as a parallel component,
augmented with atomicity constraints
on a task,21 predicate abstractions of
the interface of a module,14 or multiple
ownership types.8
Programmers often find it difficult to
specify such high-level contracts involv-
ing large modules; however, most find
it convenient to specify local properties
of programs using assert statements
and type annotations. Local assertions
and type annotations are often gener-
ated from a program’s implicit correct-
ness requirements (such as data race,
deadlock freedom, and memory safety).
The system propagates implications
of these local assertions to the module
boundaries through a combination of
static verification and directed automat-
ed unit testing. These implications cre-
ate serial contracts that specify how the
modules (such as frameworks) are used
correctly. When the contracts for the
parallel modules are in place, program-
mers use static program verification to
machine. This surprising result is partly
explained by the way the autotuner tire-
lessly tries many unusual variants of a
particular routine. Unlike libraries, au-
totuners also allow tuning to the partic-
ular problem size. Autotuners also pre-
serve clarity and support portability by
reducing the temptation to mangle the
source code to improve performance
for a particular computer.
Autotuning also helps with produc-
tion of parallel code. However, paral-
lel architectures introduce many new
optimization parameters; so far, there
are few successful autotuners for paral-
lel codes. For any given problem, there
may be several parallel algorithms, each
with alternative parallel data layouts.
The optimal choice may depend not
only on the processor architecture but
also on the parallelism of the computer
and memory bandwidth. Consequent-
ly, in a parallel setting, the search space
will be much larger than for traditional
serial hardware.
The table lists the results of auto-
tuning on three multicores for three
kernels related to the dwarfs’ sparse
matrix, stencil for PDEs, and structured
grids9,30,31 mentioned earlier. This au-
totuned code is the fastest known for
these kernels for all three computers.
Performance increased by factors of two
to four over standard code, much better
than you would expect from an optimiz-
ing compiler.
Efficiency-layer programmers will
be able to build autotuners for use by
domain experts and other efficiency-
layer programmers to help deliver on
the goals of efficiency, portability, and
Synthesis with sketching. One chal-
lenge for autotuning is how to produce
the high-performance implementa-
tions explored by the search. One ap-
proach is to synthesize these complex
programs. In doing so, we rely on the
search for performance tuning, as well
as for programmer productivity. To ad-
dress the main challenge of traditional
synthesis—the need for experts to com-
municate their insight with a formal
domain theory—we allow that insight
to be communicated directly by pro-
grammers who write an incomplete
program, or “sketch.” In it, they provide
an algorithmic skeleton, and the syn-
thesizer supplies the low-level mechan-
ics by filling in the holes in the sketch.
64 COMMUNICATIONS OF THE ACM | oCtobER 2009 | vol. 52 | No. 10
contributed articles
check if the client code composed with
the contracts is correct.
Static program analysis in the pres-
ence of pointers and heap memory
falsely reports many errors that cannot
really occur. For restricted parallelism
models with global synchronization,
this analysis becomes more tractable,
and a recently introduced technique
called “directed automated testing,”
or concolic unit testing, has shown
promise for improving software quality
through automated test generation us-
ing a combination of static and dynam-
ic analyses.21 The Par Lab combines
directed testing with model-checking
algorithms to unit-test parallel frame-
works and libraries composed with se-
rial contracts. Such techniques enable
programmers to quickly test executions
for data races and deadlocks directly,
since a combination of directed test
input generation and model checking
hijacks the underlying scheduler and
controls the synchronization primi-
tives. Our testing techniques will pro-
vide deterministic replay and debug-
ging capabilities at low cost. We will
also develop randomized extensions of
our directed testing techniques to build
a probabilistic model of path cover-
age. The probabilistic models will give
a more realistic estimate of coverage of
race and other concurrency errors in
parallel programs.
Parallelism for energy efficiency. While
the earlier computer classes—desktops
and laptops—reused the software of
their own earlier ancestors, the energy
efficiency for handheld operation may
need to come from data parallelism
in tasks that are currently executed se-
quentially, possibly from three sources:
Efficiency. Completing a task on slow
parallel cores will be more efficient than
completing it in the same time sequen-
tially on one fast core;
Energy amortization. Preferring data-
parallel algorithms over other styles of
parallelism, as SIMD and vector com-
puters amortize the energy expended
on instruction delivery; and
Energy savings. Message-passing pro-
grams may be able to save the energy
used by cache coherence.
We apply these principles in our work
on parallel Web browsers. In algorithm
design, we observe that to save energy
with parallelization, parallel algorithms
must be close to “work efficient,” that
is, they should perform no more total
work than a sequential algorithm, or
else parallelization is counterproduc-
tive. The same argument applies to op-
timistic parallelization. Work efficiency
is a demanding requirement, since, for
some “inherently sequential” problems,
like finite-state machines, only work-
inefficient algorithms are known. In this
context, we developed a nearly work-ef-
ficient algorithm for lexical analysis. We
are also working on data-parallel algo-
rithms for Web-page layout and identi-
fying parallelism in future Web-browser
applications, attempting to implement
them with efficient message passing.
Space-time partitioning for decon-
structed operating systems. Space-time
partitioning is crucial for manycore cli-
ent operating systems. A spatial partition
(partition for short) is an isolated unit
containing a subset of physical machine
resources (such as cores, cache parti-
tions, guaranteed fractions of memory
or network bandwidth, and energy
budget). Space-time partitioning virtu-
alizes spatial partitions by time-multi-
plexing whole partitions onto available
hardware but at a coarse-enough gran-
ularity to allow efficient programmer-
level scheduling in a partition.
The presence of space-time parti-
tioning leads to restructuring systems
Autotuned performance in GFLOPS/s on three kernels for dual-socket systems.
MPU Type
Intel e5345 Xeon
4 out-of-order cores,
AMD 2356 Opteron X4
4 out-of-order cores,
Sun 5140 UltraSPARC T2
8 multithreaded cores,
Optimization SpMV Stencil LBMHD SpMV Stencil LBMHD SpMV Stencil LBMHD
Standard 1.0 1.3 3.5 1.4 1.5 3.0 2.1 0.5 3.4
NUMA 1.0 — 3.5 2.4 2.6 3.7 3.5 0.5 3.8
Padding 1.3 4.5 — 3.1 5.8 — 0.5 3.8
Vectorization — — 4.6 — — 7.7 — — 9.7
Unrolling — 1.7 4.6 — 3.6 8.0 — 0.5 9.7
Prefetching 1.1 1.7 4.6 2.9 3.8 8.1 3.6 0.5 10.5
Compression 1.5 — — 3.6 — — 4.1 — —
$/TLB block 2.2 — — 4.9 — — 5.1
Collab Thread ———————6.7
SIMD 2.5 5.6 — 8.0 14.1 — — —
Final 1.5 2.5 5.6 3.6 8.0 14.1 4.1 6.7 10.5
contributed articles
oCtobER 2009 | vol. 52 | No. 10 | COMMUNICATIONS OF THE ACM 65
services as a set of interacting distrib-
uted components. We propose a new
“deconstructed OS” called Tessellation
structured around space-time partition-
ing and two-level scheduling between
the operating system and application
runtimes. Tessellation implements
scheduling and resource management
at the partition granularity. Applica-
tions and OS services (such as file sys-
tems) run within their own partitions.
Partitions are lightweight and can be
resized or suspended with similar over-
heads to a process-context swap.
A key tenet of our approach is that
resources given to a partition are either
exclusive (such as cores or private cach-
es) or guaranteed via a quality-of-service
contract (such as a minimum fraction
of network or memory bandwidth).
During a scheduling quantum, the ap-
plication runtime within a partition
is given unrestricted “bare metal” ac-
cess to its resources and may schedule
tasks onto them in some way. Within
a partition, our approach has much in
common with the Exokernel.11 In the
common case, we expect many appli-
cation runtimes to be written as librar-
ies (similar to libOS). Our Tessellation
kernel is a thin layer responsible for
only the coarse-grain scheduling and
assignment of resources to partitions
and implementation of secure restrict-
ed communications among partitions.
The Tessellation kernel is much thin-
ner than traditional kernels or even
hypervisors. It avoids many of the per-
formance issues associated with tra-
ditional microkernels by providing OS
services through secure messaging to
spatially co-resident service partitions,
rather than context-switching to time-
multiplexed service processes.
Par Lab hardware tower. Past parallel
projects were often driven by the hard-
ware determining the application and
software environment. The Par Lab is
driven top down from the applications,
so the question this time is what should
architects do to help with the goals of
productivity, efficiency, correctness,
portability, and scalability?
Here are four examples of this kind
of help that illustrate our approach:
Supporting OS partitioning. Our hard-
ware architecture enforces partition-
ing of not only the cores and on-chip/
off-chip memory but also the commu-
nication bandwidth among these com-
ponents, providing quality-of-service
guarantees. The resulting performance
predictability improves parallel pro-
gram performance, simplifies code au-
totuning and dynamic load balancing,
supports real-time applications, and
simplifies scheduling.
Optional explicit control of the mem-
ory hierarchy. Caches were invented so
hardware could manage a memory hi-
erarchy without troubling the program-
mer. When it takes hundreds of clock
cycles to go to memory, programmers
and compilers try to reverse-engineer
the hardware controllers to make bet-
ter use of the hierarchy. This backward
situation is especially apparent for
hardware prefetchers when program-
mers try to create a particular pattern
that will invoke good prefetching. Our
approach aims to allow programmers
to quickly turn a cache into an explicitly
managed local store and the prefetch
engines into explicitly controlled Di-
rect Memory Access engines. To make it
easy for programmers to port software
to our architecture, we also support a
traditional memory hierarchy. The low-
overhead mechanism we use allows
programs to be composed of methods
that rely on local stores and methods
that rely on memory hierarchies.
Accurate, complete counters of perfor-
mance and energy. Sadly, performance
counters on current single-core com-
puters often miss important measure-
ments (such as prefetched data) or are
unique to a computer and only under-
standable by the machine’s designers.
We will include performance enhance-
ments in the Par Lab architecture only
if they have counters to measure them
accurately and coherently. Since energy
is as important as performance, we also
include energy counters so software can
improve both. Moreover, these coun-
ters must be integrated with the soft-
ware stack to provide insightful mea-
surements to the efficiency-layer and
productivity-layer programmers. Ide-
ally, this research will lead to a standard
for performance counters so schedul-
ers and software development kits can
count on them on any multicore.
Intuitive performance model. The
multicore diversity mentioned earlier
exacerbates the already difficult jobs
performed by programmers, compiler
writers, and architects. Hence, we de-
veloped an easy-to-understand visual
To save the
IT industry,
researchers must
greater end-user
value from
an increasing
number of cores.
66 COMMUNICATIONS OF THE ACM | oCtobER 2009 | vol. 52 | No. 10
contributed articles
Center is pursuing deterministic mod-
els that allow programmers to reason
with sequential semantics for testing
while naturally exposing a parallel per-
formance model for WYSIWYG perfor-
mance. For reactive programs where
parallelism is part of the problem, it is
pursuing a shared-nothing approach
that leverages actor-like models used in
distributed systems. For application do-
mains that allow greater specialization,
it is developing a framework to gener-
ate domain-specific environments that
either hide concurrency or expose only
specialized forms of concurrency to the
end user while exploiting domain-spe-
cific optimizations and performance
measures. Initial applications and do-
mains include teleimmersion via “vir-
tual teleportation” (multimedia), dy-
namic real-time virtual environments
(computer graphics), learning by read-
ing, and authoring assistance (natural
language processing).
Stanford. The Pervasive Parallelism
Laboratory (
Laboratory) at Stanford University takes
an application-driven approach toward
parallel computing that extends from
programming models down to hard-
ware architecture. The key technical
concepts are domain-specific languag-
es for increasing programmer produc-
tivity and a common parallel runtime
environment combining dynamic and
static approaches for concurrency
and locality management. There are
domain-specific languages for artifi-
cial intelligence and robotics, business
data analysis, and virtual worlds and
gaming. The experimental platform
is the Flexible Architecture Research
Machine, or FARM, system, combining
commercial processors with FPGAs in
the memory fabric.
Georgia Tech. The Sony, Toshiba,
IBM Center of Competence for the Cell
Broadband Engine Processor (http://sti. at Georgia Tech focuses
on a single multicore computer, as its
name suggests. Researchers explore
versions of programs on Cell, includ-
ing image compression6 and financial
modeling.2 The Center also sponsors
workshops and provides remote access
to Cell hardware.
Rice University. The Habanero Multi-
core Software Project (http://habanero. at
model with built-in performance guide-
lines to identify bottlenecks in the dozen
dwarfs in Figure 3.29 The Roofline mod-
el plots computational and memory-
bandwidth limits, then determines the
best possible performance of a kernel
by examining the average number of op-
erations per memory access. It also plots
ceilings below the “roofline” to suggest
the optimizations that might be useful
for improving performance. One goal of
the performances counters should be to
provide everything needed to automati-
cally create Roofiline models.
A notable challenge from our ear-
lier description of the hardware tower
is how to rapidly innovate at the hard-
ware/software interface, when it can
take four to five years to build chips and
run programs needed to evaluate them.
Given the capacity of field-programma-
ble gate arrays (FPGAs), researchers can
prototype full hardware and software
systems that run fast enough to inves-
tigate architectural innovations. This
flexibility means researchers can “tape
out” every day, rather than over years.
We will leverage the Research Accelera-
tor for Multiple Processors (RAMP) Proj-
ect ( to
build flexible prototypes fast enough to
run full software stacks—including new
operating systems and our five com-
pelling applications—to enable rapid
architecture innovation using future
prototype software, rather than past
Reasons for Optimism
Given the history of parallel comput-
ing, it’s easy to be pessimistic about our
chances. The good news is that there
are plausible reasons researchers could
succeed this time:
No killer microprocessor. Unlike in the
past, no one is building the faster serial
microprocessor; programmers need-
ing more performance have no option
other than parallel hardware;
New measures of success. Rather than
the traditional goal of linear speedup
for all software as the number of pro-
cessors increases, success can reflect
improved responsiveness or MIPS/Joule
for a few new parallel killer apps;
All the wood behind one arrow. As
there is no alternative, the whole IT in-
dustry is committed, meaning many
more people and companies are work-
ing on the problem;
Manycore synergy with cloud comput-
ing. SaaS applications in data centers
with millions of users are naturally par-
allel and thus aligned with manycore,
even if clients apps are not;
Vitality of open source software. The
OSS community is a meritocracy, so it’s
likely to embrace technical advances
rather than be limited by legacy code.
Though OSS has existed for years, it is
more important commercially today
than it was;
Single-chip multiprocessors enable in-
novation. Having all processors on the
same chip enables inventions that were
impractical or uneconomical when
spread across many chips; and
FPGA prototypes shorten the hard-
ware/software cycle. Systems like RAMP
help researchers explore designs of
easy-to-program manycore architec-
tures and build prototypes more quickly
than they ever could with conventional
hardware prototypes.
Given the importance of the chal-
lenges to our shared future in the IT
industry, pessimism is not a sufficient
excuse to sit on the sidelines. The sin is
not lack of success but lack of effort.
Related Projects
Computer science hasn’t solved the
parallel challenge though not because
it hasn’t tried. There could be a dozen
conferences dedicated to parallelism,
including Principles and Practice of Par-
allel Programming, Parallel Algorithms
and Architectures, Parallel and Distrib-
uted Processing, and Supercomputing.
All traditionally focus on high-perfor-
mance computing; the target hardware
is usually large-scale computers with
thousands of microprocessors. Simi-
larly, there are many high-performance
computing research centers. Rather
than review this material, here we high-
light four centers focused on multicore
computers and their approaches to the
parallel challenge in academia:
Illinois. The Universal Parallel Com-
puting Research Center (http://www. at the University of
Illinois focuses on making it easy for do-
main experts to take advantage of paral-
lelism, so the emphasis is more on pro-
ductivity in specific domains than on
generality or performance.1 It relies on
advancing compiler technology to find
opportunities for parallelism, whereas
the Par Lab focuses on autotuning. The
contributed articles
oCtobER 2009 | vol. 52 | No. 10 | COMMUNICATIONS OF THE ACM 67
Rice University is developing languages,
compilers, managed runtimes, concur-
rency libraries, and tools that support
portable parallel abstractions with high
productivity and high performance for
multicores; examples include parallel
language extensions25 and optimized
synchronization primitives.24
We’ve provided a general view of the
parallel landscape, suggesting that the
goal of computer science should be
making parallel computing productive,
efficient, correct, portable, and scal-
able. We highlighted the importance
of finding new compelling applica-
tions and the advantages of manycore
and heterogeneous hardware. We also
described the research of the Berkeley
Par Lab. While it will take years to learn
which of our ideas work well, we share it
here as a concrete example of a coordi-
nated attack on the problem.
Unlike the traditional approach of
making hardware king, the Par Lab
is application-driven, working with
domain experts to create compelling
applications in music, image- and
speech-recognition, health, and par-
allel browsers.
The software span connecting ap-
plications to hardware relies more on
parallel software architectures than
on parallel programming languages.
Instead of traditional optimizing com-
pilers, we depend on autotuners, us-
ing a combination of empirical search
and performance modeling to create
highly optimized libraries tailored to
specific machines. By splitting the soft-
ware stack into a productivity layer and
an efficiency layer and targeting them
at domain experts and programming
experts respectively, we hope to bring
parallel computing to all programmers
while keeping domain experts produc-
tive and allowing expert programmers
to achieve maximum efficiency. Our ap-
proach to correctness relies on verifica-
tion where possible, then uses the same
tools to reduce the amount of testing
where verification is not possible.
The hardware tower of the Par Lab
serves the software span and applica-
tion tower. Examples of such service
include support for OS partitioning, ex-
plicit control for the memory hierarchy,
accurate measurement for performance
and energy, and an intuitive, multicore
10. Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E.,
Petitet, A., Vuduc, R., Whaley, R., and Yelick, K. Self-
adapting linear algebra algorithms and software.
Proceedings of the IEEE, Special Issue on Program
Generation, Optimization, and Adaptation 93, 2 (Feb.
2005), 293–312.
11. Engler, D.R. Exokernel: An operating system
architecture for application-level resource
management. In Proceedings of the 15th Symposium
on Operating Systems Principles (Cooper Mountain, CO,
Dec. 3–6, 1995), 251–266.
12. Gamma, E. et al. Design Patterns: Elements of
Reusable Object-Oriented Software. Addison-Wesley
Professional, Reading, MA, 1994.
13. Gelernter, D. and Carriero, N. Coordination languages
and their significance. Commun. ACM 35, 2 (Feb. 1992),
14. Henzinger, T.A. et al. Permissive interfaces. In
Proceedings of the 10th European Software Engineering
Conference (Lisbon, Portugal, Sept. 5–9). ACM Press,
New York, 2005, 31–40.
15. Hill, M. and Marty, M. Amdahl’s Law in the multicore
era. IEEE Computer 41, 7 (2008), 33–38.
16. International Technology Roadmap for
Semiconductors. Executive Summary, 2005 and 2007;
17. Kantowitz, B. and Sorkin, R. Human Factors:
Understanding People-System Relationships. John
Wiley & Sons, Inc., New York, 1983.
18. Mattson, T., Sanders, B., and Massingill, B. Patterns for
Parallel Programming. Addison-Wesley Professional,
Reading, MA, 2004.
19. O’Hanlon, C. A conversation with John Hennessy and
David Patterson. Queue 4, 10 (Dec. 2005/Jan. 2006),
20. Patterson, D. and Hennessy, J. Computer Organization
and Design: The Hardware/Software Interface, Fourth
Edition. Morgan Kaufmann Publishers, Boston, MA, Nov.
21. Sen, K. and Viswanathan, M. Model checking
multithreaded programs with asynchronous atomic
methods. In Proceedings of the 18th International
Conference on Computer-Aided Verification (Seattle,
WA, Aug. 17–20, 2006).
22. Sen, K. et al. CUTE: A concolic unit testing engine for
C. In Proceedings of the Fifth Joint Meeting European
Software Engineering Conference (Lisbon, Portugal,
Sept. 5–9). ACM Press, New York, 2005, 263–272.
23. Shaw, M. and Garlan, D. An Introduction to Software
Architecture. Technical Report CMU/SEI-94-TR-21,
ESC-TR-94-21. CMU Software Engineering Institute,
Carnegie Mellon University, Pittsburgh, PA, 1994.
24. Shirako, J., Peixotto, D., Sarkar, V., and Scherer,
W. Phasers: A unified deadlock-free construct for
collective and point-to-point synchronization. In
Proceedings of the 22nd ACM International Conference
on Supercomputing (Island of Kos, Greece, June 7–12).
ACM Press, New York, 2008, 277–288.
25. Shirako, J., Kasahara, H., and Sarkar, V. Language
extensions in support of compiler parallelization. In
Proceedings of the 20th Workshop on Languages and
Compilers for Parallel Computing (Urbana, IL, Oct.
11–13). Springer-Verlag, Berlin, 2007, 78–94.
26. Thomas, D. et al. Agile Web Development with Rails,
Second Edition. The Pragmatic Bookshelf, Raleigh, NC,
27. UPC Language Specifications, Version 1.2. Technical
Report LBNL-59208. Lawrence Berkeley National
Laboratory, Berkeley, CA, 2005.
28. Wawrzynek, J. et al. RAMP: Research Accelerator for
Multiple Processors. IEEE Micro 27, 2 (Mar. 2007),
29. Williams, S., Waterman, A., and Patterson, D. Roofline:
An insightful visual performance model for floating-
point programs and multicore architectures. Commun.
ACM 52, 4 (Apr. 2009), 65–76.
30. Williams, S. et al. Lattice Boltzmann simulation
optimization on leading multicore platforms. In
Proceedings of the 22nd IEEE International Parallel
and Distributed Processing Symposium (Miami, FL, Apr.
14–18, 2008).
31. Williams, S. et al. Optimization of sparse matrix-vector
multiplication on emerging multicore platforms. In
Proceedings of the Supercomputing (SC07) Conference
(Reno, NV, Nov. 10 –16). ACM Press, New York, 2007.
The authors are all affiliated with the Par Lab (http://parlab. at the University of California, Berkeley.
© 2009 ACM 0001-0782/09/1000 $10.00
performance model. We also plan to try
to scrape off the barnacles that have ac-
cumulated on the hardware/software
stack over the years.
This parallel challenge offers the
worldwide research community an op-
portunity to help IT remain a growth
industry, sustain the parts of the world-
wide economy that depend on the con-
tinuous improvement in IT cost-per-
formance, and take a once-in-a-career
chance to reinvent the whole software/
hardware stack. Though there are rea-
sons for optimism, the difficulty of the
challenge is reflected in the numerous
parallel failures of the past.
Combining upside and downside,
this research challenge represents the
most significant of all IT challenges
over the past 50 years. We hope many
more innovators will join this quest to
build a parallel bridge.
This research is sponsored by the Uni-
versal Parallel Computing Research
Center, which is funded by Intel and
Microsoft (Award # 20080469) and by
matching funds from U.C. Discovery
(Award #DIG07-10227). Additional
support comes from the Par Lab Affili-
ate companies: National Instruments,
NEC, Nokia, NVIDIA, and Samsung.
We wish to thank our colleagues in the
Par Lab and the Lawrence Berkeley Na-
tional Laboratory collaborations who
shaped these ideas.
1. Adve, S. et al. Parallel Computing Research at Illinois:
The UPCRC Agenda. White Paper. University of Illinois,
Urbana-Champaign, IL, Nov. 2008.
2. Agarwal, V., Liu, L.-K., and Bader, D. Financial modeling
on the Cell broadband engine. In Proceedings of 22nd
IEEE International Parallel and Distributed Processing
Symposium (Miami, FL, Apr. 14–18, 2008).
3. Alexander, C. et al. A Pattern Language: Towns,
Buildings, Construction. Oxford University Press, 1997.
4. Asanovic, K. et al. The Parallel Computing Laboratory
at U.C. Berkeley: A Research Agenda Based on the
Berkeley View. UCB/EECS-2008-23, University of
California, Berkeley, Mar. 21, 2008.
5. Asanovic, K. et al. The Landscape of Parallel Computing
Research: A View from Berkeley. UCB/EECS-2006-183,
University of California, Berkeley, Dec. 18, 2006.
6. Bader, D.A. and Patel, S. High-performance MPEG-2
software decoder on the Cell broadband engine. In
Proceedings of the 22nd IEEE International Parallel
and Distributed Processing Symposium (Miami, FL, Apr.
14–18, 2008).
7. Buschmann, F. et al. Pattern-Oriented Software
Architecture: A System of Patterns. John Wiley & Sons,
Inc., New York, 1996.
8. Clarke, D.G. et al. Ownership types for flexible alias
protection. In Proceedings of the OOPSLA Conference
(Vancouver, BC, Canada, 1998), 48–64.
9. Datta, K. et al. Stencil computation optimization and
autotuning on state-of-the-art multicore architectures.
In Proceedings of the ACM/IEEE Supercomputing (SC)
2008 Conference (Austin, TX, Nov. 15–21). IEEE Press,
Piscataway, NJ, 2008.
... • once the overall speedup approaches 1 1−P , it is worth investigating the new bottleneck in 1 − P, rather than refining the optimization of P; ...
... • as N tends to infinity, the maximum overall speedup tends to be 1 1−P . In other words, after a certain threshold adding more processors to a computing system will led to insignificants benefits. ...
... Proof. Base case K = 1: If K = 1, the left side is 2 1 2 = 1 and the right side is 2 1 − 1 = 1. So, the theorem holds when K = 1. ...
Full-text available
From embedded systems to desktop computers, and of course HPC (High Performance Computing) solutions, computing resources today are most of all based on multi-core / many-core architectures. While the presence of parallel hardware is ubiquitous, applications that exploit its full potential are still difficult to write. One particular mention is to Graphics Processing Units (GPUs) programming. Thanks to its data-parallel oriented architecture, a GPU can achieve a higher throughput in terms of floating point operations in time unit and memory bandwidth compared to an off-the-shelf CPU with similar power consumption and cost. Nevertheless, a GPU naïve implementation could be so inefficient as to lose orders of magnitude of performance compared to its optimized counterpart. For this reason, it is fundamental to have enough experience on the reference architecture to provide an optimal solution and make the switch from CPU to GPU advantageous. A pattern language defines a structured collection of design practices within a field of expertise. In the past, pattern languages were proven to be an effective way to communicate experience and help researchers and developers to reduce the learning curve over a particular expertise field. In the field of parallel programming, much work has been done to provide a composable set of patterns that could be used to design an algorithm in a way that makes it completely hardware-agnostic and flawlessy integrable inside algorithmic skeleton frameworks, which actually care of producing optimized code for a target architecture or a heterogenous platform. While algorithmic skeleton frameworks are in many cases portable and efficient, a number of common applications had to be retrofitted to provide good performance on GPUs; this shows the need for the novice developer to get well acquainted with the details of the platform. In this dissertation we present a new pattern language, SIMPL (SIMt Pattern Language), that is solely dedicated to the development of optimized code on a SIMT (single-instruction multiple-thread) architecture, which models a modern GPU. To the best of our knowledge, this is the first pattern language exclusively dedicated to General Purpose computing on GPUs (GPGPU). This language is currently made by 16 patterns, structured into 5 categories, and gathers the experience we made on this platform so far, presenting it in a reusable form. Among those patterns, we place particular emphasis on the original approaches that constitute our main contribution to the research field. We discuss in detail a set of case studies which involve the application of our pattern language. Specifically, we describe the implementation of the sparse matrix-vector multiply routine, reviewing the available literature and discussing our own approach to the problem, together with pointers to available software. As our main contribution, we propose three novel matrix storage formats, ELL-G and HLL which were derived from ELL, and HDIA for matrices having mostly a diagonal sparsity pattern. We compare the performance of the proposed formats to the results provided by the state-of-the-art formats with experiments realized on different GPU platforms and test matrices coming from various application domains. Furthermore, we implement the reversal of MD5 and SHA1 hash functions on a cluster of Nvidia GPUs. Our CUDA implementation achieves comparable or even better average performance results when compared to other popular password cracking software, reaching near-maximal throughput over different GPU architectures. Finally, we present the GPU implementation of a broad-phase collision detection algorithm for particles simulation, which uses a uniform grid as spatial partitioning scheme. In some tests our original approach achieves a speedup of 2 compared to the fastest known method supporting a fixed maximum number of elements per cell, and a speedup of 7 compared with the fastest method without such a constraint.
... O processamento de quantidades grandes de dados requer o uso da computação paralela, atualmente feita em grandes aglomerados de computadores comuns [2], [3]. Um dos modelos de programação paralela mais populares atualmente é o MapReduce [4], que surgiu como uma alternativa aos modelos tradicionais de programação paralela, com o objetivo de simplificar a programação, permitindo que programador foque no desenvolvimento de sua tarefa, e não nos detalhes da paralelização da computação. ...
Conference Paper
Quantidades cada vez maiores de dados, conhecidas como Big Data, são um fato do mundo real e um desafio em termos de processamento de dados. Ordenação é uma das tarefas mais comuns em computação e ordenar grandes massas de dados é uma necessidade em vários processos. O modelo de programação paralela MapReduce tem sido amplamente adotado para processar dados em larga escala em agrupamentos de computadores. Apresentamos neste artigo a implementação de dois algoritmos paralelos de ordenação, o Quicksort Paralelo e o Ordenação por Amostragem. Ambos foram implementados no ambiente de programação MapReduce/Hadoop e testados quanto ao seu desempenho para ordenar dados distribuídos em várias máquinas. Uma variedade de experimentos revela o comportamento de ambos os algoritmos e indica que o algoritmo Ordenação por Amostragem apresenta melhor desempenho.
... In particular for mobile devices, there are hard limits for both energy (battery capacity) and power (maximum heat dissipation). However, over the last decades we see more and more heterogeneity also in the data centers [1,5]. Examples are general purpose graphics processor units (GPGPUs), the Intel Xeon Phi accelerator cards, or the field programmable gate array (FPGA)-based Amazon EC2 F1 instances released in 2017 1 . ...
Full-text available
Heterogeneous accelerator enhanced computing architectures are a common solution in embedded computing, mainly due to the constraints in energy and power efficiency. Such accelerator enhanced systems dispatch data- and computing-intensive tasks to specialized, optimized and thus efficient hardware units, leaving most control flow tasks for the more generic but less efficient central processing units (CPUs). Nowadays, also high-performance computing (HPC) systems are becoming more heterogeneous by incorporating accelerators into the computing nodes. In this chapter, we introduce the concept of heterogeneous computing and present the design of a hardware accelerator for solving the Link Assessment (LA) problem, in introduced Chapter 3. The hardware accelerator integrates its main dedicated processing units with a customized cache design and light-weight data path. We provide detailed area, energy, and timing results for a 28 nm application specific integrated circuit (ASIC) process and DDR3 memory devices. Compared to an CPU-based cluster, our proposed solution uses 38x less memory and is 1030x more energy efficient for processing a users-movies dataset with half a million edges.
... In other words, a PAP could achieve much better overall performance than any of its member algorithms. Third, considering the tremendous growth of parallel computing architectures [23] (e.g., multi-core CPUs) over the last few decades, leveraging parallelism has become very important in designing effective solvers for hard optimization problems [24][25][26][27][28]. PAPs employ parallel solution strategies and thus allow using modern computing facilities in an extremely simple way. ...
It has been widely observed that there exists no universal best Multi-objective Evolutionary Algorithm (MOEA) dominating all other MOEAs on all possible Multi-objective Optimization Problems (MOPs). In this work, we advocate using the Parallel Algorithm Portfolio (PAP), which runs multiple MOEAs independently in parallel and gets the best out of them, to combine the advantages of different MOEAs. Since the manual construction of PAPs is non-trivial and tedious, we propose to automatically construct high-performance PAPs for solving MOPs. Specifically, we first propose a variant of PAPs, namely MOEAs/PAP, which can better determine the output solution set for MOPs than conventional PAPs. Then, we present an automatic construction approach for MOEAs/PAP with a novel performance metric for evaluating the performance of MOEAs across multiple MOPs. Finally, we use the proposed approach to construct a MOEAs/PAP based on a training set of MOPs and an algorithm configuration space defined by several variants of NSGA-II. Experimental results show that the automatically constructed MOEAs/PAP can even rival the state-of-the-art multi-operator-based MOEAs designed by human experts, demonstrating the huge potential of automatic construction of PAPs in multi-objective optimization.
... Because of these difficulties, the classic paradigm neglected the transfer time, so the blocking constraint in the classical paradigm means only logical dependence. Technological computing is based on the Hardware-Software contract (Asanovic 2009): mathematics provides the solid theoretical basis for computing but neglects the data transfer time, and technology must adapt itself to the interface defined by von Neumann a three-quarter century ago, and for [the timing relations of] vacuum tubes only. ...
Full-text available
In all kinds of implementations of computing, whether technological or biological, some material carrier for the information exists, so in real-world implementations, the propagation speed of information cannot exceed the speed of its carrier. Because of this limitation, one must also consider the transfer time between computing units for any implementation. We need a different mathematical method to consider this limitation: classic mathematics can only describe infinitely fast and small computing system implementations. The difference between mathematical handling methods leads to different descriptions of the computing features of the systems. The proposed handling also explains why biological implementations can have lifelong learning and technological ones cannot. Our conclusion about learning matches published experimental evidence, both in biological and technological computing.
Conference Paper
Full-text available
Modern VLSI (Very-Large-Scale Integration) integrated circuits contain several billion transistors. Systems of this complexity are very difficult to design. Manually designing each transistor at the level of logic gates is beyond the skill of the expert team. Manual verification of the chip design is also beyond the capabilities of even engineering teams. With the increasing complexity of electronic systems, there has been a need to automate both the design and verification stages at more abstract levels. This paper describes concept of multi-level compiler which convert algorithm described in Python language to FPGA bitstream. Compiler transforms automatically high-level description (Python) to low level (bitstream) on different levels based on configuration files. Testing is also done automatically on several levels of abstraction. Process automation enable to reduce designing and testing time. The article propose VHDL Microinstruction compiler, test tools with examples of its use.
Full-text available
The fixed route dial-a-ride problem (FRDARP) is a variant of the famous dial-a-ride problem, in which all the requests are chosen between terminals that are located along a fixed route. A reduction to the shortest path problem enables finding an optimal solution for FRDARP in polynomial time. However, the basic graph construction ends up with a huge graph, which makes the reduction impractical due to its memory consumption. To this end, we propose several pruning heuristics that enable us to considerably reduce the size of the graph through its dynamic construction. Additionally, we utilize the special features of the problem to apply parallelization to the graph traversal process. Our experiments show that each of the proposed heuristics on its own improves the practical solvability of FRDARP. Moreover, using them together is considerably more efficient than any single heuristic. Finally, the experiments confirm the efficiency of our suggested parallelization policy.
Technical Report
Full-text available
The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. Instead of traditional benchmarks, use 13 "Dwarfs" to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) "Autotuners" should play a larger role than conventional compilers in translating parallel programs. To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. To be successful, programming models should be independent of the number of processors. To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.
Full-text available
This work organizes all of parallel programming into a set of design patterns. It was up to date as of 2004 when programming was largely restricted to the CPU. It does not, however, address data parallel hardware (vector units and GPUs). Hence, for CPU programming this is still a great book. For GPUs, however, you'll need to wait for the next addition.
Conference Paper
Full-text available
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.
We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.