Conference PaperPDF Available

Parakeet: A Just-In-Time Parallel Accelerator for Python

Authors:

Abstract and Figures

High level productivity languages such as Python or Matlab enable the use of computational resources by non-expert programmers. However, these languages often sac-rifice program speed for ease of use. This paper proposes Parakeet, a library which provides a just-in-time (JIT) parallel accelerator for Python. Parakeet bridges the gap between the usability of Python and the speed of code written in efficiency languages such as C++ or CUDA. Parakeet accelerates data-parallel sections of Python that use the standard NumPy scientific comput-ing library. Parakeet JIT compiles efficient versions of Python functions and automatically manages their execu-tion on both GPUs and CPUs. We assess Parakeet on a pair of benchmarks and achieve significant speedups.
Content may be subject to copyright.
Parakeet: A Just-In-Time Parallel Accelerator for Python
Alex Rubinsteyn Eric Hielscher Nathaniel Weinman Dennis Shasha
Computer Science Department, New York University, New York, NY, 10003
{alexr,hielscher,nsw233,shasha}@ cs.nyu.edu
Abstract
High level productivity languages such as Python or Mat-
lab enable the use of computational resources by non-
expert programmers. However, these languages often sac-
rifice program speed for ease of use.
This paper proposes Parakeet, a library which provides
a just-in-time (JIT) parallel accelerator for Python. Para-
keet bridges the gap between the usability of Python and
the speed of code written in efficiency languages such as
C++ or CUDA. Parakeet accelerates data-parallel sections
of Python that use the standard NumPy scientific comput-
ing library. Parakeet JIT compiles efficient versions of
Python functions and automatically manages their execu-
tion on both GPUs and CPUs. We assess Parakeet on a
pair of benchmarks and achieve significant speedups.
1 Introduction
Numerical computing is an indispensable tool to profes-
sionals in a wide range of fields, from the natural sciences
to the financial industry. Often, users in these fields ei-
ther (1) aren’t expert programmers; or (2) don’t have time
to tune their software for performance. These users typi-
cally prefer to use productivity languages such as Python
or Matlab rather than efficiency languages such as C++.
Productivity languages facilitate non-expert programmers
by trading off program speed for ease of use [23].
A common problem, however, is that the performance
tradeoff is often very stark – code written in Python
or Matlab [19] often has much worse performance than
code written in C++ or Fortran. This problem is get-
ting worse, as modern processors (multicore CPUs as
well as GPUs) are all parallel, and current implementa-
tions of productivity languages are poorly suited for par-
allelism. Thus a common workflow involves prototyping
algorithms in a productivity language, followed by port-
ing the performance-critical sections to a lower level lan-
guage. This second step can be time-consuming, error-
prone, and it diverts energy from the real focus of these
users’ work.
In this paper, we present Parakeet, a library
that provides a JIT parallel accelerator for NumPy,
the commonly-used scientific computing library for
Python [22]. Parakeet accelerates performance-critical
sections of numerical Python programs to be competitive
with efficiency language code, obviating the need for the
above-mentioned “prototype, port” cycle.
The Parakeet library intercepts programmer-marked
functions and uses high-level operations on NumPy ar-
rays (e.g. mapping a function over the array’s elements)
as sources of parallelism. These functions are just-in-time
compiled to either x86 machine code using LLVM [17],
or GPU code that can be executed on NVIDIA GPUs via
the CUDA framework [20]. These native versions of the
functions are then automatically executed on the appropri-
ate hardware. Parakeet allows complete interoperability
with all of the standard Python tools and libraries.
Parakeet currently supports JIT compilation to paral-
lel GPU programs and single-threaded CPU programs.
While Parakeet is a work in progress, our current results
clearly demonstrate its promise.
2 Overview
Parakeet is an accelerator library for numerical Python al-
gorithms written using the NumPy array extensions [22].
Parakeet does not replace the standard Python runtime but
rather selectively augments it. To run a function within
Parakeet a user must wrap it with the decorator @PAR. For
example, consider the following NumPy code for averag-
ing the value of two arrays:
@PAR
def avg(x,y):
return (x+y) / 2.0
If the decorator @PAR were removed, then avg would run
as ordinary Python code. Since NumPy’s library func-
tions are compiled separately they always allocate result
arrays (even when the arrays are immediately consumed).
1
By contrast, Parakeet specializes avg for any distinct input
type, optimizes its body into a singled fused map (avoid-
ing unnecessary allocation) and executes it as parallel na-
tive code.
Parakeet is not meant as a general-purpose accelerator
for all Python programs. Rather, it is designed to execute
array-oriented numerical algorithms such as those found
in machine learning, financial computing, and scientific
simulation. In particular, the sections of code that Para-
keet accelerates must obey the following constraints:
Due to the difficulty of implementing efficient non-
uniform data structures on the GPU, we require all
values within Parakeet to be either scalars or NumPy
arrays. No dictionaries, sets, or user-defined objects
are allowed.
To compile Python into native code we must assign
types to each expression. We are still able to re-
tain some of Python’s polymorphism by specializing
different typed versions of a function for each dis-
tinct set of argument types. However, expressions
whose types depend on dynamic values are disal-
lowed (e.g. 43 if bool_val else "sausage").
Only functions which don’t modify global state or
perform I/O can be executed in parallel. Local muta-
bable variables are always allowed.
A Parakeet function cannot call any other function which
violates these restrictions or one which is not imple-
mented in Python. To enable the use of NumPy li-
brary functions Parakeet must provide equivalent func-
tions written in Python. In general, these restrictions
would be onerous if applied to an entire program but Para-
keet is only intended to accelerate the computational core
of an algorithm. All other code is executed as usual by the
Python interpreter.
Though Parakeet supports loops, it does not parallelize
them in any way. Parallelism is instead achieved through
the use of the following data parallel operators:
map(f,X1, ..., Xn,fixed=[],axis=None)
Apply the function fto each element of the array ar-
guments. By default, fis passed each scalar element
of the array arguments.
The axis keyword can be used to specify a different
iteration pattern (such as applying fto all columns).
The fixed keyword is a list of closure arguments for
the function f.
allpairs(f,X1,X2,fixed=[],axis=0)
Apply the function fto each pair of elements from
the arrays X1and X2.
reduce(f,X1, ..., Xn,fixed=[],axis=None,init=None)
Combine all the elements of the array arguments
using the (n+ 1)-ary commutative function f.
The init keyword is an optional initial value for the
reduction. Examples of reductions are the NumPy
functions sum and product.
scan(f,X1, ..., Xn,fixed=[],axis=None,init=None)
Combine all the elements of the array arguments and
return an array containing all cumulative intermedi-
ate values of the combination. Examples of scans
are the NumPy functions cumsum and cumprod.
For each occurrence of a data parallel operator in a pro-
gram, Parakeet may choose to synthesize parallel code
which implements that operator combined with its func-
tion argument. It is not always necessary, however, to ex-
plicitly use one of these operators in order to achieve par-
allelization. Parakeet implements NumPy’s array broad-
casting semantics by implicitly inserting calls to map into
a user’s code. Furthermore, NumPy library functions are
reimplemented in Parakeet using the above data parallel
operators and thus expose opportunities for parallelism.
3 Parakeet Runtime and Internals
We will refer to the following code example to help illus-
trate the process by which Parakeet transforms and exe-
cutes code.
def add(x,y):
return x+y
def sum(x):
return parakeet.reduce(add, x)
@PAR
def norm(x):
return math.sqrt(sum(x*x))
Listing 1: Vector Norm in Parakeet
When the Python interpreter reaches the definition of
norm, it invokes the @PAR decorator which parses the func-
tion’s source and translates it into Parakeet’s untyped in-
ternal representation. There is a fixed set of primitive
functions from NumPy and the Python standard library,
such as math.sqrt, which are translated directly into
Parakeet syntax nodes. The helper functions add and sum
would normally be in the Parakeet module but they are de-
fined here for clarity. These functions are non-primitive,
so they themselves get recursively parsed and translated.
In general, the @PAR decorator will raise an exception if
it encounters a call to a non-primitive function which ei-
ther can’t be parsed or violates Parakeet’s semantic re-
2
Python
Interpreter Untyped IL Typed IL
Lambda Lifting
SSA Conversion
Specialization
Function
Call
CSE, Inlining
Simplification
Array Fusion
Interpreter
Shape Inference
Garbage Collection
Data Movement
Code Cache
CPU, GPU
Execution
ImpJIT
PTX
LLVM
Figure 1: Parakeet Pipeline
strictions. Lastly, before returning execution to Python,
Parakeet converts its internal representation to Static Sin-
gle Assignment form [12].
3.1 Type Specialization
Parakeet intercepts calls to norm and uses the argument
types to synthesize a typed version of the function. During
specialization, all functions called by entropy are them-
selves specialized for particular argument types. In our
code example, if norm were called with a 1D float ar-
ray then sum would also be specialized for the same input
type, whereas add would be specialized for pairs of scalar
floats.
In Parakeet’s typed representation, every function must
have unambiguous input and output types. To eliminate
polymorphism Parakeet inserts casts and map operators
where necessary. When norm is specialized for vector ar-
guments, its use of the multiplication operator is rewritten
into a 1D maps of a scalar multiply.
The actual process of type specialization is imple-
mented by interleaving an abstract interpreter, which
propagates input types to infer local types, and a rewrite
engine which inserts coercions where necessary.
3.2 Optimization
In addition to standard compiler optimizations (such as
constant folding, function inlining, and common sub-
expression elimination), we employ fusion rules [2, 16,
15] to combine array operators. Fusion enables us in-
crease the computational density of generated code and
to avoid the creation of unnecessary array temporaries.
3.3 Execution
We have implemented three backends for Parakeet thus
far: CPU and GPU backends for JIT compiling native
code, as well as an interpreter for handling functions that
cannot themselves be parallelized but may contain nested
parallelizable operators.
Once a function has been type specialized and opti-
mized, it is handed off to Parakeet’s scheduler which is
responsibile for choosing among these three backends.
For each array operator, the scheduler employs a cost-
based heuristic which considers nested array operators,
data sizes, and memory transfer costs to decide where to
execute it.
Accurate prediction of array shapes is necessary both
for the allocation of intermediate values on the GPU as
well as for the above cost model determining placement
of computations. We use an abstract interpreter which
propagates shape information through a function using the
obvious shape semantics for each operator. For example,
areduce operation collapses the outermost dimension of
its argument whereas a map preserves a shape’s outermost
dimension.
When the scheduler encounters a nested array operator
– e.g. a map whose payload function is itself a reduce
it needs to choose which operator, if any, will be paral-
lelized. If an array operator is deemed a good candidate
for native hardware execution, the function argument to
the operator is then inlined into a program skeleton that
implements the operator. Parakeet flattens all nested array
computations within the function argument into sequen-
tial loops.
Several systems similar to Parakeet [8, 9] generate GPU
programs by emitting CUDA code which is then compiled
by NVIDIA’s CUDA nvcc toolchain. Instead, Parakeet
emits PTX (a GPU pseudo-assembly) directly, since the
compile times are dramatically shorter. To generate CPU
code we use the LLVM compiler framework [17].
4 Evaluation
We evaluate Parakeet on two benchmarks: Black-Scholes
option pricing, and K-Means Clustering. We compare
Parakeet against hand-tuned CPU and GPU implementa-
3
Figure 2: Black Scholes Total Times
tions. Due to space constraints, and since at the time of
writing our CPU backend only supports single-threaded
execution, we only present GPU results for Parakeet.
For Black-Scholes, the CPU reference implementation is
taken from the PARSEC [4] benchmark suite, and the
GPU implementation is taken from the CUDA SDK [21].
For K-Means Clustering both the CPU and GPU reference
versions come from the Rodinia benchmark suite [11].
For both benchmarks, we ported the reference implemen-
tations as directly as possible from their source languages
to Parakeet.
We ran all of our benchmarks on a machine running 64-
bit Linux with an Intel Core i7 3.2GHz 960 4-core CPU
and 16GB of RAM. The GPU used in our system was
an NVIDIA Tesla C1060 with 240 vector lanes, a clock
speed of 1.296 GHz, and 4GB of memory.
4.1 Black-Scholes
Black-Scholes option pricing [5] is a standard algorithm
used for data parallel benchmarking. We compare Para-
keet against the multithreaded OpenMP CPU implemen-
tation from the PARSEC [4] suite with both 1 and 8
threads and the GPU version in the CUDA SDK [21]. We
modified the benchmarks to all use the input data from
the PARSEC implementation so as to have a direct com-
parison of the computation alone. We also modified the
CUDA version to calculate only one of the call or put price
per option so as to match the behavior in PARSEC.
In Figure 2, we see the total execution times of the var-
ious versions. These times include the time it takes to
transfer data to and from the GPU in the GPU bench-
marks. As expected, Black Scholes performs very well
Figure 3: Black Scholes GPU Execution Times
on the GPU as compared with the CPU. We see that Para-
keet performs very similarly to the hand-written CUDA
version, with overheads decreasing as a percentage of the
run time as the data sizes grow since most of them are
fixed costs related to dynamic compilation.
In Figure 3, we break down Parakeet’s performance
as compared with the hand-written CUDA version. The
Parakeet run times range from 24% to 2.4X slower than
those of CUDA, with Parakeet performing better as the
data size increases. We can see that transferring data to
and from the GPU’s memory is expensive and dominates
the runtime of this benchmark. The GPU programs that
Parakeet generates are slightly less efficient than those of
the CUDA version, with approximately 50% higher run
time on average. Most of this slowdown is due to compi-
lation overheads.
4.2 K-Means Clustering
We also tested Parakeet on K-Means clustering, a com-
monly used unsupervised learning algorithm. We chose
K-Means since it includes both loops and nested array op-
erators, and thus illustrates Parakeet’s support for both.
The Parakeet code we used to run our benchmarks can be
seen in Listing 2.
In Figure 4, we see the total run times of K-Means for
the CPU and GPU versions with K = 3 clusters and 30
features on varying numbers of data points. Here, the dis-
tinction between the GPU and the CPU is far less stark. In
fact, for up to 64K data points the 8-thread CPU version
outperforms the GPU. Further, we see that for more than
64K data points, Parakeet actually performs better than
both the CPU and GPU versions.
4
Figure 4: K-Means Total Times with 30 Features, K = 3
Figure 5: K-Means Total Times with 30 Features, K = 30
The reason Parakeet is able to perform so well with
respect to the CUDA version is due to the difference
in how they compute the new average centroid for the
new clusters in each iteration. The CUDA version brings
the GPU-computed assignment vectors back to the CPU
in order to perform this reduction, as it involves many
unaligned memory accesses and so has the potential to
perform poorly on the GPU. Parakeet’s scheduler chooses
to execute this code on the GPU instead, prefering to
avoid the data transfer penalty. For such a small number
of clusters, the Parakeet method ends up performing far
better. However, for larger numbers of clusters (roughly
30 and above), the fixed overhead of launching an indi-
vidual kernel to average each cluster’s points overwhelms
the performance advantage of the GPU and Parakeet ends
up performing worse than the CUDA version. We are
currently implementing better code placement heuristics
based on dynamic information, in order to use the GPU
only when it would actually be advantageous.
import parakeet as par
from parakeet import PAR
def sqr_dist(x, y):
sqr_diff = (x-y)*(x-y)
return par.sum(sqr_diff)
def minidx(C, x):
dists = par.map(sqr_dist, C, fixed=[x])
return par.argmin(dists)
def calc_centroid(X, a, cluster_id):
return par.mean(X[a == cluster_id])
@PAR
def kmeans(X, assignments):
k = par.max(assignments)
cluster_ids = par.arange(k)
C = par.map(calc_centroid, cluster_ids,
fixed=[X, a])
converged = False
while not converged:
last = assignments
a = par.map(minidx, X, fixed=[C])
C = par.map(calc_centroid,
cluster_ids, fixed=[X, a])
converged = par.all(last == assignments)
return C
Listing 2: K-Means Parakeet Code
In Figure 5, we present the run times of K-Means for
30 features and K = 30 clusters. In this case, the hand-
written CUDA version performs best in all cases, though
its advantage over the CPU increases and its advantage
over Parakeet decreases with increasing data size. As
discussed, the Parakeet implementation suffers from the
poorly performing averaging computation that it executes
on the GPU, with a best case of an approximate 2X slow-
down over the CUDA version.
Figure 6 shows a breakdown of the different factors
that contribute to the overall K-Means execution times.
Both Rodinia’s CUDA version and Parakeet use the CPU
to perform a significant amount of the computation. The
Rodinia version uses the CPU to compute the averages of
the new clusters in each iteration, while Parakeet spends
much of its CPU time on program analysis and transfor-
mation; garbage collection; and setting up GPU kernel
launches. As mentioned above, Parakeet spends much
more time on the GPU than the Rodinia version does. In
this benchmark, as opposed to Black-Scholes, little of the
execution time is spent on data transfers. It is important to
5
Figure 6: K-Means GPU Times with 30 Features, K = 3
note that the Python implementation of K-Means is orders
of magnitude shorter than the CUDA implementation.
5 Related Work
The earliest attempts to accelerate array-oriented pro-
grams were sequential compilers for APL. Abrams [1]
designed an innovative two-stage compiler wherein APL
was first statically translated to a high-level untyped inter-
mediate representation and then lazily compiled to a low-
level scalar architecture. Inspired by Abrams’ work, Van
Dyke created an APL environment for the HP 3000 [25]
which also dynamically compiled delayed expressions
and cached generated code using the type, rank, and shape
of variables.
More recently, a variety of compilers have been created
for the Matab programming language [19]. FALCON [14]
performs source-to-source translation from Matlab to For-
tran 90 and achieves parallelization by adding dependence
annotations into its generated code. MaJIC [3] accelerates
Matlab by pairing a lightweight dynamic compiler with a
static translator in the same spirit as FALCON. Both FAL-
CON and MaJIC infer simultaneous type and shape infor-
mation for local variables by dataflow analysis.
The use of graphics hardware for non-graphical com-
putation has a long history [18]. The Brook language ex-
tended C with “kernels” and “streams”, exposing a pro-
gramming model similar to what is now found in CUDA
and OpenCL [7]. Over the past decade, there have been
several attempts to use an embedded DSL to simplify
the arduous task of GPGPU programming. Microsoft’s
Accelerator [24] was the first project to use high level
(collection-oriented) language constructs as a basis for
GPU execution. Accelerator’s programming model does
not support function abstractions (only expression trees)
and its only underlying parallelism construct is limited to
the production of map-like kernels. Accelerate [10] is a
first-order array-oriented language embedded in Haskell
which attains good performance on preliminary bench-
marks but is unable to describe nested computations and
requires user annotations to move data onto the GPU. A
more sophisticated runtime has been developed for the
Delite project [6], which is able to schedule complex com-
putations expressed in Scala across multiple backends.
Parakeet and Copperhead [8] both attempt to parallelize
a numerical subset of Python using runtime compilation
structured around higher-order array primitives. Parakeet
and Copperhead both support nested array computations
and are able to target either GPUs or CPUs. Since Cop-
perhead uses a purely functional intermediate language,
its subset of Python is more restrictive than the one which
Parakeet aims to accelerate. Unlike Copperhead, Parakeet
allows loops and modification of local variables. Copper-
head infers a single type for each function using an ex-
tension of the Hindley-Milner [13] inference algorithm.
This simplifies compilation to C++ templates but disal-
lows both polymorphism between Python scalars and “ar-
ray broadcasting” [22]. Parakeet, on the other hand, is
able to support polymorphic language constructs by spe-
cializing a function’s body for each distinct set of ar-
gument types and inserting coercions during specializa-
tion. Copperhead does not use dynamic information when
making scheduling decisions and thus must rely on user
annotations. Parakeet’s scheduler dynamically chooses
which level of a nested computation to parallelize.
6 Conclusion
Parakeet allows the programmer to write Python code
using a widely-used numerical computing library while
achieving good performance on modern parallel hard-
ware. Parakeet automatically synthesizes and executes
efficient native binaries from Python code. Parakeet is
a usable system in which moderately complex programs
can be written and executed efficiently. On two bench-
mark programs, Parakeet delivers performance compet-
itive with hand-tuned GPU implementations. Parakeet
code can coexist with standard Python code, allowing full
interoperability with all of Python’s tools and libraries.
We are currently extending our CPU backend to use mul-
tiple cores and improving dynamic scheduling and com-
pilation decisions.
6
References
[1] AB RA MS , P. S. An APL machine. PhD thesis, Stanford,
CA, USA, 1970. AAI7022146.
[2] AB U-SU FAH, W. A. -K . Improving the performance of
virtual memory computers. PhD thesis, Champaign, IL,
USA, 1979. AAI7915307.
[3] AL M ´
AS I, G., A ND PAD UA, D. Majic: Compiling matlab
for speed and responsiveness. In PLDI ’02: Proceedings
of the 2002 ACM SIGPLAN conference on Programming
Languages Design and Implementation (New York, NY,
USA, 2002), ACM, pp. 294–303.
[4] BIENIA, C., KUMAR, S., SI NG H, J. P., A ND LI, K .
The PARSEC benchmark suite: Characterization and ar-
chitectural implications. In PACT ’08: Proceedings of
the 17th International Conference on Processors, Archi-
tectures, and Compilation Techniques (October 2008).
[5] BL ACK , F., AND SCHOLES, M. The pricing of options
and corporate liabilities. The Journal of Political Economy
81, 3 (1973), 637–654.
[6] BROWN, K., SUJEETH, A., LEE, H. J., RO MP F, T.,
CHA FI, H., ODE RS KY, M., A ND OL UK OTU N, K . A
heterogeneous parallel framework for domain-specific lan-
guages. In Parallel Architectures and Compilation Tech-
niques (PACT), 2011 International Conference on (oct.
2011), pp. 89 –100.
[7] BUCK, I., FOL EY, T., HORN, D., SUGERMAN, J., FA-
TAHA LI AN , K. , HO US TO N, M., A ND HANRAHAN, P.
Brook for GPUs: stream computing on graphics hardware.
In ACM SIGGRAPH 2004 Papers (New York, NY, USA,
2004), ACM, pp. 777–786.
[8] CATANZAR O, B., GAR LA ND , M., AN D KEU TZ ER , K.
Copperhead: Compiling an embedded data parallel lan-
guage. In Proceedings of the 16th ACM Symposium on
Principles and Practice of Parallel Programming (2011),
pp. 47–56.
[9] CH AFI , H., SU JE ET H, A . K. , BROWN, K. J., LEE, H.,
ATRE YA, A. R., A ND OL UKO TU N, K. A domain-specific
approach to heterogeneous parallelism. In Proceedings
of the 16th ACM Symposium on Principles and Practice
of Parallel Programming (New York, NY, USA, 2011),
ACM, pp. 35–46.
[10] CH AK RAVARTY, M. M., KE LL ER , G., LE E, S., MCDON -
NE L, T. L. , AN D GROV ER , V. Accelerating Haskell array
codes with multicore GPUs. In Proceedings of the Sixth
Workshop on Declarative Aspects of Multicore Program-
ming (2011), pp. 3–14.
[11] CH E, S., BOYE R, M., MEN G, J ., TAR JAN , D., SHEAF-
FE R, J., LEE , S.-H., AN D SKA DRO N, K . Rodinia: A
benchmark suite for heterogeneous computing. In Pro-
ceedings of the IEEE International Symposium on Work-
load Characterization (IISWC) (October 2009), pp. 44–54.
[12] CY TRO N, R ., F ERRANTE, J. , ROS EN , B. K ., W EG MA N,
M. N., AN D ZADECK, F. K . Efficiently computing static
single assignment form and the control dependence graph.
ACM Trans. Program. Lang. Syst. 13 (October 1991), 451–
490.
[13] DAMAS, L., AN D MILNER, R. Principal type-schemes
for functional programs. In Proceedings of the 9th ACM
SIGPLAN-SIGACT symposium on Principles of program-
ming languages (New York, NY, USA, 1982), POPL ’82,
ACM, pp. 207–212.
[14] DERO SE , L., GALLIVAN, K., GALLOPOULOS, E., MAR-
SO LF, B. , AN D PAD UA, D. Falcon: A matlab interactive
restructuring compiler. In IN LANGUAGES AND COM-
PILERS FOR PARALLEL COMPUTING (1995), Springer-
Verlag, pp. 269–288.
[15] JO NE S, S. P., TOL MAC H, A ., A ND HO AR E, T. Playing by
the rules: rewriting as a practical optimisation technique
in GHC. In Proceedings of the 2004 Haskell Workshop
(2001).
[16] KE NN EDY, K., AND MCKINLEY, K . S . Typed fusion
with applications to parallel and sequential code genera-
tion. Tech. rep., 1993.
[17] LATT NE R, C. LLVM: An infrastructure for multi-stage
optimization. Master’s thesis, Computer Science Dept.,
University of Illinois at Urbana-Champaign, Urbana, IL,
Dec 2002. See http://llvm.cs.uiuc.edu.
[18] LE NG YE L, J., REI CH ER , M., DO NAL D, B. R ., AND
GRE EN BE RG , D. P. Real-time robot motion planning us-
ing rasterizing computer graphics hardware. In In Proc.
SIGGRAPH (1990), pp. 327–335.
[19] MO LE R, C. B . MATLAB — an interactive matrix labo-
ratory. Technical Report 369, University of New Mexico.
Dept. of Computer Science, 1980.
[20] NVIDIA. CUDA ZONE.
http://www.nvidia.com/cuda.
[21] NVIDIA. NVIDIA CUDA SDK 3.2.
http://www.nvidia.com/cuda.
[22] OLIPHANT, O. Python for scientific computing. Comput-
ing in Science and Engineering 9 (2007), 10–20.
[23] PR EC HE LT, L. Are scripting languages any good? a vali-
dation of Perl, Python, Rexx, and Tcl against C, C++, and
Java. Advances in Computers 57 (2003), 205–270.
[24] TARDITI, D., P URI, S. , AN D OGL ES BY, J. Accelera-
tor: Using data parallelism to program GPUs for general-
purpose uses. In ASPLOS ’06: Proceedings of the 12th In-
ternational Conference on Architectural Support for Pro-
gramming Languages and Operating Systems (November
2006).
[25] VAN DYKE , E. J. A dynamic incremental compiler for
an interpretive language. Hewlett-Packard Journal (July
1977), 17–24.
7
... There are many works on extending the Python programming language for GPU programming, e.,g, [6,26,35,42]. Closest to our approach are Parakeeet [35] and Copperhead [6], which are high-level data-parallel languages embedded in Python. In both DSLs, programmers write array computations that are made parallel by the use of skeletons like map and reduce. ...
... There are many works on extending the Python programming language for GPU programming, e.,g, [6,26,35,42]. Closest to our approach are Parakeeet [35] and Copperhead [6], which are high-level data-parallel languages embedded in Python. In both DSLs, programmers write array computations that are made parallel by the use of skeletons like map and reduce. ...
... Type checking/type inference could be performed in a similar way for different types of arrays. Mixing arrays of different types is more challenging for type inference, and in that case code generation for kernels could happen at kernel launch, where we already know the types of the arrays in use, and type specialization [35] can be performed. Furthermore, we would like to compare Hok with Nx, pure CUDA, and other GPU frameworks in the future. ...
Conference Paper
GPUs (Graphics Processing Units) are usually programmed using low-level languages like CUDA or OpenCL. Although these languages allow the implementation of very optimized software, they are difficult to program due to their low-level nature, where programmers have to mix coordination code, i.e., how tasks are created and distributed, with the actual computation code. In this paper we present Hok, an extension to the Elixir functional language that allows the implementation of higher-order GPU kernels, granting programmers the ability to clearly separate coordination from computation. The Hok system provides a DSL (Domain-Specific Language) for writing low-level GPU kernels that can be parameterized with the computation code. Hok allows device functions, including anonymous functions, to be created and referenced in the host code so that they can configure a kernel before it is launched.We demonstrate that Hok can be used to implement high-level abstractions such as algorithmic skeletons and array comprehensions. We also present experiments that demonstrate the usability of the current implementation of Hok, and show that high speedups can be obtained in comparison to pure Elixir, specially in computationally intensive programs with large inputs.
... Other ones define an automatic, array computation middleware [15] designed as a back-end for multiple languages, including Python. Automatic, GPU-targeted compilers for languages embedded in Python also abound [5,8,35], most of which transform a Python AST at run-time based on various levels of annotation and operational abstraction. ...
Preprint
To concisely and effectively demonstrate the capabilities of our program transformation system Loo.py, we examine a transformation path from two real-world Fortran subroutines as found in a weather model to a single high-performance computational kernel suitable for execution on modern GPU hardware. Along the transformation path, we encounter kernel fusion, vectorization, prefetch- ing, parallelization, and algorithmic changes achieved by mechanized conversion between imperative and functional/substitution- based code, among a number more. We conclude with performance results that demonstrate the effects and support the effectiveness of the applied transformations.
... For example, Accelerate defines an embedded array language in Haskell [38], while Copperhead works with a functional, data-parallel subset of Python [21]. Parakeet uses a similar Python subset, with less emphasis on the functional aspect [39], whereas PyGPU specializes its DSL for image processing algorithms [20]. Other research defines entirely new languages, such as Lime [40], Chestnut [41] or HIPA cc [42]. ...
Preprint
GPUs and other accelerators are popular devices for accelerating compute-intensive, parallelizable applications. However, programming these devices is a difficult task. Writing efficient device code is challenging, and is typically done in a low-level programming language. High-level languages are rarely supported, or do not integrate with the rest of the high-level language ecosystem. To overcome this, we propose compiler infrastructure to efficiently add support for new hardware or environments to an existing programming language. We evaluate our approach by adding support for NVIDIA GPUs to the Julia programming language. By integrating with the existing compiler, we significantly lower the cost to implement and maintain the new compiler, and facilitate reuse of existing application code. Moreover, use of the high-level Julia programming language enables new and dynamic approaches for GPU programming. This greatly improves programmer productivity, while maintaining application performance similar to that of the official NVIDIA CUDA toolkit.
... This has driven the need to overcome the performance overhead of the interpreter and the restrictions imposed by the global interpreter lock (GIL). This has resulted in technologies to increase the performance of existing Python codes through the compilation to native code, including Cython [11], MicroPython [9], Numba [24], Copperhead [14], Parakeet [26], ALPyNA [19] and PyCUDA [4]. The high-level approach of Numba, Copperhead and Parakeet is similar, whereby they define an embedded domain specific language (eDSL) and utilise Python function decorators (directives) to annotate the code to be compiled to native code or offloaded to GPUs. ...
Preprint
Vipera provides a compiler and runtime framework for implementing dynamic Domain-Specific Languages on micro-core architectures. The performance and code size of the generated code is critical on these architectures. In this paper we present the results of our investigations into the efficiency of Vipera in terms of code performance and size.
Chapter
Vipera provides a compiler and runtime framework for implementing dynamic Domain-Specific Languages on micro-core architectures. The performance and code size of the generated code is critical on these architectures. In this paper we present the results of our investigations into the efficiency of Vipera in terms of code performance and size.KeywordsDomain-specific languagesPythonnative code generationRISC-Vmicro-core architectures
Thesis
Full-text available
The ICT sector is claimed to account for 2% of global emissions. Even though this might seem like a small number, the success of ICT technologies will always lead to more greenhouse gas emissions. In 2019, the number of data centers worldwide was 1.5 million, which is expected to reach 2.5 million by 2025. In this situation, lowering the emissions from the ICT sector depends on reducing the amount of energy used by data centers. There are three main methods to achieve this goal: improving the hardware's efficiency, lowering the cooling systems' energy consumption, or decreasing the energy consumption of the servers themselves. This thesis focuses on the last approach, which I believe is the most affordable one, as it does not require any physical changes to data centers. My goal is to assist developers in making more eco-friendly software services by providing them with tools and guidelines to create software that runs on servers while consuming less energy. To do so, I decided to pursue an empirical approach consisting of three steps: test, measure, and optimize.The reason for such a decision is to follow the rapid pace of the software industry. In fact, the software industry has one of the fastest growth rates, which makes it challenging to keep up with the newest technologies. So, instead of just reporting my insights, I gave practitioners the means and protocols to allow them to test their hypotheses. I believe that some of the insights shared as part of this thesis might already have become obsolete when published.Due to the urgency of the climate change issue, I decided first to harness the most popular yet energy-hungry programming language, Python. Therefore, I started by analyzing Python's code's energy behavior during its most commonly used cases. Then, I provided a non-intrusive technique to reduce its energy consumption. After that, I extended this strategy to another programming language famous for its legacy code base, Java, to show that we can still reduce the energy consumption of already running applications without paying a considerable price.Finally, I adopted a more systemic approach. Instead of optimizing one single application, can one reduce the energy consumption of the data center as a whole entity? Thanks to the micro-services architecture, one application can be constructed using many services, each independent of the other. This type of architecture releases us from the bond of adopting a single programming language as the monolithic application does. And with this, one can use multiple programming languages and take advantage of the strengths of each one for a specific scenario. The last chapter analyzed the energy behavior of several programming languages regarding web services while opening a new path toward sustainability within timeless applications.
Thesis
Le secteur des TIC serait responsable de 2% des émissions mondiales. Même si ce chiffre peut paraître faible, le succès des technologies TIC entraînera toujours une augmentation des émissions de gaz à effet de serre. En 2019, le nombre de centres de données dans le monde était de 1,5 million, et devrait atteindre 2,5 millions d’ici 2025. Dans cette situation, la diminution des émissions du secteur des TIC dépend de la réduction de la consommation énergétique des centres de données. Il existe trois méthodes principales pour atteindre cet objectif : améliorer l’efficacité du matériel, faire baisser la consommation énergétique des systèmes de refroidissement, ou diminuer la consommation énergétique des serveurs eux- mêmes. Cette thèse se focalise sur la dernière approche, qui me semble être la plus abordable, car elle ne nécessite aucun changement physique dans les centres de données.Mon objectif est d’aider les développeurs à créer des logiciels plus écologiques en leur fournissant des outils et des directives pour créer des applications qui tournent sur des serveurs tout en consommant moins d’énergie. Pour ce faire, j’ai décidé d’adopter une approche empirique en trois étapes : tester, mesurer et optimiser.La raison d’une telle décision est de suivre le rythme rapide de l’industrie du logiciel. En fait, le secteur du logiciel connaît l’un des taux de croissance les plus rapides, ce qui rend difficile de suivre les nouvelles technologies. Ainsi, au lieu de me contenter de rapporter juste mes réflexions, j’ai fourni aux praticiens les moyens et les protocoles nécessaires pour leur permettre de tester leurs hypothèses. Je suis convaincu que certaines des conclusions partagées dans le cadre de cette thèse pourraient déjà être obsolètes au moment de leur publication.Vue l’urgence de la question du changement climatique, j’ai décidé d’abord d’exploiter le langage de programmation le plus populaire et pourtant le plus gourmand en énergie, Python. J’ai donc commencé par analyser le comportement énergétique du code Python dans ses utilisations les plus courantes. J’ai ensuite proposé une technique non intrusive pour réduire sa consommation énergétique. Ensuite, j’ai étendu cette stratégie à un autre langage de programmation célèbre pour sa base de code ancienne, Java, pour montrer que nous pouvons encore réduire la consommation d’énergie d’applications déjà en cours d’exécution sans payer un prix considérable. Enfin, j’ai adopté une approche plus systémique. Au lieu d’optimiser une seule applica- tion, peut-on réduire la consommation d’énergie du centre de données dans son ensemble ? Grâce à l’architecture micro-services, une application peut être construite à l’aide de nom- breux services, tous indépendants les uns des autres. Ce type d’architecture nous libère de l’obligation d’adopter un seul langage de programmation comme le fait l’application monolithique. Et avec cela, on peu employer plusieurs langages de programmation et profiter des forces de chacun d’eux pour un scénario spécifique. Le dernier chapitre a analysé le com- portement énergétique de plusieurs langages de programmation concernant les services Web tout en ouvrant une nouvelle voie vers la durabilité au sein des applications intemporelles
Chapter
Full-text available
The development of efficient numerical programs and library routines for high-performance parallel computers is a complex task requiring not only an understanding of the algorithms to be implemented, but also detailed knowledge of the target machine and the software environment. In this paper, we describe a programming environment that can utilize such knowledge for the development of high-performance numerical programs and libraries. This environment uses an existing high-level array language (MATLAB) as source language and performs static, dynamic, and interactive analysis to generate Fortran 90 programs with directives for parallelism. It includes capabilities for interactive and automatic transformations at both the operation-level and the functional- or algorithm-level. Preliminary experiments, comparing interpreted MATLAB programs with their compiled versions, show that compiled programs can perform up to 48 times faster on a serial machine, and up to 140 times faster on a vector machine.
Conference Paper
Full-text available
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.
Conference Paper
Full-text available
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose programming models available today can yield high performance but are too low-level to be accessible to the average programmer. We propose leveraging domain-specific languages (DSLs) to map high-level application code to heterogeneous devices. To demonstrate the potential of this approach we present OptiML, a DSL for machine learning. OptiML programs are implicitly parallel and can achieve high performance on heterogeneous hardware with no modification required to the source code. For such a DSL-based approach to be tractable at large scales, better tools are required for DSL authors to simplify language creation and parallelization. To address this concern, we introduce Delite, a system designed specifically for DSLs that is both a framework for creating an implicitly parallel DSL as well as a dynamic runtime providing automated targeting to heterogeneous parallel hardware. We show that OptiML running on Delite achieves single-threaded, parallel, and GPU performance superior to explicitly parallelized MATLAB code in nearly all cases.
Conference Paper
Full-text available
This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Conference Paper
Full-text available
Current GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction, we propose a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations. We embed this purely functional array language in Haskell with an online code generator for NVIDIA's CUDA GPGPU programming environment. We regard the embedded language's collective array operations as algorithmic skeletons; our code generator instantiates CUDA implementations of those skeletons to execute embedded array programs. This paper outlines our embedding in Haskell, details the design and implementation of the dynamic code generator, and reports on initial benchmark results. These results suggest that we can compete with moderately optimised native CUDA code, while enabling much simpler source programs.
Conference Paper
Full-text available
Computing systems are becoming increasingly parallel and heterogeneous, and therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However, targeting these emerging devices often requires using multiple disparate programming models and making decisions that can limit forward scalability. In previous work we proposed the use of domain-specific languages (DSLs) to provide high-level abstractions that enable transformations to high performance parallel code without degrading programmer productivity. In this paper we present a new end-to-end system for building, compiling, and executing DSL applications on parallel heterogeneous hardware, the Delite Compiler Framework and Runtime. The framework lifts embedded DSL applications to an intermediate representation (IR), performs generic, parallel, and domain-specific optimizations, and generates an execution graph that targets multiple heterogeneous hardware devices. Finally we present results comparing the performance of several machine learning applications written in OptiML, a DSL for machine learning that utilizes Delite, to C++ and MATLAB implementations. We find that the implicitly parallel OptiML applications achieve single-threaded performance comparable to C++ and outperform explicitly parallel MATLAB in nearly all cases.
Conference Paper
Full-text available
This paper presents and characterizes the Princeton Ap- plication Repository for Shared-Memory Computers (PAR- SEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiproces- sors have focused on high-performance computing applica- tions and used a limited number of synchronization meth- ods. PARSEC includes emerging applications in recogni- tion, mining and synthesis (RMS) as well as systems appli- cations which mimic large-scale multi-threaded commercial programs. Our characterization shows that the benchmark suite is diverse in working set, locality, data sharing, syn- chronization, and o-chip trac. The benchmark suite has been made available to the public.
Conference Paper
GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls.We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50% of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU.