The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software.
ABSTRACT Previous studies have revealed that paravirtualization im- poses minimal performance overhead on High Performance Computing (HPC) workloads, while exposing numerous ben- efits for this field. In this study, we are investigating the memory hierarchy characteristics of paravirtualized systems and their impact on automatically-tuned software systems. We are presenting an accurate characterization of memory attributes using hardware counters and user-process account- ing. For that, we examine the proficiency of ATLAS, a quintessential example of an autotuning software system, in tuning the BLAS library routines for paravirtualized sys- tems. In addition, we examine the effects of paravirtual- ization on the performance boundary. Our results show that the combination of ATLAS and Xen paravirtualiza- tion delivers native execution performance and nearly iden- tical memory hierarchy performance profiles. Our research thus exposes new benefits to memory-intensive applications arising from the ability to slim down the guest OS without influencing the system performance. In addition, our find- ings support a novel and very attractive deployment scenario for computational science and engineering codes on virtual clusters and computational clouds.
- SourceAvailable from: psu.edu[show abstract] [hide abstract]
ABSTRACT: The specic demands of high-performance computing (HPC) often mismatch the assumptions and algorithms provided by legacy operating systems (OS) for common workload mixes. While feature- and application-rich OSes allow for exible and low-cost hardware congurations, rapid development, and exible testing and debugging, the mismatch comes at the cost of | oftentimes signican t | performance degra- dation for HPC applications. The ubiquitous availability of virtualization support in all relevant hardware architectures enables new programming and execution models for HPC applications without loos- ing the comfort and support of existing OS and application environments. In this paper we discuss the trends, motiva- tions, and issues in hardware virtualization with emphasis on their value in HPC environments.Operating Systems Review. 01/2006; 40:8-11.
Conference Proceeding: Proactive fault tolerance for HPC with Xen virtualization.[show abstract] [hide abstract]
ABSTRACT: Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from fa ults and generally rely on a checkpoint/restart mechanism. Yet, in t oday's systems, node failures can often be anticipated by detectin g a dete- riorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transpar- ent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with heal th mon- itoring and load-based migration. We exploit Xen's live mig ra- tion mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proac- tive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experi mental results demonstrate that live migration hides migration co sts and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhance- ments make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full check- point/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the con text of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is a ctually triggered by health monitoring.Proceedings of the 21th Annual International Conference on Supercomputing, ICS 2007, Seattle, Washington, USA, June 17-21, 2007; 01/2007
- [show abstract] [hide abstract]
ABSTRACT: Supercomputers use parallelism to provide users with increased computational power. Most supercomputers are programmed in some higher level language, commonly Fortran; all supercomputer vendors provide Fortran compilers that detect parallelism and generate parallel code to take advantage of the architecture of their machines. This article discusses some of the common (and not so common) features that compilers for vector or multiprocessor computers must have in order to successfully generate parallel code. Many examples are given that are related to the generic types of machines to which they apply. Where appropriate, the authors relate these parallel compiler optimizations to those used in standard compilers.Communications of the ACM. 01/1986; 29(12):1184-1201.
The Impact of Paravirtualized Memory Hierarchy on Linear
Algebra Computational Kernels and Software
Dept. of Computer Science,
University of California, Santa
Dept. of Electrical Engineering
and Computer Science,
University of Tennessee.
Dept. of Electrical Engineering
and Computer Science,
University of Tennessee.
Dept. of Electrical Engineering
and Computer Science,
University of Tennessee.
Dept. of Computer Science,
University of California, Santa
Previous studies have revealed that paravirtualization im-
poses minimal performance overhead on High Performance
Computing (HPC) workloads, while exposing numerous ben-
efits for this field. In this study, we are investigating the
memory hierarchy characteristics of paravirtualized systems
and their impact on automatically-tuned software systems.
We are presenting an accurate characterization of memory
attributes using hardware counters and user-process account-
ing. For that, we examine the proficiency of ATLAS, a
quintessential example of an autotuning software system, in
tuning the BLAS library routines for paravirtualized sys-
tems. In addition, we examine the effects of paravirtual-
ization on the performance boundary.
that the combination of ATLAS and Xen paravirtualiza-
tion delivers native execution performance and nearly iden-
tical memory hierarchy performance profiles. Our research
thus exposes new benefits to memory-intensive applications
arising from the ability to slim down the guest OS without
influencing the system performance. In addition, our find-
ings support a novel and very attractive deployment scenario
for computational science and engineering codes on virtual
clusters and computational clouds.
Our results show
Categories and Subject Descriptors
D.4.8 [Operating Systems]: Performance; C.4 [Performance
of Systems]: Performance attributes
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HPDC’08, June 23-27, 2008, Boston, Massachusetts, USA.
Copyright 2008 ACM 978-1-59593-997-5/08/06 ...$5.00.
Virtual Machine Monitors, Paravirtualization, AutoTuning,
BLAS, high performance, linear algebra, cloud computing
Virtualization has historically offered numerous benefits
for high performance computing. It was, however ignored
in computationally intensive settings as a result of its po-
tential for performance retardation. A recent approach to
virtualization, dubbed paravirtualization has emerged as a
possible alternative. Originally developed to help consoli-
date servers in commercial settings, this OS-based approach
to virtualization has grown rapidly in popularity. To pro-
vide the greatest flexibility, portability, and instrumentation
possibilities, it comprises both software and hardware tech-
niques. Originally explored by Denali  and Xen  in
2003, this technique in which the guest and the host OSs
are both strategically modified to provide optimized perfor-
mance is used today in several virtualization systems.
To this end, several studies [17, 26, 27] have measured the
performance ramifications of running general HPC bench-
marks on paravirtualized systems and they report near-native
performance for different HPC workloads. At the same time,
other studies have focused on the flexibility and functional-
ity benefits of using modern virtualization techniques. For
example, OS-level check-pointing , fault-tolerance and
load balancing  are very attractive possibilities in HPC
environments. In addition, other researchers [22, 19] have
looked into dynamic adaptation of the guest OS for perfor-
mance and application-customized guest OSs for scientific
parallel codes [5, 14, 28].
Although previous studies of paravirtualization have ad-
dressed the general-case performance characteristics of HPC
benchmarks, they lacked the investigation of the perfor-
mance boundary and performance consequences under scarce
memory conditions. This has a remarkable importance in
HPC, because of the performance sensitivity of memory-
intensive codes and autotuning Linear Algebra (LA) pack-
ages to the memory characteristics. In this vein, we provide
a detailed study of the impact of the paravirtualized execu-
tion on LA codes that use autotuning software for portable
performance. Autotuning has become an important technol-
ogy for “core” libraries that are performance-critical since
these systems often require complex and otherwise labor-
intensive configuration to achieve maximal performance lev-
els. Thus, our goals are two fold. First, we wish to un-
derstand how autotuning is affected by paravirtualization.
For example, we wish to know whether the autotuning soft-
ware can “sense” the presence of paravirtualization during
the tuning process. Secondly we wish to explore the po-
tential impact paravirtualization may have on highly tuned
numerical software. While it may be that a vanilla installa-
tion is unaffected as has been shown in previous studies ,
in this work we investigate the effects of paravirtualization
on the performance boundary.
In particular, we study the efficacy of Automatically Tuned
Linear Algebra Software (ATLAS) in detecting system con-
figurations for paravirtualized systems. We then use DGEMM
matrix-matrix multiplication as a memory-intensive code to
measure performance in Mflops for double precision arith-
metic and compare the performance of several OS kernels
with varying main memory allocations. With this in mind,
we investigate the different attributes of the memory hier-
archy of the paravirtualized system.
The main contribution of this paper is an exposition of
how paravirtualization (as implemented by Xen) impacts the
performance of LA codes, particularly with respect to its use
of the memory hierarchy. In addition, our findings complete
along with our previous studies [27, 26], the investigation of
the feasibility of utilizing clusters of virtual machines to ex-
ecute scientific codes, without sacrificing performance. Our
research, therefore advocates the huge potential of deploy-
ing scientific codes on virtual clusters and computing clouds.
In turn, this presents new and very attractive deployment
scenarios for computational sciences and engineering codes.
The novel deployment scenarios are not the only appealing
implication of our research, but the saving on the comput-
ing expenditure can be another very desirable advantage. As
the cost of virtual clusters comprises a fraction of comput-
ing hardware acquisition and maintenance costs, our results
have the potential to influence the total cost of the inquiry
process in the computational sciences and engineering.
The paper is structured as follows. We present a short
survey on paravirtualization in the next section, as well as
the terminology we will use. We detail our experimental
settings in the following section. Section 4 presents the im-
pact of paravirtualization on ATLAS system detection and
the performance of the generated and hand-tuned routines.
Next, Section 5 investigates the effect of the paravirtualiza-
tion on the memory hierarchy of the system, by describing its
impact on a memory intensive dense matrix-matrix multipli-
cation performance and characterizing its RSS (Resident Set
Size), swap and TLB activity. Finally, we discuss the impli-
cations of our work in Section 6 and present our conclusions
in Section 7.
Historically, virtualization has been a technique for pro-
viding secure and flexible hosting of user and system pro-
grams spanning several trust levels. In an Internet comput-
ing setting, this hosting capability has relied on language se-
mantics (e.g. Java) to permit the importation of untrusted
code for execution by a software virtual machine. While
the approach has been successful, incompatibilities between
the virtual machine mandated by the language and the typ-
Figure 1: The two software stacks in our experi-
mentation settings. (i) the stack on the left is the
traditional (native) software stack. (ii) the stack on
the right shows the virtualized software stack.
ical organization of actual computational resources (proces-
sors, memory, I/O subsystems, etc.) imposes a performance
penalty on virtualized program execution. Many powerful
and elegant optimization techniques have been developed
to minimize this penalty, but at present language virtual-
ized systems still do not achieve native execution speeds for
A recent approach to virtualization, dubbed paravirtual-
ization has emerged as a possible alternative. Paravirtual-
ization is a software virtualization technique which allows
the virtual machines to achieve near native performance.
In paravirtualized systems, for example Xen , the sys-
tem software stack is augmented, as illustrated in Figure 1.
The stack on the left shows the traditional OS deployment
stack, while the right stack portrays the paravirtualized de-
ployment stack. In the latter, the hypervisor1occupies a
small part of the main memory, and acts as a moderator
layer between the hardware and the guest OS kernels. On
top of the hypervisor, two types of guest OSs are run. The
first type is regarded as a privileged services guest OS, which
provides OS services to the other less-privileged OS and has
more direct access to memory, devices, and the hardware in
general. There must be at least one privileged virtual ma-
chine per physical machine. The other kind of guest OS is
a less privileged OS kernel, which uses paravirtualized de-
vice drivers and has moderated access to the hardware. The
privileged guest OS is responsible for running virtualization
software tools that manage, start, monitor and even migrate
the other less-privileged domains.
In order to measure the performance impact of paravir-
tualization on autotuning software, we used Automatically
Tuned Linear Algebra Software (ATLAS) [24, 9]. ATLAS
focuses on applying empirical search techniques to provide
highly tunable performance for linear algebra libraries. It
empirically explores the search spaces for the values of the
different parameters for Basic Linear Algebra Subprograms
(BLAS) [15, 7, 10] and Linear Algebra Package (LAPACK) 
routines for matrix operations. Those kinds of matrix ker-
nels are among the most widely studied and optimized rou-
tines in computational science due to their influence on the
overall performance of many applications. Traditionally, de-
velopers either carefully optimized these algorithms by hand
1The hypervisor is a small piece of software that runs di-
rectly on the hardware and acts as a slim layer between the
guest OSs and the hardware. It is also referred to as virtual
machine monitor (VMM). Accordingly, the OS kernels that
run on the hypervisor are termed virtual machines, or guest
C = αAB + βC
C = αABT+ βC
C = αATB + βC
C = αATBT+ βC
y = αAx + βy
y = αATx + βy
A = αxyT+ A
General Dense non-transpose non-transpose Matrix-Matrix Multiplication
General Dense non-transpose transpose Matrix-Matrix Multiplication
General Dense transpose non-transpose Matrix-Matrix Multiplication
General Dense transpose transpose Matrix-Matrix Multiplication
General Vector-Matrix Multiplication
General Vector-Matrix transpose Multiplication
General Rank one update
Table 1: Mathematical notations for the routines in BLAS and LAPACK libraries.
or they relied on compiler optimizations to improve the per-
formance. Hand-tuning requires a lot of expertise and quite
a bit of effort from the developer. Even if the hand-tuning
is successful, it is not often portable to other architectures,
so the developer has to repeat the process multiple times
to support a useful set of platforms. On the other hand,
using compiler optimizations requires almost no effort from
the developer, but may only give modest results, especially
compared to the theoretical peak performance of the ma-
chine. Many compiler optimization techniques such as loop
blocking, loop unrolling, loop permutation, fusion and distri-
bution have been developed to transform programs written
in high-level languages to run efficiently on modern archi-
tectures [1, 21]. Commonly referred to as model-driven op-
timization, most compilers select the optimization parame-
ters such as block size, unrolling factor, and loop order with
analytical models. The models may be based on real ar-
chitectural attributes or other heuristics, but compilers are
burdened with the task of handling a wide variety of code,
so the built-in optimizers usually cannot compete with ex-
perienced hand-tuners. In contrast, empirical optimization
techniques generate a large number of code variants for a
particular algorithm (e.g. matrix multiplication) using dif-
ferent optimization parameter values or different code trans-
formations. All these candidates run on the target machine
and the one that gives the best performance is picked. Using
this empirical optimization approach, projects like ATLAS,
PHiPAC , OSKI , and FFTW  can successfully
generate highly optimized libraries for dense and sparse lin-
ear algebra kernels and FFT respectively. This empirical
approach has been recognized as an alternative approach
to traditional compiler optimizations and machine specific
hand-tuning of linear algebra libraries, since it normally gen-
erates faster libraries than the other approaches and can
adapt to many different machine architectures. In a recent
report from Berkeley on the future of parallel computing ,
software autotuners were regarded as a way to enable ef-
ficient optimizations and should become more adopted in
translating parallel programs and code generation. Towards
this end, we expect autotuners will be more embraced in
the near future, and will run on virtualized machines such
as the computing clouds. Hence, we are investigating in this
paper the impact of paravirtualization on the operation of
With this in mind, we used the performance and the pa-
rameter values of the autotuned BLAS library as an in-
dication of the efficiency of the autotuning process in par-
avirtualized environments. ATLAS is convenient to use for
these experiments because of its widespread use for gener-
ating tuned LA libraries. In addition, the detected charac-
teristics of the system can be easily examined in the log files
L1 Blocking factor.
MULADD boolean flag to
indicate whether the MULADD
is done as one operation or not.
latency between floating
blocking factor used in
each specific routine.
Unrolling factors for M, N, K
Unrolling factors for X, Y
nu, mu and ku
Xunroll and Y unroll
Table 2: Description of the parameters used in tun-
ing the BLAS routines.
and compared among the different OS kernel configurations.
Also, since ATLAS typically achieves 75-90% of peak per-
formance in the native configuration, it should give a good
indication of whether the various OS kernel configurations
are capable of high sustained floating point performance.
Notice that ATLAS essentially performs a “parameter-
sweep” search of the performance space so that it can iden-
tify the values of the specific parameters that yield the best
performance (among those tested).
configuration typically achieves a better performance than
a generic installation. Because applications running near
peak machine speed can be more performance sensitive to
effects introduced by their OS environment (e.g. OS noise),
we wish to examine the degree to which paravirtualization
interferes with an optimized installation. Notice also that
the set of parameters identified by ATLAS are conveniently
logged making it possible to use them to detect specific per-
formance differences between native and paravirtualized exe-
cution. That is, by comparing ATLAS tuning logs for native
and virtualized optimization, we should be able to identify
immediately how paravirtualization is affecting the execu-
tion of optimized LAPACK libraries.
To help understand some of the parameters and the sub-
routine names mentioned in this paper, we will briefly de-
scribe the BLAS and LAPACK naming convention (for full
details, see ). Subroutines are named XYYZZ or XYYZZZ,
where X indicates the data type (S for single precision, D for
double precision, etc.), YY indicates the matrix type (GE for
general, GB for banded, etc.), and the final ZZ or ZZZ indi-
cates the computation performed (MM for matrix-matrix mul-
tiply, MV for matrix-vector multiply, etc.). Therefore, DGEMM
would be a double-precision general matrix-matrix multi-
plication. In this matrix-matrix multiplication routine, an
The resulting library
M ×K matrix A multiplies a K × N matrix B, resulting in
the M×N matrix C. ATLAS finds the optimal value for the
best blocking and loop unrolling factors for on-chip multi-
ply using timings, i.e. it examines the search space by trying
different values for blocking and loop unrolling. In Table 1,
we outline the BLAS routines that ATLAS optimizes and
in Table 2, brief descriptions of the different parameters for
the routines are outlined.
3. EXPERIMENTAL SETTINGS
We ran our experiments on a Pentium D dual-core ma-
chine, where each core is a 2.8-GHz Pentium with an 800-
MHz processor bus, 16KB of L1 cache and 1024KB L2 cache.
The machine memory system uses a 533-MHz bus with 1 GB
of dual interleaved DDR2 SDRAM memory.
In order to find the performance ramifications of paravir-
tualization, we compare the performance of three types of
OS kernels. Furthermore, we test two configurations that
differ in the main memory size allocated at boot time for
each OS kernel. The first kernel is a Fedora Core Linux
2.6.19 kernel, which we used as a base performance kernel,
and henceforth referred to as “native”. For this kernel, the
device drivers, applications and BLAS libraries run directly
in the OS (without virtualization), which is the common
software stack used nowadays in HPC clusters.
On the other hand, the paravirtualized software stack is
different, as we described in the previous section. We used
Xen as our paravirtualizing software, with the hypervisor
in the first 32MB of the main memory. Furthermore, in
Xen terminology, the privileged guest OS is dubbed Dom0
(for Domain 0) while the less privileged guest OS is dubbed
DomU (for Domain Unprivileged). We adopt this termi-
nology for the rest of our paper.
OS kernels (native, Dom0, DomU), we test the performance
with two main memory configurations: 256MB and 756MB.
The reason for changing the total memory assigned to the
systems was to test the performance of the system under
limited memory conditions and to generate near-boundary
memory cases for virtualized systems. We disabled the bal-
loon driver2in Xen domains in order to isolate the impact
of the balloon driver on memory performance and to build
a fair comparison between the different systems. In our ex-
perimentation, we used Linux kernel 2.6.16 as the guest OS
for both dom0 and domU, patched with Xen 3.0.4. All the
OS kernels were built with SMP support.
We use ATLAS 3.7.35, the latest version available for
autotuning the BLAS routines.
mance achieved using the ATLAS-generated code (with and
without SSE2 support). We also compare the performance
achieved by the DGEMM routine for different matrix sizes.
In addition, threading was enabled in all ATLAS builds to
allow ATLAS to build parallel libraries.
For each of the three
We compare the perfor-
4.AUTOTUNING SOFTWARE SYSTEMS
ATLAS is one of the earliest autotuning software sys-
tems for performance optimization. We use ATLAS in our
research as a quintessential example of an autotuner for pro-
cessors with deep memory hierarchies and pipelined func-
tional units. As we described in Section 2, ATLAS uses
2The balloon driver in Xen allows the domains to grow and
shrink dynamically in their total main memory allocation,
according to their runtime memory workloads.
DomU 756MB DomU 256MB Dom0 756MBDom0 256MB Native 756MBNative 256MB
Performance in Mflops
FPU Register to Register Performance
FPU for double precision numbers; as detected by
ATLAS for the OS kernels.
Performance of the register-to-register
an empirical search methodology to optimize the different
routines for BLAS and LAPACK. This search process is
composed of three key phases. In the first phase, ATLAS
focuses on detecting the system characteristics. Through a
probe process, ATLAS collects information about cache size
of the system, the floating point unit (FPU), the number of
registers, and other architectural information. The second
phase is concerned with determining the best values of pa-
rameters to be used in generating the BLAS and LAPACK
routines based on the detected system characteristics and
the results of the empirical search. After tuning, ATLAS
runs cache benchmarks to determine the optimal value for
CacheEdge, which represents the value of the cache size for
blocking the matrix matrix multiplication routines. Finally,
ATLAS uses all the information it gathered to generate the
optimized BLAS library. In the next three subsections, we
detail the performance of ATLAS in each of the three phases
respectively. For those results, we found that the precision
of the multiplication (i.e. single versus double) does not im-
pact the difference in performance between the OS kernels.
Therefore, we only detail the double precision performance
4.1System Characteristics Detection
In order for ATLAS to autotune the BLAS libraries, it
starts its operation by probing the system characteristics.
Table 3 shows the output of ATLAS for each of these pa-
rameters. The first row shows that ATLAS detected L1
cache to be of size 16KB for all the OS kernels. The second
row illustrates the number of registers detected in each of
the systems, for which ATLAS detected 7 registers for all
three OS kernels. The length of the floating point pipeline
(in cycles) is presented in the third row, while the number
of FPU registers is presented in the fourth row.
Furthermore, Figure 2 shows the floating point unit (FPU)
register-to-register performance (i.e., with no memory la-
tency) as measured by ATLAS. For each of these perfor-
mance numbers, we present an average of 20 runs with er-
ror bars reflecting the margin of error for a 95% confidence
level of the mean. In this figure, the Y -axis represents the
performance in Mflops. Therefore, we concluded from the
L1 Cache Size
Sys Info nreg
FPU: Pipeline cycle
FPU: Registers num.
15 15 15
Table 3: System characterization as detected by ATLAS for the OS kernels.
measurements that there is no significant performance differ-
ence between the OS kernels for FPU operation. Overall, we
concluded from these results that the paravirtualization does
not alter the system characterization, nor does it impose any
performance overhead in register-to-register performance for
floating point operations.
4.2 Cache Blocking Size Configuration
Tuning the CacheEdge (i.e., cache blocking parameter)
can help increase performance and reduce the memory us-
age of BLAS routines. In this phase, ATLAS attempts to
determine the optimal cache size for blocking the matrix-
matrix multiplication routines. It first tests the blocking
performance using only L1 cache, then uses different values
of L2 cache.
We compared the performance achieved by each OS ker-
nel for L1 cache and each value of L2 blocking. Figure 3
depicts the performance in Mflops of a double precision
matrix-matrix multiplication of dimension 2500 using only
L1 cache blocking, while Figure 4 represents the perfor-
mance of using L2 blocking. All the numbers reported here
are the average of 20 runs.
axis represents the size of L2 cache in KB used in blocking,
while the y-axis represents the corresponding performance.
The error bars reflect the margin of error for 95% confi-
dence level. Note that we extended the ATLAS subpro-
gram which does the CacheEdge measurements to evaluate
the performance beyond the physical L2 cache size in order
to monitor any difference. However, no performance differ-
ence was detected between the native and paravirtualized
kernels. Figure 5 shows a histogram of the final CacheEdge
selected by ATLAS for the 20 runs, after disregarding the
runs where ATLAS chose only L1 blocking. The reason
ATLAS does not choose the same CacheEdge size for L2
blocking every time is that the code is sensitive to the slight
performance difference for cache sizes between 512KB and
2048KB. Therefore, a small variability in the performance
impacts the chosen CacheEdge but does not impact the over-
all performance as Figures 3 and 4 show. That is, a small
difference in performance will cause ATLAS to choose a dif-
ferent power of 2 for the cache block size (a relatively large
change). The histogram in Figure 5 reflects the variation of
cache-block size which ATLAS selected over different runs,
but Figures 3 and 4 show that this variation does not ulti-
mately affect the overall performance (note the small error
bars in the figures).
In addition, Table 4 outlines the median values of the
CacheEdge selected. The reason we chose to report the me-
dian rather than the mean is that ATLAS chooses among
different categorical values of L2, i.e., the median was more
representative of the optimal CacheEdge’s choice. From Ta-
ble 4 and Figure 5, we gather that the selection of the opti-
mal CacheEdge performance is similar for all the OS kernels.
For the latter figure, the x-
DomU.756M DomU.256MDom0.756MDom0.256M Native.756M Native.256M
Performance in Mflops
Cache Edge Performance using L1 Cache blocking
Figure 3: Performance of the 2500d matrix-matrix
multiplication for L1 cache blocking.
ATLAS for the OS kernels.
Median of the CacheEdge selected by
This shows that ATLAS finds minimal or no difference be-
tween the different OS kernels in choosing their optimal L2
4.3Routines Generation and Tuning
In order for ATLAS to obtain the best performance from
the system, it runs different routines and measures their per-
formance to choose the most efficient optimization for the
BLAS library customization. Some of the computational
kernels that come with ATLAS are hand-written assembly
routines, while others are autogenerated based on the output
of the system probe phase. In many cases and especially for
the popular architectures, the hand-written computational
kernels perform much better than the generated routines,
since the former kernels make use of specific special archi-
tectural features that ATLAS code generator does not cur-
rently support. In this section, we compare the performance
of the ATLAS generated codes and the hand-written codes
for the different OS kernels.