ArticlePDF Available

Workload acceleration with the IBM POWER vector-scalar architecture

Authors:

Abstract and Figures

In this paper, we describe the history and development of the IBM POWER ® vector-scalar architecture, as well as how the design goals of hardware efficiency and software interoperability are achieved by integrating existing floating-point and vector functions into a new unified architecture and function unit. The vector-scalar instructions were defined with an emphasis on out-of-the-box performance and consumability, while accelerating a broad set of enterprise server workloads. Vector-scalar instructions were first introduced in the IBM POWER7 ® architecture to accelerate high-performance computing applications. With the introduction of the POWER8 ® processor, the vector-scalar architecture expanded to accelerate a diverse set of enterprise workloads including unstructured text and string processing, business analytics, in-memory databases, big data, and stream coding. We conclude this paper with a description of workload performance and application acceleration to demonstrate the effectiveness of the new vector-scalar architecture.
Content may be subject to copyright.
Workload acceleration
with the IBM POWER
vector-scalar architecture
M. Gschwind
In this paper, we describe the history and development of the
IBM POWER
®
vector-scalar architecture, as well as how the
design goals of hardware efciency and software interoperability
are achieved by integrating existing oating-point and vector
functions into a new unied architecture and function unit. The
vector-scalar instructions were denedwithanemphasison
out-of-the-box performance and consumability, while accelerating
a broad set of enterprise server workloads. Vector-scalar
instructions were rst introduced in the IBM POWER7
®
architecture to accelerate high-performance computing
applications. With the introduction of the POWER8
®
processor,
the vector-scalar architecture expanded to accelerate a diverse set
of enterprise workloads including unstructured text and string
processing, business analytics, in-memory databases, big data,
and stream coding. We conclude this paper with a description of
workload performance and application acceleration to
demonstrate the effectiveness of the new vector-scalar
architecture.
Introduction
In this paper, we describe the evolution and performance
characteristics of the IBM POWER* vector-scalar
architecture, an integrated vector and oating-point
architecture rst introduced in the POWER7* processor [1].
As its name suggests, the vector-scalar architecture was
developed to accelerate both vector and scalar applications.
It restructures the core microarchitecture by integrating
and sharing register les and execution units of the
previously separate scalar oating-point and AltiVec** [2]
units, while maintaining full backwards compatibility.
The new architecture enables a design with reduced area
and power dissipation to increase overall core performance.
Applications can use the resources of this integrated unit
and allocate them to either scalar computations, vector
computations, or a mix of scalar and vector computations,
yielding better performance at lower cost than the
previously distinct units. In addition, the integrated
architecture simplies data sharing between scalar and
vector execution and thus enables more efcient
vectorization of application algorithms by programmers
and compilers, further improving processor performance.
The vector-scalar architecture more than doubles the
oating-point performance of POWER processors for many
application domains and represents the most signicant
architectural and design innovation in the POWER
architecture [3] since the introduction of the 64-bit Power
Architecture* extension [4, 5] more than a decade earlier.
Since its introduction in the POWER7 processor,
the vector-scalar architecture and design have been the
subject of ongoing improvements to increase both its
effectiveness and applicability to a steadily increasing set
of application domains. Thus, while the inaugural
specication of the vector-scalar architecture was focused
on extending the oating-point capabilities of enterprise
servers for traditional high-performance computing (HPC)
and numeric analytics workloads, subsequent generations
are expanding the workload domains addressed by the
vector-scalar architecture to enterprise server application
domains as diverse as text analytics and database
acceleration.
In this paper, we describe the history leading to the
development of the vector-scalar unit and the inuence of
Digital Object Identier: 10.1147/JRD.2016.2527418
M. GSCHWIND 14 : 1IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
ÓCopyright 2016 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the rst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
0018-8646/16 B2016 IBM
previous generations of IBM SIMD (single-instruction,
multiple-data) architectures on the genesis of the POWER
vector-scalar architecture. We also describe design criteria
for dening a SIMD architecture for server processors
and provide an overview of the vector-scalar architecture.
Finally, we describe workload optimization and performance
characteristics of the new vector-scalar architecture for
benchmarks of the SPECfp benchmark suite and for data
streaming from memory, linear algebra, stream coding, and
business analytics with in-memory databases in columnar
storage format, and the software ecosystem framework
supporting the vector-scalar architecture.
The evolution of POWER SIMD architecture
When SIMD vector processing was originally introduced
for the Power architecture with the AltiVec SIMD
instructions, it was dened in the context of a desktop
environment to support the Macintosh** graphical user
interface and many digital content-creation applications.
The AltiVec instruction set was tremendously successful
in spawning a number of new digital content applications
and an exciting new desktop environment that offered
differentiation for PowerPC*-based personal computer
applications [2].
While this offered a great advantage to the PowerPC-based
personal computers of the Apple Macintosh computers,
the focus of the IBM enterprise server business was
different. Consequently, the initial introduction of AltiVec
occurred in Power processors targeting the Apple
Macintosh computers, such as the IBM PPC970 processor.
The AltiVec instructions also offered attractive
characteristics for the game console market, and both the
Sony PlayStation** 3 and Microsoft Xbox 360** systems
built on these high-performance graphics attributes
offered by AltiVec. These systems would also serve to
explore new design concepts, which later inuenced the
evolution of the Power SIMD architecture.
The Microsoft Xbox 360 exploits a symmetric
multiprocessor architecture with three Power processors as
the main application engine [6]. To provide the necessary
compute performance for graphics and content creation
on the central processing unit (CPU), the Power
architecture is paired with a deeply pipelined media unit
combining oating-point (FP) and SIMD-based graphics
processing in a shared unit. The Xbox 360 SIMD
architecture is an extension of the AltiVec architecture,
known as VMX128. The VMX128 architecture provides a
large register le with 128 vector registers and additional
graphics functions targeted at media processing [7].
The Sony PlayStation 3 is based on the heterogeneous
Cell Broadband Engine** (Cell/B.E.**) architecture [8].
The Cell/B.E. architecture introduces an innovative
accelerator-based programming model combining a Power
architecture-based CPU with eight accelerators based on
the Synergistic Processor Element (SPE) with a SIMD
instruction set architecture (ISA) [9]. Similar to AltiVec,
the SPE ISA offers 128-bit-wide registers to support
processing of integer and oating-point data types.
However, the SPE ISA integrates scalar and vector
processing in a single dataow to build a more
area-efcient and power-efcient design with a unied
128-entry register le. The register le size reects the
multiple requirements of a deeply pipelined high
frequency design, the statically scheduled in-order
processor design, and the need to store all data types in a
single register le [10].
The large unied register le offers two advantages.
First, from a hardware resource perspective, it reduces the
area of the CPU by eliminating the duplication between
scalar and SIMD data paths and register les. Second, from
a performance perspective, it also offers the opportunity to
generate faster code by simplifying vectorization and
eliminating unnecessary expensive data transfers between
scalar and vector register les, as described below.
A number of server designs were based on the AltiVec
and Cell SPE architectures. In traditional IBM enterprise
servers, AltiVec instructions rst became available in
the PPC970-based blade servers [11], and then as integral
part of POWER6* servers under IBMs vector and media
extension (VMX) name [12]. The bandwidth and wide
register le offered some attractive attributes for server
designs in key library functions but was of limited
applicability to many other traditional workloads due to its
focus on short integer and single-precision data types
commonly used in media processing and digital content
creation. Also, mixing scalar and vector processing in
a single computation proved difcult because vector and
scalar types were stored in separate register les.
Due to its efcient accelerator architecture, the
Cell/B.E. architecture not only proved to be a highly
effective processor for game consoles, but also offered
high performance and high power/performance efciency
to many other highly compute-bound workloads [13, 14].
Based on these increasingly important attributes,
Cell/B.E. was used in a number of high-performance
embedded applications, in blade servers, and in the
Roadrunner system, the rst heterogeneous supercomputer
consisting of general purpose microprocessors and
programmable accelerators. In November 2008, the
Roadrunner system [15] became the rst supercomputer
to reach petascale performance (i.e., over a petaop per
second) based on the TOP500 supercomputer performance
ranking [16] and, propelled by its efcient vector
accelerator architecture, also became the worldsmost
power-efcient greenestsupercomputer listed in
the Green500 supercomputer ranking [17].
Soon after the success of Cell/B.E. and the introduction
of the AltiVec architecture into IBMs server line,
14 : 2 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
IBM initiated a new effort to create a server bringing the
advantages of the Cell SPE architecture to IBMs
mainstream Power server processors. This became even
more important with IBMs participation in the
High-Performance Computing Systems (HPCS) program
initiated by the U.S. Defense Advanced Research Projects
Agency (DARPA) to develop a next generation of
supercomputers based on commercial server technology.
Defining a server-optimized vector
SIMD framework
With IBMs participation in DARPAs HPCS program,
the PERCS (Productive, Easy-to-use, Reliable Computing
System) project focused on creating a commercial
processor core with HPC attributes. In particular, the goal
was to ensure that the new high-performance functions are
easy to use by co-designing hardware and software
technologies, such as microarchitecture, architecture,
compilers, and libraries. The programming environment
for exploiting accelerators in general-purpose processors
was still in ux, making the use of general-purpose
programmable accelerator cores unattractive for
general-purpose commercial applications.
Thus, we set out to extend the Power architecture to
deliver a highly efcient system capable of delivering high
performance out of the boxwithout requiring extensive
application tuning. Thus, our goal was twofoldto
deliver an architecture with the power/performance
efciency of a SIMD vector design demonstrated with
the Cell/B.E., and to design a robust set of functions that
may be exploited by compilers without extensive
application modications and proling.Atthesametime,
it was critical to not increase the processor core size.
To offer supercomputer-class performance, we
projected a need to deliver approximately 28 GFLOPS
(billion oating-point operations) per second per core,
i.e., 4 oating-point operations per cycle at 7 GHz,
or 8 oating-point operations per cycle at about 3.5 GHz.
Data parallelism was clearly offering better
power/performance efciency, given the high power
dissipation and diminishing performance gains achievable
from higher-frequency designs [18, 19].
Eight oating-point operations per cycle were
achievable with four fused-multiply-add units, which
could be either arranged as four scalar processing units,
two vector fused-multiply-add (FMA) operations on
vectors with two elements, or a single vector with four
elements. Having independence between execution units
is preferable, because it gives programmers more
exibility to use the units based on workload
requirements, resulting in an overall higher utilization of
execution units. Supporting the dispatch of four scalar
FMA operations per cycle would require building wide
instruction decode, dispatch, and issue data paths beyond
four compute and two branch instructions. This design
choice would increase power dissipation of the processor
without commensurate benets to commercial application,
making the processor overall less attractive for commercial
applications. On the other hand, a single instruction
operating on a single 4-element vector offers less
exibility than two separate instructions, each working on
a 2-element vector, which ts in the traditional Power
server instruction ow.
To avoid an increase in chip area due to the added
vector functionality, we decided to integrate the vector unit
and scalar oating-point units as previously demonstrated
by the SPE and VMX128 designs. This enables sharing a
single set of oating-point units for scalar and vector
processing, and also enables sharing instruction queues
and instruction issue logic. The traditional server designs,
capable of simultaneously issuing two oating-point
instructions, map well onto the preferable vector
conguration issuing two 2-wide vector instructions.
However, increasing the parallelism of vector
computations by making better use of the issue ports and
issuing oating-point instructions (on the POWER7
processor 1]) and all instructions (on the POWER8*
processor [20]) to both execution clusters was bound to
increase the register pressure seen during code generation.
(Informally speaking register pressure often refers to the
number of free registers available for use at a given point
of time in the execution of a program.) To keep twice
as many vector execution units busy as in the past, more
operands would need to be maintained in processor registers.
At the same time, the use of deep pipelining to increase
processor clock speed had also increased register pressure
for oating-point operations. Thus, we decided to explore
expanding the register les for oating-point and vector
register les while avoiding any associated area increase.
While SPE and VMX128 provided 128 architectural
registers for vector processing, they do not provide
register renaming. Thus, all parallelism has to be achieved
in the compiler by code scheduling using the architectural
registers. In comparison, the Power server designs
typically provide register renaming to exploit
instruction-level parallelism using out-of-order execution,
thereby requiring fewer architecturally dened registers. At
the same time, compilation techniques to exploit parallelism
in vectorized code, such as data-parallel if-conversion, as
well as instruction scheduling for wide instruction issue
and deep pipelines, are increasing register pressure. Thus,
our goal was to increase the number of architecturally
dened registers for scalar and vector registers while also
signicantly increasing the number of rename registers.
Physical integration of scalar FP and vector register
les at the design level by building a single register le
offers savings by sharing decoders and other logic.
However, the efciencies to be gained are comparatively
M. GSCHWIND 14 : 3IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
small and complicate the register renaming logic. Thus,
the vector-scalar architecture architecturally integrates
oating-point and vector register les to obtain additional
improvements by enabling applications to use the
64-entry unied register le more efciently. This enables
programmers and compilers to allocate a larger number of
registers from the unied architectural register le
depending on application need.
Code generation for the Cell SPE demonstrated that a
unied register le offers several advantagesmany
applications use either scalar oating-point or vector types,
but not both. These applications experience a factor-of-two
increase of usable registers by being able to allocate all
registers to their dominant data types. Vectorized code
with a unied register le is more efcient when
vectorized code reads or writes scalar operands, compared
to designs that require possibly long latency transfers
between distinct register les and processing units.
In such environments, many opportunities for improving
performance with SIMD vectorization are lost when the
cost (e.g., cost of latency and performance) of transferring
between scalar and vector register les is added. Finally,
the exact transfer costs are difcult to predict for
compilers, forcing them to make conservative decisions
and avoid vectorization even in cases where such
vectorization may be advantageous.
The new unied register le subsumes and overlays
legacy scalar oating-point and vector register les, both
to enable vector-scalar code to integrate with legacy code
modules and to realize area savings [21]. The legacy
scalar oating-point registers are extended to 128-bit to
create a register le with registers of uniform length
capable of storing either scalar or vector data. We call this
new unied register le the vector-scalar (VS) register le,
and we call the instructions using the new unied register
le vector-scalar instructions.
Figure 1 shows the vector-scalar register le that we
introduced. The new vector-scalar instructions use all
64 vector-scalar registers as operands. Legacy
oating-point instructions operate on the rst 32
vector-scalar registers. AltiVec instructions operate on
VS registers 3263 corresponding to AltiVec vector
registers 031.
A new instruction set for vector and
scalar processing
To use the full vector-scalar register le, we introduce a
new set of instructions, the vector-scalar instructions,
addressing the entire 64-entry register le using 6-bit
register speciers. Instructions with 6-bit register speciers
are new to the architecture, and traditional instruction
encodings make it difcult to efciently integrate them
with existing 5-bit speciers in a design. To simplify
decoding of oating-point, vector, and new vector-scalar
register operands, we implement new VS register speciers
using a non-contiguous specier format. Thus, the
low-order 5 bits of a register specier are always in the
same position, regardless of whether instructions use
a5-bitora6-bitregisterspecier. The new vector-scalar
instructions specify an additional sixth bit in a separate
instruction eld. Thus, when decoding instructions directed
at execution in the vector-scalar unit, the sixth bit is then
either set to zero (0), for legacy scalar oating-point
instructions; to one (1), for AltiVec instructions; and to the
extended specier bit, for new vector-scalar instructions
addressing all 64 VS registers (VSRs) [7].
New VS instructions consume more encoding space
than other Power architecture instructions. Each register
operand extension bit doubles the encoding costrelative
to the 232 possible unique instruction encodings. Thus,
a 3-operand instruction requires eight times more encoding
space, and a 4-operand instruction requires 16 times more
encoding space. This is a particular challenge for
xed-width reduced instruction set computing (RISC)
ISAs which derive an advantage of decoding simplicity
to decode many more instructions in parallel than a
complex instruction set computing (CISC) architecture.
On the other hand, adding a new, longer instruction
Figure 1
The 64-entry vector-scalar register le The new VS register le
with 64 vector-scalar registers (VSR) is created by extending the
32 oating-point registers (FPR) to 128 bit and combining them
with 32 AltiVec registers (VR)
14 : 4 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
format to a CISC architecture is a small incremental cost.
Transitioning from a xed-width RISC encoding to a
variable-width encoding would incur a much larger
initial penalty for the rst such added new
instruction format.
To preserve the efcient xed-width RISC ISA
encoding, it was critical to conserve opcode space to avoid
depletingany remaining encoding space for the Power
architecture, and we decided to provide new instruction
forms for addressing all VS registers only for those
instructions that demonstrate substantial benetfromthe
increased register le size. For example, the low latency
integer vector instructions derive little benetfroma
large register le.
Basedonthisframework,weretainthetraditional
AltiVec encoding for integer operations with the ability to
address 32 vector registers, and add new integer-oriented
instruction following the established AltiVec instruction
formats. We provide new VS instructions for four
instruction categories: scalar and vector oating-point
instructions, data reorganization instructions, and memory
access instructions.
Floating-point vector-scalar instructions
A full set of scalar double precision instructions operating
on 64 VS registers were added in the POWER7
architecture, and single-precision instructions were added
in the POWER8 architecture. Like in the classicPower
oating-point architecture, single precision numbers are
represented in 64 bit double-precision encoding, but
rounded to single precision.
The vector-scalar instruction set also introduces full
support for mixed precision computations as dened by
the revised IEEE 754 oating-point standard introduced in
2008 [22]. In the Power architecture, scalar double
precision oating point instructions have always been
able to process a mix of single and double precision
inputs. With the new vector-scalar instructions, when
scalar single-precision oating-point instructions
process any mix of single and double precision inputs,
the result is the single precision number closest to the
innitely precisemathematical result of the operation
on the inputs.
The VS instructions introduce instructions for
performing oating-point vector operations on vectors of
two 64 bit double precision values per register. In addition,
the VS instructions include a new set of vector single
precision FP instructions operating on four 32-bit single
precision FP values per VS register, matching the format
used by the AltiVec graphics oating-point instructions.
The instruction set includes the usual arithmetic
operations, such as fused-multiply-add/subtract
instructions. To reduce encoding space, FMA instructions
are encoded with one read/write operand, where either
the product is accumulated into an addend that serves as
input and output (common in linear algebra), or where
one of the multiplicands is overwritten (inter alia,usedin
evaluating polynomial series). The new instructions
include a exible set of scalar and vector conversion
instructions between integer, single precision FP,
and double precision FP formats.
In addition to the traditional computational operations,
the classic FP instructions, and VS scalar and vector
instructions, contain support for accurate, fully
IEEE-compliant iterative approximation of divide and
square root computations in softwarenew test divide
and test square root instructions test the inputs of
divide square root to determine whether a divide and
square root software approximation is expected to
require special handling due to the presence of NaN
inputs, result overow, or other conditions requiring
special handling.
The new VS instructions provide fully IEEE-compliant
operation, and the VS vector instructions are controlled
by rounding modes and provide IEEE exception condition
indication to report underow, overow, invalid operations
and NaN operands. When oating-point exceptions
are enabled, no result is written to the target register upon
exception, and control is passed to software.
VS scalar compare instruction set a condition register,
and VS vector compare instructions set a vector register
with a Boolean value for each element, indicating
whether a specied condition is true or false. The Boolean
value in the vector register le can be used in conjunction
with data-parallel if-conversion to vectorize code with
control ow, as described below. Vector compare
instructions can also set a summary in a condition register
indicating whether any, all, or none of the elements
meet a condition.
Vector-scalar data reorganization instructions
The VS instruction set adds several new data
reorganization instructions to the already robust
reorganization instruction set provided by AltiVec. These
include merging word and double-word vectors and
replicating word and double-word elements, corresponding
to the higher precision element sizes targeted by the
vector-scalar architecture. If more ne-grained control is
needed, these functions continue to be available for
operands in VS registers 32 to 63 by using AltiVec
instructions, exploiting the register le overlay that
integrates the vector-scalar and AltiVec registers.
The VS instructions include a 4-operand data-parallel
select instruction that can select a result on a per-element
basis from one of two input operands under control of a
third control operand. The control operand takes the form
of a bit mask generated by vector compare instructions.
Complementing the VS select instructions, the VS
M. GSCHWIND 14 : 5IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
Boolean logical instructions compute arbitrary conditional
expressions involving vectorized comparisons. Together,
these instructions enable programmers and compilers to
vectorize conditional operations using data-parallel
if-conversion and avoid the cost of performing sequential
difcult-to-predict data-driven branches [9].
Vector-scalar memory access and data
movement instructions
The VS memory instructions introduce new VS memory
access instructions for scalar and vector accesses. For
scalar accesses, scalar memory access instructions support
loading single and double precision oating-point values
that work similarly to the classicoating-point
instructions operating on the oating-point registers.
The new VS vector memory instructions support
unaligned memory accesses directly and can load or store
any vector registers at any arbitrary alignment. For the
POWER7 processor, the primary workload focus was on
providing support for high-performance numerical
algorithms such as linear algebra and big data applications.
To support such applications, the POWER7 processor
introduced a new mechanism for implementing
high-performance unaligned accesses vectors starting
at any naturally aligned double precision
boundaryunaligned load instructions that extend
across a cache segment boundary and require two cache
accesses are completed by returning the data up to the
boundary into the target register, but notication of
data availability for the target register is suppressed.
Simultaneously, the load/store unit initiates re-issuance of
the unaligned memory access instruction, modied to
perform an access to any remaining data, writes them into
the result register, and indicates availability of the result
to make any dependent instructions ready to execute.
High-performance unaligned accesses depend on separate
write controls for each double word of the register le to
write partial results on a double word boundary into
target registers.
As shown in Table 1, in the POWER7 processor,
unaligned accesses that are not aligned on a double word
boundary are executed by aborting the access and
emulating the access using microcode (for memory
accesses aligned at least at a word boundary) or the
operating system (for alignment below a word boundary)
when they extend beyond a cache subline boundary at
32 bytes. On the POWER8 processor, the new
high-performance alignment microarchitecture is
implemented for all alignment boundaries.
In addition to normal VS vector load instructions, the
new VS instruction set includes a new vector load and
splat instruction that loads a single scalar value and
replicates that value across all elements of the target
vector register. Finally, the VS instruction set includes
instructions to move data between general purpose and
VS registers, starting with the POWER8 processor.
New vector integer instructions
With the POWER8 architecture, we introduced a variety of
new integer-oriented vector instructions. These new vector
integer instructions retain the AltiVec instruction encoding
32 vector register operands, selected from vector-scalar
registers 3263 corresponding to the AltiVec registers.
These new vector integer instructions include operations
on vectors of 64-bit integers, and instructions operating on
128-bit integers.
This category also includes instructions supporting
Advanced Encryption Standard (AES) coding and secure
hashing as well as various coding schemes, such as
binary polynomial multiply-sum, the secure hash algorithm
Table 1 Execution of an exemplary (big-endian) vector/scalar memory load instruction on the POWER7 and
POWER8 processors. On both POWER7 and POWER8 processors, naturally aligned vectors are loaded with a single
memory subsystem access. On POWER7, operating system emulation is used to perform memory accesses that are
not aligned at a word (4 byte) boundary and are best avoided by code generation. On POWER8, all accesses within
a cache subline (i.e., those not crossing a 32 byte boundary) are performed using a single access, and all accesses
crossing a cache subline boundary are performed using the high-performance memory access capability by double
issuing the memory access instruction. Memory accesses crossing page boundaries require additional processing.
14 : 6 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
(SHA2), permute-xor, bit permute, and bit transpose,
as well as other enhancements to existing integer
vector instructions.
Workload optimization
An initial design goal for the vector-scalar unit was to
enhance the performance of numeric applications by
increasing both the number of function units and available
registers. For many linear algebra applications, the
performance improvements scale directly with the
increased parallelism.
To explore the performance impact on a broader set of
oating-point applications, Figure 2 shows normalized
runtimes (lower is better) for the SPECfp benchmark
suite to compare the performance characteristics of the
Power ISA across three generations: the instruction set
available with legacy oating-point and AltiVec support
prior to the introduction of the vector-scalar instructions
(for the POWER6 processor), and the rst generation
and second generation with the vector-scalar instructions
(for the POWER7 processor and the POWER8 processor).
The performance results were collected to quantify the
impact of the instruction set. Thus, while each of the three
measurements was compiled with three different instruction
set generations, the code was generated for the POWER8
machine model describing POWER8 latencies, functional
units, memory hierarchy, and so forth, with a development
release of the XL C/C++ and XL FORTRAN compilers
with interprocedural optimization and SIMD
autovectorization, but no benchmark specic optimizations.
All workloads are executed on the same IBM p824L
system in single-thread mode in order to explore the
impact of the instruction set enhancements while factoring
out compiler capabilities, processor design, and system
design. As these results are out of the boxresults,
i.e., the source code is unmodied and not tuned to take
advantage of new architectural capabilities, this gure also
measures the consumabilityof the new architecture.
Enablement of unmodied code purely based on compiler
technology was a key requirement to make extensions
immediately accessible to software developers and system
users without lengthy and expensive code enablement
and tuning.
Without interprocedural optimizations (not shown),
the peak performance improvement across SPECcpu2006
oating-point benchmarks reaches a factor of two
(i.e., reducing the normalized runtime to 0.50) with the
vector-scalar instructions on the POWER7 and POWER8
architectures, and a geometric mean runtime speedup of
1.07 and 1.09 (corresponding to normalized runtimes
of 0.93 and 0.92) for the POWER7 and POWER8
instruction sets. With interprocedural optimizations as
used to obtain the results shown in Figure 2, the peak
performance improvement reaches a peak speedup
of 1.57 for the cactusADM SPECcpu2006 benchmark,
and a geometric mean runtime improvement of 1.12 and
1.15 for the SPECfp benchmarks with the POWER7 and
POWER8 instruction sets.
In addition to optimizing numeric performance, the
vector-scalar instructions were also targeted for new
workload optimizations for key enterprise workloads for
Figure 2
Performance comparison of the three recent generations of IBM vector and oating-point architectures Reported runtimes are normalized relative
to POWER6 runtime
M. GSCHWIND 14 : 7IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
big data and business analytics applications, including
numeric modeling (linear algebra), in-memory databases
using columnar store organization, unstructured text
processing, and coding applications. The following
sections explore the performance characteristics of key
enterprise workloads considered during the design of the
rst and second generation of vector-scalar instructions.
Optimizing memory accesses
The new VS vector memory instructions support
unaligned memory accesses directly. In contrast, the
AltiVec architecture relies on programmers to expand
unaligned memory accesses in application code into
a sequence of aligned loads and use permute instructions
to extract unaligned vectors from successive aligned
accesses. Thus, this sequence shifts the unaligned vector
value spanning two aligned memory quadwords loaded
into two vector registers into an aligned position in a
register. For unaligned access sequences of nvector
registers, this leads to nþ1 vector accesses, npermutes,
and one lvsl(load vector for shift left) instruction to
compute the shift factor. Asymptotically, this corresponds
to one vector load and one compute operation to access
unaligned values in sufciently long, unaligned streams.
With the shift from long media streams in desktop client
systems to server workloads such as linear algebra,
big data applications, and business analyticsand the
need for processing unstructured text as input to analytics
processingthe characteristics of memory streams are
changing. With the use of middleware and libraries,
applications cannot be generated with an expectation of
receiving aligned or unaligned streams, because the
alignment properties are established by the caller of the
library function. Thus, the library has to be coded to
expectthe worst case alignment of its input arrays,
even when these arrays will frequently be aligned. Thus,
even users providing aligned input parameters would incur
the overhead of executing the code to load and align data
extracted from unaligned data streams.
In addition, many server applications are operating
using a shorter access sequencelinear algebra functions
may operate on blocked array data (e.g., 4 8matrixtiles
basedontileswith32oating-point values) to enable data
reuse from operations on small sub-matrices. Also,
commercial applications often operate on single records,
small memory objects, or small arrays; and string
processing often involves processing of short strings.
While the application-based processing of unaligned
accesses used in AltiVec is efcient for long unaligned
streams, vector access sequences for many server
workloads are too short to amortize the startup overhead. In
addition, this process results in the need to fetch, decode,
and issue additional instructions to perform data alignment
and increases register pressure. Application-based
processing of unaligned vectors requires two aligned vector
registers each containing a portion of the unaligned vector,
one vector register to hold a control word used to identify
the unaligned bytes to be extracted, and one vector register
receiving the extracted unaligned vector. The sequence for
performing unaligned stores is even more complex, with
particularly complex processing at the beginning and
ending of an unaligned stream, or for single, isolated
unaligned vector store. This complexity increases
further when multithreading, atomicity requirements, or
memory protection boundaries must be considered.
The increased register pressure and instruction
bandwidth necessary for software-based unalignment
handling conict directly with optimizations, such as
register allocation to capture reference locality and
scheduling instructions to issue two vector compute
instructions per cycle. As shown in Listing 1, support for
high-performance unaligned loads leads to a signicant
improvement in code size and resource use for even a
simple example such as the well-known double precision
zi¼axiþyi(DAXPY) loop:
vector double *x, *y;
vector double *z ALIGNED (16);
vector double a = vec_splats (a);
for(i=0;iGMAX; i++)
*z++ = vec_madd (a_vec, *x++, *y++);
Here, vec_madd returns a vector containing the results of
performing a fused multiply-add operation for each
corresponding set of elements of the given vectors. The
corresponding AltiVec code of Listing 1(a) assumes
vector double *z to be aligned for simplicity and space
due to the signicant complexity of handling unaligned
stream stores. With the vector-scalar unaligned support
(see Listing 1(b)), this sequence requires eight instructions
instead of twelve AltiVec instructions per loop iteration, and
one instruction instead of ve instructions for loop setup.
Reducing the setup overhead becomes particularly important
for short streams or singleton vector accesses. In addition,
the vector-scalar code with unaligned support for up to three
misaligned streams requires only four vector registers,
whereas the AltiVec code requires nine vector registers to
support two unaligned and one aligned stream. For more
complex examples, compilers can easily run out of registers
when generating code for the AltiVec architecture and
must generate code to spill and reload registers to and from
memory, further degrading performance.
Numeric applications
Numeric processing performance is critical in many
application domains including enterprise server domains.
The use of numeric application, and, in particular, linear
14 : 8 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
algebra in scientic and engineering applications, is
well-known and has uses in areas that include modeling
high-energy physics, environmental processes (such as
climate and weather forecasting), and biological processes,
as well as civil and mechanical engineering and material
sciences. With the increase of data volumes in business
processes and an increasingly instrumented world,
numerically intensive applications have become instrumental
in many other application areas, such as machine learning,
deep learning, search (e.g., the PageRank algorithm used in
Internet search), data mining, and analytics.
The new VS instructions implementations offer several
improvements for numeric applications to reach the
peak issue rate of two two-element double precision
oating-point SIMD vector instructions per cycle and to
achieve sustained throughput of 4 fused-multiply-add
operations per cycle. These improvements include
an expanded register set that enables more efcient tiling;
support for unaligned data access that enables the most
efcient execution of compute-intensive loops on aligned
and unaligned data with shared code without requiring
expensive loop versioning; and load-and-splat instructions
that integrate data reorganization without additional
overhead in the memory access, e.g., to efciently
multiply a vector with a scalar value.
A typical application example is the commonly used
DGEMM function in linear algebra libraries, a general
matrix-matrix multiplication of double precision
oating-point values. Matrix multiplication uses a row
of a matrix and multiplies it by a column of a matrix in
order to produce one value in the product matrix. At any
one point in time, the entire product matrix and at least
one row and one column of the input matrices
need to be in registers. The building block of
optimized matrix multiplication functions is a dense
matrix-multiply tile exploiting a processors
register le to capture the reference locality of a tiled
matrix multiplication. The matrix-multiply kernel of this
tile executes the entire matrix multiply without further
memory accesses until all partial results for the tile have
been computed.
In order to take advantage of the full performance
potential offered by two two-way double precision vector
oating-point execution units, independent operations are
necessary to cover the pipeline latency of the vector
execution units. The example of Listing 2 shows a kernel
of a linear algebra double precision matrix-matrix
multiplication c¼ab(DGEMM). The kernel
represents the computation on a 4 8matrixtile
computing c½0:3½0:7¼a½0:3½0:3b½0:3½0:7.
At the entry to this kernel, general-purpose registers r3
and r4 point to the beginning of the input range of the
input parameters aand bin memory (i.e., r3 ¼&a½0½0
and r4 ¼&b½0½0), and the intermediate results for the
tile are held in vector-scalar registers vs32 to vs47.
As seen in Listing 2, this basic building block requires
4oating-point vector registers to hold an 8-element
slice of a matrix Yrow, 4 oating-point vector
registers to store a 4 element slice of one matrix column
(with the element replicated in both vector elements
of the vector register to implement a scalar vector
product), and 16 vector registers to hold all 32 elements
of the 4 8 product tile computed using 16 oating-point
vector instructions.
Listing 1 Accessing unaligned data using AltiVec and VSX. Shown here is the DAXPY benchmark using
(a) AltiVec instructions and (b) vector-scalar instructions to access input streams of unknown alignment.
M. GSCHWIND 14 : 9IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
To take advantage of the performance potential of
POWER servers implementing the vector-scalar facility,
it is of critical importance to use the available architectural
resources by ensuring vector-scalar instructions are
available to take advantage of hardware resources. This
can be achieved by a combination of loop unrolling and
memory latency hiding techniques (such as software
pipelining and data prefetching) in conjunction with the
large vector-scalar register le. We use single-precision
vector matrix multiplication as an example to
demonstrate the gains possible by increasing the
vector-scalar instruction content using different tile sizes
(with code very similar to the matrix matrix multiply
shown above) and unrolling the kernel for computing
the result tile. Table 2 gives program statistics for a
single-precision vector matrix multiplication kernel as a
function of tile size. Increasing the tile size, e.g.,
using code transformations such as loop unrolling,
makes better use of the available registers, and creates
more vector instructions as a fraction of total instruction
count that may be decoded, issued and executed in
parallel. Figure 3 plots the normalized execution time
and SIMD content of the dynamic instruction mix as a
function of unrolling factor and tile size. Figure 4 plots
normalized execution time as a function of SIMD content
of the dynamic instruction mix for the same code,
demonstrating the acceleration obtainable with SIMD
vectorization.
Listing 2 The linear algebra double-precision matrix multiply (DGEMM) kernel using the vector-scalar
architecture.
14 : 10 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
Coding support
The vector-scalar instruction set introduces support for
accelerating a number of coding techniques to the
Power architecture, with the goals of handling data
streams at the speed of data feeds and supporting a variety
of data coding techniques, such as compression and
decompression, and encryption and decryption of data
feeds. Expanding on hardware coding support rst
introduced with memory-bus-attached accelerators in
POWER7+* systems, the POWER8 architecture
introduces in-core coding support AES encryption, secure
hash algorithm (SHA) hashes and signatures, cyclic
redundancy code (CRC) with application-specied CRC
functions with exible bit-multiply-sum instructions and
redundant array of independent disks (RAID) codes.
Responding to the need for increased data security for
business processes and in electronic commerce, these
functions have applicability for protecting online
commerce transactions, data security for disk encryption,
client caching, and securing server-side data, network
protection, digital rights management, and data
authentication and signatures with the Secure Hashing
Algorithm (SHA). In addition, protection codes, such as a
variety of different CRCs, are critical in many data
integrity applications, for le serving (e.g., with the
T10 DIF [Data Integrity Feature] standard) and other
network protocols, and for in-storage data protection in
IBMs General Parallel File System (GPFS*) and for
software RAID applications.
The rst exploiters of these functions include libraries
implementing encryption standards, the Hadoop big data
platform with its use of in-core CRC functions
implementing the Hadoop Distributed File System (HDFS)
CRC32 in AIX* and Linux**, a Galois Counter Mode
library, and Encrypted File Systems.
Table 2 Performance and program attributes of a single precision matrix-vector multiply linear algebra kernel, as a
function of tile size and SIMD computation content of the instruction mix.
Figure 3
Performance and fraction of SIMD instructions of the dynamic
instruction mix as a function of the loop unrolling factor, for a single
precision matrix-vector multiply linear algebra kernel
Figure 4
Performance as a function of the fraction of vector-scalar instructions
in a dynamic instruction mix, for a single-precision matrix-vector
multiply linear algebra kernel
M. GSCHWIND 14 : 11IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
Business analytics optimization with column store
database support
Enterprises are increasingly turning to information
technology not only to automate operations and record
transactions, but also to understand and optimize their
business and business processes. In this context, the
transaction data collected during business operations
provide a powerful source for gaining new insights from
large data repositories of otherwise dormant data by
applying business analytics to understand the raw data.
One goal for business analytics applications is to detect
patterns in data and build models of empirically observed
events. Business analytics is often classied into three
domains of increasingly abstracted and reduced data with
increasing business value. The rst domain, descriptive
analytics, summarizes existing data as a number of metrics
extracted from raw data. The second domain, predictive
analytics, uses a model extracted from observed business
data to probabilistically extrapolate to unavailable data
(both past and future data). A third domain, prescriptive
analytics, uses a predictive model in conjunction with a
business goal to recommend one of multiple possible
actions.
Much of todays transactional business information is
stored in databases. Today, transactional records are
accessed only a few times over the entire life span of the
recorda transaction record is typically created at the
inception of a business transaction. The record is accessed
a few times during the delivery of the transaction to
fulll any undertakings made as part of the transaction
(this can often be a single access, e.g., when an order is
fullled). Finally, the record may be accessed a few times
for billing and accounts settlement transactions. In some
cases, all three phases are completed in a single
transaction (e.g., at a point of sale in stores.)
After the completion of a transaction, records most
frequently are dormant and often captive in data silos
distributed across many business units, even though these
transactional records represent a unique source to gain
real-time insights into a business and optimize many
business aspectsincluding correlation of purchases,
seasonal and regional sales trends, margin analysis, store
protability and fraud detection and prevention. However,
if made properly accessible, information contained in
these records can be used either directly as descriptive
business analytics results, or as input to predictive
and prescriptive business analytics. In order to extract this
information from transactional databases, transactional
records need to be aggregated across their operational silos
(e.g., purchasing, warehouse inventory, and store sales
data), and fast and efcient data selection and grouping of
transactional records is necessary.
To provide data accessibility, businesses are
increasingly turning to data warehouses to store business
data across the entire enterprise. While traditional
transactional queries often access a few records at a time,
business analytics operates on large sets of records,
that are either aggregated (GROUP BY) or selected
(SELECT WHERE) based on common characteristics that
may prompt further actions.
To optimize for these frequent large-scale database
accesses, businesses are increasingly turning to analytic
databases, read-only systems that store historical data on
business metrics often across multiple business aspects.
Decision makers in a business can use these data to
perform queries, generate reports, and develop models
based on the data in the database often referred to as a
data mart”— uses include ordering information based on
predicted regional and seasonal sales and developing
targeted advertising campaigns.
In-memory columnar, compressed databases have been
gaining wide acceptance for these data marts. Data marts
are most often created from, or as a replica of, OLTP
(online transaction processing) databases while minimizing
redundancy. As shown in Figure 5, data marts are
created by a distillation process, transforming OLTP
records into short, highly normalized and compressed
data, often omitting data records, and elds (columns)
from retained data records, based on the intended use of a
data mart. In this representation, most or all columns
are compressed by assigning each observed eld value a
numeric value, which is also an index into a newly
created secondary dictionary tables that contain the full
eld representation. As example, when compressing a
database containing addresses that are located either
in Los Angeles or New York, the city name string eld
maybeencodedinasinglebit,andanewlycreated
secondary database table species that the bit values
0 and 1 correspond to Los Angeles and New York,
respectively. The goal of this transformation is two-fold.
On one hand, it increases efciency of search
operations by reducing all accesses to simple bit eld
comparisons regardless of the original data type, making
them more uniform and efcient. On the other hand, the
massive compression enables large datasets to be
stored in main memory and offer even more dramatic
speedups by transforming queries from an I/O-bound to a
compute and memory-bandwidth bounded problem in a
data mart.
While some of the rst data mart implementations
relied on special purpose acceleration, the POWER8
architecture introduces workload optimization features to
the vector-scalar unit to create data marts efciently
using general purpose Power processors. In particular,
the POWER8 architecture introduces support for
software-dened variable-width bit eld arithmetic to
accelerate data mart queries, as well as further
optimizing the memory subsystem with increased
14 : 12 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
memory bandwith and enhanced hardware and
software prefetching.
Because traditional xed-width eld arithmetic is
limited to vectors of homogeneous elds with relatively
large increments (SIMD architectures may typically
support vectors of 8-bit, 16-bit, 32-bit, or 64-bit elds),
data marts use software-dened variable-width bitelds
based on columnar algebra,which can be implemented
with vector-scalar bit-logical operations and 128-bit add
and subtract. Columnar algebra on records where the
aggregate index size exceeds 128 bits can be implemented
by representing them as a multiple of the base 128-bit
operation using 128-bit add and subtract with carry-in and
carry-out operations. Like vector data values, carry bits
are stored in vector registers, to enable efcient scheduling
and renaming of operations.
Because data marts primarily concern selection and
aggregation, the key column algebra operations are
comparisons for equality, for inequality, and for range
check, i.e., a comparison against a lower and an upper
bound. Using VSX instructions, multiple variable-width
bit elds corresponding to database columns may be
test for equality and for range tests as rst proposed by
Johnson et al. [23]. The key insight is to break addition or
subtraction of long bit strings into operations on arbitrary,
software-dened subelds using logical operations and
a mask register (variable M, below which contains a single
1bit in the most signicant bit of each eldtomark
eld boundaries) to break the carry bit propagation
between elds. In addition, overow conditions in each
eld are used to determine the result of comparison
operations.
A check for equality in any eld of a record stored as
variable-width elds of a bit string may be performed with
an XOR of the eld being tested with the comparison
value, yielding a 1 bit for each bit position where bit
strings are unequal. By adding a mask ¼Mð¼ NOT MÞ
a bit inequality at any of the low-order bit position can
be propagated to the most signicant bit (MSB)
within each eld with the expression
ðððTXOR ComparisonValueÞ&MÞþMÞ.By
combining that result with a result for the MSB,
abitvalueof0intheMSBofthebiteld expression
ðððT XOR ComparisonValueÞ&MÞþMÞOR
ðTXOR ComparisonValueÞindicates that T¼eqVal for
each bit eld.
A range check with a per-eldBooleanresultmaybe
computed by subtracting the lower bound from the
eld value, and the upper bound from the eld value.
If either subtraction results in an overow, the
compared eld is outside the tested bounds. All elds
of a bit string can be tested simultaneously by breaking
propagation of overow using Boolean operations.
Johnson et al. [23] give an informal description of the
Figure 5
In-memory database processing of compressed records in data marts Data marts are created by transforming records into a highly normalized
form individual record elds are distilled into compressed indices into a secondary table Indices are sufciently short to be represented by bit
elds in a word, enabling large databases to be stored in system memory Secondary tables are commonly used only during query setup or post
processing
M. GSCHWIND 14 : 13IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
logic and show that logic minimization results in the
following expression:
C = UpperBound
CAT
-LowerBound
CAT
L=(T|M
CAT
)(LowerBound
CAT
&~M
CAT
)
U=(C&~M
CAT
)(L & ~M
CAT
)
R=L]U](UpperBound
CAT
]C)
where M
CAT
is a bitmask having 1only in the most
signicant bit of each eld, and LowerBound
CAT
and
UpperBound
CAT
are the concatenated values of the lower
and upper ranges to compare against the compressed
record bitstring T
CAT
consisting of multiple concatenated
columns T
i
. The result for the range predicate for each
eld ican be determined as the value of the most
signicant bit in that eld MSBi, which is 0 if the record
eld is within the specied bounds for the eld
LowerBound
i
eT
i
eUpperBound
i
.
While variable-width columnar algebra expressions
require several instructions to compute a result, they
facilitate a massive compression of records in order to
enable data marts to be maintained in memory, and are
more efcient than unpacking the elds for comparison.
They also enable the simultaneous evaluation of multiple
range predicates on elds compressed into a word.
Because other conditions may all be represented as
range conditions, e.g., x>gtVal $gtVal þ1xMAX ,
and x¼eqVal $eqVal xeqVal, the above equation
canalsobeusedtogeneratecode,simultaneously
checking an arbitrary logical condition on each eld of a
data mart record with the code in Listing 3 using the new
quadword (128 bit) arithmetic in the vector-scalar
instruction set. In this example code for the concurrent
evaluation of multiple variable-width bit-eld range
predicates on a compressed database record using VSX
instructions, the database record T
CAT
(consisting of
multiple concatenated bit elds T is in register vrt), and
the LowerBound
CAT
and UpperBound
CAT
values
(consisting of mutliple concatenated variable-width bit
elds storing the lower and upper bounds for each record
bit eld, respectively) are stored in registers vrland vruo.
Register vrMcontains a mask corresponding to the
multiple concatenated bit elds, where the MSB of each
bit eld is set to 1.The result of the concurrent range
predicate evaluation is computed in vector register vr1.
For each of the multiple variable-width bit elds, the MSB
indicates whether the corresponding eld meets the
predicate (indicated by a 0bit).
Vector-scalar software exploitation
A key design goal for the new vector-scalar instructions
is consumability and out-of-the-boxenablement of the
Power ecosystem. We achieved this by enabling
applications to consist of a mixture of unmodied
applications and libraries and modules exploiting the new
vector-scalar functions during system operation, and by
enabling developers to recompile applications and
libraries to take advantage of the new vector-scalar
functions without requiring code modication during
Listing 3 Concurrent evaluation of multiple variable-width bit-eld range predicates on a compressed database
record using VSX instructions. Range predicates on multiple concatenated columns T
CAT
may be evaluated by
operating on the concatenated bounds (concatenating the lower and upper bound of the range predicates
LowerBound
CAT
and UpperBound
CAT
).
14 : 14 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
software development. To achieve this level of
interoperability, the new functions must integrate and
interoperate seamlessly with legacy code developed
for the existing FPU (oating-point unit,) and AltiVec
instructions at the application level. In particular,
existing code developed, compiled, and shipped (based on
the existing FPU and AltiVec functionality and prior to
the introduction of the new vector-scalar functions) must
be interoperable with newly developed code at the
function call level. Legacy applications need to call new
system libraries that have been enhanced by exploiting
the new vector-scalar instructions, and new applications
must be able to call libraries developed prior to the
introduction of the new vector-scalar instructions. In
particular, this excludes expecting any accommodation of
vector-scalar software or hardware requirements by code
not using vector-scalar functionality. Consequently,
ensuring interoperability is the sole responsibility of the
hardware denitions and software conventions for the new
vector-scalar instructions.
From a hardware perspective, seamless data sharing is
accomplished because the new vector-scalar registers are a
superset of the established oating-point and vector
register les, and the vector-scalar instructions operate on
the same data formats as traditional FPU and vector
instructions. While FPU, AltiVec, and vector-scalar
instructions are controlled by separate machine-state
register bits, the Power application binary interface (ABI)
species that environments that support and enable
vector-scalar instructions also support classic FPU and
AltiVec instructions. This ensures software interoperability
with existing program modules and libraries and allows
functions exploiting the new vector-scalar instructions
to be called from and call functions built with the
instruction set and ABIs prior to the introduction of the
new vector-scalar instructions.
Thus, scalar parameters are passed in vector-scalar
registers vs3 to vs10, corresponding to the traditional
f3 to f10 parameter registers for scalar oating-point
parameters. Similarly, vector parameters are passed in
vector-scalar registers vs34 to vs45, corresponding to the
traditional vr2 to vr13 parameter registers for parameters
of vector type. Similarly, other rules on the preservation
of registers by callee functions remain unmodied
for state corresponding to oating-point and AltiVec
registers. As shown in Figure 6, the newly added register
bits 64 to 127 of vector-scalar registers 0 to 31 are
volatile, because code compiled with the classic FPU
instructions cannot preserve the corresponding right
halvesof these registers.
The integration of vector-scalar instructions with
the classic FPU and AltiVec offers additional
benetsprograms and header les including legacy
FPU and AltiVec inline assembly code (e.g., with the
GCC inline assembler command asm) can be compiled
to exploit the new vector-scalar instruction set without
source code change. The shared rounding control between
FPU and vector-scalar/instructions allows code that
modies rounding modes to be recompiled without
requiring users to locate and change code that modies
rounding mode controls.
Many existing applications may be recompiled with a
compiler enabled to generate code making use of
the new vector-scalar instructions and deliver immediate
performance improvements out of the boxand
without application tuning or source code modication.
Consequently, optimizing compilers may recompile
pre-existing code to take advantage of the larger register
set to accommodate memory and pipeline latencies,
and increase instruction level parallelism. Vectorizing
compilers may take advantage of the new broad vector
instruction repertoire to vectorize applications in order to
accelerate them with the use of data-level parallelism.
In addition, the Power SIMD programming model based
on vector data types and built-in functions is extended in
Figure 6
ABI conventions governing the software use of the vector-scalar
register le The unied vector-scalar register le enables intero-
peration between legacy oating-point and AltiVec vector code and
new vector-scalar code by providing data sharing within individual
functions, and across functions Parameter conventions for vector-
scalar code correspond to those of legacy oating-point and AltiVec
conventions to enable function interlinkage
M. GSCHWIND 14 : 15IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
support of newly introduced data types by adding
support for newly introduced 64-bit and 128-bit vector
data types and 64b double precision oating-point vector
data types using vector long, vector _int128, and vector
double data types in C and C++, respectively. Similarly,
the SIMD vector programming environment is adding
support for new data types to existing built-in functions
as well as adding built-in functions in support of
new instructions.
At IBM, SIMD vector architecture has been developed
with a view to efcient exploitation by compilers from
its very beginnings [2429]. In dening the vector-scalar
architecture, providing an efcient target architecture
for automatic code SIMD vectorization was an
important goal to enable the independent software
vendor community to take advantage of the new
functions and performance enhancements in the Power
architecture. Critical architecture decisions, such as
support for high-performance unaligned memory,
directly came both from workload analysis and
from the constraints of code generation. For example,
compilers are able to align individual array elements
at their respective natural boundary, but algorithms
may require vector accesses to occur starting at an
arbitrary element, making naturally aligned vector
accesses impossible to guarantee. This helped motivate
the decision to emphasize high-performance unaligned
accesses for doubleword-aligned vectors, i.e., a vector
starting at any naturally aligned double precision
oating-point array element (in the POWER7 processor
basedontherst-generation vector-scalar architecture
target of optimizing HPC applications). In addition,
we found that handling of alignment constraints
led to signicant schedule length increases as well as
increasing register pressure as discussed earlier.
Both the proprietary XL C/C++ and XL FORTRAN
compilers, as well as the community-based GCC and
LLVM (formerly, Low Level Virtual Machine) compilers,
have been enabled to take advantage of the new
functionality with support for the new Power SIMD vector
programming API, code generation improvements, and
enhanced autovectorization capabilities.
Simultaneously, IBM has enhanced existing libraries
and middleware to make use of the new Power
architecture facilities. For example, the IBM
Mathematical Acceleration Subsystem (MASS) library
and the IBM Engineering and Scientic Subroutine
Library (ESSL) have been enhanced with processor-tuned
versions for the POWER7 and POWER8 processors
taking advantage of the new vector-scalar functions [30].
Finally, IBM continues to work with the software
developer community to enable the Power applications
and middleware portfolio to take advantage of the new
facilities.
In-core vector acceleration and heterogeneous
accelerators
While heterogeneous accelerators, external to a processor
core, can offer signicant benet in terms of optimizing
execution units for a specic workload, they also
introduce a signicant communication latency and
synchronization overhead [31]. Thus, the relative
efciency of external accelerators depends signicantly
on workload characteristics, such as problem size and
granularity.
Figure 7, adapted from Salapura et al. [32], explores
the speedup obtainable for a set of workload acceleration
options, such as inline SIMD vector processing and
external accelerators at the memory-bus, I/O-bus, and
network level. Salapura et al. compare acceleration
choices as a function of data granularity and attachment
points for a variety of implementation choices. Thus,
while out-of-core network-attached, I/O-bus attached, and
memory-bus attached accelerators offer the ability to
ofoad operations of large data streams, in-core
instruction-set support achieves high coding performance
with much smaller data granularity. Similarly, signicant
transfer costs may be incurred when the working size
exceeds the accelerator memory. Datta et al. [33] estimate
that this penalty represents a factor of about 24-fold
compared to the use of shared memory between central
processor and attached accelerator. Thus, while numeric
application accelerators have shown impressive peak
speeds, PCI (Peripheral Component Interconnect)-based
accelerators suffer from high transfer overhead for
small working set sizes and large working sizes exceeding
the accelerator memory.
Reecting the importance of workload acceleration
across a broad spectrum of workload characteristics,
the POWER8 processor includes support both for
in-line SIMD vector acceleration as well as a new
modular accelerator interface with the coherent
accelerator-processor interface (CAPI) [34]. By relying on
coherent integration into the system architecture [8, 20],
such as with the Coherent Accelerator Processor Interface
introduced with POWER8, system architects can reduce
the overhead of accelerator access to make them more
attractive for a broader range of applications.
Conclusion
The vector-scalar instruction set integrates the previously
independent oating-point and AltiVec architectures in a
common vector-scalar facility. Integrating the oating-point
and vector designs, while maintaining compatibility with
legacy workloads, is a key enabler for both design
efciency and improved exibility as a target for
parallelizing compilers. The vector-scalar facility builds on
the previous designs and expands them to a new
high-performance enterprise architecture optimized for
14 : 16 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
workload-optimized acceleration of oating-point, integer
and structured and unstructured text workloads ranging
from engineering and scientic functions to big data and
business analytics workloads.
Acknowledgments
Many people worked on the features of the successive
IBM SIMD vector design generations described in this
paper. We acknowledge their efforts in developing these
innovative features that enable efcient and exible
computing, and we thank all of these IBM technologists
for their dedication to the evolving Power Architecture.
*Trademark, service mark, or registered trademark of International
Business Machines Corporation in the United States, other countries,
or both
**Trademark, service mark, or registered trademark of Freescale
Semiconductor, Inc , Apple, Inc , Sony Computer Entertainment
Corporation, Microsoft Corporation, or Linus Torvalds in the United
States, other countries, or both
References
1IBM POWER7 technology and systems,IBM J Res & Dev ,
vol 55, no 3, May/Jun 2011
2 K Diefendorff, P K Dubey, R Hochsprung, and H Scales,
AltiVec extension to PowerPC accelerates media processing,
IEEE Micro, vol 20, no 2, pp 8595, Mar /Apr 2000
3 R R Oehler and R D Groves, IBM RISC system/6000
processor,IBM J Res & Dev , vol 34, no 1, pp 123,
Jan 1990
4 J M Borkenhagen, G H Handlogten, J D Irish, and
S B Levenstein, AS/400 64-bit PowerPC-compatible processor
implementation,in Proc IEEE Int Conf Comput Des , 1994,
pp 192196
5 J M Borkenhagen, R J Eickemeyer, R N Kalla, and
S R Kunkel, A multithreaded PowerPC processor for
commercial servers,IBM J Res & Dev , vol 44, no 6,
pp 111, Nov 2000
6 J Andrews and N Baker, Xbox 360 system architecture,IEEE
Micro, vol 26, no 2, pp 2537, Mar /Apr 2006
7 M Gschwind, R Montoye, B Olsson, and J -D Wellman,
Implementing instruction set architectures with
non-contiguous register le speciers,US Patent 7421566,
Aug 12, 2005
8 J A Kahle, M N Day, H P Hofstee, C R Johns,
T R Maeurer, and D Shippy, Introduction to the cell
multiprocessor,IBM J Res & Dev , vol 49, no 4/5,
pp 589604, 2005
9 M Gschwind, H P Hofstee, B K Flachs, M Hopkins,
Y Watanabe, and T Yamazaki, Synergistic processing in cells
multicore architecture,IEEE Micro, vol 26, no 2,
pp 1024, Mar /Apr 2006
10 M Gschwind, H P Hofstee, B K Flachs, M Hopkins,
Y Watanabe, and T Yamazaki, A Novel SIMD Architecture for
the Cell Heterogeneous Chip-Multiprocessor, Hot Chips 17,
Aug 2005 [Online] Available http //www hotchips org/
wp-content/uploads/hc_archives/hc17/2_Mon/HC17 S1/HC17
S1T1 pdf
Figure 7
Speedup obtainable for different types of accelerators, based on Salapura et al [19] This gure explores the achievable speedup for a variety of
stream coding acceleration architectures as a function of data granularity based on in-core and attached stream coding accelerator architectures
While attached accelerators enable the ofoading of data coding, in-core stream coding accelerators achieves excellent speedups even at small
data granularity
M. GSCHWIND 14 : 17IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
11 D Citron, H Inoue, T Moriyama, M Kawahito, H Komatsu,
and T Nakatani, Exploiting the AltiVec unit for commercial
applications,in Proc 9th Workshop Comput Archit Eval
Commercial Workloads, Austin, TX, USA, 2006, pp 17
12 L Eisen, J J W Ward, III, H Tast, N Mäding, J Leenstra,
S M Mueller, C Jacobi, J Preiss, E M Schwarz, and
S R Carlough, IBM POWER6 accelerators VMX and DFU,
IBM J Res Dev , vol 51, no 7, pp 663684, 2007
13 Hybrid computing systems,IBM J Res & Dev , vol 53,
no 5, pp 12, Sep 2009
14 M Gschwind, F Gustavson, and J F Prins, High performance
computing with the cell broadband engine,Sci Program ,
vol 17, no 1/2, pp 12, 2009
15 D Grice, H Brandt, C Wright, P McCarthy, A Emerich,
T Schimke, C Archer, J Carey, P Sanders, J A Fritzjunker,
S Lewis, and P Germann, Breaking the petaops barrier,
IBM J Res Dev , vol 53, no 5, pp 1 11 16, Sep 2009
16 The TOP500 List, Nov 2008 [Online] Available http //www
top500 org/lists/2008/11/
17 The Green500 List, Jun 2008 [Online] Available http //www
green500 org/lists/green200806
18 V Salapura, R Bickford, M A Blumrich, A A Bright,
D Chen, P Coteus, A Gara, M Giampapa, M Gschwind,
M Gupta, S Hall, R A Haring, P Heidelberger,
D Hoenicke, G V Kopcsay, M Ohmacht, R A Rand,
T Takken, and P Vranas, Power and performance
optimization at the system level,in Proc Conf Comput
Frontiers, 2005, pp 125132
19 V Salapura, R Walkup, and A Gara, Exploiting workload
parallelism for performance and power optimization in blue
gene,IEEE Micro, vol 26, no 5, pp 6781, Sep /Oct 2006
20 B Sinharoy, J A Van Norstrand, R J Eickemeyer,
H Q Le, J Leenstra, D Q Nguyen, B Konigsburg, K Ward,
M D Brown, J E Moreira, D Levitan, S Tung, D Hrusecky,
J W Bishop, M Gschwind, M Boersma, M Kroener,
M Kaltenbach, T Karkhanis, and K M Fernsler, IBM
POWER8 processor core microarchitecture,IBM J Res &
Dev , vol 59, no 1, Paper 2, pp 2 12 21, 2015
21 M Gschwind and B Olsson, Multi-addressable register le,
US Patent 7877 582, Jan 31, 2008
22 IEEE Standard for Floating-Point Arithmetic, IEEE 754-2208,
Aug 2008
23 R Johnson, V Raman, R Sidle, and G Swart, Row-wise
parallel predicate evaluation,in Proc Very Large Database
Endowment PVLDB, 2008, pp 622634
24 A Eichenberger, P Wu, and K OBrien, Vectorization for
SIMD architectures with alignment constraints,in Proc
ACM SIGPLAN Conf Program Language Des Implement ,
ACM SIGPLAN Notices, vol 39, no 6, pp 8293, May 2004
25 P Wu, A Eichenberger, and A Wang, Efcient SIMD code
generation for runtime alignment and length conversion,in
Proc IEEE Int Symp CGO, 2005, pp 153164
26 A E Eichenberger, J K OBrien, K M OBrien, P Wu,
T Chen, P H Oden, D A Prener, J C Shepherd, B So,
Z Sura, A Wang, T Zhang, P Zhao, M K Gschwind,
R Archambault, Y Gao, and R Koo, Using advanced
compiler technology to exploit the performance of the Cell
Broadband Engine architecture,IBM Syst J , vol 45, no 1,
pp 5984, 2006
27 D Nuzman and R Henderson, Multi-platform
auto-vectorization,in Proc 4th Annu Int Symp CGO,
Mar 2006, pp 281294
28 M Gschwind, Method and apparatus for generating data
parallel select operations in a pervasively data parallel
System,US Patent 8201 159, Aug 4, 2006
29 M Gschwind, D Erb, S Manning, and M Nutter, An open
source environment for cell broadband engine system software,
IEEE Comput , vol 40, no 6, pp 3747, Jun 2007
30 A Eichenberger, M Gschwind, J Gunnels, and V Salapura,
Matrix multiplication operations using pair-wise load and splat
operations,US Patent Appl 2012/0011348,
Jan 12, 2010
31 M Gschwind, Optimizing data sharing and address translation
for the cell BE heterogeneous chip multiprocessor,in Proc
IEEE 26th Int Conf Comput Des , Lake Tahoe, CA, USA,
2008, pp 478485
32 V Salapura, T Karkhanis, P Nagpurkar, and J E Moreira,
Accelerating business analytics applications,in Proc Conf
High Performance Comput Archit , 2012, pp 413422
33 K Datta, M Murphy, V Volkov, S Williams, J Carter,
L Oliker, D Patterson, J Shalf, and K Yelick, Stencil
computation optimization and auto-tuning on state-of-the-art
multicore architectures,in Proc ACM/IEEE Conf
Supercomput , Austin, TX, USA, 2008, pp 112
34 J Stuecheli, B Blaner, C R Johns, and M S Siegel, CAPI
A coherent accelerator processor interface,IBM J Res Dev ,
vol 59, no 1, pp 7 17 7, 2015
Received January 14, 2015; accepted for publication
February 25, 2015
Michael Gschwind IBM Systems, Poughkeepsie, NY 12601
(mkg@us ibm com) Dr Gschwind leads mainframe and Power
architecture development in the IBM Systems division where he is a
Senior Technical Staff Member and Senior Manager of the Systems
Architecture team Dr Gschwind dened the vector-scalar
architecture as the architecture lead for the PERCS (productive,
easy-to-use, reliable computing system) project dening the future
POWER7 processor, and continues to lead its development as
IBM Systems Architecture Chief Architect Prior to his current role,
he was Floating-Point Chief Architect and Unit Lead, and Technical
Lead for core reliability for the Blue Gene*/Q In addition to his
hardware development roles, Dr Gschwind has also held key
software leadership roles; he developed the rst Cell compiler and
served as technical lead for the Cell software environment Most
recently, he also served as Chief Architect for the OpenPOWER
software environment where he led the denition of ABIs and APIs
Dr Gschwind received his Ph D degree from Technische
Universität Wien in Vienna, Austria in 1996 He has published
numerous articles and received over 100 patents in the area of
computer architecture In 2006, Dr Gschwind was recognized as IT
Innovator and Inuencer by ComputerWeek He is a member of the
ACM (Association for Computing Machinery) SIGMICRO (Special
Interest Group on Microarchitecture) Executive Board, a Member of
the IBM Academy of Technology, an IBM Master Inventor, an
ACM Distinguished Speaker, and an IEEE (Institute of Electrical
and Electronics Engineers) Fellow
14 : 18 M. GSCHWIND IBM J. RES. & DEV. VOL. 60 NO. 2/3 PAPER 14 MARCH/MAY 2016
... The VLIW core has six issue lanes ( Figure 1) that respectively feed: a branch and control unit (BCU), two 64-bit ALUs, a 128-bit FPU, a 256-bit load-store unit (LSU), and a coprocessor. Besides the Fisher-style VLIW features [5] (partial predication, dismissable loads, no rotating registers), this architecture unifies its scalar and SIMD datapaths around a main register file of 64×64-bit registers, for the same motivations as the POWER vector-scalar architecture [7]. Operands for the SIMD instructions map to register pairs (16 bytes) or register quadruples (32 bytes). ...
Conference Paper
The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. This problem has been addressed by suitably designing the architecture, implementation, and programming models, of the Kalray MPPA (Multi-Purpose Processor Array) family of single-chip many-core processors. We introduce the third-generation MPPA processor, whose key features are motivated by the high-performance and high-integrity functions of automated vehicles. High-performance computing functions, represented by deep learning inference and by computer vision, need to execute under soft real-time constraints. High-integrity functions are developed under model-based design, and must meet hard real-time constraints. Finally, the third-generation MPPA processor integrates a hardware root of trust, and its security architecture is able to support a security kernel for implementing the trusted execution environment functions required by applications.
... POWER9 extends the compatibility-mode architecture utilized for previous POWER-based platforms with the addition of a new POWER9 mode that allows for the exploitation of new POWER9 processor and platform features, as well as features introduced in previous POWER architectures [14]. POWER9 mode supports new instructions for the VSX category [15], string, video encode, quad floating-point, atomic memory operations, and usermode accelerator access. Hypervisors typically limit other POWER9 platform features based on the compatibility mode, such as direct control over interrupt sources as described previously in the section "POWER9 interrupt architecture and routing improvements." ...
Full-text available
Article
The IBM POWER9 architecture offers a substantial set of novel and performance-improvement features that are made available to both scale-up and scale-out applications via system software. These features provide significant performance improvements for cognitive, cloud, and virtualization workloads, many of which use dynamic scripting languages. In this paper, we describe some of the key features.
... Combined vector-scalar design is also proposed in CELL and power processors Gschwind 2006Gschwind , 2016. Cell and power processors target high performance, and it is crucial for them to have parallel hardware that can exploit available DLP and provide high performance. ...
Article
In the low-end mobile processor market, power, energy, and area budgets are significantly lower than in the server/desktop/laptop/high-end mobile markets. It has been shown that vector processors are a highly energy-efficient way to increase performance; however, adding support for them incurs area and power overheads that would not be acceptable for low-end mobile processors. In this work, we propose an integrated vector-scalar design for the ARM architecture that mostly reuses scalar hardware to support the execution of vector instructions. The key element of the design is our proposed block-based model of execution that groups vector computational instructions together to execute them in a coordinated manner. We implemented a classic vector unit and compare its results against our integrated design. Our integrated design improves the performance (more than 6×) and energy consumption (up to 5×) of a scalar in-order core with negligible area overhead (only 4.7% when using a vector register with 32 elements). In contrast, the area overhead of the classic vector unit can be significant (around 44%) if a dedicated vector floating-point unit is incorporated. Our block-based vector execution outperforms the classic vector unit for all kernels with floating-point data and also consumes less energy. We also complement the integrated design with three energy/performance-efficient techniques that further reduce power and increase performance. The first proposal covers the design and implementation of chaining logic that is optimized to work with the cache hierarchy through vector memory instructions, the second proposal reduces the number of reads/writes from/to the vector register file, and the third idea optimizes complex memory access patterns with the memory shape instruction and unified indexed vector load.
Chapter
This chapter presents the third‐generation Massively Parallel Processor Array (MPPA) processor, manufactured in 16FFC CMOS technology, whose many‐core architecture has significantly improved upon the previous ones in the areas of performance, programmability, functional safety and cyber‐security. It discusses many‐core architectures and their limitations with regard to intelligent system requirements. The chapter presents the main features of the third‐generation MPPA architecture and processor. It introduces the MPPA3 application software environments. The structuring of the MPPA3 architecture into a collection of compute units, each comparable to an embedded multi‐core processor, is the main feature that enables the consolidation of application partitions operating at different levels of functional safety and cyber‐security, on a single processor. High‐integrity computing on the MPPA3 processor refers to applications that execute in a physically isolated domain of the processor, whose functions are developed under model‐based design and must meet hard real‐time constraints.
Full-text available
Conference Paper
This article consists of a collection of slides from the author's conference presentation. Some of the conclusions of this presentation include: Single chip heterogeneous multicore system takes chip performance to a new level by: 1. exploiting thread level parallelism, 2. exploiting instruction level parallelism, 3. exploiting data level parallelism; Novel pervasive vector-centric architecture; SPE compute engine offers high compute density; Cell delivers unprecedented supercomputer power for consumer applications.
Full-text available
Patent
There are provided methods and computer program products for implementing instruction set architectures with non-contiguous register file specifiers. A method for processing instruction code includes processing an instruction of an instruction set using a non-contiguous register specifier of a non-contiguous register specification. The instruction includes the non-contiguous register specifier.
Full-text available
Article
The POWER8™ processor is the latest RISC (Reduced Instruction Set Computer) microprocessor from IBM. It is fabricated using the company's 22-nm Silicon on Insulator (SOI) technology with 15 layers of metal, and it has been designed to significantly improve both single-thread performance and single-core throughput over its predecessor, the POWER7® processor. The rate of increase in processor frequency enabled by new silicon technology advancements has decreased dramatically in recent generations, as compared to the historic trend. This has caused many processor designs in the industry to show very little improvement in either single-thread or single-core performance, and, instead, larger numbers of cores are primarily pursued in each generation. Going against this industry trend, the POWER8 processor relies on a much improved core and nest microarchitecture to achieve approximately one-and-a-half times the single-thread performance and twice the single-core throughput of the POWER7 processor in several commercial applications. Combined with a 50% increase in the number of cores (from 8 in the POWER7 processor to 12 in the POWER8 processor), the result is a processor that leads the industry in performance for enterprise workloads. This paper describes the core microarchitecture innovations made in the POWER8 processor that resulted in these significant performance benefits.
Full-text available
Article
The introduction of the PowerPC 970 JS20 blade server opens opportunities for vectorizing commercial applica- tions using the integrated AltiVec unit. We examined the vectorization of applications from diverse elds such as XML parsing, UTF-8 encoding, life sciences, string manip- ulations, and sorting. We obtained performance speedups (over optimized scalar code) for string comparisons (2-3), XML delimiter lookup (1.5-5), and UTF-8 conversion (2-4). The focus of this paper is on the process rather than on the results. Vectorizing commercial applications vastly dif- fers from vectorizing graphic and image processing appli- cations. In addition to the results achieved, we describe the pitfalls encountered, the advantages and disadvantages of the AltiVec unit, and what is missing in its current imple- mentation. Sorting presents an interesting example. Vectorizing the quicksort algorithm was not successful due to low paral- lelism and misaligned data accesses. Vectorization of the combsort algorithm was very successful, with speedups of 5.0, until the data spilled from the L2 cache. Combining both approaches, by r st partitioning the input using quick- sort and then continuing with combsort, yielded speedups of over 2.0. This research led to several patent disclosures, many al- gorithmic enhancements, and an insight into the correct in- tegration of software with the AltiVec unit. The wealth of information collected during this study is being conveyed to the auto-vectorization teams of the relevant compilers.
Full-text available
Conference Paper
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.
Full-text available
Conference Paper
Heterogeneous Chip Multiprocessors (HMPs), such as the Cell Broadband Engine, offer a new design optimization opportunity by allowing designers to provide accelerators for application specific domains. Data sharing between CPUs and accelerators, and memory access mechanisms and protocols are crucial decisions in the design of an HMP. In this article, we analyze the choices between hardware and software managed coherence between CPU and accelerators for DMA-based data sharing, and find that hardware-coherent DMA shows a performance benefit of up to 3x, even for simple workloads.We explore memory address translation architecture choices for DMA-based data sharing. In multiprogramming environments, address translation is commonly used to separate processes. For efficiency, direct access to system memory requires address translation capabilities in the accelerator. We find that hardware managed address translation shows a performance benefit of up to 5x, even for simple workloads, by avoiding the costs of accelerator/CPU communication and supervisor management of the translation context and the introduction of a serial bottleneck on the CPU.
Full-text available
Article
This paper describes the microarchitecture of the RS64 IV, a multithreaded PowerPC® processor, and its memory system. Because this processor is used only in IBM iSeries™ and pSeries™ commercial servers, it is optimized solely for commercial server workloads. Increasing miss rates because of trends in commercial server applications and increasing latency of cache misses because of rapidly increasing clock frequency are having a compounding effect on the portion of execution time that is wasted on cache misses. As a result, several optimizations are included in the processor design to address this problem. The most significant of these is the use of coarse-grained multithreading to enable the processor to perform useful instructions during cache misses. This provides a significant throughput increase while adding less than 5% to the chip area and having very little impact on cycle time. When compared with other performance-improvement techniques, multithreading yields an excell ent ratio of performance gain to implementation cost. Second, the miss rate of the L2 cache is reduced by making it four-way associative. Third, the latency of cache-to-cache movement of data is minimized. Fourth, the size of the L1 caches is relatively large. In addition to addressing cache misses, pipeline “holes” caused by branches are minimized with large instruction buffers, large L1 I-cache fetch bandwidth, and optimized resolution of the branch direction. In part, the branches are resolved quickly because of the short but efficient pipeline. To minimize pipeline holes due to data dependencies, the L1 D-cache access is optimized to yield a one-cycle load-to-use penalty.
Article
Heterogeneous computing systems combine different types of compute elements that share memory. A specific class of heterogeneous systems discussed in this paper pairs traditional general-purpose processing cores and accelerator units. While this arrangement enables significant gains in application performance, device driver overheads and operating system code path overheads can become prohibitive. The I/O interface of a processor chip is a well-suited attachment point from a system design perspective, in that standard server models can be augmented with application-specific accelerators. However, traditional I/O attachment protocols introduce significant device driver and operating system software latencies. With the Coherent Accelerator Processor Interface (CAPI), we enable attaching an accelerator as a coherent CPU peer over the I/O physical interface. The CPU peer features consist of a homogeneous virtual address space across the CPU and accelerator, and hardware-managed caching of this shared data on the I/O device. This attachment method greatly increases the opportunities for acceleration due to the much shorter software path length required to enable its use compared to a traditional I/O model.
Article
Business text analytics applications have seen rapid growth, driven by the mining of data for various decision making processes. Regular expression processing is an important component of these applications, consuming as much as 50% of their total execution time. While prior work on accelerating regular expression processing has focused on Network Intrusion Detection Systems, business analytics applications impose different requirements on regular expression processing efficiency. We present an analytical model of accelerators for regular expression processing, which includes memory bus-, I/O bus-, and network-attached accelerators with a focus on business analytics applications. Based on this model, we advocate the use of vector-style processing for regular expressions in business analytics applications, leveraging the SIMD hardware available in many modern processors. In addition, we show how SIMD hardware can be enhanced to improve regular expression processing even further. We demonstrate a realized speedup better than 1.8 for the entire range of data sizes of interest. In comparison, the alternative strategies deliver only marginal improvement for large data sizes, while performing worse than the SIMD solution for small data sizes.
Article
This paper describes the hardware architecture of the IBM RISC System/6000* processor, which combines basic RISC principles with a partitioning of registers by function into multiple ALUs. This allows a high degree of parallelism in execution and permits a compiler to generate highly optimized code to manage the interaction among parallel functions. Floating-point arithmetic is integrated into the architecture, and floating-point performance is comparable to that of many vector processors.