ArticlePDF Available

IBM POWER9 system software

Authors:
  • IBM T.J. Watson Research Center, NY

Abstract

The IBM POWER9 architecture offers a substantial set of novel and performance-improvement features that are made available to both scale-up and scale-out applications via system software. These features provide significant performance improvements for cognitive, cloud, and virtualization workloads, many of which use dynamic scripting languages. In this paper, we describe some of the key features.
IBM POWER9 system
software
The IBM POWER9 architecture offers a substantial set of novel and
performance-improvement features that are made available to both
scale-up and scale-out applications via system software. These
features provide signicant performance improvements for cognitive,
cloud, and virtualization workloads, many of which use dynamic
scripting languages. In this paper, we describe some of the key
features.
J. Jann
P. Mackerras
J. Ludden
M. Gschwind
W. Ouren
S. Jacobs
B. F. Veale
D. Edelsohn
Introduction
Since the introduction of digital computers, system software
has played the role of a gearbox between the system
hardware and the use of the hardware by the applications. In
addition, in hardware systems that concurrently execute two
or more applications, the system software maintains
application isolation, which is fair and efficient resource
allocation using policies set by the administrator, the
operating system (OS), and the hypervisor. When a system
problem occurs, the system software will attempt to save
the system state, perform failure analysis, and attempt
recovery.
To better serve important and essential applications,
computer hardware continues to evolve to serve the needed
changes in deployment preferences, application logic, and
languages; and system software needs to change to enable
and facilitate these hardware changes. To this end, for this
generation of IBM POWER9 hardware, the POWER9
system software has introduced new methodologies to
efficiently and effectively serve these new application
behaviors, while considering the growing presence of
Linux, cognitive computing, cluster computing, clouds, and
higher level languages such as the modern dynamic
scripting languages.
In this paper, we describe the novel POWER9
architectural enhancements made accessible to applications
via system software, including radix tree address
translation; interrupt architecture and routing
improvements; GZIP acceleration and nest accelerator
architecture; system capacities, scalability, and
system-software compatibility mode; and POWER9
new architectural features—enhancing performance and
its software ecosystem.
Radix tree address translation in POWER9
The POWER9 processor is the first processor to implement
the new OpenPOWER POWER 3.0B Instruction Set
Architecture [1], which provides both the traditional
POWER Hashed Page Table (HPT) supported by previous
generations of POWER processors, and the newly
architected radix tree address translation modes [2, 3].
Radix tree address translation offers improved performance
for most Linux workloads, while HPT address translation
supports IBM’s proprietary operating systems (OSs) AIX
and IBM i, as well as Linux.
The HPT has the advantage that a translation can be
performed (i.e., a translation lookaside buffer [TLB] miss
can be serviced) by reading up to two cache lines from
memory [4]. This characteristic should enable the HPT to
provide good performance for applications with very large
memory footprints and low locality of references (i.e., with
essentially random or quasi-random access patterns).
The disadvantage of the HPT structure is that it does not
cache well. The hashing algorithm used in the POWER
Memory Management Unit (MMU) tends to put each page
table entry (PTE) for a process into a separate cache line.
Thus, most TLB misses cause a cache miss and need to read
main memory, particularly in processes with large working
sets.
Radix tree translation, as defined in the POWER 3.0B
ISA, includes support for virtualization. Each virtual
machine (VM) (i.e., each logical partition [LPAR] or
Kernel Virtual Machine [KVM] guest) has a “host” radix
tree in hypervisor memory, which translates the VM’s view
of its physical address space into the real physical address
space of the machine. This radix tree is called the partition-
scoped tree. It is managed by the hardware, and it translates
guest real addresses into host real addresses. In addition, the
OS kernel inside the VM creates and manages one radix
Digital Object Identifier: 10.1147/JRD.2018.2846959
ßCopyright 2018 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
0018-8646/18 ß2018 IBM
IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018 J. JANN ET AL. 6:1
tree for its own use, plus one radix tree for each process.
These “guest” radix trees are called “process-scoped” trees,
and their entries contain guest real addresses. These two
levels of scope significantly help cloud-related
virtualizations.
Translating an effective address (EA) generated by a
program running inside a VM conceptually involves two
steps: (1) translating the EA to a guest real address using a
process-scoped radix tree, followed by (2) translating that
guest real address to a physical address using the VM’s
partition-scoped tree. However, because the process-scoped
tree is expressed in terms of guest real addresses, this
process can involve up to 24 reads from memory (i.e., 5
lookups of the partition-scoped tree, each requiring 4 reads,
plus 4 reads to the process-scoped tree). Thus, it is essential
to use an effective caching strategy to avoid having to do 24
memory reads for every TLB miss. The POWER9
processor uses a page-walk cache (PWC) similar to that
described by Barr et al. [5]. Using the PWC, the average
number of memory accesses required to service a TLB miss
is expected to be less than 1.
Because radix page tables place information about
adjacent addresses into adjacent doublewords in memory, a
program with high locality of references in its address
access pattern will also have high locality of references in
the pattern of accesses performed by the MMU to service
TLB misses. Thus, the CPU caches work efficiently to
cache the page-table entries (PTEs), and so radix tree
translation is expected to be more efficient than HPT
translation for workloads with high locality of references.
All POWER OSs—IBM AIX, Linux, and IBM i—support
the HPT format, as the HPT format provides good
performance for traditional POWER applications such as
mission-critical enterprise online transaction processing
(OLTP) workloads. Many Linux applications, such as most
artificial intelligence (AI) and high-performance computing
(HPC) applications, exhibit high locality of references in
their data sets. Consequently, Linux OSs are introducing
support for the new radix page table formats to provide
better performance on these emerging workloads.
The Linux kernel manages memory using a radix tree
structure to store per-page information for each process.
This data structure is used and manipulated by architecture-
independent code, but each architecture can define the
layout in memory of the tables at each level of the tree. For
the POWER architecture, the layout of the tree has recently
been modified to match the format expected by the MMU
hardware, including storing the entries in big-endian
format, even if the OS is using the little-endian ABI. This
makes it possible to have the MMU hardware access the
Linux page tables directly.
Providing a common format for page translation by
hardware and software provides more efficient memory
management for Linux systems. Not only is system memory
usage reduced by storing only a single representation of the
address space, but also removing the need to translate
between software and hardware address translation formats
offers significant path-length reduction in the kernel,
reducing overall time spent servicing page faults. In
addition, guest OSs avoid the need to use hypercalls to
update their page tables, which they are required to do when
using HPT translation. (Because the HPT contains physical
addresses, it is managed by the hypervisor and is not
directly accessible to the OS.)
The POWER ISA 3.0B ISA defines additional structures
in memory that allow any virtual address to be translated,
where a virtual address is defined by an effective address, a
process ID (PID), and a logical partition ID (LPID). This
allows off-core resources such as accelerators to work
directly with effective addresses relating to user-mode
processes. The top-level structure is the partition table,
which is an array indexed by LPID. Each entry in the
partition table corresponds to a VM, except for entry 0,
which is reserved to describe the translation environment
for the hypervisor. Each entry indicates whether the
corresponding partition uses HPT or radix translation, and
each entry contains two pointers, one pointing either to the
HPT or to the partition-scoped radix tree for the partition,
and the other pointing to the process table for the partition.
A hypervisor-privileged special-purpose register (SPR)
called PTCR points to the base of the partition table, and
another called LPIDR contains the currently selected
LPID value.
The process table is an array in guest memory that is
indexed by PID. For radix translation, each entry points to
the process-scoped radix tree for the corresponding process.
For HPT translation, each entry points to a segment table
describing effective-to-virtual translations used in the HPT
translation process (see [1] for details). The current PID
value is stored in the PIDR SPR. Note that the PID as used
here (referred to as effective PID, and effPID for short) is
not necessarily identical to the OS’s notion of process ID; it
generally corresponds more closely to the notion of an
address-space ID.
The left half of Figure 1 presents a conceptual view of
guest real memory. The translation for each guest
application (PID 0) or the OS kernel itself (PID ¼0) is
translated by a process-scoped radix tree (shown in red).
Each guest effective address (gEA) is translated via this
radix tree into a guest real address (gRA). The right half of
Figure 1 illustrates how host real memory can be allocated
for a hypervisor and several guests in a radix-based system.
The hypervisor region is mapped by a blue partition-scoped
radix tree pointed to by the Partition Table Entry indexed by
LPID ¼0. Each guest region of host real memory is
mapped by an additional blue partition-scoped radix tree
pointed to by the Partition Table Entry indexed by it LPID
value. As a result, each guest real address created by the
6:2 J. JANN ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018
VM is translated a second time by the blue partition-scoped
radix tree based on its LPID value.
Finally, to enable efficient transition between application
and OS, and OS and hypervisor, the radix page table mode
divides the 64-bit effective address space into four
“quadrants” and uses different radix trees in each quadrant.
The most significant 2 bits of the address select the
quadrant. Generally, quadrant 0 (addresses starting at 0) is
used for application processes and quadrant 3 (addresses
starting at 0xC000_0000_0000_0000) is used for the OS
kernel.
Figure 2 illustrates which values of PID and LPID are
used, depending on the privilege state of the processor and
the quadrant being accessed. The areas with the same color
are meant to illustrate how the same translation path can be
accessed by different quadrants running in different
privilege states. Note that in the “Guest” column, the page
protection mechanism is commonly used to prevent
applications from accessing the kernel (OS) address region.
The entries labeled “Segment Interrupt” identify the
quadrants for which access is not permitted, and attempted
access will result in an exception. The labels “HV” and
“PR” are bits in the Machine State Register (MSR) that
correspond to hypervisor state and problem state,
respectively.
When translating addresses in quadrant 0, the MMU uses
the process-scoped tree corresponding to the PID value in
the PIDR register, whereas for quadrant 3, it uses the
process-scoped tree corresponding to PID 0. That is, the
first entry in the process table describes the OS kernel’s own
address space, which has a separate radix tree of its own,
distinct from those used to describe application process
address spaces.
Quadrants 0 and 3 are used in this way both inside a VM
and in a host OS such as Linux running non-virtualized (i.e.,
without an underlying hypervisor). A non-virtualized Linux
system has the kernel running in hypervisor mode, with full
access to the machine’s hardware. Such a system can run
one or more VMs using the KVM facilities in the Linux
kernel. In this configuration, the non-virtualized Linux
system is referred to as the host and the VMs as the guests.
To facilitate transitions from the guest to the host and
vice-versa, the MMU uses an effective LPID value of 0 for
all accesses to quadrants 0 and 3 while the machine is in
hypervisor mode [3, 6]. This allows the hypervisor to load a
guest LPID value into LPIDR without needing to first go
Figure 1
Nested radix memory map.
IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018 J. JANN ET AL. 6:3
into real mode (address translation disabled). In addition,
the host’s process-scoped radix trees contain physical
addresses (i.e., the host uses a one-step translation process,
without involving a partition-scoped radix tree).
When the CPU is in hypervisor mode and using radix tree
translation, the MMU provides access to the virtual address
space of the current guest VM (the one selected by the
LPID value in the LPIDR register) using quadrants 1 and 2.
Accesses to quadrant 1 access the application space of the
current guest; that is, they use the process-scoped tree
selected by the values in LPIDR and PIDR, combined with
guest-real to physical translation specified by the guest’s
partition-scoped radix tree. Similarly, accesses to quadrant
2 access the kernel space of the current guest, as for
quadrant 1 but using a PID value of 0. This facility is
particularly useful when KVM needs to fetch an instruction
from the guest to emulate it in software, as happens, for
example, when the guest accesses an emulated I/O device
(one that does not physically exist and instead is made
to appear to exist to the guest by software emulation in
the host).
The POWER architecture-specific part of the KVM code
has recently been extended to support the POWER9
processor. This code can be found in the Linux kernel
source repository [7] in the arch/powerpc/kvm directory.
The host can run with either HPT or radix translation,
selected by a boot-time parameter. At the time of this
writing, KVM requires the guest OS to use the same
translation mode as the host. In the future, this restriction is
expected to be removed, to the extent that a host using radix
mode will be able to run a guest using HPT translation.
POWER9 interrupt architecture and routing
improvements
The POWER9 processor introduces changes in the interrupt
architecture to improve interrupt routing and the
presentation of events (Figure 3). Prior to POWER9, when
presenting interrupts, the hardware was not aware of which
OS thread was running on a given hardware thread. For
shared processor partitions, this resulted in interrupts being
distributed across all the processors in the shared processor
pool. Therefore, OSs would receive interrupts that are not
destined for them (also known as phantom interrupts).
Prior to POWER9, OS control and handling of interrupts
required hypervisor calls on virtualized systems. This
included determining the reason for an interrupt, changing a
thread’s interrupt acceptance priority, signaling an end of
interrupt (EOI), and enabling/disabling interrupts.
Hypervisor calls were also required for triggering inter-
processor interrupts (IPIs). Requiring hypervisor calls
introduces additional path-length, hence increasing overall
interrupt latency. The interrupt architecture in the POWER9
processor improves the overall system performance and
scalability.
With the POWER9 processor, an OS can have direct
control over its interrupt sources. Interrupt sources can
originate from hardware I/O devices, virtual devices via the
hypervisor, and OS triggered IPIs. The OS can set up an
Figure 2
Nested Radix Translation Quadrants (LPIDR^ ¼0). The quadrants apply only to Radix-on-Radix when translation is ON. Color-coding in this table is
meant to illustrate where the same translation path is being used.
6:4 J. JANN ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018
interrupt source to target a specific OS thread or a group of
threads. Interrupt sources can be enabled/disabled and IPIs
can be triggered directly by the OS without hypervisor
involvement, reducing path-length. Interrupt sources owned
by the hypervisor are directly presented to the hypervisor
without any OS-induced latency or OS involvement.
When an interrupt source is triggered, the POWER9
interrupt hardware inspects interrupt routing control
structures in system memory to determine the target for that
interrupt source. Following these control structures, the
hardware places the associated event on the appropriate
OS-controlled event queue (EQ).
After the interrupt hardware places the event in the EQ, it
then determines whether an interrupt is required for the
event. The OS can configure the interrupt routing control
structures to present a single interrupt for multiple events. If
an interrupt is required, the interrupt hardware determines
whether there is a hardware thread that can accept the
interrupt. A thread context associated with each hardware
thread contains information about which OS thread is
currently executing on the hardware thread. The thread
context also contains the thread’s interrupt-acceptance
priority, which is controlled directly by the OS, without
hypervisor involvement.
The interrupt is presented to a hardware thread if its
thread context matches the specific target OS thread or a
member of a group of target threads, and is allowed by the
interrupt acceptance priority. For interrupts targeted to a
group of threads, any thread that is a member of the group
can be chosen. The interrupt hardware maintains historical
aging information to evenly distribute interrupts amongst
the members of the group [8].
A hypervisor interrupt is triggered if no matching thread
context is found [9]. Once the hypervisor receives the
interrupt, it determines which OS thread needs to be
dispatched (assigned to a hardware thread) to receive the
interrupt. Upon dispatch, the interrupt hardware matches on
the thread context, and the interrupt will be presented to the
OS thread.
Once the interrupt is presented to the OS, no hypervisor
involvement is required to fully service the associated
event. The OS directly removes events from the EQ,
services the events, and issues EOIs. Servicing multiple
events under a single interrupt reduces the interruptions to
the OS and the overall system interrupt traffic. This
interrupt architecture is described in detail in another paper
in this issue [10].
POWER9 Gzip acceleration and nest accelerator
architecture
The IBM POWER7þprocessor introduced the nest
accelerator (NX) unit, which allows for on-chip
acceleration of compression- and encryption-related
operations. With the POWER9 processor, the NX unit now
includes a Gzip accelerator engine. The Gzip accelerator
engine features industry-standard RFC 1951 (deflate)
Figure 3
Benefits of the IBM POWER9 interrupt architecture.
IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018 J. JANN ET AL. 6:5
compliance and RFC 1950 (zlib) and RFC 1952 (Gzip) file
format support, as well as fixed and dynamic Huffman-table
support and assist for dynamic Huffman-table creation [11].
Prior to POWER9, each OS interaction with the NX unit
on virtualized systems required the OS kernel to execute a
hypervisor call. The hypervisor performed necessary
address translation and validation and then issued the
accelerator operation on behalf of the OS. Though effective,
accessing the NX unit through the hypervisor incurs a
performance penalty due to the extra path length required
by the hypervisor call.
The POWER9 processor includes hardware support to
allow OS user-level software access to the NX unit. The
NX-unit accelerators are invoked using the copy-paste
facility [1] through the virtual accelerator switchboard
(VAS). Address translations are handled by the nest MMU
(NMMU). For each NX-accelerator type, there is an
initialization phase that is set up by the OS with hypervisor
calls. After the initialization phase, user-mode memory
accesses to an NX-accelerator are directed to the NX unit
[11], bypassing the hypervisor, hence decreasing the overall
time required to complete the operations.
In more details, as part of the initial system boot, the
hypervisor configures VAS receive-window contexts for
each of the accelerator types that will be accessed directly
via user-level processes. The VAS receive-window contexts
point to the request FIFOs in system memory. When an OS
wants to allow a user-level process access to a NX-unit
accelerator, it initially sets up a VAS send-window context
via hypervisor calls. The hypervisor controls direct access
to the NX unit via the send-windows. The number of
requests that can be simultaneously submitted is also
controlled by the hypervisor through a hardware credit-
based system. Once the appropriate VAS windows have
been configured, an authorized channel for the OS to the
requested NX unit accelerator is established.
The OS user-level process is then able to configure an
accelerator request structure in its process memory and use
the copy-paste facility to copy the contents of that request to
the accelerator’s receive FIFO. The NX-unit accelerator
receives notification of the request and pulls it from the
FIFO to be processed. During processing, the accelerator
uses the NMMU for address translation services. When the
operation is complete, the process is notified via an
interrupt, or it detects completion via polling, as configured
by the process.
System capacities, scalability, and system-
software compatibility-mode architecture
Chip, core, and thread capacities
POWER9 offers two chip versions aimed at better serving
both scale-up and scale-out system designs that are
optimized for emerging analytics, AI, cognitive workloads,
HPC, cloud, hyperscale data centers, and enterprise
workloads [12].
The scale-out version of the POWER9 chip targets
smaller symmetric multiprocessing (SMP) configurations
supporting up to four sockets in a system. This chip
contains up to 24 cores, where each contains four hardware
threads. This scale-out chip supports industry-standard
DDR4 memory DIMMs [13].
The scale-up version of the POWER9 chip is optimized
for larger multi-socket SMP configurations. This scale-up
chip contains up to 12 cores, each with eight hardware-
threads per core [12, 13]. Like POWER8, the POWER9
processors support up to four different simultaneous
multithreading (SMT) modes: ST/SMT1 with one thread,
SMT2 with two threads, SMT4 with four threads, and
SMT8 with eight threads; SMT8 is available in the scale-up
chip only [12–14]. The SMT8 core provides higher per-core
capacity and can reduce per-core software licensing. A
POWER9 system supports assigning all its hardware
threads (1,536 or more) to a single LPAR or distributing
them across many LPARs. Table 1 provides a history of the
SMT modes supported by various generations of the
POWER processor.
POWER9 memory capacity
The scale-up version of the POWER9 chip supports
extreme memory capacity with support for up to 8 TB
(terabytes) of memory per socket [12, 13]. System software
may grow to support an increasing amount of system-wide
physical memory, e.g., 64 TB or more. In a virtualized
system, besides a small amount of physical memory
consumed by the hypervisor, all the remaining physical
memory can be assigned to the partitions or VMs (i.e., 1 to
1,000 more LPARs or guest VMs).
Compatibility-mode architecture
The compatibility-mode architecture defines what
instructions and system features are available to OSs and
user applications, based on processor architecture. By
providing compatibility modes for software designed for
older generations of the Power ISA (instruction set
architecture), and the ability to limit what system features
are available, this compatibility architecture allows OSs
that do not support the newer processor architectures and
Table 1 Supported SMT modes by POWER processor
architecture generations
6:6 J. JANN ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018
system features to run without risk on the newer POWER
Architecture systems.
The architecture also allows live migrations of LPARs
running older OS versions to POWER9 systems, as well as
for live migration of LPARs back and forth between
different generations of systems, e.g., from IBM POWER7
to POWER9, and then back to POWER7 at a later time. In
this scenario, the LPAR must run in the lowest
compatibility mode supported by all the systems used for
running the LPAR (i.e., in the above example, the LPAR
must be run in POWER7-compatibility mode).
POWER9 extends the compatibility-mode architecture
utilized for previous POWER-based platforms with the
addition of a new POWER9 mode that allows for the
exploitation of new POWER9 processor and platform
features, as well as features introduced in previous POWER
architectures [14]. POWER9 mode supports new
instructions for the VSX category [15], string, video encode,
quad floating-point, atomic memory operations, and user-
mode accelerator access. Hypervisors typically limit other
POWER9 platform features based on the compatibility
mode, such as direct control over interrupt sources as
described previously in the section “POWER9 interrupt
architecture and routing improvements.” The POWER9
processor can be run in IBM POWER6, POWER7, IBM
POWER8, or POWER9 compatibility mode, whereas
PowerVM will support the new POWER9 mode and the
POWER7 and POWER8 compatibility modes.
POWER9 new architectural features: Enhancing
performance and its software ecosystem
The POWER9 architecture adds many new features that
greatly enhance the performance of important components
of the software ecosystem for IBM Power Systems. The
features are utilized by the latest compilers and libraries
developed for POWER9 systems. Ten of them are described
in this section.
Branch prediction
The POWER9 processor design greatly enhances branch
prediction capabilities. Branch mispredictions are very
detrimental to performance, as a pipeline flush wastes the
work that has been expended on instructions in flight and
requires an expensive and time-consuming fetch of
instructions from a completely different stream. The
advanced branch predictor in POWER9 significantly
increases branch prediction accuracy to avoid instruction
pipeline hiccups, allowing the processor’s computational
resources to produce faster results for the customer
workloads.
Pipeline scheduling
The POWER9 pipeline is shorter and more adaptive. The
fetch-to-compute portion of the pipeline has been compressed
by five stages, which allows more rapid production of results
from unexpected redirections of the instruction stream. The
processordeploys instructions to the function units more
flexibly, toallow more optimal utilizationof resources. These
features combine to allow the processor to perform optimally
on a wider variety of instruction streams that have not been
pre-generated with the ideal optimization for the POWER9
processor. This greatly enhances “out-of-the-box
performance of applications.
IEEE 128-bit oating point
The IBM POWER9 processor provides a third binary
floating-point format in addition to the standard 32-bit and
64-bit floating-point types. This format type implements the
128-bit binary floating-point format, based on the IEEE
754R standard for floating point [16], consisting of 1 bit for
sign, 15 bits for the exponent, and 113 bits for the mantissa,
including an implied hidden bit.
Linux-on-POWER compilers and libraries are
implementing support for this new floating-point type based
on the OpenPOWER ELF v2 application binary interface
(ABI) [17]. It is anticipated that when all the support is in
place, the POWER9 IEEE 128-bit floating point will allow
code that needs a larger exponent or mantissa range than the
traditional 64-bit type can support. This new support
provides full hardware performance for numerically
intensive computing applications that require extended
precision.
Half-precision conversion instructions
POWER9 adds instructions to convert between scalar 16-bit
floating point and 32-bit floating point. POWER9 also adds
instructions to aid in converting vector 16-bit floating point
and vector 32-bit floating point. Support for half precision
provides special capabilities to enhance the performance of
new generations of reduced precision computation utilized
in the cognitive computing space.
Vector strings
New instructions were added to speed up string
operations in vector registers. First demonstrated with the
introduction of the SIMD accelerator for business
analytics for z System [18], string processing using the
vector facility provides significant improvements in
throughput by enabling the use of wide data paths for
variable-length data types. The new instructions are vector
load string with variable number of bytes, vector store
string with variable number of bytes, and vector string
compare [19–22].
Multi-precision arithmetic
POWER9 adds new capabilities to utilize the overflow
flag as a carry bit for multiply-and-add arithmetic
operations by providing a new set of instructions.
IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018 J. JANN ET AL. 6:7
These operations are important to the performance of
high-performance technical computing, public-key
cryptography, blockchain implementations, elliptic-curve
cryptography, and TLS (transport layer security) support
provided by web servers. In addition, this feature
provides POWER9 users with enhanced performance for
large-integer arithmetic by allowing algorithms to utilize
a second, parallel set of computations that leverage the
rich set of computational resources available in the
POWER9 processor.
32-bit carry and overow
POWER9 now provides information about 32-bit carry and
overflow when running a 64-bit application [23]. This new
infrastructure provides rapid access to the 32-bit carry and
overflow state to allow higher efficiency for operations on
32-bit data types, with direct performance benefits for
dynamic programming languages, such as PHP, Python,
and Ruby.
Lightweight mff (move from FPSCR) set of
instructions
POWER9 adds instructions to modify a subset of the
FPSCR (floating-point status and control register) in a
very efficient manner. Recent additions and extensions to
the floating-point behavior defined in language standards
provide much greater user control over rounding behavior
and rounding modes. This enhanced set of mff instructions
allows compilers and libraries to adjust the processor
floating-point behavior, providing users with full language
conformance at the demanding performance levels that
users expect from the IBM POWER processors.
PC-relative addressing with the addpcis instruction
POWER9 provides a great leap through an entirely new
addressing mode. The addpcis instruction can be used to
load the next instruction address into a register, allowing
the instruction address to easily and seamlessly contribute
to the computation of an address for storage access. This
directly addresses the evolution of programming languages
and application design that leverage much more abstraction
and modularity, enabling the POWER architecture to keep
pace with modern languages.
ISEL (integer select) instruction performance
enhancements
IBM POWER9 processor design enhances the performance
of the ISEL instruction, allowing more efficient
implementations of operations using condition registers.
Dynamic programming languages frequently utilize idioms
that require this computation, which now can leverage ISEL
for a high-performance implementation.
Conclusion
The IBM POWER9 architecture and system software
innovations offer great performance improvements for both
scale-out and scale-up systems, via significant features that
support cloud and dynamic scripting languages, both of
which are increasingly being used in modern cognitive
computing workloads.
The radix tree address translation in POWER9 offers
improved performance for most Linux workloads. A new
interrupt architecture allows direct control over interrupt
sources by the guest OSes, improving overall system
performance and scalability. POWER9 systems also include
a new Gzip accelerator that allows user-level software to
directly access the NX unit, eliminating overhead in
accessing accelerators, thereby providing a performance
boost to applications utilizing them.
The POWER9 platform and system software continue to
support enterprise workloads that require scalability and
compatibility between hardware, system software, and
existing applications, plus the ability to easily migrate
workloads and partitions or VMs between systems.
Additionally, the software ecosystem for POWER9 is
enhanced with support for features such as a new 128-bit
binary floating-point format, providing full hardware
performance for numerically intensive computing
workloads. The new POWER9 half-precision conversion
instructions enhance cognitive computing. Finally, support
for new multi-precision arithmetic enhances high-
performance technical computing and public-key
cryptography, directly improving the performance of
blockchain implementations, elliptic curve cryptography,
and transport layer security.
With this set of new features, the POWER9 platform is
well suited for the workloads of this new era, including big-
data analytics, AI and cognitive workloads, HPC, private/
public/hybrid clouds, as well as enterprise workloads.
Acknowledgment
We are grateful to all the IBMers who have contributed to
POWER9 system software (OSs, hypervisors, etc.),
architecture, and the POWER ecosystem, but the list is too
long to provide here. Nevertheless, we would like thank
R. Kalla and P. Pattnaik for their leadership.
References
1. IBM Power ISA Version 3.0B, (Section 5.7: Storage Addressing),
2017. [Online]. Available: htt_ps://openpowerfoundation.org/?
resource_lib = power-isa-version-3-0
2. A. Bybell, B. Frey, and M. Gschwind, “Selectable address
translation mechanisms,” U.S. Patent 9,600,419, 2012. [Online].
Available: htt_ps://patents.google.com/patent/US9600419
3. A. Bybell, B. Frey, M. Gschwind et al., “Managing translation of a
same address across multiple contexts using a same entry in a
translation lookaside buffer,” U.S. Patent 9,311,249, 2012.
[Online]. Available: htt_ps://patents.google.com/patent/
US9311249
6:8 J. JANN ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018
4. I. Yanic and D. Tsafrir, “Hash, don’t cache (the page table),”
in Proc. ACM SIGMETRICS Int. Conf. Meas. Model.
Comput. Sci., Jun. 2016, pp. 337–350. [Online]. Available: ht_tp://
www_.cs.technion.ac.il/dan/papers/hvsr-sigmetrics-2016.pdf
5. T. W. Barr, A. L. Cox, and S. Rixner, “Translation caching: Skip,
don’t walk (the page table),” in Proc. 37th Annu. Int. Symp.
Comput. Archit., Jun. 2010, pp. 48–59. [Online]. Available:
htt_ps://www_.cs.rice.edu/CS/Architecture/docs/barr-isca10.pdf
6. A. Bybell, B. Frey, M. Gschwind et al., “Managing translations
across multiple contexts using a TLB with entries directed to
multiple privilege levels and to multiple types of address spaces,”
U.S. Patent 9,317,443, 2014. [Online]. Available: htt_ps://patents.
google.com/patent/US9317443
7. “Linux Kernel source code repository.” [Online]. Available:
htt_ps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
8. R. Arndt, F. Auernhammer, S. Jacobs et al., “Techniques for
indicating a preferred virtual processor thread to service an
interrupt in a data processing system,” U.S. Patent 9,678,901 B2.
[Online]. Available: htt_ps://patents.google.com/patent/
US9678901B2/en
9. R. Arndt, F. Auernhammer, and B. Mealey, “Techniques for
escalating interrupts in a data processing system to a higher
software stack level,” U.S. Patent Appl. 20170139860 A1.
[Online]. Available: htt_ps://patents.google.com/patent/
US20170139860A1/en
10. F. Auernhammer and R. L. Arndt, “XIVE: External interrupt
virtualization for the cloud infrastructure,” IBM J. Res. & Dev.,
vol. 62, nos. 4/5, 2018 (this issue).
11. “Linux on power architecture platform reference.” [Online].
Available: htt_ps://openpowerfoundation.org/?resource_lib =
linux-on-power-architecture-platform-reference
12. J. Stuecheli, “POWER9 chip technology,” in Proc. OpenPOWER
Summit Europe, Barcelona, Spain, 2016. [Online]. Available:
htt_ps://openpowerfoundation.org/wp-content/uploads/2016/11/
Jeff-Stuecheli-POWER9-chip-technology.pdf
13. B. Thompto, “POWER9: Processor for the cognitive era,” in
Proc. IEEE Hot Chips 28 Symp., Cupertino, CA, USA, 2016,
pp. 1–19. [Online]. Available: htt_ps://www_.hotchips.org/wp-
content/uploads/hc_archives/hc28/HC28.23-Tuesday-Epub/
HC28.23.90-High-Perform-Epub/HC28.23.921-.POWER9-
Thompto-IBM-final.pdf
14. “POWER9 processor user’s manual: OpenPOWER, version 1.0.”
[Online]. Available: htt_ps://www-355.ibm.com/systems/power/
openpower/posting.xhtml?postingId =
FC2A3168C5821FA78525803D00719802
15. M. Gschwind, “Workload acceleration with the IBM POWER
vector-scalar architecture,” IBM J. Res. & Dev., vol. 60, nos. 2/3,
Paper 14, pp. 14:1–14:18, 2016.
16. IEEE Standard for Floating-Point Arithmetic, IEEE 754-2008,
2008. [Online]. Available: ht_tp://ieeexplore.ieee.org/document/
4610935/
17. “Power architecture 64-Bit ELF V2 ABI specification,” 2015.
[Online]. Available: htt_ps://members.openpowerfoundation.org/
document/dl/576
18. E. Schwarz, R. Krishnamurthy, C. Parris et al., “The SIMD
accelerator for business analytics on the IBM z13,” IBM J. Res. &
Dev., vol. 59, nos. 4/5, 2015.
19. M. Gschwind, B. Olsson, and R. Silvera, “Computer instructions
for limiting access violation reporting when accessing strings and
similar data structures,” U.S. Patent 9,690,509, 2014. [Online].
Available: htt_ps://patents.google.com/patent/US9690509
20. M. Gschwind and B. Olsson, “Processing page fault exceptions in
supervisory software when accessing strings and similar data
structures using normal load instructions,” U.S. Patent 9678886,
2014. [Online]. Available: htt_ps://patents.google.com/patent/US
9678886
21. M. Gschwind and B. Olsson, “Exception preserving parallel data
processing of string and unstructured text,” U.S. Patent
US20170091124A1, 2015. [Online]. Available: htt_ps://patents.
google.com/patent/US20170091124A1
22. M. Gschwind and B. Olsson, “Vector load with instruction-
specified byte count less than a vector size for big and little endian
processing,” U.S. Patent US20170139709A1. [Online]. Available:
htt_ps://patents.google.com/patent/US20170139709A1
23. M. Gschwind and B. Olsson, “Simultaneously capturing status
information for multiple operating modes,” U.S. Patent 9,727,353.
[Online]. Available: htt_ps://patents.google.com/patent/
US9727353
Received March 11, 2018; accepted for publication April 17,
2018
Joefon Jann IBM Research, Thomas J. Watson Research Center,
Yorktown Heights, NY 10598 USA (joefon@us.ibm.com). Ms. Jann has
been an IBM Distinguished Engineer since 2007. She leads the POWER
system software research work in the worldwide IBM Research labs.
She selects and facilitates their research toward a smooth transition to
productization by the Systems Business Unit. She has half a dozen or so
Corporate Outstanding Technical or Innovation awards in the areas of
virtualization (dynamic logical partitioning), robustness, security, and
autonomic performance in the OS. Her interests include whole stack
performance, cognitive analytics, graphs and matrix groups, and
software-dened networking for POWER systems. She is a member of
the IBM Academy of Technology, a Senior Member of the IEEE, and a
member of IEEE Women In Engineering, ACM, and Society of Women
Engineers.
She has an M.S. degree in computer science from Columbia
University, an M.A. degree in mathematics from City University of
New York (CUNY), and a B.A. degree in mathematics with honors
from Wellesley College-MIT.
Paul Mackerras IBM Systems, Canberra, ACT, Australia
(pmac@au1.ibm.com). Dr. Mackerras is an IBM Senior Technical Staff
Member in the IBM Linux Technology Center. He has contributed to the
Linux kernel for more than 20 years, particularly in the area of support
for the POWER (formerly PowerPC) architecture, including serving as
the architecture maintainer for the PowerPC architecture for 7 years. He
currently works on the Linux Kernel Virtual Machine code supporting
IBM Power processors.
John Ludden IBM Systems, Essex Junction, VT 05452 USA
(ludden@us.ibm.com). Mr. Ludden received a B.S. degree in electrical
engineering from the Rochester Institute of Technology. He joined IBM
in 1990 and is a Senior Technical Staff Member. He served as the
architecture verication lead engineer for the IBM POWER3 through
POWER9 processors. He has worked with the IBM Research - Haifa
Laboratory in Israel for 20 years to advance the instruction-level test
generation tools used to verify all current processors within IBM. He
holds several patents, has published multiple papers in the eld of
processor verication, and pioneered the simultaneous multithreading
and transactional memory verication methodologies currently
employed by IBM. He owned the radix translation verication plan for
the POWER9 processor.
Michael Gschwind IBM Transformation and Operations, Armonk,
NY 10504 USA (mkg@us.ibm.com). Dr. Gschwind is Chief Engineer for
Machine Learning and Deep Learning in the IBM Global Chief Data
Ofce, where he leads the development of IBMs AI platform.
Dr. Gschwind has been a technical leader for IBMs key
transformational initiatives, creating and leading the development of
PowerAI and IBMs AI hardware and software products, leading the
development of the OpenPOWER Hardware Architecture, as well as the
software interfaces for the new little-endian OpenPOWER Linux and its
Software Ecosystem. In previous assignments, he was a chief architect
for Blue Gene, the little-endian OpenPOWER Linux software
IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018 J. JANN ET AL. 6:9
architecture, OpenPower hardware architecture, the rst-in-industry
numeric accelerator architecture Cell, POWER9, POWER8,
POWER7, the Microsoft Xbox360 media accelerator engine and
mainframe architecture. As a faculty member at the Technical
University of Vienna, Dr. Gschwind invented inference and training
accelerators for neural networks that have become the technology base
for the AI revolution. He is an author of more than 100 reviewed papers
spanning all elds of computing from chip design and design
automation to software architecture and AI platforms, and an inventor
of over 500 issued patents. Dr. Gschwind is a Fellow of the IEEE, an
ACM Distinguished Speaker, Chair of the ACM SIGMICRO Executive
Board, an IBM Master Inventor, and a Member of the IBM Academy
of Technology.
Wade Ouren IBM Cognitive Systems, Rochester, MN 55901 USA
(wadeo@us.ibm.com). Mr. Ouren is an IBM Senior Software Engineer
in the PowerVM hypervisor development team. He has worked on the
design and implementation of PowerVM memory management and
virtualization for POWER5, POWER6, POWER7, POWER8, and
POWER9 with 14 patents issued. In previous assignments, he has
worked on memory management and PASE in IBM i.
Mr. Ouren has a bachelors degree in computer science from Minot
State University and a masters degree in computer science from the
University of North Dakota.
Stuart Jacobs IBM Cognitive Systems, Rochester, MN 55901 USA
(sjacobs@us.ibm.com). Mr. Jacobs is an IBM Senior Software Engineer
in the PowerVM hypervisor development team. He has worked on the
design and implementation of PowerVM memory management and
virtualization for POWER5, POWER6, POWER7, POWER8, and
POWER9 with 35 patents issued. In previous assignments, he has
worked on memory management in IBM i.
Mr. Jacobs received a bachelors degree in computer engineering
and a masters degree in software engineering from the University of
Minnesota.
Brian F. Veale IBM Cognitive Systems, Austin, TX 78758 USA
(bfveale@us.ibm.com). Dr. Veale is a Senior Software Engineer with
IBM Cognitive Systems. After beginning his career in distributed
systems at Southwest Research Institute in San Antonio, Texas, where
he developed components of a training simulator for the U.S Air Force,
he was a GAANN Fellow and a Lecturer teaching courses in computer
science and electrical and computer engineering at the University of
Oklahoma, where he received a Ph.D. degree in computer science in
2005. He joined IBM in 2006 and has spent most of his career as a lead
architect for AIX focusing on support and exploitation of new processor
features and hardware systems. In 2015, he received an IBM
Outstanding Technical Achievement Award for his work on POWER8
and Transactional Memory. Dr. Veale is a senior member of the Institute
of Electrical and Electronics Engineers and a lifetime professional
member of the Association of Computing Machinery.
David Edelsohn IBM Research, Thomas J. Watson Research
Center, Yorktown Heights, NY 10598 USA (edelsohn@us.ibm.com).
Dr. Edelsohn is a Senior Technical Staff Member, Open Source
Ecosystem. He is a founding member of the GCC Steering Committee,
Trustee of the GNU Toolchain Fund, and maintainer of the PowerPC
port of GCC. He was one of the core innovators who helped establish
Linux at IBM and continues to drive IBMs engagement with Open
Source communities.
He has a Ph.D. degree in computational physics from Syracuse
University, an M.Sc. degree in astronomy from California Institute of
Technology, and an A.B. degree in physics and astronomy from
University of California, Berkeley.
6:10 J. JANN ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 6 JULY/SEPTEMBER 2018
... On POWER7+, POWER8, and POWER8+ CPUs, all on-chip accelerators are accessed through the privileged Initiate Coprocessor Store Word Indexed (icswx) instruction and its associated communication protocol [17]. From POWER9 onwards, the Virtual Accelerator Switchboard (VAS) facilities have been introduced with the goal of making on-chip accelerator resources accessible from userspace [83]. For user-space access to be enabled, the firmware has to be modified to initialize accelerators and make them available in the device tree. ...
Thesis
Full-text available
The heterogeneity of today's state-of-the-art computer architectures is confronting application developers with an immense degree of complexity which results from two major challenges. First, developers need to acquire profound knowledge about the programming models or the interaction models associated with each type of heterogeneous system resource to make efficient use thereof. Second, developers must take into account that heterogeneous system resources always need to exchange data with each other in order to work on a problem together. However, this data exchange is always associated with a certain amount of overhead, which is why the amounts of data exchanged should be kept as low as possible. This thesis proposes three programming abstractions to lessen the burdens imposed by these major challenges with the goal of making heterogeneous system resources accessible to a wider range of application developers. The lib842 compression library provides the first method for accessing the compression and decompression facilities of the NX-842 on-chip compression accelerator available in IBM Power CPUs from user space applications running on Linux. Addressing application development of scale-out GPU workloads, the CloudCL framework makes the resources of GPU clusters more accessible by hiding many aspects of distributed computing while enabling application developers to focus on the aspects of the data parallel programming model associated with GPUs. Furthermore, CloudCL is augmented with transparent data compression facilities based on the lib842 library in order to improve the efficiency of data transfers among cluster nodes. The improved data transfer efficiency provided by the integration of transparent data compression yields performance improvements ranging between 1.11x and 2.07x across four data-intensive scale-out GPU workloads. To investigate the impact of programming abstractions for data placement in NUMA systems, a comprehensive evaluation of the PGASUS framework for NUMA-aware C++ application development is conducted. On a wide range of test systems, the evaluation demonstrates that PGASUS does not only improve the developer experience across all workloads, but that it is also capable of outperforming NUMA-agnostic implementations with average performance improvements of 1.56x. Based on these programming abstractions, this thesis demonstrates that by providing a sufficient degree of abstraction, the accessibility of heterogeneous system resources can be improved for application developers without occluding performance-critical properties of the underlying hardware.
Conference Paper
Full-text available
Cognitive Cloud has drawn increasing attention from practitioners , academics, and funding agencies and has been adopted progressively. However, the concept remains mired in various definitions with different studies providing contrasting descriptions. Therefore, to understand the concept of cognitive cloud and to provide its definition, in this work we conduct a systematic mapping study of the literature investigating 24 papers proposing five main definitions. The main outcome of this work is a complete definition that merges all the common aspects of cognitive cloud, enabling practitioners and researchers to better understand what cognitive cloud is.
Article
Full-text available
In this paper, we describe the history and development of the IBM POWER ® vector-scalar architecture, as well as how the design goals of hardware efficiency and software interoperability are achieved by integrating existing floating-point and vector functions into a new unified architecture and function unit. The vector-scalar instructions were defined with an emphasis on out-of-the-box performance and consumability, while accelerating a broad set of enterprise server workloads. Vector-scalar instructions were first introduced in the IBM POWER7 ® architecture to accelerate high-performance computing applications. With the introduction of the POWER8 ® processor, the vector-scalar architecture expanded to accelerate a diverse set of enterprise workloads including unstructured text and string processing, business analytics, in-memory databases, big data, and stream coding. We conclude this paper with a description of workload performance and application acceleration to demonstrate the effectiveness of the new vector-scalar architecture.
Article
Full-text available
IBM z Systems™ have been hosting critical business applications across several industries including banking, healthcare, insurance, and retail for over 50 years. With the recent growth in the use of analytics in the business environment, more customers are seeking to integrate analytics into their core operational environments to enable real-time insight on operational data for competitive advantage. IBM z Systems support very high-performance computation for business analytics. The z System processors are typically higher frequency than any general-purpose microprocessor, and with the zEnterprise® 196 (z196) system, huge gains in performance were made through out-of-order execution. The robust cache hierarchy also provides additional performance boosts to high concurrency analytics operations. The IBM z13™ extends these performance improvements with a new SIMD (single-instruction, multiple-data) engine in the processor for targeted acceleration of business analytics workloads. This SIMD engine supports new instructions for three data types: integer, string, and floating-point. We provide an overview of the z13 SIMD architecture and show how it advances current architectures. We also discuss the compiler and math library support for this new architecture and provide a few examples of applications that can be accelerated by this new hardware feature on the z13.
Conference Paper
Full-text available
This paper explores the design space of MMU caches that accelerate virtual-to-physical address translation in processor architectures, such as x86-64, that use a radix tree page table. In particular, these caches accelerate the page table walk that occurs after a miss in the Translation Lookaside Buffer. This paper shows that the most effective MMU caches are translation caches, which store partial translations and allow the page walk hardware to skip one or more levels of the page table. In recent years, both AMD and Intel processors have implemented MMU caches. However, their implementations are quite different and represent distinct points in the design space. This paper introduces three new MMU cache structures that round out the design space and directly compares the effectiveness of all five organizations. This comparison shows that two of the newly introduced structures, both of which are translation cache variants, are better than existing structures in many situations. Finally, this paper contributes to the age-old discourse concerning the relative effectiveness of different page table organizations. Generally speaking, earlier studies concluded that organizations based on hashing, such as the inverted page table, outperformed organizations based upon radix trees for supporting large virtual address spaces. However, these studies did not take into account the possibility of caching page table entries from the higher levels of the radix tree. This paper shows that any of the five MMU cache structures will reduce radix tree page table DRAM accesses far below an inverted page table.
Article
In today's microprocessors, external interrupts are the only processor resource that has no full virtualization hardware support. Therefore, they create significant overhead in the form of virtual machine (VM) context switches, especially in cloud and hyperscale datacenter systems, in which the physical processors are shared by a large number of virtual processors allocated to the VMs running on the system. The IBM POWER9 processor closes this gap in the processor hardware virtualization support by introducing the eXternal Interrupt Virtualization Engine (XIVE) architecture. XIVE defines a holistic interrupt delivery mechanism that supports multiple layers of interrupt coalescing and hardware-based routing of interrupts to the correct physical thread and target level. As required, individual interrupts can thus be routed directly to a user process, to a specific supervisor or operating system, or to the hypervisor, thereby eliminating the need for interrupt re-routing in software and minimizing the number of context switches. In addition, XIVE provides the means to automatically trigger escalation interrupts to the next higher privilege level in case the target of an interrupt is not dispatched. In this paper, we discuss the fundamental concepts of the XIVE architecture and provide an overview of the XIVE-based interrupt controller implementation of the IBM POWER9 processor
Techniques for indicating a preferred virtual processor thread to service an interrupt in a data processing system
  • R Arndt
  • F Auernhammer
  • S Jacobs
R. Arndt, F. Auernhammer, S. Jacobs et al., "Techniques for indicating a preferred virtual processor thread to service an interrupt in a data processing system," U.S. Patent 9,678,901 B2. [Online]. Available: htt_ ps://patents.google.com/patent/ US9678901B2/en
Selectable address translation mechanisms
  • A Bybell
  • B Frey
  • M Gschwind
A. Bybell, B. Frey, and M. Gschwind, "Selectable address translation mechanisms," U.S. Patent 9,600,419, 2012. [Online].
Managing translations across multiple contexts using a TLB with entries directed to multiple privilege levels and to multiple types of address spaces
  • A Bybell
  • B Frey
  • M Gschwind
A. Bybell, B. Frey, M. Gschwind et al., "Managing translations across multiple contexts using a TLB with entries directed to multiple privilege levels and to multiple types of address spaces," U.S. Patent 9,317,443, 2014. [Online]. Available: htt_ ps://patents. google.com/patent/US9317443
Computer instructions for limiting access violation reporting when accessing strings and similar data structures
  • M Gschwind
  • B Olsson
  • R Silvera
M. Gschwind, B. Olsson, and R. Silvera, "Computer instructions for limiting access violation reporting when accessing strings and similar data structures," U.S. Patent 9,690,509, 2014. [Online].
Managing translation of a same address across multiple contexts using a same entry in a translation lookaside buffer
  • A Bybell
  • B Frey
  • M Gschwind
A. Bybell, B. Frey, M. Gschwind et al., "Managing translation of a same address across multiple contexts using a same entry in a translation lookaside buffer," U.S. Patent 9,311,249, 2012. [Online]. Available: htt_ ps://patents.google.com/patent/ US9311249