ArticlePDF Available

The impact of diverse memory architectures on multicore consumer software: An industrial perspective from the video games domain

Authors:
  • Codeplay Software Limited
  • Codeplay Software Ltd, Edinburgh, Scotland

Abstract

Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game.
The Impact of Diverse Memory Architectures
on Multicore Consumer Software
An industrial perspective from the video games domain
George Russell Colin Riley
Neil Henning Uwe Dolinsky
Andrew Richards
Codeplay Software Ltd.
{george,colin,neil,uwe,andrew}@codeplay.com
Alastair F. Donaldson
Computer Science Department,
University of Oxford
alad@comlab.ox.ac.uk
Alexander S. van Amesfoort
Delft University of Technology
Dept. of Software Technology
a.s.vanamesfoort@tudelft.nl
Abstract
Memory architectures need to adapt in order for perfor-
mance and scalability to be achieved in software for mul-
ticore systems. In this paper, we discuss the impact of tech-
niques for scalable memory architectures, especially the use
of multiple, non-cache-coherent memory spaces, on the im-
plementation and performance of consumer software. Pri-
marily, we report extensive real-world experience in this area
gained by Codeplay Software Ltd., a software tools company
working in the area of compilers for video games and GPU
software. We discuss the solutions we use to handle varia-
tions in memory architecture in consumer software, and the
impact such variations have on software development effort
and, consequently, development cost. This paper introduces
preliminary findings regarding impact on software, in ad-
vance of a larger-scale analysis planned over the next few
years. The techniques discussed have been employed suc-
cessfully in the development and optimisation of a shipping
AAA cross-platform video game.
Categories and Subject Descriptors D.1.3 [Concurrent
Programming]: Parallel programming
The research leading to these results has received funding from the Euro-
pean Union Seventh Framework Programme (FP7/2007-2013) under grant
agreement n248481 (PEPPHER Project, www.peppher.eu).
Supported by EPSRC grant EP/G051100/1
Contributions to this work were carried out while the author was an intern
at Codeplay Software Ltd., supported in part by the European Union NoE
HiPEAC-2 (FP7/ICT 217068)
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
MSPC’11, June 5, 2011, San Jose, California, USA.
Copyright
c
2011 ACM 978-1-4503-0794-9/11/06. . . $10.00
General Terms Performance, Design, Languages, Experi-
mentation
Keywords memory architecture, performance, multi-core,
accessor class, offload
1. Introduction
Video game development drives computer hardware ad-
vances, and is an effective method of subsidising or recoup-
ing R&D costs. Large AAA
1
titles rival opening weekend
and total revenue figures of big hollywood productions.
AAA commercial games are large, complex, high perfor-
mance, real-time software systems targeting multiple hard-
ware platforms. Due to the need for high performance, as
well as object-oriented structuring mechanisms, they are al-
most universally implemented using C++. Codebases are
large and evolve rapidly due to changing design require-
ments. Tight bounds exist on acceptable resource usage, both
for processing time and storage capacity.
In response to requests from game companies for tools
to support the development of portable games that can run
on architectures with multiple, disjoint memory spaces, we
have designed Offload C++ [2], a compiler and runtime sys-
tem for offloading large portions of C++ applications to run
on accelerator cores in heterogeneous multicore systems.
Since early 2009, we have worked with a number of AAA
game providers on applying Offload C++ to their code bases.
During these large video game development projects, we
have observed that a large proportion of effort is spent at-
tempting to achieve high performance from the memory ar-
chitecture of the target systems. Development costs are di-
rectly related to this effort. This is compounded by the re-
quirement to accommodate highly different memory archi-
tectures (e.g. the radically different memory architectures
of the PlayStation 3 and XBox 360 games consoles) within
1
In the video games industry, an AAA, or Triple-A title, refers to a high-
budget ((tens of) millions of dollars) video game title, with 100+ people
actively involved at any point in time for 1–3 years.
38
portable software, without compromising on performance.
Catering to individual memory architectures can present a
significant obstacle to code portability, reuse, and future
proofing. This is undesirable, given the costs associated with
development of software.
Contributions of this paper. After providing an overview
of the challenges posed by modern memory architecture and
surveying related work (Section 2), we provide a summary
of Offload C++ (Section 3). We then discuss the impact of
scalable memory architectures on the design of consumer
C++ software, describing problems we have faced when ap-
plying Offload C++ to large games code bases, outlining
techniques we have designed to partially solve these prob-
lems, and discussing where these solutions fall short (Sec-
tion 4). We then discuss an orthogonal issue related to mem-
ory systems, that of indexed addressing (Section 5), and de-
scribe how Codeplay’s compiler tools help developers write
efficient code for architectures with indexed addressing.
2. Scalable Memory Architectures
Processor performance has outpaced memory performance,
leading to the memory wall [11], which limits scalable
performance in multicore architectures. Recent memory
architectures aim to accomodate multicore scalability by
adopting custom features such as explicitly managed, local
scratch-pad memory and DMA transfer mechanisms (e.g.
the Cell BE processor [6], as found in the Sony PlaySta-
tion 3 console), or on-chip networks (as employed by Intel’s
48-core Single-Chip Cloud Computer (SCC) processor [9]).
Scalability of memory systems has also been improved by
measures such as dropping formerly standard features like
cache-coherency and byte (store) addressing, and by increas-
ing alignment restrictions.
GPU memory architectures are geared towards provid-
ing massive bandwidth. This only works well if memory
accesses can be combined into large, vector requests that
are equally divided over available memory banks. Instead of
huge cache hierarchies, in GPU designs, resources are spent
on computational and latency tolerating logic. Some fixed-
function hardware is present to accelerate texturing (fetch-
ing, 2D cache locality and (de)compression). External accel-
erators like GPUs have the common disadvantage that sig-
nificant overhead is required to transfer data between host
and device memory. Recently, CPU and GPU architectures
appear to move in a somewhat more converging direction.
CPUs integrate some GPU cores, while memory systems of
GPUs start to cache more aggressively.
A software view of memory. The traditional software view
of memory is as a single, flat address space with a uniform
access cost and large capacity. Highly optimised software
may take care to ensure optimal cache behaviour, but details
of cache architectures are not exposed directly to software.
Changes to memory architectures are subverting this fun-
damental assumption made by software, which is in turn re-
GameEntity e1, e2; // Allocated in local store
// Fetch game entities associated with collision
dma_get(&e1, collisionPair->first, sizeof(GameEntity), t);
dma_get(&e2, collisionPair->second, sizeof(GameEntity), t);
dma_wait(t); // Block until data arrives
do_collision_response(&e1, &e2);
// Write back updated entities
dma_put(&e1, collisionPair->first, sizeof(GameEntity), t);
dma_put(&e2, collisionPair->second, sizeof(GameEntity), t);
Figure 1. Example illustrating the use of explicit DMA for
data movement in games code
flected in programming languages and programming styles.
If a platform has special mechanisms to manage data, such
as explicit DMA transfers or other local storage manipula-
tion functions, then at the lowest level these can be exposed
via intrinsics. When writing SPE code in high performance
games for the PlayStation 3 console, developers are forced
to write low-level DMA operations using intrinsics. For ex-
ample, Figure 1 shows what the programmer might write to
pull two game entities involved in a collision pair into local
store, update their state according to the collision, and write
the results back to main memory. The dma_get and dma_put
operations are non-blocking, and are passed an associated
tag t, hence the two game entities are fetched in parallel;
dma_wait(t) blocks until all pending operations associated
with tag t have completed. Correct synchronization of DMA
operations is essential for software correctness, but difficult
to achieve in practice. The difficulty of DMA programming
has prompted design of both static [3] and dynamic [7] anal-
ysis tools to detect DMA races.
Memory hierarchy-aware languages and tools. Intrinsic-
based programming is low-level and non-portable. Many re-
cent programming languages expose a hierarchical view of
memory, enabling programmers to exploit data locality di-
rectly, yet attempting to minimise manual effort on the part
of the programmer. OpenCL [8] and CUDA [12] expose a
three level hierarchy, of ‘global’, ‘local’, and ‘private’ mem-
ory to programs, Offload C++ [2] (see Section 3) exposes
a two level hierarchy of ‘outer‘ and ‘local‘ memory, while
Sequoia [5] exposes a multi-level model.
Alternatively, a compiler or run-time system may provide
an application with the illusion of a flat address space, using
underlying mechanisms and predetermined policies to move
code and data transparently to where in the memory hierar-
chy it may be most efficiently accessed. This is performed in
the XLC C++ and OpenMP compiler [4] and the Hera JVM
[10] for the Cell BE.
3. Offload C++
In Section 4, we describe problems we faced when adapting
AAA games code to run on systems with multiple memory
spaces, using Offload C++. We first give a brief overview of
Offload C++; for full details please see [2].
39
void GameWorld::doFrame(...) {
__offload_handle_t h = __offload {
// Offload to accelerator
this->calculateStrategy(...);
};
this->detectCollisions(); // Executed in parallel by host
__offload_join(h); // Wait for accelerator to complete
this->updateEntities();
this->renderFrame();
}
Figure 2. Simplified example showing how an offload block
could be used in a games context
Offload C++ extends the C++ language with a new key-
word, __offload, which is used to prefix a lexical scope to
indicate that the code within the scope should begin execut-
ing asynchronously in a separate thread. The lexical scope
is called an offload block. Offload C++ is geared towards ar-
chitectures like the Cell BE, consisting of a host core and a
number of accelerators, where the accelerator and host in-
struction sets are different, and where each accelerator is
equipped with its own private, scratch-pad memory. In the
Cell architecture, the host is the Power Processor Element
(PPE), and the accelerators are the Synergistic Processor
Elements (SPEs), each of which has 256K of scratch-pad
memory. In this context, wrapping a portion of code in an
offload block indicates that:
1. The code within the block should be offloaded to an
accelerator core (SPE)
2. The host should continue to execute, so that the offloaded
accelerator thread runs in parallel
3. Data declared inside the offload block should be allocated
in scratch-pad memory
4. Access from within the offload block to data declared
outside the block should result in automatically generated
data-movement code
Making this work for full C++ involves solving some
challenging issues related to (1) and (4). For (1), it is neces-
sary to statically identify all code invoked (directly, or indi-
rectly through chains of possibly virtual function calls) from
the offload block and compile it separately for the accelera-
tor cores. For (4), to compile pointer dereferences it must be
possible to determine whether a pointer refers to local mem-
ory, in which case a standard load/store can be issued, or
to host memory, in which case an inter-memory space data
transfer must be initiated.
Problem (1) is solved by equipping the compiler with
techniques for automatic function duplication. There are two
cases where manual annotations are required to help the
compiler: one is when a call graph rooted in an offload block
calls functions in separate compilation units, which are not
immediately available for compilation. The other is that the
programmer must specify which methods or functions may
be called virtually or via function pointer inside an offload
block. This list of methods, called a domain, and techniques
for dynamic dispatch, are discussed in Section 4.1.
Problem (4) is solved via an extended type system. Point-
ers and references declared inside an offload block scope are
automatically type qualified with a new __outer qualifier if
they reside on the accelerator but reference host memory.
Offload C++ maintains strong type checking to refuse erro-
neous pointer manipulations such as assignments between
pointers into different memory spaces.
Example. Figure 2 shows a simplified version of a video
game loop, where AI computation is performed for game en-
tities (calculateStrategy), collisions are detected (detect
Collisions), and the results of these are used to update and
render the game world (updateEntities and renderFrame).
Suppose that strategy calculation and collision detection can
be safely performed in parallel, and we wish to offload strat-
egy calculation to an accelerator. This is achieved by wrap-
ping this call in an offload block, by using the __offload
keyword as shown in the figure. (In practice, some addi-
tional syntax is used to pass parameters to the block, which
we do not discuss due to space limitations.) When execution
reaches the offload block, the Offload C++ runtime launches
an accelerator thread to execute calculateStrategy. To do
so, an accelerator version of this method, and all meth-
ods it may call, must be pre-compiled. The runtime im-
mediately returns control to the host thread, which exe-
cutes detectCollisions in parallel with the accelerator. The
__offload_join library function is used to synchronize the
offloaded thread, when its results are required. During ex-
ecution of calculateStrategy by the accelerator, any ac-
cesses to host memory are automatically compiled into data
transfers that go through a software cache.
4. The Impact of Scalable Memory
Architectures on Consumer Software
We now discuss the impact of changes to memory architec-
tures on the design of video games, challenges these changes
have posed when applying Offload C++ to offload large
fragments of game code in systems with multiple memory
spaces, and solutions we have used to solve these problems.
Game code is typically structured such that computa-
tion is specified as parallel, distinct tasks with well defined
synchronisation points executing in a pre-defined and fixed
schedule each frame. Tasks perform complex processing on
relatively small numbers of objects (100’s 1000’s), for
purposes ranging from animation, AI, collision detection,
physics, and rendering.
4.1 Virtual Methods
Object-oriented design and programming relies heavily on
virtual methods, or dynamic dispatch. These are ubiquitous
in games code, yet efficiently implementing virtual methods
for systems with multiple memory spaces is a challenge.
Consider a C++ method call, obj->f(...). In a tradi-
tional system with a single memory space, the ‘obj’ pointer
is dereferenced to obtain a pointer to the virtual table
40
Outer Domain
Inner Domain
Pointers to Functions
in Global Store
Count ID
Function
Address
Index
Inner Index
Figure 3. Virtual calls with multiple memory spaces
(vtable). The virtual table pointer is dereferenced with an
offset to obtain the address for the particular implementa-
tion of method f to call. Consider now a system equipped
with accelerator cores, where each accelerator has a private
local memory, and suppose that obj refers to an object stored
in the local memory of a particular accelerator. Consider the
invocation of obj->f(...) by a thread running on this accel-
erator core. It seems natural that we would wish the method
call to be executed by the accelerator, operating on fast lo-
cal memory. Conversely, if obj refers to an object in main
memory, and the call is issued by the host core, we would
expect the appropriate implementation of f to be executed
by the host. To make this work in practice, for systems where
different types of cores have incompatible instruction set ar-
chitectures, we must move away from the setting where each
object type has a single associated vtable.
We now describe the implementation of the system used
for virtual method dispatch within Offload C++ for the Cell
BE. First, as mentioned in Section 3, the Offload C++ lan-
guage allows the programmer to annotate offloaded code
with details of which functions and methods are expected
to be invoked through dynamic dispatch. These will be pre-
compiled to execute on an accelerator core using local mem-
ory. When compiling an offload block for an SPE core,
the compiler tracks the context of call sites, to determine
whether a call requires dynamic dispatch (either because it
is a virtual call, or through a function pointer). In this case,
a system of inner and outer domains is used, as illustrated in
Figure 3. Instead of a normal vtable lookup and call, a do-
main lookup is performed after vtable lookup to determine if
an implementation of the routine is present in the local mem-
ory space. This lookup is a two stage process. First, a search
over an array of known virtual method addresses, the outer
domain, determines whether the routine is present in local
store. If a potential match is found in the outer domain, the
index of the matching pointer in the outer domain is used
to index into the inner domain. Within the inner domain,
we obtain details of function duplicates present distinct
combinations of memory spaces in arguments require dis-
tinct duplicates to be made with the appropriate data transfer
code. Overloads may be selectively compiled, so there is no
guarantee that a full set is present. The inner domain details
the number of duplicates present, in a sequence of identifier,
function address pairs. The identifier is compiler generated
meta-data to identify the signature of the routine with respect
to combinations of memory spaces. Once a function address
is obtained with an id matching the desired duplicate, the ad-
dress of the desired routine in local store is known, and the
call can be performed.
At present, if a dynamically dispatched function does not
provide a match in the inner domain, an exception is gener-
ated, providing information which the programmer can use
to tell the compiler which methods should be pre-compiled
for local dynamic dispatch. Elaborations on this technique
could implement alternative behaviours, such as on-demand
code loading for functions not present in local memory.
Practical experience offloading virtual methods in games.
In the games domain, there is usually only a small selection
of virtual functions that can be called at a given point in
an algorithm. For example, in physics computations, vir-
tual calls to collision callback methods are invoked (one
method for each combination of object types that might col-
lide), while during game AI, specific checks used in decision
making involve virtual invocations. Thus annotating a given
portion of code to be offloaded with a set of methods to be
called virtually is often not too onerous.
However, we have found practical cases where virtual
method annotations start to explode. Recently, in applying
Offload C++ to a AAA title, we found that, without per-
forming code restructuring, it was necessary to annotate a
portion of offloaded code with upwards of 100 virtual func-
tions. The problem was that the game used an abstract com-
ponent system, performing more than 1300 virtual calls
per frame, which we tried to offload in its entirety. We re-
alised that, for a given task, only a selection of types and
virtual methods were actually used when the portion of code
offloaded was executed. We therefore restructured the com-
ponent system to be type specialised, in 1 day, and without
loss of generality. We wrote a separate offload for each task,
one per component, instead of a single offload for all the dis-
tinct components, resulting in 13 separate type-specialised
offloads. After the restructuring, the maximum number of
virtual functions associated with a portion of offloaded code
being shipped in this particular game is 40.
For performance critical regions of code, avoiding vir-
tual methods is important when offloading to systems with
multiple memory spaces. The uniform abstraction of a vir-
tual call such as move() hides the specific type, and hence
size, of the object on which the function is invoked. Con-
sequently, the object data cannot be prefetched into fast lo-
cal store. We have found that it is possible to achieve suffi-
cient performance when all commonly accessed data can-
not be prefetched. However, processing objects in groups
of uniform type permits prefetching and double buffered
transfers, for further performance increases. Thus, despite
the support for virtual methods in the presence of multiple
memory spaces provided by the Offload compiler, develop-
41
ers may still need to spend time rewriting portions of code
to avoid virtual calls. Nevertheless, the support provided by
Offload C++ provides an important stepping stone between
no offloading whatsoever, and an offloaded and fully opti-
mized routine. It took 1 developer 2 months to offload the
very complex existing AI code of a AAA game to SPU, with
200 lines of additional code resulting in a 50% perfor-
mance increase. We note that the restructuring to avoid vir-
tual method calls additionally improved performance across
the range of target platforms, including targets with tradi-
tional memory architectures.
4.2 Data Locality Optimisations
Data locality and virtual calls. Virtual calls can be expen-
sive when performed repeatedly in a tight loop, with little
computation per invocation. Unfortunately, this use case is
common in games deigned in an OO style. Consider the loop
in the following example:
GameObject
*
objects[N_OBJECTS];
...
GameObject
*
current = &objects[0];
for (int i=0; i < N_OBJECTS; i++)
{ current->move(); current++; }
which iterates over a collection of pointers to objects, in-
voking virtual method move on each. There is a negative
performance impact when both the collection of pointers
(objects), and the referenced objects (targets of pointers in
objects) are allocated in a non-local memory space where
access is via a high latency channel. On each iteration, the
current pointer indexing into the container is dereferenced,
triggering a transfer between memory spaces, to obtain a
pointer to an object, which is dereferenced again to locate
a function address in the vtable to perform the virtual call.
Each iteration therefore incurs the latency of two dependent
memory transfer operations, a significant overhead, espe-
cially in the case where the per object computation does not
take substantial time.
Software Cache Systems. Cache systems have been im-
plemented in software for diverse memory architectures to
mitigate transfer overhead [1, 15]. Software cache lookup in-
troduces some overhead, but this is typically outweighed by
the performance increase from avoiding performing repeated
accesses to data via inter-memory transfers, a problem com-
mon in adapting existing code to support multiple memory
space architectures. A compiler can identify and alter each
access to use a cache when appropriate during compilation.
Offload C++ provides this facility, through its type system
which differentiates between pointers to local and host data
from within an offload block (see Section 3). To support this,
we have developed several software caches, favouring dif-
ferent types of application behaviour. The programmer must
decide, based on profiling, which cache is most suitable for
a given offload; this is beyond the scope of the compiler.
Accessor Classes and Source Level Portability.
Programmers can use portable accessor classes (efficient
data access abstractions) and knowledge of their applica-
tion’s access patterns to achieve high performance. Offload
provides a number of accessor classes, written using C++
templates and operator overloading [14].
To minimize transfers, and to better exploit a fast local
store, we use an Array accessor class as follows:
GameObject
*
objects[N_OBJECTS];
...
Array<GameObject
*
,N_OBJECTS> local_objects;
GameObject
*
current = &local_objects[0];
for (int i=0; i < N_OBJECTS; i++)
{ current->move(); current++; }
We have interposed an Array data accessor between the orig-
inal array, and the code to access that array. This accessor
class is part of the Offload C++ library, and will perform a
single, efficient bulk transfer of the array of pointers into fast
local store. Subsequently, it acts like an array, allowing in-
dexing operations. This removes a high latency transfer op-
eration from each iteration, increasing performance. On a
shared memory system, an Array implementation provides
direct access to data We have not explicitly stated how the
array is to be transferred: this can be factored out in the im-
plementation of Array, permitting the use of this technique
on portable code.
We have added such data transfer code to optimise of-
floaded game code compiled using function duplication
when executing on multiple address space systems. Similar
source changes will arise in systems with multiple memories
in single address space, where consideration must be given
to inter memory transfers .
5. Indexed Addressing
Memory systems usually are based on addresses that refer to
bytes. Some addressing systems, however, are word-oriented
(e.g. TigerSHARC) or vector-oriented (e.g. PlayStation 2
vector unit). In such systems, using an assembler instruction
to add 1 to an address causes the address to refer to the next
word or vector, instead of the next byte. This allows a much
simpler memory architecture, and such a system can be used
to index into registers instead of memory. However, this
can cause problems with software, as most modern software
assumes byte addressing. The problems are only normally
visible when defining pointers to data that is smaller than
the addressing unit size, or when accessing elements of data
structures which are smaller than the addressing unit size.
Solutions to this problem are: use a language that uses
word addressing, such as BCPL [13]; keep all pointers as
byte-pointers and convert when dereferencing, or use a hy-
brid approach that allows most existing byte-pointer code
to work on word-addressed systems. BCPL uses a system
whereby all pointers are word pointers. When processing
byte pointers (e.g. for strings) special library routines are
used. Such languages are rare on modern systems, creat-
ing a code portability problem. Keeping pointers as byte-
pointers and converting on dereference gives the greatest
42
level of portability, but at the expense of an often unaccept-
able performance hit. Compiling a pointer dereference (a
common operation in C++ code) may require several shifts
and some logical operations. The following example works
when keeping pointers as byte-pointers, but may be ineffi-
cient on word-addressed architectures:
for (int i=0; i<N; i++) {
*
string++ = (char)i; }
In designing compilers for game vendors, we have de-
vised a hybrid approach that largely maintains software
portability. The novelty of our approach is that the com-
piler statically generates errors when applied to code that
is inefficient for the device. We define an extra attribute for
each pointer data type: the addressing unit size. So we can
have two types of char pointers:
char __word
*
p; // word-addressed pointer to a char
char __byte
*
p; // byte-addressed pointer to a char
By default, a non-annotated pointer is treated as word-
addressed. This means that, by default, any char pointer p
can only point to word-aligned bytes. If we do arithmetic
on the pointer, e.g. computing p+1, then we ensure that the
type is now byte-addressed, but the value contains a word-
addressed element (in this case p) and a byte-addressed off-
set (in this case 1). This makes the operation of dereferenc-
ing the pointer quite efficient: we know that we can load a
word at the address pointed to by p, and that we then ex-
tract the second byte from that word, which we can compile
efficiently, because we know it is a constant value of 1. If
we attempt to add an integer variable x to p, then we know
that we will get a variable byte-pointer, and that cannot be
dereferenced efficiently, so we raise a compilation error. This
forces the user to think about the pointer arithmetic they are
performing, and rewrite it in a more efficient manner.
An extended type-checker allows pointer expressions de-
rived from word-addressed pointers to be assigned to byte-
addressed pointers, but prohibits non-word-addressed values
from being assigned to word-addressed pointers. This is il-
lustrated by the following, where p is word-addressed:
char
*
q = p+4; // this is legal, if the word size is 4
char
*
q = p+1; // this is illegal
char __byte
*
q = p+1; // this is legal
This allows us to define and operate on structures contain-
ing byte elements, which is the most common use-case for
byte-addressing on word-addressed memory architectures:
struct T {
char a, b, c, d;
}
*
p; // this is a word-pointer to data of type T
// This works, using the constant offsets of ’a’ and ’b’
p->a = p->b;
Our technique does not easily support operations on ar-
rays of characters (i.e. strings). This is rarely a problem in
practice: processors equipped with word-addressed memory
are not usually designed to operate on text, hence the deci-
sion to use word addressing in the first place. We have found
that game developers prefer the hybrid technique when they
want to be highlighted of inefficient code generation.
6. Conclusions
The issue of how to improve consumer software, such as
games, to make better use of memory architectures is a
multi-faceted problem. We have illustrated some problem-
atic hardware / software interactions that occur at present,
and suggested techniques for dealing with these instances,
with illustrative examples from Offload C++, a program-
ming language and system for offloading large portions of
C++ code to run on accelerator cores with local stores.
Exploiting the full performance of architectures with
multiple memory spaces in an existing code base can be
a complex, costly process. Yet, the alternative of rewriting
software from scratch for each new architecture is even less
appealing. We believe that this issue has to be attacked from
two directions. Software developers require better compil-
ers and portable libraries to support new memory architec-
tures, while common patterns for memory-aware application
refactoring must be identified.
References
[1] J. Balart et al. A novel asynchronous software cache imple-
mentation for the Cell-BE processor. LCPC, 2008.
[2] P. Cooper, U. Dolinsky, A. F. Donaldson, A. Richards, C. Ri-
ley, and G. Russell. Offload - automating code migration to
heterogeneous multicore systems. HiPEAC, 2010.
[3] A. F. Donaldson, D. Kroening, and P. Ruemmer. Automatic
analysis of scratch-pad memory code for heterogeneous mul-
ticore processors. TACAS, 2010.
[4] A. E. Eichenberger et al. Using advanced compiler technology
to exploit the performance of the cell broadband enginetm
architecture. IBM Systems Journal, 45:59–84, 2006.
[5] K. Fatahalian et al. Sequoia: programming the memory hier-
archy. SC, 2006.
[6] H. P. Hofstee. Power efficient processor architecture and the
Cell processor. HPCA, 2005.
[7] IBM. Cell BE Race Check Library, July 2008. Described in
Example Library API Reference, version 3.1.
[8] Khronos Group. The OpenCL specification.
http://www.khronos.org/opencl.
[9] T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas,
P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and
S. Dighe. The 48-core SCC processor: the programmer’s view.
SC, 2010.
[10] R. McIlroy and J. Sventek. Hera-JVM: a runtime system for
heterogeneous multi-core architectures. OOPSLA, 2010.
[11] S. A. McKee. Reflections on the memory wall. Computing
Frontiers, 2004.
[12] NVIDIA. CUDA zone. http://www.nvidia.com/cuda/.
[13] M. Richards. BCPL: a tool for compiler writing and system
programming. AFIPS Spring Joint Computer Conference,
1969.
[14] G. Russell, P. Keir, A. Donaldson, U. Dolinsky, A. Richards,
and C. Riley. Programming heterogeneous multicore systems
using threading building blocks. HPPC, 2010.
[15] M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D.
Owens. Efficient computation of sum-products on GPUs
through software-managed cache. ICS, 2008.
43
... Although there are studies addressing this aspect of game development (e.g. Russel et al., 2011), these technical characteristics are mostly studied in abstract terms without specific application areas in focus. ...
... The following articles are also recommended: - Russel, G. et al. (2011). "The impact of diverse memory architectures on multicore consumer software: An industrial perspective from the video games domain". ...
Book
Full-text available
Digital games have become a ubiquitous part of our society. In many countries, game development is a substantial and important industry. Academic institutions provide programmes aimed at preparing students for careers in game development. Over the past 20 years, there has been great interest in game research. However, very few studies address game development. Instead, most studies have focused on: serious applications of games; analysis of games and players; or, social aspects of playing. This book provides an overview of the scattered academic landscape of game development research. It highlights studies from a wide range of disciplines and raises arguments for game development to be understood as a complex activity that inherently includes elements of science, engineering, design and art. The consequences of this complexity need to be taken into account by research and/or academic programmes that have a disciplinary focus. There is otherwise the risk that the true nature of game development will not be understood.
Article
Full-text available
The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband Engine™ architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC® processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.
Conference Paper
Full-text available
We present Offload, a programming model for offloading parts of a C++ application to run on accelerator cores in a heterogeneous multicore system. Code to be offloaded is enclosed in an offload scope; all functions called indirectly from an offload scope are compiled for the accelerator cores. Data defined inside/outside an offload scope resides in accelerator/host memory respectively, and code to move data between memory spaces is generated automatically by the compiler. This is achieved by distinguishing between host and accelerator pointers at the type level, and compiling multiple versions of functions based on pointer parameter configurations using automatic call-graph duplication. We discuss solutions to several challenging issues related to call-graph duplication, and present an implementation of Offload for the Cell BE processor, evaluated using a number of benchmarks.
Conference Paper
Full-text available
Heterogeneous multi-core processors, such as the IBM Cell processor, can deliver high performance. However, these processors are notoriously difficult to program: different cores support different instruction set architectures, and the processor as a whole does not provide coherence between the different cores' local memories. We present Hera-JVM, an implementation of the Java Virtual Machine which operates over the Cell processor, thereby making this platforms more readily accessible to mainstream developers. Hera-JVM supports the full Java language; threads from an unmodified Java application can be simultaneously executed on both the main PowerPC-based core and on the additional SPE accelerator cores. Migration of threads between these cores is transparent from the point of view of the application, requiring no modification to Java source code or bytecode. Hera-JVM supports the existing Java Memory Model, even though the underlying hardware does not provide cache coherence between the different core types. We examine Hera-JVM's performance under a series of real-world Java benchmarks from the SpecJVM, Java Grande and Dacapo benchmark suites. These benchmarks show a wide variation in relative performance on the different core types of the Cell processor, depending upon the nature of their workload. Execution of these benchmarks on Hera-JVM can achieve speedups of up to 2.25x by using one of the Cell processor's SPE accelerator cores, compared to execution on the main PowerPC-based core. When all six SPE cores are exploited, parallel workloads can achieve speedups of up to 13x compared to execution on the single PowerPC core.
Conference Paper
Full-text available
We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implemen - tation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms. We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life appli- cations in artificial intelligence, statistics, image proc essing, and digital communications. Our motivation to accelerate MPF origi- nated in the context of the analysis of genetic diseases, whi ch in some cases requires years to complete on modern CPUs. Com- puting MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a co mplex data-dependent access pattern, high data reuse, and a low compute- to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4 GHz Core 2 with a 4 MB L2 cache.
Conference Paper
Full-text available
Modern multicore processors, such as the Cell Broadband Engine, achieve high performance by equipping accelerator cores with small “scratch-pad” memories. The price for increased performance is higher programming complexity – the programmer must manually orchestrate data movement using direct memory access (DMA) operations. Programming using asynchronous DMAs is error-prone, and DMA races can lead to nondeterministic bugs which are hard to reproduce and fix. We present a method for DMA race analysis which automatically instruments the program with assertions modelling the semantics of a memory flow controller. To enable automatic verification of instrumented programs, we present a new formulation of k-induction geared towards software, as a proof rule operating on loops. We present a tool, Scratch, which we apply to a large set of programs supplied with the IBM Cell SDK, in which we discover a previously unknown bug. Our experimental results indicate that our k-induction method performs extremely well on this problem class. To our knowledge, this marks both the first application of k-induction to software verification, and the first example of software model checking for heterogeneous multicore processors.
Conference Paper
Full-text available
The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.
Article
The language BCPL (Basic CPL) was originally developed as a compiler writing tool and as its name suggests it is closely related to CPL (Combined Programming Language) which was jointly developed at Cambridge and London Universities. BCPL adopted much of the syntactic richness of CPL and strived for the same high standard of linguistic elegance; however, in order to achieve the efficiency necessary for system programming its scale and complexity is far less than that of CPL. The most significant simplification is that BCPL has only one data type---the binary bit pattern---and this feature alone gives BCPL a characteristic flavour which is very different of that of CPL and most other current programming languages.
Conference Paper
This paper describes the implementation of a runtime library for asynchronous communication in the Cell BE processor. The runtime library implementation provides with several services that allow the compiler to generate code, maximizing the chances for overlapping communication and computation. The library implementation is organized as a Software Cache and the main services correspond to mechanisms for data look up, data placement and replacement, data write back, memory synchronization and address translation. The implementation guarantees that all those services can be totally uncoupled when dealing with memory references. Therefore this provides opportunities to the compiler to organize the generated code in order to overlap as much as possible computation with communication. The paper also describes the necessary mechanism to overlap the communication related to write back operations with actual computation. The paper includes the description of the compiler basic algorithms and optimizations for code generation. The system is evaluated measuring bandwidth and global updates ratios, with two benchmarks from the HPCC benchmark suite: Stream and Random Access.