ArticlePDF Available

Abstract and Figures

This article presents a new architecture to implement all game loop models for games and real-time applications that use the GPU as a mathematics and physics coprocessor, working in parallel processing mode with the CPU. The presented model applies automatic task distribution concepts. The architecture can apply a set of heuristics defined in Lua scripts in order to get acquainted with the best processor for handling a given task. The model applies the GPGPU (general-purpose computation on GPUs) paradigm. In this article we propose an architecture that acquires knowledge about the hardware by running tasks in each processor and, by studying their performance over time, finding the best processor for a group of tasks.
Content may be subject to copyright.
An Adaptative Game Loop Architecture with Automatic Distribution of
Tasks between CPU and GPU
Mark Joselli
UFF,Medialab
Marcelo Zamith
UFF,Medialab
Esteban Clua
UFF,Medialab
Anselmo Montenegro
UFF,Medialab
Regina Leal-Toledo
UFF,IC
Aura Conci
UFF,Medialab
Paulo Pagliosa
UFMS,DCT
Luis Valente
PUC-Rio,VisionLab
Bruno Feij ´o
PUC-Rio,VisionLab
Abstract
This paper presents a new architecture to implement any game loop
models for games and real-time applications that uses the GPUas
aMathematicsandPhysicsco-processor,workinginparallelpro-
cessing mode with the CPU. The model applies concepts of auto-
matic task distribution. The architecture can apply a set of heuris-
tics defined in Lua scripts to get acquainted about what is the best
processor for handling a given task. The model applies the GPGPU
(General-Purpose Computation on GPUs) paradigm. The architec-
ture that this work proposes acquires knowledge about the hardware
by running tasks in each processor, and by studying their perfor-
mance over time, learning about what is the best processor fora
group of tasks.
Keywor ds: : Game loops, GPGPU, Task distribution
Aut ho r’s Cont ac t:
{mjoselli, mzamith, esteban, anselmo, leal,aconci}@ic.uff.br
pagliosa@dct.ufms.br
{lvalente, bruno}@inf.puc-rio.br
1Introduction
Real time systems, like games, are defined as solutions that have
time constraints to run their tasks. So, if for any reason the sys-
tem is not able to execute its work under some time threshold, it
will fail. In order to achieve such constraints game loops canbe
implemented.
Computer games are multimedia applications that employ knowl-
edge of many different fields, such as Computer Graphics, Artificial
Intelligence, Physics, Network and others. More, computer games
are also interactive applications that exhibit three general classes of
tasks: data acquisition, data processing, and data presentation. Data
acquisition in games is related to gathering data from input devices
as keyboards, mice and joysticks. Data processing tasks consist
on applying game rules, responding to user commands, simulat-
ing Physics and Artificial Intelligence behaviors. Data presentation
tasks relate to providing feedback to the player about the current
game state, usually through images and audio.
Many computer games offer experiences where many actions seem
to happen at once. However, computers usually have limited re-
sources, so it is necessary to harvest the results of all processes in-
volved in a game and present them to the player. If the application
is not able to perform this work on time, the user may not receive
continuous feedback, and the interactivity that the game should pro-
vide will not be acceptable. Hence, one of the main requirements
of a game will not be fulfilled. This issue characterizes computer
games as a heavy real-time application.
Acommonparametertomeasurecomputergameperformance is
the number of frames per second (FPS) displayed on the screen.
Aframerepresentsanimagedisplayedonthescreen. Acommon
accepted lower bound for interactive rates is 16 frames per second.
Usually, a frame rate from 50 to 60 FPS is considered optimal.
Nowadays, computers and new video game consoles (such as the
Xbox 360 and the Playstation 3) feature multicore processors. For
this reason, game loops that take advantage of these resources are
likely to become important in the near future. Therefore, paral-
lelizing game tasks with multiple threads is a natural step. In order
to take advantage of different hardware, a generic architecture for
game loops and a multi thread game loop with this architectureare
present.
The development of programmable GPUs (graphics processing
units), has enabled new possibilities for general purpose compu-
tation (GPGPU) which now can be used to process some of the
common tasks of the game loop, like data processing tasks. This
is good news for games due to the parallel architecture of the lat-
est GPUs, which have more processing power than CPUs. GPUs
perform better than CPUs when large amounts of data are involved,
but to take advantage of this power, it necessary to develop a dif-
ferent approach than the traditional CPU sequential model. Hence,
due to architectural characteristics, the CPUs are more suitable for
processing small amounts of data while the GPUs are more suitable
for large amounts of data. In order to achieve better performance in
both cases (small and large amount of data), it is necessary toimple-
ment an automatic method to distribute tasks between the CPU and
the GPU. For this heuristics to work better with different hardware
architectures, they are implemented in a script language.
The main objective of this work is present a new game loop archi-
tecture that can be used to implement any game loop model and can
take advantage of automatic dynamic distribution of tasks between
the CPU and the GPU. This distribution is based on heuristics that
are defined in Lua [Lua ] scripts. This work presents concepts that
could be applied also to other hardwares like the PlayStation3with
the Cell Processor [Hofstee 2005].
This paper is organized as follows. Section 2 presents GPGPU con-
cepts. Section 3 presents related works in game loops and taskdis-
tribution between CPU and GPU. Section 4 presents the generic
architecture for game loops. Section 5 presents the test casewith
the test game loop and the test heuristics, and Section 6 presents the
results. Finally, Section 7 presents the conclusions.
2GPGPU
Graphics Processors Units or simply GPUs are processors dedi-
cated to mathematical processing in the graphics pipeline. The evo-
lution of those processors allows it to be used for processingother
mathematical tasks.
The GPUs have been evolving constantly and in a faster way than
the CPUs, acquiring superior computation power. A nVidia 8800
ultra [NVIDIA 2006], for instance, can sustain a measured 384
GFLOPS’s against 35.3 GFLOPS’s for the 2.6 GHz dual core Intel
Xeon 5150 [NVIDIA 2008]. This fact is attributed to the parallel
SIMD architecture of the GPUs (the nVidia GeForce 9800 GX2, for
example, has 256 unified stream processors). Because of the GPUs’
parallel architecture, they are very good for processing applications
that require high arithmetic rates and data bandwidths.
Nvidia and AMD/ATI are implementing unified architectures in
their GPUs. Each architecture is associated with it a specificlan-
guage: Nvidia has developed CUDA (Compute Unified Architec-
ture) [nVidia 2007b] and AMD developed CAL (Compute Abstrac-
tion Layer) [AMD 2007]. One main advantage in the use of these
languages is that they allow the use of the GPU in a more flexible
way (both languages are based on the C language) without some of
the traditional shader languages limitations such as “scatter” mem-
ory operations, i.e. indexed write array operations, and others that
are not even implemented as integer data operands like bit-wise log-
ical operations AND, OR, XOR, NOT and bit-shifts [Owens et al.
2007]. Nevertheless, the disadvantage of these architectures is that
SBC - Proceedings of SBGames'08: Computing Track - Full Papers
Belo Horizonte - MG, November 10 - 12
VII SBGames - ISBN: 85-766-9217-1
115
they are only available for the vendors of the software, i.e.,CUDA
only works on Nvidia and CAL only works on AMD/ATI cards. In
order to have GPGPU programs that work on both GPUs it is nec-
essary to implement them in shader languages like GLSL (OpenGL
Shading Language), HLSL (High Level Shader Language) or CG
(C for Graphics) with all the vertex and pixel shader limitations and
idiosyncrasies.
In addition, Intel has recently presented a new architecturefor
GPUs called Larrabee [Seiler et al. 2008]. It is made up of sev-
eral x86 processors in parallel which can be used to process both
graphics and non-graphics data. The advantage of this architecture
is that it does not need a special language, just plain C. Neverthe-
less, it will only be available in late 2009.
There are many areas that apply GPGPU: wheather forcast [Micha-
lakes and Vachharajani 2008], chemistry [Ufimtsev and Mart ˜
Anez
2008] and of course graphics. Games use GPGPU mainly in two
areas: Physics and AI.
PhysX [Ageia 2008] and Havok [Intel 2008] are examples of
physics engines which have used the GPU to accelerate their
physics loop (eight times of speedup in the case of Havok [Green
2007]). Also the book GPU Gems 3 [Nguyen 2007] presents a
full section dedicated for physics using the GPGPU. In the field of
AI it can be seen implementations of state machines [Rudom ˜
An
et al. 2005] and flocking boids [Erra et al. 2004] using the GPU to
processes its data.
3Relatedworks
The game loop can be divided in three general classes of tasks:
Data acquisition task is responsible for getting user com-
mands from the various input devices;
Data processing tasks, also referred as the update stage, are
responsible for tasks that update the game state, for exam-
ple: character animation, Physics simulation, Artificial Intel-
ligence, game logic, and network data acquisition;
Data presentation task is responsible for presenting the results
to the user. In games, this corresponds usually to rendering
graphics and playing audio.
The main objective of real-time game loop models in the literature
is to arrange the execution of those tasks, in order to simulate par-
allelism. The work by Valente et al [Valente et al. 2005] provides
asurveyofreal-timegameloopmodelshatregardssingle-player
computer games. But it does not cover the use of GPGPU as an
update stage of the game loop.
The simplest implementation of a game loop with the GPGPU as
an update stage is executing it sequentially, as shown in figure 1.
Diverse GPGPU implementations use this game loop, as the CUDA
particles demo [nVidia 2007a].
Another game loop with GPGPU presented in the literature is the
multithread architecture with GPGPU stage uncoupled from the
main loop [Joselli et al. 2008a; Joselli et al. 2008b]. This architec-
ture is composed by two threads. One of them gathers user input,
executes rendering, and updates the game state. T he other thread
runs the GPGPU. Figure 2 illustrates this game loop model.
The multithread uncoupled with a GPGPU stage [Zamith et al.
2007] is the other game loop with GPGPU available on the lit-
erature. This game loop consists of three threads: the first deals
with gathering user input and updating the game state. The second
thread is responsible for rendering the scene. The third one runs the
GPGPU. Figure 3 illustrates this game loop.
The literature on task distribution between CPU and GPU is scarce.
The work by [Zamith et al. 2007] implements a semi-automatic
task scheduling distribution between CPU and GPU via a scriptfile.
Joselli et al. [Joselli et al. 2008a; Joselli et al. 2008b] implements
some heuristics for automatic task distribution between CPUand
GPU, using a Physics engine that has some methods implemented
in both CPU and GPU.
Read Input
GPGPU
Render
Update
Figure 1: Single coupled loop
Read Input
Render
Update
GPGPU
Figure 2: Multithread architecture with GPGPU stage uncoupled
from the main loop
All of those game loops can be implemented in the adaptative game
loop architecture presented in the next section.
4TheAdaptativeGameLoopArchitec-
ture
This paper presents a new game loop architecture, named Adapta-
tive Game Loop Architecture. This architecture is able to:
Implement the game loop by using a multi-threads or a single-
thread mode;
Use coupled and uncoupled tasks;
Use pixel shader or CUDA.
This architecture is based on the concept of tasks. A task corre-
sponds to some work that the application should execute, for exam-
ple: reading player input, rendering and update applicationobjects.
In the proposed architecture, a task can be anything that the ap-
plication should work towards processing. However, not all tasks
can be processed by all processors. Therefore, the application has
three groups of tasks. The first one consists of tasks that can be
modeled only for running on the CPU, like reading player input,
file handling, and managing other tasks. The second group consists
of tasks that can only run in the GPU, like the presentation of the
scene. The third group can also be modeled for running on both
processors. These tasks are responsible for updating the state of
some objects that belongs to the application, like AI and Physics.
SBC - Proceedings of SBGames'08: Computing Track - Full Papers
Belo Horizonte - MG, November 10 - 12
VII SBGames - ISBN: 85-766-9217-1
116
Read Input
Render
Update
GPGPU
Figure 3: Multithread uncoupled with an GPGPU stage
The task concept is modeled as an abstract class that different
threads are able to load. Figure 4 illustrates the UML class dia-
gram for the Task and its subclasses.
The Task class is the virtual base class and has four subclasses: In-
put Task, Update Task, Presentation Task, and Automatic Update
Tas k . T h e fi r st th r ee ar e a l so ab s tra c t c las s e s. T h e lat t e r is aspecial
class whose work is to perform the automatic dynamic distribu-
tion between the CPU and the GPU. This distribution consists of
choosing the processor that is going to run a task according tosome
heuristic specified in a script file. Also a special class, the Tas k
Manager class, is responsible for creating and keeping all the tasks
of the game loop (discussed in Subsection 4.1).
The Input Task classes and subclasses handle user input related is-
sues. The Update Task classes and subclasses are responsiblefor
updating the game state. The CPU Update class should be used for
tasks that run on the CPU, and the GPU Update class corresponds
to tasks that run on the GPU. The Presentation Task and subclasses
are responsible for presenting information to the user, which can be
visual (Render Task) or sonic (Sound Task).
4.1 The Task Manager
The Task Manager (TM) is the core component of the proposed ar-
chitecture. It is responsible for instancing, managing, synchroniz-
ing, and finalizing task threads. Each thread is responsible for tasks
that run either on the CPU or on the GPU. In order to configure the
execution of the tasks, each task has control variables described as
follows:
THREADID: an id of the thread that the task is going to use.
When the TM creates a new thread, it creates a THREADID
for the thread and it assigns the same id to every task that
executes in that thread;
UNIQUEID: an unique id of the task. It is used to identify the
tasks;
TAS K T Y P E : t h e t a sk type . T h e f o l l owing t y p e s a r e a va i l a b l e :
input, update, presentation, and manage;
DEPENDENCY: a list of the tasks (ids) that this task depends
on to execute.
With that i nformation, t he TM cre ates the task and configures how
the task is going to execute. A task manager can also hold another
task manager, so to use it to manage some distinct group of tasks.
An example of this case is the automatic update tasks that Subsec-
tion 4.2 presents.
The Task Manager acts as a server and the tasks act as its clients,
as every time a task ends, it sends a message to the Task Manager.
The Task manager then checks which task it should execute in the
thread.
When the Task Manager uses a multi-thread game loop, it is neces-
sary to apply a parallel programming model to be able to identify
the shared and non-shared sections of the application, because they
should be treated differently. The independent sections compose
tasks that are processed in parallel, like the rendering task. The
shared sections, like the update tasks, need to be synchronized in
to guarantee mutual-exclusive access to shared data and to preserve
task execution ordering.
Although the threads run independently from each other, it isnec-
essary to ensure the execution order of some tasks that have pro-
cessing dependence. The architecture accomplishes this by using
the DEPENDENCY variable list that the Task Manager checks to
know the task execution ordering.
The processing dependence of shared objects needs to use a syn-
chronization object, as applications that use many threads do. Multi
thread programming is a complex subject, because the tasks in
the application run alternately or simultaneously, but not linearly.
Hence, synchronization objects are tools for handling task depen-
dence and execution ordering. This measure should also be care-
fully applied in order to avoid thread starvation and deadlocks. The
TM uses semaphores as the synchronization object.
4.2 The Automatic Update Task
The purpose of this class is to define which processor will run the
task. The class may change the task’s processor during the applica-
tion execution, which characterizes a dynamic distribution.
One of the major features of this new architecture is to allow dy-
namic and automatic task allocation between the CPU and GPU,
in order to do that it uses the Automatic Update Task class. This
task can be configured in order to be executed in three modes: CPU
only, GPU only and in the automatic distribution between CPU and
GPU. In order to execute on the CPU a CPU implementation must
be provided, and in the GPU a GPU implementation must be pro-
vided, and in order to make use of the automatic distribution both
implementation must be provided. The scheduling is by heuristic
in a script file. Also a configuration on how the heuristic is going
to behave is needed, and for that a script configuration file is pre-
sented in Subsection 4.2.1. The scripts files are implementedin
Lua [Ierusalimschy et al. 2006] (Subsection 4.2.2).
The Automatic Update Task acts like a server and its tasks as
clients. The role of the automatic update task is to execute a heuris-
tic to automatic determine in which processor the task will beex-
ecuted. The Automatic update task executes the heuristic andde-
termines which client will execute the next task and will senda
message to the chosen client, allowing it to execute. Also, every
time the clients finish a task they send a message to the server to let
it know it has finished. Figure 5 illustrate this process.
GPGPU
CPU
finish
start
finish
start
Heuristic
execute
result
Automatic Update Task
Figure 5: The Automatic Update Task class and messages
4.2.1 The Configuration Script
The configuration script is used in order to configure how the auto-
matic update task will execute the heuristic.This script defines four
variables:
INITFRAMES: used in order to set how many frames are used
by the heuristic to do the initial tests. These initial tests are
used in order that the user may want that the heuristic make
the initial tests different than the normal test;
DISCARDFRAME: used in order to discards the first DIS-
CARDFRAME frame results, because the main thread can be
loading images or models and it can affect the tests;
SBC - Proceedings of SBGames'08: Computing Track - Full Papers
Belo Horizonte - MG, November 10 - 12
VII SBGames - ISBN: 85-766-9217-1
117
Mouse Task
Task
Input Task
Update Task
Presentation Task
Joystick Task
Keyboard Task
CPU Task
GPU Task
Render Task Sound Task
Task Manager
1
+n
1
Automatic Update Task
1
1
1
1
Figure 4: Multithread uncoupled with an GPGPU stage
LOOPFRAMES: it is used to set on how frequency the heuris-
tic will be executed. If this value is set to -1 the heuristic will
be executed only once;
EXECUTEFRAMES: it is used to set how many frames are
needed before the decision on changing the processor will ex-
ecute the next tasks.
An example of the configuration scrip file can be seen in script 1.
Script 1 Configuration Script
INITFRAMES = 20
DISCARDFRAME = 5
LOOPFRAMES = 50
EXECUTEFRAMES = 5
So the automatic update task begins executing after the DISCARD-
FRAME are executed, then it execute INITFRAMES frames in
the CPU and the next INITFRAMES in the GPU then it decides
where the next LOOPFRAMES frames will be executed. If the
LOOPFRAMES is greater then -1, it executes EXECUTEFRAMES
frames in the CPU and it executes EXECUTEFRAMES frames in
the GPU then it decides where the next LOOPFRAMES frames will
be executed and keep repeating until the application is aborted.
4.2.2 The Heuristic Script
The heuristic script is used in order to distribute automatically the
tasks between the CPU and the GPU. This script defines three func-
tions:
reset(): reset all the variables that the script uses in orderto
decide which processor will execute the task. This function
is called after the LOOPFRAMES frames are executed. The
variable that are norma lly used by the heuristic are:
CPUTime: this is the sum of all the elapsed times that
the task has been processed in the CPU;
GPUTime: this is the sum of all the elapsed times that
the task has been processed in the GPU;
bestCPUFPS: the best frame rate achieved by the CPU;
bestGPUFPS: the best frame rate achieved by the GPU;
setVariable(elapsedTime, processor): in this function is where
all the variables that are used by the heuristic are set. This
function is called after the EXECUTEFRAMES frames in
each processor. This function can be seen on script 2.
Script 2 setVariable Script
function setVariable(elapsedTime, processor)
FPS = 1 / elapsedTime
if (processor == CPU) then
CPUtime = CPUtime + elapsedTime
if (FPS > bestCPUFPS) then
bestCPUFPS = FPS
end
else
GPUtime = GPUtime + elapsedTime
if (FPS > bestGPUFPS) then
bestGPUFPS = FPS
end
end
end
main (): This is the function that executes the heuristic and
decides which processor will execute the task. This function
is called just before the LOOPFRAMES frames are executed.
Ascriptwiththisfunctionimplementedwiththedecisionof
always executing in the GPU can be seen on script 3.
Script 3 Main Script
function main()
return GPU;
end
5TestCase
The test case corresponds to the n-bodies sample [Nyland et al.
2007] from the GPU Gems 3 [Nguyen 2007]. The authors had
implemented this example only to validate the model of game loop
proposed in this work, because this problem works with intense
mathematics processing.
The n-bodies demo is an approximation of the evolution of a sys-
tem of bodies in which every body interacts with every other body.
It also applies to different simulations like protein folding, turbu-
lent fluids, global illumination and astrophysics. In this case the
n-bodies is a simulation of astrophysics in which each body repre-
SBC - Proceedings of SBGames'08: Computing Track - Full Papers
Belo Horizonte - MG, November 10 - 12
VII SBGames - ISBN: 85-766-9217-1
118
sents a galaxy or an individual star, and each body attracts orrepeals
each other with gravitational forces.
This sample was implemented in both CPU and GPU. The GPU
version uses CUDA. It is important to remark that even thoughtthe
demo uses CUDA, the game loop implementation could use CAL
or shader languages (GLSL, HLSL or CG) without major modifi-
cations in the framework layer. Figure 6 illustrates a set of frames
from the simulation.
Figure 6: N-bodies sample
The authors did not emphasize the n-bodies problem, because it is
not the aim of this work. The n-bodies is used such as example only
for validate the method proposed.
5.1 The Tested Heuristic
The demo uses a very simple heuristic. It checks which processor
was the fastest one in running the task and then selects this proces-
sor to run the next frames. Script 4 lists the heuristic.
Script 4 Teste d Heu ris tic
function main()
if (CPUtime < GPUtime) then
mode = CPU;
else
mode = GPU;
end
end
The heuristic is configured in script, and any kind of heuristics can
be implemented, but the heuristics developed by the authors can
work in two different ways:
The first, that is called initial, is configured in order to execute
20 frames in the CPU and 20 frames in the GPU and then it
decides for the fastest processor.
The second, that is called looped, is configured to loop in the
following state: execute 5 frames in the CPU and 5 frames in
the GPU and decide to for the fasted processor to execute the
next 200 frames.
5.2 The Tested Game Loop
To te s t the a r c hit e c tur e , t h e dem o i m p lem e nted a g a m e loo p w i th
an input task and a render task in the main loop, and an automatic
update task with CPU/GPU in another thread (uncoupled). Figure
7illustratesthisgameloop.
6Results
The tests were done base on the fast n-bodies simulation with
CUDA, such as described before. There are two groups of tests.
Read Input
Render
Update
GPGPU CPU
Automatic
Update
Task
Figure 7: The Multi thread loop with an automatic update task
uncoupled from the main loop.
The first group uses the initial heuristic where the fastest processor
is selected at the beginning of the task execution. The secondgroup
uses the other heuristic (looped), that is, the heuristic is invoked for
each cycle of the frames. The CPU tests were made with an Intel
quad-core 2.4 GHz and the GPU tests were made with three differ-
ent GPUs a nVidia Geforce 8800 GTS, a nVidia Geforce 8400 GS
and a nVidia 8200M G.
For both groups, the example was executing the application and the
work of heuristic to choice the processor. Table 1 illustratetheper-
formance of the application and the processing for both processors.
The initial number of bodies is 4 and it is increased until 8192bod-
ies. Figure 8 shows a comparison between the CPU and the nVidia
8800 GTS GPU when there are few number of bodies.
0
2
4
6
8
10
12
14
0 5 10 15 20 25 30 35
Milliseconds
Bodies
INTEL CORE 2 QUAD 2.40Ghz
8800 GTS
Figure 8: Comparison between the CPU and the GPU
Although the GPU is faster than CPU, until 25 bodies, approxi-
mately, the CPU is better and the heuristic choices it, after the GPU
is already chosen. So, the CPU is faster with less bodies and the
GPU, in this example, is more efficient with higher number of bod-
ies. Figure 9 shows the comparison between the evolution of the
tested GPUs.
7Conclusion
Multicore hardware architectures are a tendency, and both the CPU
and the GPU have developed great evolution in this aspect. Quad-
code processors are present in the latest CPU architecture and uni-
fied architecture is presented in the latest GPUs. This tendency is
not only in increasing the processing power but also increasing the
number of cores available. With that the parallel processing, in this
architectures, is a reality. With this hardware evolution the games
will get much more sophisticated and multicore game loops with
the use of the GPU will get more common.
SBC - Proceedings of SBGames'08: Computing Track - Full Papers
Belo Horizonte - MG, November 10 - 12
VII SBGames - ISBN: 85-766-9217-1
119
Tab l e 1 : Elapsed time of processors in 100 iterations measured in milliseconds
bodies CPU elapsed time 8800 GTS elapsed time 8400 GS elapsed time 8200M G elapsed time processor
4
0.404
7.792 4.251 4.666 CPU
8
1.010
9.170 4.417 5.105 CPU
16
3.279
9.265 4.756 5.116 CPU
32 12.243
10.600 5.700 10.343
GPU
64 48.001
11.250 10.931 26.496
GPU
128 190.745
15.182 30.658 89.628
GPU
256 773.152
29.244 107.664 318.036
GPU
512 3124.517
75.188 410.865 1186.663
GPU
1024 12155.210
282.648 1619.403 4584.704
GPU
2048 48627.184
989.581 6526.119 18097.682
GPU
4096 195216.563
3835.552 25815.580 71998.977
GPU
0
200
400
600
800
1000
1200
1 10 100 1000
Milliseconds
Bodies
8800 GTS
8400 GS
8200 GS
Figure 9: Comparison between the GPUs
References
AGEIA,2008. Physx. Avalibleat: http://www.ageia.com.
20/02/2008.
AMD, 2007. Amd stream computing. Avali-
ble at: http://ati.amd.com/technology/
streamcomputing/firestream-sdk-whitepaper
.pdf.20/02/2008.
ERRA,U.,CHIARA,R.D.,SCARANO,V.,AND TATA FI O R E ,M.
2004. Massive simulation using gpu of a distributed behavioral
model of a flock with obstacle avoidance. Vi s i on, M o d el i n g, an d
Vis u ali z a t ion ,233240.
GREEN,S.,2007. Gpgpuphysics.Siggraph07GPGPUTutorial.
HOFSTEE,H.P. 2005. Powerefcientprocessorarchitectureand
the cell processor. IEEE Proceedings of the 11th International
Symposium on High-Performance Architecture.
IERUSALIMSCHY,R., DE FIGUEIREDO,L.H.,AND CELES,W.
2006. Lua 5.1 Reference Manual.Lua.org.
INTEL,2008. Havok. Avalibleat: http://www.havok.com.
20/02/2008.
JOSELLI,M.,ZAMITH,M.,VAL ENTE,L.,CLUA,E.W.G.,
MONTENEGRO,A.,CONCI,A.,FEIJ´
O,B.,DORNELLAS,M.,
LEAL,R.,AND POZZER,C. 2008. Automaticdynamictask
distribution between cpu and gpu for real-time systems. IEEE
Proceedings of the 11th International Conference on Computa-
tional Science and Engineering,4855.
JOSELLI,M.,CLUA,E.,MONT ENEGRO,A.,CONCI,A.,AND
PAGLIO SA,P. 2008. Anewphysicsenginewithautomaticpro-
cess distribution between cpu-gpu. Sandbox 08: Proceedings of
the 2008 ACM SIGGRAPH symposium on Video games,149
156.
LUA.Theprogramminglanguagelua.Dispon
˜
Avel em:
http://www.lua.org/manual/.20/12/2007.
MICHALAKES,J.,AND VACHHARAJANI,M. 2008. Gpuacceler-
ation of numerical weather prediction. IEEE International Sym-
posium on Parallel and Distributed Processing,17.
NGUYEN,H. 2007. GPU Gems 3 - Programming Techniques
for High-performance Graphics and General-Purpose Compu-
tation.Addison-Wesley.
NVIDIA. 2006. Geforce 8800 gpu architecture overview. tb-
02787-001 v0.9. Technical report, NVIDIA.
NVIDIA,2007. Cudaparticles. Avalibleat:
http://developer.download.nvidia.com/
compute/cuda/1 1/Website/projects/
particles/doc/particles.pdf.20/02/2008.
NVIDIA,2007. Nvidiacudacomputeunieddevicear-
chitecture documentation version 1.1. Avalible at:
http://developer.nvidia.com/object/cuda.html.
20/12/2007.
NVIDIA. 2008. Nvidia - cuda compute unifieddevice architecture.
Programming guide, NVIDIA.
NYLAND,L.,HARRIS,M.,AND PRINS,J. 2007. Fastn-body
simulation with cuda. GPU Gems 3 Chapter 31,677695.
OWENS,J.D.,LEUBKE,D.,GOVINDA RA JU ,N.,HARRIS,M.,
KR˜
AGER,J.,LEFOHN,A.E.,AND PURCELL,T.J. 2007.
Asurveyofgeneral-purposecomputationongraphicshardware.
Computer Graphics Forum 26(1),80113.
RUDOM ˜
AN,T.,MILL ˜
AN,E.,AND HERN ˜
ANDEZ,B. 2005.Frag-
ment shaders for agent animation using finite state machines.
Simulation Modelling Practice and Theory 13(8),741751.
SEILER,L.,CARMEAN,D.,SPRANGLE ,E.,FORSYTH,T.,
ABRASH,M.,DUBEY,P.,JUNKINS,S.,LAKE,A.,SUGER-
MAN,J.,CAV I N ,R.,ESPASA,R.,GROCHOWSKI ,E.,JUAN,
T., AND HANRAHAN,P. 2008. Larrabee:Amany-corex86ar-
chitecture for visual computing. ACM Transactions on Graphics
27,3.
UFIM TS E V,I.S.,AND MART ˜
ANEZ,T.J. 2008. Quantumchem-
istry on graphical processing units. 1. strategies for two-electron
integral evaluation. Jou rnal Chemistry Theory Com putatio n 4
(2),222–231.
VALENTE,L.,CONCI,A.,AND FEIJ ´
O,B. 2005. Realtimegame
loop models for single-player computer games. In Proceedings
of the IV Brazilian Symposium on Computer Games and Digital
Entertainment,8999.
ZAMITH,M.,CLUA,E.,PAG LIO SA,P.,CONCI,A.,MONTENE-
GRO,A.,AND VALENTE,L. 2007. Thegpuusedasamath
co-processor in real time applications. Proceedings of the VI
Brazilian Symposium on Computer Games and Digital Enter-
tainment,3743.
SBC - Proceedings of SBGames'08: Computing Track - Full Papers
Belo Horizonte - MG, November 10 - 12
VII SBGames - ISBN: 85-766-9217-1
120
... In the same line, Tayyub & Khan [55] use a CPU-GPU architecture to perform collision detection. Joselli et al. [56] present an adaptive game loop that is able to dynamically allocate tasks in different cores of multiple CPUs and GPUs, choosing the best device for the task by evaluating performance in each device over time. When the amount of resources is limited, Cloud Gaming increases the capabilities of the devices by using a computing outsourcing architecture. ...
Article
Full-text available
Video games have evolved into a key part of modern culture and a major economic force, with the global market projected to reach {\}$ 282.30 billion in 2024. As technology advances, video games increasingly demand high computing power, often requiring specialized hardware for optimal performance. Real-time strategy games, in particular, are computationally intensive, with complex artificial intelligence algorithms that simulate numerous units and behaviors in real-time. Specialized gaming PCs are use a dedicated GPU to run video games. Due to the usefulness of GPUs besides gaming, modern processors usually include an integrated GPU, specially in the laptop market. We propose a hybrid architecture that utilizes both the dedicated GPU and the integrated GPU simultaneously, to accelerate AI and physics simulations in video games. The hybrid approach aims to maximize the utilization of all available resources. The AI and physics computations are offloaded from the dedicated GPU to the integrated GPU. Therefore, the dedicated GPU can be used exclusively for rendering, resulting in improved performance. We implemented this architecture in a custom-built game engine using OpenGL for graphics rendering and OpenCL for general-purpose GPU computations. Experimental results highlight the performance characteristics of the hybrid architecture, including the challenges of working with the two devices and multi-tenant GPU interference.
... Although welcome, performance improvements might not be sufficient when one has to implement complex decision making processes. Recall that a videogame is typically executed by running the so called game loop routine [22]. The game loop consists in a single-threaded repeated execution of update operations to the current game scene: within each step of this cycle, user input is processed, AI decision-making operations and physics simulations are made, and the game scene is modified accordingly. ...
... From Figure 5, it can be seen that the game FPS varies from 50 to 30 FPS during the game. The achieved performance is considered optimal in a game [33]. The proposed game prototype has been tested without the architecture in two ambients: using the Android framework and the native code. ...
Conference Paper
Full-text available
Nowadays mobile phones, especially smartphones, are equipped with advanced computing capabilities. Most of these devices have multicore processors such as dual-core CPUs and many-core GPUs. These processors are designed for both low power consumption and high performance computation. Moreover, most devices still lack libraries for generic multicore computing usage, such as CUDA or OpenCL. However, computing certain kind of tasks in these mobile GPUs, and other available multicores processors, may be faster and much more efficient than their single threaded CPU counterparts. This advantage can be used in game development to optimize some aspects of a game loop and also include new features. This work presents an architecture designed to process most of the game loop inside a mobile GPU using the Android Renderscript API. As an illustrated test case for the proposed architecture, this work presents a game prototype called “MobileWars”, completely developed using the proposed architecture.
... Joselli et al. [14] claimed that a typical game loop can be divided into three general classes of tasks: ...
Conference Paper
Over the past 30 years, software developers have been conveniently taking advantage of hardware performance increase, giving little consideration to internal architecture changes of the hardware. However, hardware architectural changes like central processing unit will affect software architectures and can no longer be ignored. This is especially true for real-time applications, including computer games, which tend to push the limits of hardware and take the most advantage of available resources. By applying the concepts of concurrency, multithreading and multi-core Central Processing Unit (CPU) technology, this paper redefines the existing linear architecture of game engines as a generic concurrent and multi-core friendly architecture. Major game engine modules and their inter-dependencies are identified in order to design the new architecture. A sample game was developed to evaluate the performance of the proposed architecture. The comparison of the test results provided in this paper indicates noticeable improvements (5.1% to 61.2%) in the concurrent architecture over the conventional linear approach. User acceptance evaluation with several industry experts also showed encouraging feedback.
... Data decomposition approach allows these components to work in a pipelined manner, which allows each component to process a batch of data then pass it to next component that is running concurrently on another core. Joselli et al. [12] approach separates game loop tasks for parallel execution on CPU and GPU by employing GPGPU (general-purpose GPU processing) paradigm. Kulkarni et al. [13] present "Galois" platform which exploits parallelism in complex data structures with data dependency and pointer-based relations between data objects, such as trees and graphs. ...
Article
Fine-grained multithreaded applications are becoming more vital as new processing hardware is moving towards larger number of pro- cessing cores per CPU. The increased number of cores facilitate per- formance enhancement of real-time applications including computer games. In this paper, we present a new design for a multithreaded game engine which incorporates multithreading each game engine component separately using data decomposition. Our approach suggests maintaining sequential game loop to avoid major changes on current single-threaded game engines. Experimental results have shown a maximum relative speedup of 3.36 and a maximum relative efficiency of 84%, which are achieved on 4-core CPU, in addition to component-level enhancements, which reflects high utilization of the multi-core platform
Chapter
Full-text available
Learning analytics have become a solidly established line of research, leveraging the large amounts of data that students generate while interacting with online materials such as those stored in a learning management system (LMS). On the other hand, serious games are also experiencing a growing acceptance as a powerful medium for learning. But at the crossroads of learning analytics and serious games, there is great potential for intelligent analytics models that leverage the vast amounts of data that a single gameplay session can generate, which may be orders of magnitude larger than the data generated in an online study session with an LMS. In this chapter, we explore this potential, and outline relevant advances using serious games as a source for rich learning analytics.
Article
In everyday life human being uses accessories that maintain a relationship with their characteristics and physical dimensions, such as chairs, places and workplaces. From this it is identified that population's comfort and well-being can be influenced by the degree to which such accessories engage people. For the development of these tools, it is important to have anthropometric profiles established by statistical analysis of previously acquired measurements. Anthropometry is known as the science that studies the human body dimensions; including linear dimensions, weight, volume and types of movement, involving the use of body reference marks, carefully defined, the specific positioning of subjects for data collection, use of appropriate instruments and their respective statistical treatment. Being this project a continuation of "Development of an application of anthropometric measurement based on Kinect”, aims to modify the methodology through the implementation of an update of variables in use (measures), with which is adjust the characterization of the anthropometric profile of a precise and convergent shape to its applicability in the ergonomic design of workplaces, taking into account the differentiation between genres and the use of Kinect sensor as a way for measuring, structuring and virtualize anthropometric profile.
Article
The constant over time and especially the population growth has made emerging markets, have to make major changes to facilitate their adaption to globalization markets in the retail sector. For this reason the technology comes into play as an appropriate means for the emergence of this type of retail markets; considering the user as an important part of the process of continuous improvement, both on the trade done as well as the use of technological tools that allow user convenience against ergonomic positions that should have when being in front of rack with products associated with your preferences and also improving the acquisition of products for Latin American market anthropometric measures. In this paper, a software is developed with the Kinect Sensor technology as a low-cost tool for identifying anthropometric measurements of user to build profile in their own retail operations.
Chapter
Simulation systems are becoming common in different knowledge fields, such as aeronautics, defense, and industrial applications, among many others. While in the past these systems where mostly based on typical Virtual Reality Environments, with the advance of the game industry simulators are being developed using typical game engines and gaming software architectures. Distributed computing is being used in several fields to solve many computation intensive problems. Due the complexity of Simulation systems, this architecture can also be used, devoting host processing to renderization, which is usually the task that simulators spend most of its processing time. By using distributed computing, simulators could need softer system requirements, since the main loop would be distributed. This work presents concepts of simulator software, which is based on the main loop technique. After describing state-of-the-art concepts, we present an efficient automatic load balancing and distributing logic computation among several computers for simulators.
Conference Paper
Games are simulations of the physical and imaginary worlds. Games nowadays run on commodity platforms that include different categories of powerful computing elements with varying capabilities. To benefit from this variety, suitable mapping of works to computing elements is essential for optimal performance. Arbiter Work Stealing (AWS) is a new scheduler addressing this requirement. The AWS scheduler builds on the classical work stealing algorithm by adding an upper layer that "manages" multiple running instances of the work stealing algorithm. AWS automatically schedules the dynamically generated game application tasks to appropriate processors using a cost model that takes into account current work load, execution times, data locality, and data transfer rates. Experimental results show that incorporating AWS to schedule tasks of a parallel game application yields superior performance through better utilization of the available resources and through better use of data locality in a heterogeneous computing environment.
Article
Full-text available
There are few academic works on game loop models and algorithms in the literature. Also, the literature lacks a comprehensive conceptual framework for real time game loops. This paper proposes a general classification for real time game loop models and presents an algorithm that improves one of the most common models for single-player computer games.
Article
Full-text available
The Graphics Processing Units or simply GPUs have evolved into extremely powerful and flexible processors. This flexibility and power have allowed new concepts in general purpose computation to emerge. This paper presents a new architecture for physics engines focusing on the simulation of rigid bodies with some of its methods implemented on the GPU. Sending physics computation to the GPU enables the unloading of the required computations from the CPU, allowing it to process other tasks and optimizations. Another important reason for using the GPU is to allow physics engines to process a higher number of bodies in the simulation. It also presents an automatic process distribution scheme between CPU and GPU. The importance of the automatic distribution for physics simulation arises from the fact that, sometimes, the simulated scene characteristics may change during the simulation and by using an automatic distribution scheme the system may obtain the best performance of both processors (CPU and GPU). Also, with an automatic distribution mode, the developer does not have to decide which processor will do the work allowing the system to choose between CPU and GPU. This paper also presents an uncoupled multithread game loop used by the physics engine.
Article
Full-text available
The increase of computational power of programmable GPU (Graphics Processing Unit) brings new concepts for using these devices for generic processing. Hence, with the use of the CPU and the GPU for data processing come new ideas that deals with distribution of tasks among CPU and GPU, such as automatic distribution. The importance of the automatic distribution of tasks between CPU and GPU lies in three facts. First, automatic task distribution enables the applications to use the best of both processors. Second, the developer does not have to decide which processor will do the work, allowing the automatic task distribution system to choose the best option for the moment. And third, sometimes, the application can be slowed down by other processes if the CPU or GPU is already overloaded. Based on these facts, this paper presents new schemes for efficient automatic task distribution between CPU and GPU. This paper also includes tests and results of implementing those schemes with a test case and with a real-time system.
Article
We present mathematical sketching, a novel, pen-based, modeless gestural interaction paradigm for mathematics problem solving. Mathematical sketching derives from the familiar pencil-and-paper process of drawing supporting diagrams to facilitate the formulation of mathematical expressions; however, with a mathematical sketch, users can also leverage their physical intuition by watching their hand-drawn diagrams animate in response to continuous or discrete parameter changes in their written formulas. Diagram animation is driven by implicit associations that are inferred, either automatically or with gestural guidance, from mathematical expressions, diagram labels, and drawing elements. The modeless nature of mathematical sketching enables users to switch freely between modifying diagrams or expressions and viewing animations. Mathematical sketching can also support computational tools for graphing, manipulating and solving equations; initial feedback from a small user group of our mathematical sketching prototype application, MathPad², suggests that it has the potential to be a powerful tool for mathematical problem solving and visualization. In Valve's Source graphics engine, bump mapping is combined with precomputed radiosity lighting to provide realistic surface illumination. When bump map data is derived from geometric descriptions of surface detail (such as height maps), only the lighting effects caused by the surface orientation are preserved. The significant lighting cues due to lighting occlusion by surface details are lost. While it is common to use another texture channel to hold an "ambient occlusion" field, this only provides a darkening effect which is independent of the direction from which the surface is being lit and requires an auxiliary channel of data.
Article
Modern videogames place increasing demands on the computational and graphical hardware, leading to novel architectures that have great potential in the context of high performance computing and molecular simulation. We demonstrate that Graphical Processing Units (GPUs) can be used very efficiently to calculate two-electron repulsion integrals over Gaussian basis functionsthe first step in most quantum chemistry calculations. A benchmark test performed for the evaluation of approximately 106 (ss|ss) integrals over contracted s-orbitals showed that a naïve algorithm implemented on the GPU achieves up to 130-fold speedup over a traditional CPU implementation on an AMD Opteron. Subsequent calculations of the Coulomb operator for a 256-atom DNA strand show that the GPU advantage is maintained for basis sets including higher angular momentum functions.
Article
Computing and presenting emergent crowd simulations in real-time is a computationally intensive task. This intensity mostly comes from the O(n 2) complexity of the traversal algorithm needed for the interactions of all elements against each other based on a prox-imity query. By using special data structures such as grids, and the parallel nature of graphics hardware, relevant previous works re-duces this complexity by considerably factors, making it possible to achieve interactive frame rates. However, existent proposals tend to be heavily bound by the maximum density of such grids, which is usually high, yet leading to arguably inefficient algorithms. In this paper we propose the use of a fine grained grid and accompa-nying data manipulation that leads to scalable algorithmic complex-ity. We also implement a representative flocking boids case-study from which we run benchmarks with more than 1 million simulated and rendered boids at nearly 30fps. We remark that previous works achived not more than 15,000 boids with interactive frame rates.
Article
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general‐purpose computation to graphics hardware. We begin with the technical motivations that underlie general‐purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general‐purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general‐purpose application development on graphics hardware.