ArticlePDF Available

Heterogeneous Computing (CPU + GPU Power)

Authors:

Abstract

Basic concepts about what your CUDA GPU. Friendly explanations with a brief historic introduction. Simple code samples to illustrate and compare CPU and GPU performances.
Heterogeneous Computing
(GPU+CPU Power)
Computación Heterogénea
(Potencia GPU+CPU)
Joaquín Obregón-Cobo
joaquin.obregon@gmail.com NB: Written year 2011
Yesterday
We are so used to “revolutions” in computers technology that we see them as steps on the
evolution road. Well, the last step is heterogeneous computing. This computing model consists
on using two different architectures in the same computer.
We should not confuse heterogeneous computing with parallel computing:
from the very beginning of the computer era, parallel computing has been
made. They did it with some central processing units (CPUs) with software able
to run simultaneously some tasks on them. Actually much of the work we do
today is based on algorithms from the sixties (of the last century). Those old
machines used to be SISD (on task on one data) known also as Von Neumann
architecture.
By the early seventies first vector computers were released
with a CPU able to process multiple data simultaneously.
Those are the first computers with SIMD (one task on many
data) architecture. This type of architecture has had
continuity over the time with ups and downs, mainly on
supercomputers and specialized units for graphics and
sound processing.
Lest used architecture is MISD (many tasks on one data) that is seldom or never
used in regular computers. It is/was used on fault-tolerant computers for
redundancy checking.
Now most frequent architecture is MIMD
(many tasks on many data). This is the one
that reflects multiple node supercomputers
now on the top rank of computing
capabilities, and also the internal architecture
of last generation of multicore processors
living in our PCs and workstations. Those multicore
processors (duo, 3x, quad, 4x, …) are internally a set of SISD
cores working together with independent data access, so
we can say that internally the CPU is MIMD.
None of these architectures can be considered as heterogeneous computing (as we use the
term) since all of them are homogeneous systems, they use the same type of CPU.
Let’s see now GPU (Graphics Processing Unit, term created by NVIDIA on the nineties) systems
and then we will define what we understand for heterogeneous computing.
A main role in the subject has been played by the graphic systems. Speaking about graphics,
we have to start always with Evans & Sutherland, where some other key players were born in.
One of them is Silicon Graphics (SGI) leading the development of vector graphics processors
(Reality Engine based on Intel i860) for his 3D workstations. The standard OPEN GL was also
born in SGI and has played a key role on computer graphics development.
1
2
+
3
+
1
0
5
1
7
+
8
SIMD
Fig. 2
1
+
1
0
SISD
Fig. 1
1
+
1
0
MISD
+
1
Fig. 3
1
2
+
3
-
1
0
-1
1
7
x
7
MIMD
Fig. 4
Today
Two companies are now leading the GPU market, and actually having the biggest part of it.
They are NVIDIA and AMD (formerly ATI technology). Specific details are not important except
SIMT architecture present on NVIDIA That is “slightly” different from the SIMD on AMD cards.
Relevant thing here is design convergence: both of them started with GPUs focused on
optimizing bits, pixels, textures… processing to the extreme. They consequently arrived to
extremely complex systems, both from the point of view of designers and programmers. And
they finally moved to a more generic and open architecture in a enormous conceptual jump
that has allowed them to multiply performance without multiplying complexity.
The important difference between them is that NVIDIA uses
SIMT (one task on many threads) architecture. They call it
CUDA. You can see the difference as simple as the
difference between Fig. 2 and Fig. 5. Simplifying we can
say that SIMT can avoid executing one task on some cores.
This small difference has very deep consequences for the
programmer, allowing execution divergence management in
a more “traditional” or “serial” way, thus making his life
much better. This simplicity of programming and a higher
performance (as of today) makes NVIDIA CUDA the best bet for heterogeneous computing.
Now we know that there are some different computer architectures, and that they usually
have two of them inside: one for general tasks and another one for graphics (and some more
but this is not relevant here). But, where is the heterogeneous computing revolution? It is on
using our GPU as a CPU too. We have made parallel computing before with the same type of
CPUs, and now we make work together completely different CPUs, some serial (CPU) and
some parallel (GPU), taking “the best of both worlds”.
We still consider a multicore CPU as a serial one, knowing that they have some cores inside,
due to some restrictions on memory access and its serial traditional” programming model.
They get parallelism executing some (up to 12 today) tasks or threads simultaneously and
synchronizing their serial execution paths.
The huge processing power of the GPU brights with big problems. GPU loves large series of
data, being integer numbers or floating point numbers. Working with appropriate amount and
type of data we can get even a 400x speedup (this is not possible for all problems).
Heterogeneous computing gives to all computer users the power of supercomputing. This is
possible due to:
Good (better) price/performance and (electric) power/performance ratios.
Programming simplicity thanks to CUDA smart & generic model.
Now
We think there are two main concepts to understand what heterogeneous computing is and
offers to us:
1
2
+
3
1
0
1
7
+
8
Fig. 5
SIMT
A. Difference between parallel and serial programming. As we consider CUDA the best
option, we will use it to illustrate the concept.
B. Performance we can get from our GPU. We will use a sample to illustrate how we
improve increasing the problem size. We will do it with a parallel programming classic:
sum reduction.
A. Best way may be making the same work in both manners. A simple task will be perfect to
illustrate it: add a four elements vector.
A.1. Serial. We write a C program which relevant part is:
vectorAdd( int *a, int *b, int *c, int n){
for (i = 0 ; i < n ; i++)
c[i] = a[i] + b[i];
return;
};
C Compiler will prepare the instructions for our CPU needed to do the task. We have made a
simplification (Fig. 6) where we see 16 steps (8 reading, 4 add, 4 writing) to do the work.
1
2
3
0
5
6
7
4
-
-
-
-
1
2
3
0
5
6
7
4
-
-
-
-
1
2
+
3
7
4
-
-
-
-
1
2
3
0
5
6
7
4
-
-
-
-
1
2
3
0
5
6
7
4
-
-
-
-
+
1
2
3
0
5
6
7
4
6
8
-
4
1
2
3
0
5
6
7
4
-
-
-
-
1
2
3
0
5
6
7
4
-
-
-
-
1
2
3
0
5
6
7
4
-
-
0
5
6
7
4
-
-
-
-
1
2
Esc
3
0
5
6
7
4
-
-
-
4
1
2
3
0
5
6
7
4
-
-
-
-
1
2
3
0
5
6
7
4
-
-
-
-
1
2
3
0
5
6
7
4
-
-
-
-
+
1
2
3
0
5
6
7
4
6
-
-
4
1
2
3
0
5
6
-
-
+
1
2
3
0
5
6
7
4
6
8
10
4
Fig. 6
lee
lee
lee
lee
lee
lee
lee
lee
Esc
Esc
Esc
A.2. Parallel. Our equivalent CUDA C program is:
__global__ vectorAdd( int *a, int *b, int *c){
int i = threadIdx.x;
c[i] = a[i] + b[i];
return;
};
Our GPU gives the maximum performance when every core executes his own copy of the
program. Luckily, the compiler does the hard work. Every thread running can identify itself
with the variable threadIdx.x hence giving ordered access to the vector.
As we can see in Fig. 7, we need 4 steps (2 reading, 1 add, 1 writing) to end it up.
We see that the chance for performance improvement is huge, as it increases with the
problem size. The better our problem fits with the parallel model the higher is the
improvement.
B. Gpu performance will be shown with vector sum: the addition of every element In a N sized
vector. This is done in a wide spectrum of computations, for instance in matrix operations.
With a serial approach this task will need a number of steps proportional to the number of
elements N (after optimization this will be 1 read, 1 add and 1 write, say 3N steps) what we call
N order, O(N).
With a parallel approach we will use an specific algorithm: sum reduction. Shonw on fig. 8 is a
graphic description of the
process: We start adding each
odd element with its adjacent,
then adding adjacent sums from
the previous step, and so on until
we get the final result.
Total number of sums is the
same, but we do them in a
number of steps that is
proportional to the logarithm of
N (base 2), in other words LOG
order, O(Log N).
1
2
lee
3
0
5
6
7
4
-
-
-
-
lee
lee
lee
1
2
lee
3
0
5
6
7
4
-
-
-
-
lee
lee
lee
1
2
+
3
0
5
6
7
4
-
-
-
-
+
+
+
1
2
Esc
3
0
5
6
7
4
6
8
10
4
Esc
Esc
Esc
Fig. 7
1
2
1
3
4
5
5
6
9
7
8
13
6
22
28
9
10
17
11
12
21
13
14
25
15
29
38
54
92
0
120
1
2
4
8
15 Sums in 4 steps
Fig. 8
Known that 16 is 65536 logarithm base 2, we see the potential performance improvement.
Actually it is impossible to keep that speedup ratio to “infinite” due to limitations on memory
management and number of
processing cores.
In the table you can see what
actually happens with our sum
reduction.
We have to state that this is a
sample problem selected for
simplicity and it is not the best
performing one. We make one
sum for every three memory
accesses, this is a very low
calculation/access ratio. And
code is not completely
optimized.
In the next graph we have the number of elements N on the horizontal axis and execution
time (in milliseconds) along the vertical axis. Note horizontal axis has logarithmic scale, so a
linear series of data has an exponential shape.
With this simple sample we
can see some interesting
aspects of heterogeneous
computing:
a. Execution speedup
needs a minimum size.
In this case we see
advantage on GPU
usage from about
N=100 elements.
b. In a detailed view with
N ranging from 4 to 512
we see a serrated line
for GPU time, this is
caused by the internal design of the GPU. It is not so simple but, if you know that 32 is the
number of threads that run simultaneously then you have a first explanation.
c. As problem size increases speedup ratio increases too, but not very much than 44x, in
other words 44 times faster, not too bad really.
d. Improvement should have been much higher if we see those theoretical logarithm
numbers. Time for the GPU grows much less than for the CPU but still linear, not
logarithmic as predicted. This indicates that it is not limited by the processing
performance, but for the memory access, which is better on the GPU than on the CPU but
still linear. Well, luckily memory bandwidth is one the strengths of the GPU.
N
CPU (Serie)
[ms]
GPU
(Paralelo)
[ms]
Ratio
Real
Log2(N)
Ratio
Teorico
Resultado Suma
2 0,01233 0,01220 1,0 1 2 1
4 0,00035 0,01254 0,0 2 2 6
8 0,00060 0,01290 0,0 3 3 28
16 0,00116 0,01357 0,1 4 4 120
32 0,00221 0,01466 0,2 5 6 496
64 0,00983 0,01502 0,7 6 11 2.016
128 0,02707 0,01582 1,7 7 18 8.128
256 0,04558 0,01729 2,6 8 32 32.640
512 0,11940 0,02003 6,0 9 57 130.816
1.024 0,27119 0,02526 10,7 10 102 523.776
2.048 0,64000 0,02639 24,3 11 186 2.096.128
4.096 0,29091 0,02908 10,0 12 341 8.386.560
8.192 0,51613 0,03451 15,0 13 630 33.550.336
16.384 0,64000 0,04532 14,1 14 1.170 134.209.536
32.768 1,76000 0,06690 26,3 15 2.185 536.854.528
65.536 3,84000 0,10991 34,9 16 4.096 2.147.450.880
131.072 7,52000 0,19606 38,4 17 7.710 8.589.869.056
262.144 15,52000 0,36826 42,1 18 14.564 34.359.607.296
524.288 30,56000 0,71263 42,9 19 27.594 137.438.691.328
1.048.576 61,12000 1,40201 43,6 20 52.429 549.755.289.600
2.097.152 122,24000 2,77936 44,0 21 99.864 2.199.022.206.976
4.194.304 244,00000 5,53608 44,1 22 190.650 8.796.090.925.056
8.388.608 492,32001 11,09849 44,4 23 364.722 35.184.367.894.528
Computación Heterogénea - Suma Vector - Datos
0
50
100
150
200
250
300
350
400
450
500
2 8 32 128 512 2.048 8.192 32.768 131.072 524.288 2.097.152 8.388.608
CPU (Serie)
GPU (Paralelo)
Computación Heterogénea - Suma Vector
0,00
0,01
0,02
0,03
0,04
0,05
4 8 16 32 64 128 256 512
e. Total sum is a big number. Calculation is correct, it is checked. We have needed double
precision numbers to do it (64 bits) regardless of being floating point or integer. We state
this because last GPUs released are IEEE 754 compliant, this is a guaranty on precision (to
the extents of the standard).
Tomorrow
It was some years ago when “traditional” processors (CPU) have their performance
progression reduced. It was caused by some limits reached concerning clock speed and/or
dissipation of heat from the chip. That was the reason why they started to make multicore
processors, still improving performance but in a slower way.
Parallel chips (GPU) have arrived to complement the CPU, thus allowing keeping the
progression of performances, but they are not the alternative, not as a standalone solution.
They are here to cooperate, its value comes from the “combination of the two”.
Weak points are known, mostly by the technology forerunners, and they are working on it:
1. Sharing memory. Both CPU and GPU need to communicate and they do it using RAM
memory. This can be the system bottleneck so reducing and accelerating this traffic is
one of the working lines now. Focused on this are some of the aspects of the new
AMD design including both a CPU and a GPU on the same chip (they call it APU). But
accessing memory from a CPU and from a GPU is quite different, so they have to find a
good solution to the challenge of making work together serial and parallel memory
access protocols.
2. Locality. Facing mid and long term (in computers I would not try to guess how many
years is this) the challenge is energy, the energy to make chips work, without fusing of
course. In this sense the distance between data and processing units is the most
significant parameter. So the target is not adding more processing cores, it is doing it
in such a way that data are at the minimum distance from the cores. NVIDIA is now
working on it, searching for strategies to locate cores and data optimally and probably
to structure a data hierarchy across internal memories (on chip) to progressively
minimize data accesses from the cores.
I have surmised those two points, sure to be far from knowing the future. But I am sure that
reality always exceeds fiction and so we will be surprised, and I hope for our joy.
Conclusions
Even for non-optimal problems speedup achieved is very important.
We have easy access to the tools we need, most with an affordable price, and it is probable
that your computer has it inside yet.
We have the opportunity for optimizing our programs to achieve two targets:
I. Execution time reduction. “Imagine” you calculating a building or a bridge structure,
now you need 44 minutes to do it, that is enough to have lunch but it is enough too for
losing your flight. We prefer 1 minute.
II. Interactivity. In our opinion this is the great chance. If we have a code that lasts for 44
seconds and we can take it to 1 or even ½ second… we can offer an interactive
response to the user. While the user moves the mouse, drags, clicks… he gets an
“instant” interaction, a responsive interface.
We have the opportunity to use this technology to give to the user a better experience, to face
bigger or more complex problems, to solve problems with higher precision… to obtain an
advantage from our competitors.
Glossary
Bit Minimum unit of information, acronym of BInary digiT. It can be one or zero.
Chip Piece of semiconductor materials where electronic circuits are implemented allowing to store
and/or process data. Microprocessors, CPUs, GPUs, RAM,… are chips.
Core Primary part of the processor. It does the work. Each one is able to do one task over one data; it
is the minimum processing unit. Each one is a SISD.
CPU Central Processing Unit. Commonly our Intel, AMD or other microprocessor.
GPU Graphics Processing Unit. Our graphic card or the main chip inside it.
Hardware Material or physical parts which compose a computer.
IEEE - Institute of Electrical and Electronics Engineers. Their standards are usually a universal reference.
MIMD Multiple Instruction Multiple Data. Simultaneous execution of some different tasks on different
data. See Fig. 4.
MISD Multiple Instruction Single Data. Simultaneous execution of some different tasks on one data.
See Fig. 3.
Manycore Chips with a big number (up to a thousand) of processing units (cores). They have been
designed from scratch for parallel computing therefore translating parallelism to data access as
much as possible. Their architecture is SIMD or SIMT on modern devices.
Multicore This is the name commonly used to describe processors (CPU) with some (up to 12)
processing units (cores) with a MIMD internal architecture.
OPEN GL Open Graphics Language. Cross-platform standard for graphics processing. It defines the
graphic elements (lines, surfaces…) generation an handling and also their visual output (light,
perspective…).
RAM, Memory Random Access Memory. A type of memory that allows writing and reading in any
location.
SIMD Single Instruction Multiple Data. Simultaneous execution of one task on some different data. See
Fig. 2.
SIMT Single Instruction Multiple Threads. Simultaneous execution of one task on some different
threads. See Fig. 5.
SISD Single Instruction Single Data. Simultaneous execution of one task on one data. See Fig. 1.
Software Immaterial components of a computer system.
Thread Minimal sequence of task processed by a core independently. One independent task.
In figures 6 and 7 means and means .
Registered trademarks by their owners:
CUDA, Nvidia, AMD, ATI, Fusion, SGI, Silicon Graphics, Reality Engine, Open GL, Evans & Sutherland, Intel, i860.
May be “the best of both worlds”, ”combination of the two” and “imagine” are registered too.
ResearchGate has not been able to resolve any citations for this publication.
It defines the graphic elements (lines, surfaces…) generation an handling and also their visual output (light, perspective…)
  • Open Gl-Open Graphics Language
OPEN GL-Open Graphics Language. Cross-platform standard for graphics processing. It defines the graphic elements (lines, surfaces…) generation an handling and also their visual output (light, perspective…).