Secret Key Cryptography Using Graphics Cards
Debra L. Cook
Angelos D. Keromytis
Technical Report, January 14, 2004
One frequently cited reason for the lack of wide deployment of cryptographic protocols is the (per-
ceived) poor performance of the algorithms they employ and their impact on the rest of the system. Al-
though high-performance dedicated cryptographic accelerator cards have been commercially available
for some time, market penetration remains low. We take a different approach, seeking to exploit existing
system resources, such as Graphics Processing Units (GPUs) to accelerate cryptographic processing.
We exploit the ability for GPUs to simultaneously process large quantities of pixels to offload cryp-
tographic processing from the main processor. We demonstrate the use of GPUs for stream ciphers,
which can achieve 75% the performance of a fast CPU. We also investigate the use of GPUs for block
ciphers, discuss operations that make certain ciphers unsuitable for use with a GPU, and compare the
In addition to offloading system resources, the ability to perform encryption and decryption within the
GPU has potential applications in image processing by limiting exposure of the plaintext to within the
Keywords: Graphics Processing Unit, Block Ciphers, Stream Ciphers, AES.
In a large-scale distributed environment such as the Internet, cryptographic protocols and mechanisms play
an important role in insuring the safety and integrity of the interconnected systems and the resources that
are available through them. The fundamental building block such protocols depend on are cryptographic
primitives, whose algorithmic complexity often turns them into a (real or perceived) performance bottle-
neck to the systems that employ them . To address this issue, vendors have been marketing hardware
cryptographic accelerators that implement such algorithms [8, 9, 11, 12, 14]. Others have experimented
with taking advantage of special functions, such as MMX instructions .
While the performance improvement that can be derived from accelerators is significant , only a
small number of systems employ such dedicated hardware. Unless the economics of security change dras-
tically, it is not clear why users would invest in such hardware. Thus, our approach is to exploit resources
typically available in most systems. We observe that the large majority of systems, in particular workstations
and laptops (but also servers), include a high-performance Graphics Processing Unit (GPU), also known as
a graphics accelerator. Due to intense competition and considerable demand (primarily from the gaming
community) for high-performance graphics, such GPUs pack more transistors than the CPUs found in the
same PC enclosure  at a smaller price.
Furthermore, we believe that most users do not use their GPUs at full capacity when browsing or oth-
erwise requiring secure communications; conversely, the need for secure communications is perhaps dimin-
ished while playing a graphics-intensive game. Likewise, GPUsare underutilized by server machines. Thus,
there exists the potential for utilizing such widely available, high-performance, special-purpose hardware for
offl oading suitable computationally expensive tasks. Our initial intent is to determine the potential use of
typical GPUs and configurations for cryptographic applications, as opposed to requiring enhancements to
GPUs, their drivers, or other system components. Avoiding specialized requirements is necessary to provide
a benefit to generalized environments in which the GPU is otherwise underutilized.
GPUsprovide parallel processing of large quantities of data relative to what can be provided by a general
CPU.Performance levels equivalent to the processing speed of 10Ghz Pentium processor have been reached,
and GPUs from Nvidia and ATI are functioning as co-processors to CPUs in various graphics subsystems
. GPUs are already being used for non-graphics applications, but presently none are oriented towards
security . Utilizing GPUs for encryption has potential benefits for both graphics and non-graphics appli-
cations. In general, moving encryption and decryption into GPUs will offl oad system resources. Beyond
simply improving system performance, implementing ciphers within the GPUallows images to be encrypted
and decrypted without writing the image temporarily as plaintext to system memory, limiting exposure of
the plaintext to within the GPU.
Our work consists of several related experiments regarding the use of GPUs for symmetric-key ciphers.
First, we experiment with the use of GPUs for stream ciphers, leveraging the parallel processing to quickly
apply the key stream to large segments of data. Second, we determine if AES can be implemented to utilize
a GPU in a manner that allows for offl oading work from other system resources. Our work illustrates why
algorithms involving certain byte-level operations and substantial byte-level manipulation are unsuitable for
use with GPUs given current APIs. Third, we investigate the potential for implementing ciphers in GPUs
for image processing to avoid the image being written to system memory as plaintext.
1.1 Paper Organization
The remainder of the paper is organized as follows. We provide background on OpenGL commands and
pixel processing used in our implementations in Section 2. Section 3 explains how GPUs can be utilized for
stream ciphers in certain applications, and gives some preliminary performance results. Section 4 describes
the representation of AES which was implemented in OpenGL and includes a general discussion of why
certain block ciphers are not suitable candidates for use with a GPU given the existing APIs. Section 5
provides an overview of the implementation of AES that utilizes a GPU and provides performance results.
We discuss the potential use of GPU-embedded versions of symmetric-key ciphers in image processing in
Section 6. Our conclusions and future areas of work are covered in Section 7. Appendix A describes
the experimental environments, including the minimal required specifications for the GPUs. Appendix B
contains pseudo-code for our AES encryption routine.
2OpenGL and GPU Background
Before describing our implementations of symmetric key ciphers, we give a brief overview of the OpenGL
pipeline, modeled after the way modern GPUs operate, and the OpenGL commands relevant to our exper-
iments. Our implementations process data as pixels treated as fl oating point values, with one byte of data
stored in each pixel component; we do not use the pixel processing as color and stencil indices and the
vertex processing in OpenGL. Refer to  and  for a complete description. We used OpenGL version
1.4 in all experiments.
Figure 1 shows the components of the OpenGL pipeline that are relevant to pixel processing when
pixels are treated as fl oating point values. While implementations are not required to adhere to the pipeline,
it serves as a general guideline for how data is processed. We also point out that OpenGL does not require
support for the Alpha pixel component in the back buffer.
Figure 1: OpenGL Pipeline
A data format indicating such items as number of bits per pixel and the ordering of color components
specifies how the GPU interprets and packs/unpacks the bits when reading data to and from system memory.
The data format may indicate that the pixels are to be treated as fl oating point numbers, color indices, or
stencil indices. The following description concerns the fl oating point interpretation. When reading data from
system memory, the data is unpacked and converted into fl oating point values in the range
scaling and bias is applied per color component. The next step is to apply the color map, which we describe
later in more detail. The values of the color components are then clamped to be within the range
Rasterization is the conversion of data into fragments, with each fragment corresponding to one pixel in
the framebuffer. In our work this step has no impact. The fragment operations relevant to pixel processing
include dithering, threshold based tests, such as discarding pixels based on alpha value and stencils, and
blending and logical operations that combine pixels being drawn into the framebuffer with those already in
the destination area of the framebuffer. Dithering, which is enabled by default, must be turned off in our
applications to prevent pixels from being averaged with their neighbors.
When reading data from the framebuffer to system memory, the pixel values are mapped to the range
? ???????. Luminance,
? ???????. Scaling, bias, and color maps are applied to each of the RGBA components and the result clamped to
specified. When copying pixels between areas of the framebuffer, the processing occurs as if the pixels
? ???????. The components or luminance is then packed into system memory according to the format
were being read back to system memory except the data is written to the new location in the framebuffer
according to the format specified for reading pixels from system memory to the GPU.
Aside from reading the input from system memory and writing the result to system memory, the OpenGL
commands in our implementations consist of copying pixels between coordinates, with color mapping and
a logical operation of XOR enabled or disabled as needed. Unfortunately, the copying of pixels and color
maps are two of the slowest operations to perform . The logical operation of XOR produces a bitwise-
XOR between the pixel being copied and the pixel currently in the destination of the copy, with the result
being written to the destination of the copy.
A color map is applied to a particular component of a pixel when the pixel is copied from one coordinate
to another. A color map can be enabled individually for each of the RGBA components. The color map
is a static table of fl oating point numbers between 0 and 1. Internal to the GPU, the value of the pixel
component being mapped is converted to an integer value which is used as the index into the table and the
pixel component is replaced with the value from the table. For example, if the table consists of 256 entries,
as in our AES implementation, and the map is being applied to the red component of a pixel, the 8 bits of the
red value are treated as an integer between 0 and 255, and the red value updated with the corresponding entry
from the table. In order to implement the tables of item (III) in Section 4 as color maps, the tables must be
converted to tables of fl oating point numbers between 0 and 1, and hard-coded in the program as constants.
The table entries, which would vary from 0 to 255 if the bytes were in integer format, are converted to
fl oating point values by dividing by 255. Because pixels are stored as fl oating point numbers and the values
are truncated when they are converted to integers to index into a color map, 0.000001 is added to the result
(except to 0 and 1) to prevent errors due to truncation.
3 Use of Graphics Cards for Stream Ciphers
As a first step in evaluating the usefulness of GPUs for implementing cryptographic primitives, we imple-
mented the mixing component of a stream cipher (the XOR operation) inside the GPU. GPUs have the
ability to XOR large quantities of pixels simultaneously, which can be beneficial in stream cipher imple-
mentations. For applications that pre-compute segments of key streams, a segment can be stored in an array
of bytes which is then read into the GPU’s memory and treated as a collection of pixels. The data to be
encrypted or decrypted are also stored in an array of bytes which is read into the same area of the GPU’s
memory as the key stream segment, with the logical operation of XOR enabled during the read. The result
is then written to system memory. Overall, XORing the data with the key-stream requires two reads of data
into the GPU from system memory and one read from the GPU to system memory regardless of how many
bytes are being encrypted.
The number of bytes can be at most three times the number of pixels supported if the data is processed in
a back buffer utilizing only RGB components. The number of bytes can be four times the number of pixels
if the front buffer can be used or the back buffer supports the Alpha component. If the key stream is not
computed in the GPU, the cost of computing the key stream and temporarily storing it in an array is the same
as in an implementation not utilizing a GPU. Certain stream ciphers, such as RC4 , can be implemented
such that the key stream is generated within the GPU1. Others involve operations which make it difficult
or impossible to implement in the GPU given current APIs; for example, SEAL  which requires 9-bit
We compared the rate at which data can be XOR’ed with a key stream in an OpenGL implementation
to that of a
implementation (Visual C++ 6.0). We conducted the tests using a PC with a 1.8Ghz Pentium
IV processor and Nvidia GeForce3 graphics card, a laptop with a 1.3Ghz Pentium Centrino Processor and
1The modular additions required in RC4 can be performed in OpenGL with blending.
56MB/s XOR Rate
Table 1: XOR Rate Using System Resources (CPU)
a ATI Mobility Radeon graphics card, and a PC with a 800Mhz Pentium III Processor and a Nvidia TNT2
graphics card. Refer to Appendix A for additional details on the test environments. We give the results
from the implementation in Table 1. We tested several data sizes to determine the ranges for which the
OpenGL implementation would be useful. As expected, the benefit of the GPU’s simultaneous processing
is diminished if the processed data is too small. Table 2 indicates the average encryption rates over 10 trials
of encrypting 1000 data segments of size
????? , respectively, where the area of pixels is
Using RGBA components
Using RGB components
Table 2: XOR Rate Using GPUs - RGB and RGBA Pixel Components
Notice that the encryption rate was fairly constant for all data sizes on the slowest processor with the
oldest GPU (Nvidia TNT2). Possible explanations include slow memory controller, memory bus, or GPU,
although we have not investigated this further. With the GeForce3 Ti200 card the efficiency increased as
more bytes were XOR’ed simultaneously. On the laptop the peak rates were obtained with 200x200 to
400x400 square pixel areas.
When using the RGB components, the highest rate obtained by the GPUs compared to the
is 58% for the Nvidia GeForce3 Ti200 card, 48.5% for the ATI Mobility Radeon card, and 51.4% for the
Nvidia TNT2 card. With both the GeForce3 Ti200 and the ATI Radeon cards, results with the 50x50 pixel
area was significantly slower than with larger areas due to the time to read data to/from system memory
representing a larger portion of the total time. In both cases the rate is approximately 25% of that of the
program. When using the RGBA components, the highest rates on the Nvidia GeForce Ti200, ATI Radeon
and Nvidia TNT2 cards are 75.5%, 52% and 68% of the
4Representation of Block Ciphers
We now turn our attention to the use of GPUs for implementing block ciphers. The first step in our work
is to determine if AES can be represented in a manner which allows it to be implemented within a GPU.
We describe the derivation of the OpenGL version of AES and its implementation in some detail in order
to illustrate the difficulties that arise when utilizing GPUs for algorithms performing byte-level operations.