Page 1

The Salsa20 family of stream ciphers

Daniel J. Bernstein?

Department of Mathematics, Statistics, and Computer Science (M/C 249)

The University of Illinois at Chicago

Chicago, IL 60607–7045

snuffle6@box.cr.yp.to

Abstract. Salsa20 is a family of 256-bit stream ciphers designed in 2005

and submitted to eSTREAM, the ECRYPT Stream Cipher Project.

Salsa20 has progressed to the third round of eSTREAM without any

changes. The 20-round stream cipher Salsa20/20 is consistently faster

than AES and is recommended by the designer for typical cryptographic

applications. The reduced-round ciphers Salsa20/12 and Salsa20/8 are

among the fastest 256-bit stream ciphers available and are recommended

for applications where speed is more important than confidence. The

fastest known attacks use ≈ 2153simple operations against Salsa20/7,

≈ 2249simple operations against Salsa20/8, and ≈ 2255simple operations

against Salsa20/9, Salsa20/10, etc. In this paper, the Salsa20 designer

presents Salsa20 and discusses the decisions made in the Salsa20 design.

1 Introduction

A sender and receiver share a short secret key. They use the secret key to encrypt

a series of messages. A message could be short, just a few bytes, but it could be

much longer, perhaps gigabytes. The series of messages could be short, just one

message, but it could be much longer, perhaps billions of messages.

The sender and receiver encrypt messages using an encryption function: a

function that produces the first ciphertext from the key and the first plaintext,

that produces the second ciphertext from the key and the second plaintext, etc.

An encryption function has to be fast. Many senders have to encrypt large

volumes of data in very little time using limited resources. Many receivers are

faced with even larger volumes of data—not just the legitimate messages but

also a flood of forgery attempts. A slow encryption function can satisfy some

senders and receivers, but my focus is on encryption functions suitable for a

wider range of applications.

An encryption function also has to be secure. Many users are facing, or at

least think that they are facing, years of cryptanalytic computations by well-

funded attackers equipped with millions of fast parallel processors. Some users

?Permanent ID of this document: 31364286077dcdff8e4509f9ff3139ad. Date of this

document: 2007.12.25. This work was supported by the National Science Foundation

under grants CCR–9983950 and ITR–0716498, and by the Alfred P. Sloan Founda-

tion.

Page 2

Cycles/byte

Salsa20

software

amd64-xmm6 1.88 2.07 2.80 3.25 3.93 4.25

amd64-xmm6 1.88 2.07 2.57 2.80 3.91 4.33

ppc-altivec 1.99 2.14 2.74 2.88 4.24 4.39

x86-xmm5 2.06 2.28 2.80 3.15 4.32 4.70

3.47 3.65 4.86 5.04 7.64 7.84

ppc-altivec3.28 3.48 4.83 4.87 7.82 8.04

amd64-33.78 3.96 5.33 5.51 8.42 8.62

amd64-3 3.82 4.18 5.35 5.73 8.42 8.78

4.50 4.78 6.27 6.55 9.80 10.07

x86-athlon 4.61 4.84 6.44 6.65 10.04 10.24

merged6.83 7.00 8.35 8.51 11.29 11.47

merged 5.82 5.97 7.68 7.85 11.39 11.56

amd64-xmm6 5.38 5.87 7.19 7.84 10.69 11.73

x86-xmm5 5.30 5.53 7.44 7.70 11.70 11.98

x86-xmm5 5.30 5.86 7.41 8.21 11.64 12.55

x86-xmm5 5.30 5.84 7.40 8.15 11.63 12.59

x86-xmm5 5.33 5.95 7.44 8.20 11.67 12.65

x86-xmm5 5.76 6.92 8.12 9.33 11.84 13.40

x86-mmx6.37 6.79 8.88 9.29 13.88 14.29

sparc 6.65 6.76 9.21 9.33 14.34 14.45

x86-athlon 7.13 7.66 9.90 10.31 15.29 15.94

merged 8.49 8.87 12.42 12.62 18.07 18.27

merged 8.28 8.65 12.56 12.76 18.21 18.40

Salsa20/8 Salsa20/12 Salsa20/20

long 576 longArch

amd64 3000 Xeon 5160 (6f6)

amd64 2137 Core 2 Duo (6f6)

ppc32533 PowerPC G4 7410

x86 2137 Core 2 Duo (6f6)

amd64 2000 Athlon 64 X2 (15,75,2) amd64-3

ppc64 2000 PowerPC G5 970

amd64 2391 Opteron (f5a)

amd64 2192 Opteron (f58)

x86 2000 Athlon 64 X2 (15,75,2) x86-1

x86 900 Athlon (622)

ppc64 1452 POWER4

hppa 1000 PA-RISC 8900

amd64 3000 Pentium D (f64)

x86 1300 Pentium M (695)

x86 3000 Xeon (f26)

x863200 Xeon (f25)

x86 2800 Xeon (f29)

x86 3000 Pentium 4 (f41)

x86 1400 Pentium III (6b1)

sparc 1050 UltraSPARC IV

x86 3200 Pentium D (f47)

ia64 1500 Itanium II

ia641400 Itanium II

MHz Machine576 long576

Table 1.1. Salsa20 software speeds; measured by the official eSTREAM benchmarking

framework; sorted by final column. “576” means single-core cycles/byte to encrypt a

576-byte packet; “long” means single-core cycles/byte to encrypt a long stream.

are satisfied with lower levels of security, but again my focus is on encryption

functions suitable for a wider range of applications.

There is a conflict between these desiderata. One can reasonably conjecture,

for example, that every function that encrypts data in 0.5 Core-2 cycles/byte

is breakable. One can also conjecture that almost every function that encrypts

data in 5 Core-2 cycles/byte is breakable. On the other hand, several unbroken

submissions to eSTREAM, the ECRYPT Stream Cipher Project, encrypt data

in fewer than 5 Core-2 cycles/byte.

In particular, my 20-round stream cipher Salsa20/20 encrypts data in 3.93

Core-2 cycles/byte. (For comparison: Matsui and Nakajima recently reported 9.2

Core-2 cycles/byte for 10-round AES using a pre-expanded 128-bit key. See [18].)

The fastest known attack against Salsa20/20 is a 256-bit brute-force search. I

recommend Salsa20/20 for encryption in typical cryptographic applications.

Reduced-round ciphers in the Salsa20 family are attractive options for users

who value speed more highly than confidence. The 12-round stream cipher

Salsa20/12 encrypts data in 2.80 Core-2 cycles/byte; the fastest known attack

Page 3

against Salsa20/12 is a 256-bit brute-force search. The 8-round stream cipher

Salsa20/8 encrypts data in 1.88 Core-2 cycles/byte; as discussed in Section 5,

papers by several cryptanalysts have culminated in an attack against Salsa20/8

taking “only” 2249operations, but this is far beyond any computation that will

be carried out in the foreseeable future. Perhaps better attacks will be developed,

but competing ciphers at similar speeds seem to be much more easily broken!

I hadn’t heard of the Core 2 when I designed Salsa20. I was aiming for high

speed on a wide variety of platforms; I don’t find it surprising that Salsa20 is

able to take advantage of a new platform. Table 1.1 shows Salsa20’s software

speeds on various CPUs.

This paper defines Salsa20 and explains the decisions that I made in the

Salsa20 design. Section 2 discusses the selection of low-level operations used

in Salsa20—a deliberately limited set, in particular with no S-boxes. Section 3

discusses the high-level data flow in Salsa20—again quite limited, in particular

with no communication across blocks aside from a simple block counter. Section

4 discusses the middle-level structure of Salsa20. Section 5 reviews known attacks

on Salsa20.

2 Low level: Which operations are used?

2.1 What does Salsa20 do?

The Salsa20 encryption function is a long chain of three simple operations on

32-bit words:

• 32-bit addition, producing the sum a + b mod 232of two 32-bit words a,b;

• 32-bit exclusive-or, producing the xor a ⊕ b of two 32-bit words a,b; and

• constant-distance 32-bit rotation, producing the rotation a < < < b of a 32-bit

word a by b bits to the left, where b is constant.

On occasion I encounter the superstitious notion that these operations are

“too simple.” In fact, these operations can easily simulate any circuit, and are

therefore capable of reaching the same security level as any other selection of

operations. The real question for the cipher designer is whether a different mix

of operations could achieve the same security level at higher speed.

2.2 Should there be integer multiplications?

Some popular CPUs can quickly compute xy mod 264, given x,y. Some ciphers

are designed to take advantage of this operation. Sometimes one of x,y is a

constant; sometimes x,y are both variables.

The basic argument for integer multiplication is that the output bits are

complicated functions of the input bits, mixing the inputs more thoroughly than

a few simple integer operations.

The basic counterargument is that integer multiplication takes several cycles

on the fastest CPUs, and many more cycles on other CPUs. For comparison, a

Page 4

comparably complex series of simple integer operations is always reasonably fast.

Multiplication might be slightly faster on some CPUs but it is not consistently

fast.

I do like the amount of mixing provided by multiplication, and I’m impressed

with the fast multiplication circuits included (generally for non-cryptographic

reasons) in many CPUs, but the potential speed benefits don’t seem big enough

to outweigh the massive speed penalty on other CPUs. Similar comments apply

to 64-bit additions, to 32-bit multiplications, and to variable-distance (“data-

dependent”) rotations.

A further argument against integer multiplication is that it increases the risk

of timing leaks. What really matters is not the speed of integer multiplication,

but the speed of constant-time integer multiplication, which is often much slower.

Example: On the Motorola PowerPC 7450 (G4e), a fairly common general-

purpose CPU, the mull multiplication instruction usually takes 2 cycles (with

4-cycle latency), but it takes only 1 cycle (with 3-cycle latency) if “the 15 msbs

of the B operand are either all set or all cleared.” See [1, page 6.45]. The same is

true for the 8641D, the newest CPU in the same family. It is possible to eliminate

the timing leak on these CPUs by, e.g., using the floating-point multiplier, but

moving data back and forth to floating-point registers costs CPU cycles, not to

mention extra programming effort.

2.3 Should there be S-box lookups?

An S-box lookup is an array lookup using an input-dependent index. Most

ciphers are designed to take advantage of this operation. For example, typical

high-speed AES software has several 1024-byte S-boxes, each of which converts

8-bit inputs to 32-bit outputs.

The basic argument for S-boxes is that a single table lookup can mangle its

input quite thoroughly—more thoroughly than a chain of a few simple integer

operations taking the same amount of time.

The basic counterargument is that a simple integer operation takes one or

two 32-bit inputs rather than one 8-bit input, so it effectively mangles several

8-bit inputs at once. It is not obvious that a series of S-box lookups—even with

rather large S-boxes, as in AES, increasing L1 cache pressure on large CPUs

and forcing different implementation techniques for small CPUs—is faster than

a comparably complex series of integer operations.

A further argument against S-box lookups is that, on most platforms, they are

vulnerable to timing attacks. NIST’s statement to the contrary in [19, Section

3.6.2] (table lookup is “not vulnerable to timing attacks”) is erroneous. It is

extremely difficult to work around this problem without sacrificing a tremendous

amount of speed. See my paper [5] for much more information on this topic,

including an example of successful remote extraction of a complete AES key.

For me, the timing-attack problem is decisive. For any particular security

level, I’m not sure whether adding S-box lookups would gain speed, but I’m sure

that adding constant-time S-box lookups would not gain speed.

Page 5

Salsa20 is certainly not the first cipher without S-boxes. The Tiny Encryption

Algorithm, published by Wheeler and Needham in [23], is a classic example of a

reduced-instruction-set cipher: it is a long chain of 32-bit shifts, 32-bit xors, and

32-bit additions. IDEA, published by Lai, Massey, and Murphy in [17], is even

older and almost as simple: it is a long chain of 16-bit additions, 16-bit xors, and

multiplications modulo 216+ 1.

2.4Should there be fewer rotations?

Rotations account for about 1/3 of the integer operations in Salsa20. If rotations

are simulated by shift-shift-xor (as they are on the UltraSPARC and with XMM

instructions) then they account for about 1/2 of the integer operations in Salsa20.

Replacing some of the rotations with a comparable number of additions might

achieve comparable diffusion in less time.

The reader may be wondering why I used rotations rather than shifts. The

basic argument for rotations is that one xor of a rotated quantity provides as

much diffusion as two xors of shifted quantities. There does not appear to be

a counterargument. Rotate-xor is faster than shift-shift-xor-xor on many CPUs

and is never slower.

3 High level: How do blocks interact?

3.1 What does Salsa20 do?

Salsa20 expands a 256-bit key and a 64-bit nonce (unique message number) into

a 270-byte stream. It encrypts a b-byte plaintext by xor’ing the plaintext with

the first b bytes of the stream and discarding the rest of the stream. It decrypts

a b-byte ciphertext by xor’ing the ciphertext with the first b bytes of the stream.

There is no feedback from the plaintext or ciphertext into the stream.

Salsa20 generates the stream in 64-byte (512-bit) blocks. Each block is an

independent hash of the key, the nonce, and a 64-bit block number; there is no

chaining from one block to the next. The Salsa20 output stream can therefore

be accessed randomly, and any number of blocks can be computed in parallel.

There are no hidden preprocessing costs in Salsa20. In particular, Salsa20

does not preprocess the key before generating a block; each block uses the key

directly as input. Salsa20 also does not preprocess the nonce before generating

a block; each block uses the nonce directly as input.

3.2 Should encryption and decryption be different?

The most common model of a stream cipher is that each ciphertext block is the

xor of the plaintext block and the stream block at the same position. Each stream

block is determined by its position, the nonce, the key, and the previous blocks

of plaintext—equivalently, the previous blocks of ciphertext. Salsa20 follows this

model, as does any block cipher in counter mode, OFB mode, CFB mode, et al.