The Salsa20 family of stream ciphers
Daniel J. Bernstein?
Department of Mathematics, Statistics, and Computer Science (M/C 249)
The University of Illinois at Chicago
Chicago, IL 60607–7045
Abstract. Salsa20 is a family of 256-bit stream ciphers designed in 2005
and submitted to eSTREAM, the ECRYPT Stream Cipher Project.
Salsa20 has progressed to the third round of eSTREAM without any
changes. The 20-round stream cipher Salsa20/20 is consistently faster
than AES and is recommended by the designer for typical cryptographic
applications. The reduced-round ciphers Salsa20/12 and Salsa20/8 are
among the fastest 256-bit stream ciphers available and are recommended
for applications where speed is more important than confidence. The
fastest known attacks use ≈ 2153simple operations against Salsa20/7,
≈ 2249simple operations against Salsa20/8, and ≈ 2255simple operations
against Salsa20/9, Salsa20/10, etc. In this paper, the Salsa20 designer
presents Salsa20 and discusses the decisions made in the Salsa20 design.
A sender and receiver share a short secret key. They use the secret key to encrypt
a series of messages. A message could be short, just a few bytes, but it could be
much longer, perhaps gigabytes. The series of messages could be short, just one
message, but it could be much longer, perhaps billions of messages.
The sender and receiver encrypt messages using an encryption function: a
function that produces the first ciphertext from the key and the first plaintext,
that produces the second ciphertext from the key and the second plaintext, etc.
An encryption function has to be fast. Many senders have to encrypt large
volumes of data in very little time using limited resources. Many receivers are
faced with even larger volumes of data—not just the legitimate messages but
also a flood of forgery attempts. A slow encryption function can satisfy some
senders and receivers, but my focus is on encryption functions suitable for a
wider range of applications.
An encryption function also has to be secure. Many users are facing, or at
least think that they are facing, years of cryptanalytic computations by well-
funded attackers equipped with millions of fast parallel processors. Some users
?Permanent ID of this document: 31364286077dcdff8e4509f9ff3139ad. Date of this
document: 2007.12.25. This work was supported by the National Science Foundation
under grants CCR–9983950 and ITR–0716498, and by the Alfred P. Sloan Founda-
amd64-xmm6 1.88 2.07 2.80 3.25 3.93 4.25
amd64-xmm6 1.88 2.07 2.57 2.80 3.91 4.33
ppc-altivec 1.99 2.14 2.74 2.88 4.24 4.39
x86-xmm5 2.06 2.28 2.80 3.15 4.32 4.70
3.47 3.65 4.86 5.04 7.64 7.84
ppc-altivec3.28 3.48 4.83 4.87 7.82 8.04
amd64-33.78 3.96 5.33 5.51 8.42 8.62
amd64-3 3.82 4.18 5.35 5.73 8.42 8.78
4.50 4.78 6.27 6.55 9.80 10.07
x86-athlon 4.61 4.84 6.44 6.65 10.04 10.24
merged6.83 7.00 8.35 8.51 11.29 11.47
merged 5.82 5.97 7.68 7.85 11.39 11.56
amd64-xmm6 5.38 5.87 7.19 7.84 10.69 11.73
x86-xmm5 5.30 5.53 7.44 7.70 11.70 11.98
x86-xmm5 5.30 5.86 7.41 8.21 11.64 12.55
x86-xmm5 5.30 5.84 7.40 8.15 11.63 12.59
x86-xmm5 5.33 5.95 7.44 8.20 11.67 12.65
x86-xmm5 5.76 6.92 8.12 9.33 11.84 13.40
x86-mmx6.37 6.79 8.88 9.29 13.88 14.29
sparc 6.65 6.76 9.21 9.33 14.34 14.45
x86-athlon 7.13 7.66 9.90 10.31 15.29 15.94
merged 8.49 8.87 12.42 12.62 18.07 18.27
merged 8.28 8.65 12.56 12.76 18.21 18.40
Salsa20/8 Salsa20/12 Salsa20/20
long 576 longArch
amd64 3000 Xeon 5160 (6f6)
amd64 2137 Core 2 Duo (6f6)
ppc32533 PowerPC G4 7410
x86 2137 Core 2 Duo (6f6)
amd64 2000 Athlon 64 X2 (15,75,2) amd64-3
ppc64 2000 PowerPC G5 970
amd64 2391 Opteron (f5a)
amd64 2192 Opteron (f58)
x86 2000 Athlon 64 X2 (15,75,2) x86-1
x86 900 Athlon (622)
ppc64 1452 POWER4
hppa 1000 PA-RISC 8900
amd64 3000 Pentium D (f64)
x86 1300 Pentium M (695)
x86 3000 Xeon (f26)
x863200 Xeon (f25)
x86 2800 Xeon (f29)
x86 3000 Pentium 4 (f41)
x86 1400 Pentium III (6b1)
sparc 1050 UltraSPARC IV
x86 3200 Pentium D (f47)
ia64 1500 Itanium II
ia641400 Itanium II
MHz Machine576 long576
Table 1.1. Salsa20 software speeds; measured by the official eSTREAM benchmarking
framework; sorted by final column. “576” means single-core cycles/byte to encrypt a
576-byte packet; “long” means single-core cycles/byte to encrypt a long stream.
are satisfied with lower levels of security, but again my focus is on encryption
functions suitable for a wider range of applications.
There is a conflict between these desiderata. One can reasonably conjecture,
for example, that every function that encrypts data in 0.5 Core-2 cycles/byte
is breakable. One can also conjecture that almost every function that encrypts
data in 5 Core-2 cycles/byte is breakable. On the other hand, several unbroken
submissions to eSTREAM, the ECRYPT Stream Cipher Project, encrypt data
in fewer than 5 Core-2 cycles/byte.
In particular, my 20-round stream cipher Salsa20/20 encrypts data in 3.93
Core-2 cycles/byte. (For comparison: Matsui and Nakajima recently reported 9.2
Core-2 cycles/byte for 10-round AES using a pre-expanded 128-bit key. See .)
The fastest known attack against Salsa20/20 is a 256-bit brute-force search. I
recommend Salsa20/20 for encryption in typical cryptographic applications.
Reduced-round ciphers in the Salsa20 family are attractive options for users
who value speed more highly than confidence. The 12-round stream cipher
Salsa20/12 encrypts data in 2.80 Core-2 cycles/byte; the fastest known attack
against Salsa20/12 is a 256-bit brute-force search. The 8-round stream cipher
Salsa20/8 encrypts data in 1.88 Core-2 cycles/byte; as discussed in Section 5,
papers by several cryptanalysts have culminated in an attack against Salsa20/8
taking “only” 2249operations, but this is far beyond any computation that will
be carried out in the foreseeable future. Perhaps better attacks will be developed,
but competing ciphers at similar speeds seem to be much more easily broken!
I hadn’t heard of the Core 2 when I designed Salsa20. I was aiming for high
speed on a wide variety of platforms; I don’t find it surprising that Salsa20 is
able to take advantage of a new platform. Table 1.1 shows Salsa20’s software
speeds on various CPUs.
This paper defines Salsa20 and explains the decisions that I made in the
Salsa20 design. Section 2 discusses the selection of low-level operations used
in Salsa20—a deliberately limited set, in particular with no S-boxes. Section 3
discusses the high-level data flow in Salsa20—again quite limited, in particular
with no communication across blocks aside from a simple block counter. Section
4 discusses the middle-level structure of Salsa20. Section 5 reviews known attacks
2 Low level: Which operations are used?
2.1 What does Salsa20 do?
The Salsa20 encryption function is a long chain of three simple operations on
• 32-bit addition, producing the sum a + b mod 232of two 32-bit words a,b;
• 32-bit exclusive-or, producing the xor a ⊕ b of two 32-bit words a,b; and
• constant-distance 32-bit rotation, producing the rotation a < < < b of a 32-bit
word a by b bits to the left, where b is constant.
On occasion I encounter the superstitious notion that these operations are
“too simple.” In fact, these operations can easily simulate any circuit, and are
therefore capable of reaching the same security level as any other selection of
operations. The real question for the cipher designer is whether a different mix
of operations could achieve the same security level at higher speed.
2.2 Should there be integer multiplications?
Some popular CPUs can quickly compute xy mod 264, given x,y. Some ciphers
are designed to take advantage of this operation. Sometimes one of x,y is a
constant; sometimes x,y are both variables.
The basic argument for integer multiplication is that the output bits are
complicated functions of the input bits, mixing the inputs more thoroughly than
a few simple integer operations.
The basic counterargument is that integer multiplication takes several cycles
on the fastest CPUs, and many more cycles on other CPUs. For comparison, a
comparably complex series of simple integer operations is always reasonably fast.
Multiplication might be slightly faster on some CPUs but it is not consistently
I do like the amount of mixing provided by multiplication, and I’m impressed
with the fast multiplication circuits included (generally for non-cryptographic
reasons) in many CPUs, but the potential speed benefits don’t seem big enough
to outweigh the massive speed penalty on other CPUs. Similar comments apply
to 64-bit additions, to 32-bit multiplications, and to variable-distance (“data-
A further argument against integer multiplication is that it increases the risk
of timing leaks. What really matters is not the speed of integer multiplication,
but the speed of constant-time integer multiplication, which is often much slower.
Example: On the Motorola PowerPC 7450 (G4e), a fairly common general-
purpose CPU, the mull multiplication instruction usually takes 2 cycles (with
4-cycle latency), but it takes only 1 cycle (with 3-cycle latency) if “the 15 msbs
of the B operand are either all set or all cleared.” See [1, page 6.45]. The same is
true for the 8641D, the newest CPU in the same family. It is possible to eliminate
the timing leak on these CPUs by, e.g., using the floating-point multiplier, but
moving data back and forth to floating-point registers costs CPU cycles, not to
mention extra programming effort.
2.3 Should there be S-box lookups?
An S-box lookup is an array lookup using an input-dependent index. Most
ciphers are designed to take advantage of this operation. For example, typical
high-speed AES software has several 1024-byte S-boxes, each of which converts
8-bit inputs to 32-bit outputs.
The basic argument for S-boxes is that a single table lookup can mangle its
input quite thoroughly—more thoroughly than a chain of a few simple integer
operations taking the same amount of time.
The basic counterargument is that a simple integer operation takes one or
two 32-bit inputs rather than one 8-bit input, so it effectively mangles several
8-bit inputs at once. It is not obvious that a series of S-box lookups—even with
rather large S-boxes, as in AES, increasing L1 cache pressure on large CPUs
and forcing different implementation techniques for small CPUs—is faster than
a comparably complex series of integer operations.
A further argument against S-box lookups is that, on most platforms, they are
vulnerable to timing attacks. NIST’s statement to the contrary in [19, Section
3.6.2] (table lookup is “not vulnerable to timing attacks”) is erroneous. It is
extremely difficult to work around this problem without sacrificing a tremendous
amount of speed. See my paper  for much more information on this topic,
including an example of successful remote extraction of a complete AES key.
For me, the timing-attack problem is decisive. For any particular security
level, I’m not sure whether adding S-box lookups would gain speed, but I’m sure
that adding constant-time S-box lookups would not gain speed.
Salsa20 is certainly not the first cipher without S-boxes. The Tiny Encryption
Algorithm, published by Wheeler and Needham in , is a classic example of a
reduced-instruction-set cipher: it is a long chain of 32-bit shifts, 32-bit xors, and
32-bit additions. IDEA, published by Lai, Massey, and Murphy in , is even
older and almost as simple: it is a long chain of 16-bit additions, 16-bit xors, and
multiplications modulo 216+ 1.
2.4Should there be fewer rotations?
Rotations account for about 1/3 of the integer operations in Salsa20. If rotations
are simulated by shift-shift-xor (as they are on the UltraSPARC and with XMM
instructions) then they account for about 1/2 of the integer operations in Salsa20.
Replacing some of the rotations with a comparable number of additions might
achieve comparable diffusion in less time.
The reader may be wondering why I used rotations rather than shifts. The
basic argument for rotations is that one xor of a rotated quantity provides as
much diffusion as two xors of shifted quantities. There does not appear to be
a counterargument. Rotate-xor is faster than shift-shift-xor-xor on many CPUs
and is never slower.
3 High level: How do blocks interact?
3.1 What does Salsa20 do?
Salsa20 expands a 256-bit key and a 64-bit nonce (unique message number) into
a 270-byte stream. It encrypts a b-byte plaintext by xor’ing the plaintext with
the first b bytes of the stream and discarding the rest of the stream. It decrypts
a b-byte ciphertext by xor’ing the ciphertext with the first b bytes of the stream.
There is no feedback from the plaintext or ciphertext into the stream.
Salsa20 generates the stream in 64-byte (512-bit) blocks. Each block is an
independent hash of the key, the nonce, and a 64-bit block number; there is no
chaining from one block to the next. The Salsa20 output stream can therefore
be accessed randomly, and any number of blocks can be computed in parallel.
There are no hidden preprocessing costs in Salsa20. In particular, Salsa20
does not preprocess the key before generating a block; each block uses the key
directly as input. Salsa20 also does not preprocess the nonce before generating
a block; each block uses the nonce directly as input.
3.2 Should encryption and decryption be different?
The most common model of a stream cipher is that each ciphertext block is the
xor of the plaintext block and the stream block at the same position. Each stream
block is determined by its position, the nonce, the key, and the previous blocks
of plaintext—equivalently, the previous blocks of ciphertext. Salsa20 follows this
model, as does any block cipher in counter mode, OFB mode, CFB mode, et al.