Content uploaded by Herman Te Riele

Author content

All content in this area was uploaded by Herman Te Riele

Content may be subject to copyright.

Factorization of a 768-bit RSA modulus

version 1.3, January 24, 2010

Thorsten Kleinjung1,

Kazumaro Aoki2, Jens Franke3, Arjen K. Lenstra1, Emmanuel Thomé4,

Joppe W. Bos1, Pierrick Gaudry4, Alexander Kruppa4, Peter L. Montgomery5,6,

Dag Arne Osvik1, Herman te Riele6, Andrey Timofeev6, and Paul Zimmermann4

1EPFL IC LACAL, Station 14, CH-1015 Lausanne, Switzerland

2NTT, 3-9-11 Midori-cho, Musashino-shi, Tokyo, 180-8585 Japan

3University of Bonn, Department of Mathematics, Beringstraße 1, D-53115 Bonn, Germany

4INRIA CNRS LORIA, Équipe CARAMEL - bâtiment A, 615 rue du jardin botanique,

F-54602 Villers-lès-Nancy Cedex, France

5Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

6CWI, P.O. Box 94079, 1090 GB Amsterdam, The Netherlands

Abstract. This paper reports on the factorization of the 768-bit number RSA-768 by the num-

ber ﬁeld sieve factoring method and discusses some implications for RSA.

Keywords: RSA, number ﬁeld sieve

1 Introduction

On December 12, 2009, we factored the 768-bit, 232-digit number RSA-768 by the number

ﬁeld sieve (NFS, [19]). The number RSA-768 was taken from the now obsolete RSA Challenge

list [37] as a representative 768-bit RSA modulus (cf. [36]). This result is a record for factoring

general integers. Factoring a 1024-bit RSA modulus would be about a thousand times harder,

and a 768-bit RSA modulus is several thousands times harder to factor than a 512-bit one.

Because the ﬁrst factorization of a 512-bit RSA modulus was reported only a decade ago

(cf. [7]) it is not unreasonable to expect that 1024-bit RSA moduli can be factored well within

the next decade by an academic eﬀort such as ours or the one in [7]. Thus, it would be prudent

to phase out usage of 1024-bit RSA within the next three to four years.

The previous record NFS factorization was that of the 663-bit, 200-digit number RSA-200

(cf. [4]), established on May 9, 2005. That 663-bit and the current 768-bit record NFS fac-

torizations should not be confused with record special NFS (SNFS) factorizations, the most

recent of which was the complete factorization of the 1039-bit number 21039 −1, obtained in

the spring of 2007 (cf. [2]). Although 21039 −1is much bigger than RSA-768, its special form

made it an order of magnitude easier to factor than RSA-768.

The following eﬀort was involved. We spent half a year on 80 processors on polynomial selec-

tion. This was about 3% of the main task, the sieving, which was done on many hundreds of

machines and took almost two years. On a single core 2.2 GHz AMD Opteron processor with

2 GB RAM per core, sieving would have taken about ﬁfteen hundred years. This included a

generous amount of oversieving, to make the most cumbersome step, the matrix step, more

manageable. Preparing the sieving data for the matrix step took a couple of weeks on a few

processors, the ﬁnal step after the matrix step took less than half a day of computing, but

took about four days of intensive labor because a few bugs had to be ﬁxed.

It turned out that we had done about twice the sieving strictly necessary to obtain a usable

matrix, and that the extra data allowed generation of a matrix that was quite a bit easier than

anticipated at the outset of the project. Although we spent more computer time on the sieving

than required, sieving is a rather laid back process that, once running, does not require much

care beyond occasionally restarting a machine. The matrix step, on the other hand, is a more

subtle aﬀair where a slight disturbance easily causes major trouble, in particular if the problem

is by its sheer size stretching the available resources. Thus, our approach to overspend on an

easygoing part of the computation led to a matrix that could be handled relatively smoothly,

thereby saving us considerable headaches. More importantly, and another reason behind the

oversieving, the extra sieving data allow us to conduct various experiments aimed at getting

a better understanding about the relation between sieving and matrix eﬀorts and the eﬀect

on NFS feasibility and overall performance. This is ongoing research, the results of which will

be reported elsewhere. All in all, the extra sieving cycles were well spent.

For 21039 −1the matrix step was, for the ﬁrst time, performed on a number of diﬀerent clusters:

as reported in [2], the main part of the calculation was run in parallel at two locations on four

clusters. This is possible due to our usage of the block Wiedemann algorithm for the matrix

step (cf. [9]), where the main calculation consists of two consecutive sequences of matrix times

vector multiplications. As discussed in [2], a greater degree of independent parallelization leads

to a variety of challenges that need to be overcome before larger problems can be tackled

(such as, for instance, 1024-bit moduli). Here we set a ﬁrst step toward a scalable solution by

showing how greater ﬂexibility can be achieved in the number and contributions of the various

independent clusters. As a result a nine times harder (than for 21039 −1) matrix step was solved

in less than four months running up to eight independent jobs on clusters located in France,

Japan, and Switzerland. During one of the substeps about a terabyte of memory was needed.

These ﬁgures imply that much larger matrices are already within reach, leaving preciously

little doubt about the feasibility by the year 2020 of a matrix required for a 1024-bit NFS

factorization. As part of the experiments mentioned above we also intend to study if a single

large cluster would be able to handle such matrices using the block Lanczos algorithm (cf. [8]).

Compared to block Wiedemann this has advantages (a shorter, single sequence of iterations

and no tedious and memory-hungry central Berlekamp-Massey step [40]) but disadvantages

as well (it cannot be run on separate clusters and each iteration consists of a multiplication

by both a matrix and its transpose).

The factorization problem and the steps taken to solve it using NFS are described in Section 2.

The factorization is presented in Section 2.5. Implications for moduli larger than RSA-768 are

brieﬂy discussed in the concluding Section 3. Appendix A presents some of the details of the

sieving approach that we used, and Appendix B describes a new twist of the block Wiedemann

algorithm that makes it easier to share large calculations among diﬀerent parties.

There are many aspects to an eﬀort as described in this paper that are not interesting from a

scientiﬁc or research point of view, but that are nevertheless crucial for its success. Although

NFS can be run as a BOINC project (cf. [30]), we chose not to do so. Instead, we gathered

a small but knowledgeable group of contributors who upfront committed themselves to ﬁnish

the project, while dedicating at least a known minimum of computational resources for a

substantial period of time. This allowed us to target a reasonable completion date, a date

2

that we managed to meet easily despite our ample oversieving. Although diﬀerent sieving-

clients do not need to communicate, each client needs to communicate a fair amount of data

to the central storage location. This required a bit more organizational eﬀorts than expected,

occasional recovery from mishaps such as unplugged network cables, switched oﬀ servers, or

faulty raids, and a constantly growing farm of backup drives. We do not further comment

on these more managerial issues in this article, but note that larger eﬀorts of this sort would

beneﬁt from full-time professional supervision.

2 Factoring RSA-768

2.1 Factoring using the Morrison-Brillhart approach

A composite integer ncan be factored by ﬁnding integer solutions x, y of the congruence

of squares x2≡y2mod n, and by hoping that nis factored by writing it as a product

gcd(x−y, n)·gcd(x+y, n). For a random such pair the probability is at least 1

2that a

non-trivial factor of nis found in this way. The Morrison-Brillhart approach [25] solves the

equation x2≡y2mod nby combining congruences of smooth squares. In this section this is

explained and, very rudimentarily, how NFS applies the same basic idea.

A non-zero integer uis b-smooth if the prime factors of |u|are at most b. Each b-smooth

integer ucorresponds to the (π(b) + 1)-dimensional integer vector v(u)of exponents of the

primes ≤bin its factorization, where π(b)is the number of primes ≤band the “+1” accounts

for inclusion of the exponent of −1. The factor base consists of the primes at most equal to

the smoothness bound.

Let nbe a composite integer, ba smoothness bound, and ta positive integer. Suppose that

π(b) + 1 + tdiﬀerent integers vhave been found such that the least absolute remainders

r(v) = v2mod nare b-smooth. Because the corresponding (π(b) + 1)-dimensional vectors

v(r(v)) are linearly dependent, at least tindependent subsets Sof the v’s can be found using

linear algebra such that, for each of these subsets, Pv∈Sv(r(v)) is an all-even vector. Each

subset Sthen leads to a solution x=Qv∈Svand y=pQv∈Sr(v)to x2≡y2mod nand

thus, overall, tchances of at least 1

2to factor n.

If v’s as above are generated by picking random integers and hoping that the least absolute

remainder modulo nof their square is smooth, the search for the proper v’s depends on the

smoothness probability of integers of the same order of magnitude as n. That is Dixon’s

random squares method (cf. [11]), which has the advantage that its expected runtime can

rigorously be proved. It was preceded by Morrison and Brillhart (cf. [25]) who had already

obtained a much higher smoothness probability, and thus a much faster – but heuristic –

factoring method, by using continued fractions to generate quadratic residues modulo nof

roughly half the size of n, i.e., of order n1/2. Richard Schroeppel with his linear sieve was the

ﬁrst, in about 1976, to combine essentially the same high smoothness probability with a fast

sieving-based method to recognize the smooth numbers (cf. [33, Section 6]). He was also the

ﬁrst to analyze the resulting expected runtime (cf. [17, Section 4.2.6]). Later this culminated

in Carl Pomerance’s quadratic sieve (cf. [33,34]) which, like Schroeppel’s method, relies on

fast sieving based recognition of smooth quadratic residues of rough order n1/2.

3

Factoring methods of this sort that rely on smoothness of residues of order nθ(1) can be shown

to be stuck at expected runtimes of the form

e(c+o(1))(ln n)1/2(ln ln n)1/2,

for positive constants cand asymptotically for n→ ∞. The number ﬁeld sieve [19] was the

ﬁrst, and so far the only, practical factoring method to break through the barrier of the

ln n-exponent of 1

2. This is achieved by looking at more contrived congruences that involve

smoothness of numbers of order no(1), for n→ ∞, that can, as usual, be combined into a

congruence of squares x2≡y2mod n. As a result, the number ﬁeld sieve factors a composite

integer nin heuristic expected time

e((64/9)1/3+o(1))(ln n)1/3(ln ln n)2/3,

asymptotically for n→ ∞. It is currently the algorithm of choice to factor numbers without

special properties, such as those on the RSA Challenge List [37]. Of this list, we concentrate

on the number RSA-768, a 768-bit RSA modulus with 232-digit decimal representation

123018668453011775513049495838496272077285356959533479219732245215172640050726

365751874520219978646938995647494277406384592519255732630345373154826850791702

6122142913461670429214311602221240479274737794080665351419597459856902143413.

Similar to Schroeppel’s linear sieve, the most important steps of NFS are sieving and the matrix

step. In the former relations are collected, congruences involving smooth values similar to the

smooth r(v)-values above. In the latter linear dependencies are found among the exponent

vectors of the smooth values. Unlike Schroeppel’s method, however, NFS requires two non-

trivial additional steps: a pre-processing polynomial selection step before the sieving can start,

and a post-processing square root step to convert the linear dependencies into congruences of

squares. A rough operational description of these steps as applied to RSA-768 is given in the

remainder of this section. For an explanation why these steps work, the reader may consult

one of the many expositions on NFS in the literature (cf. [19,20,35]).

2.2 Polynomial selection

Let nbe a composite integer to be factored. In NFS relations are given by coprime pairs of

integers (a, b)with b > 0such that two integer values v1(a, b)and v2(a, b)that depend on a, b

are simultaneously smooth, v1(a, b)with respect to some bound b1and v2(a, b)with respect to

some bound b2. The values v1(a, b)and v2(a, b)are deﬁned as follows. Let f1(X), f2(X)∈Z[X]

be two irreducible integer polynomials of degrees d1and d2, respectively, with a common

root mmodulo n, i.e., f1(m)≡f2(m)≡0 mod n. For our simpliﬁed exposition we assume

that f1and f2are monic, despite the fact that the actual f1and f2will not be monic. Then

vk(a, b) = bdkfk(a/b)∈Z. Suﬃciently many more than π(b1) + π(b2)+2 relations lead to

enough chances to factor n(cf. [19]), as sketched in the next two paragraphs.

Let Q(αk) = Q[X]/(fk(X)) for k= 1,2be two algebraic number ﬁelds. The elements a−bαk∈

Z[αk]have norm vk(a, b)and belong to the ﬁrst degree prime ideals in Q(αk)of (prime) norms

equal to the prime factors of vk(a, b). These prime ideals in Q(αk)correspond bijectively to the

4

pairs (p, r mod p)where pis prime and fk(r)≡0 mod p: excluding factors of fk’s discriminant,

a ﬁrst degree prime ideal corresponding to such a pair (p, r mod p)has norm pand is generated

by pand r−αk.

Because fk(m)≡0 mod n, there are two natural ring homomorphisms φk:Z[αk]→Z/nZfor

k= 1,2that map Pdk−1

i=0 aiαi

k(with ai∈Z) to Pdk−1

i=0 aimimod nand for which φ1(a−bα1)≡

φ2(a−bα2) mod n. Finding a linear dependency modulo 2 among the vectors consisting of

the exponents of the primes in the b1-smooth v1(a, b),b2-smooth v2(a, b)pairs (and after,

at some stage, slightly extending the vectors to make sure that the square norms lead to

squares of algebraic numbers, cf. [1]), subsets Sof the set of relations are constructed such

that Q(a,b)∈S(a−bαk)is a square σkin Q(αk), both for k= 1 and for k= 2. With φ1(σ1)≡

φ2(σ2) mod nit then remains to compute square roots τk=σ1/2

k∈Q(αk)for k= 1,2to ﬁnd

a solution x=φ1(τ1)and y=φ2(τ2)to x2≡y2mod n.

It is easy to ﬁnd polynomials f1and f2that lead to smoothness of numbers of order no(1), for

n→ ∞. Let d1∈Nbe of order (3 ln n

ln ln n)1/3, let d2= 1, let mbe an integer slightly smaller

than n1/d1, and let n=Pd1

i=0 nimiwith 0≤ni< m be the radix mrepresentation of n. Then

f1(X) = Pd1

i=0 niXiand f2(X) = X−mhave coeﬃcients that are no(1) for n→ ∞, they

have mas common root modulo n, and for the particular choice of d1the values a,bthat suﬃce

to generate enough relations are small enough to keep bd1f1(a/b)and bd2f2(a/b)of order no(1)

as well. Finally, if f1is not irreducible, it can be used to directly factor nor, if that fails, one of

its factors can be used instead of f1. If d1>1and d2= 1 we refer to “k= 1” as the algebraic

side and “k= 2” as the rational side. With d2= 1 the algebraic number ﬁeld Q(α2)is simply

Q, the ﬁrst degree prime ideals in Qare the regular primes and, with f2(X) = X−m, the

element a−bα2of Z[α2]is a−bm =v2(a, b)∈Zwith φ2(a−bα2) = a−bm mod n.

Although with the above polynomials NFS achieves the asymptotic runtime given in the

previous section, the amount of freedom in the choice of m,f1, and f2means that from many

diﬀerent choices the best may be selected. Here we keep interpretation of the term “best”

intuitive and make no attempt to deﬁne it precisely. When comparing diﬀerent possibilities,

however, it is usually obvious which one is better. Most often this is strongly correlated with

the smoothness probability over the relevant range of a, b pairs, and thus with the size of

the coeﬃcients and the number of roots modulo small primes, smoothness properties of the

leading coeﬃcients, and the number of real roots. Sieving experiments may be carried out to

break a tie.

Given this somewhat inadequate explanation of what we are looking for, the way one should

be looking for good candidates is the subject of active research. Current best methods involve

extensive searches, are guided by experience, helped by luck, and proﬁt from patience. Only

one method is known that produces two good polynomials of degrees greater than one (namely,

twice degree two, cf. [5]), but its results are not competitive with the current best d1>1,

d2= 1 methods which are all based on reﬁnements described in [15] of the approach from [24]

and [26] as summarized in [7, Section 3.1]. A search of three months on a cluster of 80 Opteron

cores (i.e., 3

12 ·80 = 20 core years), conducted at BSI in 2005 already and thus not including

the idea from [16], produced three pairs of polynomials of comparable quality. We used the

5

pair

f1(X) = 265482057982680X6

+ 1276509360768321888X5

−5006815697800138351796828X4

−46477854471727854271772677450X3

+ 6525437261935989397109667371894785X2

−18185779352088594356726018862434803054X

−277565266791543881995216199713801103343120,

f2(X) = 34661003550492501851445829X−1291187456580021223163547791574810881

from which the common root mfollows as the negative of the ratio modulo RSA-768 of the

constant and leading coeﬃcients of f2. The leading coeﬃcients factor as 23·32·5·72·11 ·

17 ·23 ·31 ·112 877 and 13 ·37 ·79 ·97 ·103 ·331 ·601 ·619 ·769 ·907 ·1 063, respectively.

The discriminant of f1equals 212 ·32·52·13 ·17 ·17 722 398 737 ·c273, where c273 denotes a

273-digit composite number that probably does not have a factor of 30 digits or less and that

is expected to be square-free. The discriminant of f2equals one. A renewed search at EPFL in

the spring of 2007 (also not using the idea from [16]), after we had decided to factor RSA-768,

produced a couple of candidates of similar quality, again after spending about 20 core years.

Following the algorithm from [15], polynomials with the following properties were considered:

the leading coeﬃcient of f2allowed 11 (ﬁrst search at BSI) or 10 (second search at EPFL)

prime factors equal to 1 mod 6 with at most one other factor <215.5, and the leading coeﬃcient

of f1was a multiple of 258 060 = 22·3·5·11·17·23. Overall, at least 2·1018 pairs of polynomials

were considered.

Given the polynomials, smoothness bounds have to be selected. Also, we need to specify a

sieving region Sof Z×Z>0where the search for relations will be conducted. We conclude this

section with a discussion of this ﬁnal part of the parameter selection process.

During any practical search for smooth numbers, one always encounters near misses: candidates

that are smooth with the exception of a small number of prime factors that are somewhat larger

than the smoothness bound. Not only can these large primes, as these factors are commonly

referred to, be recognized at no or little extra cost, they also turn out to be useful in the

factorization process. As a result the smoothness bounds can be chosen substantially lower

than they would have to be chosen otherwise and relation collection goes faster while requiring

less memory. On the negative side, it is somewhat harder to decide whether enough relations

have been found – as the simple criterion that more than π(b1) + π(b2) + 2 are needed is no

longer adequate – but that is an issue of relatively minor concern. In principle the decision

requires duplicate removal, repeated singleton removal, and counting, all brieﬂy touched upon

at the end of Section 2.3. In practice it is a simple matter of experience.

Let b`denote the upper bound on the large primes that we want to keep, which we assume

to be the same for the algebraic and rational sides. Thus, b`≥max(b1, b2). We say that a

non-zero integer xis (bk, b`)-smooth if with the exception of, say, four prime factors between

bkand b`, all remaining prime factors of |x|are at most bk. Consequently, we change the

deﬁnition of a relation into a coprime pair of integers a, b with b > 0such that bd1f1(a/b)is

(b1, b`)-smooth and bd2f2(a/b)is (b2, b`)-smooth.

6

Selection of the smoothness bounds b1,b2, and b`, of the search area Sfor a, b, and how

many primes between b1or b2and b`will be permitted and looked for is also mostly guided

by experience, and supported by sieving experiments to check if the yield is adequate. For

RSA-768 we used smoothness bounds b1= 11 ·108,b2= 2 ·108and b`= 240 on cores with

at least 2 GB RAM (which turned out to be the majority). On cores with less memory (but

at least a GB RAM, since otherwise it would not be worth our trouble), b1= 4.5·108and

b2= 108were used instead (with the same b`). For either choice it was expected that it would

suﬃce to use as sieving region the bounded subset Sof Z×Z>0containing about 11 ·1018

coprime pairs a, b with |a| ≤ 3·109·κ1/2≈6.3·1011 and 0< b ≤3·109/κ1/2≈1.4·107.

Here κ= 44 000 approximates the skewness of the polynomial f1, and serves to approximately

minimize the largest norm v1(a, b)that can be encountered in the sieving region. Although

prime ideal norms up to b`= 240 were accepted, the parameters were optimized for norms

up to 237. Most jobs attempted to split after the sieving algebraic and rational cofactors up

to 2140 and 2110, respectively, only considering the most promising candidates (cf. [14]). As

far as we know, this was the ﬁrst NFS factorization where more than three large primes were

allowed on the algebraic side.

2.3 Sieving

With the parameters as chosen above, we need to ﬁnd relations: coprime pairs of inte-

gers (a, b)∈Ssuch that bd1f1(a/b)is (b1, b`)-smooth and bd2f2(a/b)is (b2, b`)-smooth. In

this section we describe this process.

A prime pdividing fk(r)is equivalent to (rmod p)being a root of fkmodulo p(as already

noted above with the ﬁrst degree prime ideals, and disregarding prime factors of fk’s discrimi-

nant). Because d2= 1, the polynomial f2has one root modulo pfor each prime pnot dividing

the leading coeﬃcient of f2, and each such prime pdivides f2(j)once every pconsecutive val-

ues of j. For f1the number of roots modulo pmay vary from 0(if f1has no roots modulo p)

to d1(if f1has d1linear factors modulo p). Thus, some primes do not divide f1(j)for all j,

whereas other primes pmay divide f1(j)a total of d1times for every pconsecutive j-values.

All (prime,root) pairs with prime at most b1for f1and at most b2for f2can be precomputed.

Given a (prime,root) pair (p, r)for fk, the prime pdivides bdkfk(a/b)if a/b ≡rmod p, i.e.,

for a ﬁxed bwhenever a=rb mod p. This leads to the line sieving approach where per b-

value for each (p, r)pair for f1(or f2) one marks with “ p” all relevant a-values (of the form

rb +ip for i∈Z). After this sieving step, the good locations are those that have been marked

for many diﬀerent p’s. These locations are remembered, after which the process is repeated

for f2(or f1), and the intersection of the new good locations with the old ones is further

studied, since those are the locations where relations may be found. With 1.4·107b-values

to be processed (the lines) and about 2·κtimes that many a-values per b, the lines can be

partitioned among almost any number of diﬀerent processors a factoring eﬀort of this limited

scale can get access to. This is how the earliest implementations of distributed NFS sieving

worked.

A more eﬃcient but also more complicated approach has gained popularity since the mid 1990s:

the lattice sieve as described in [32]. For a (prime,root) pair q= (q, s)deﬁne Lqas the lattice

of integer linear combinations of the 2-dimensional integer (row-)vectors (q, 0),(s, 1) ∈Z2,

and let Sqbe the intersection of Sand Lq. Fix a (prime,root) pair q= (q, s)for, say, f1. The

7

special prime q(as this lattice deﬁning large prime was referred to in [32]) is typically chosen

of the same order of magnitude as b1, possibly bigger – but smaller than b`. It follows that

qdivides bd1f1(a/b)for each (a, b)∈Sq. Lattice sieving consists of marking, for each other

(prime,root) pair pfor f1for which the prime is bounded by b1, the points in the intersection

Lp∩Sq. The good locations are remembered, after which the process is repeated for all the

(prime,root) pairs for f2with prime bounded by b2, and the intersection of the new good

locations with the old ones is further studied. In this way relations in Sqare found. Thus, for

each of these relations it is the case that qdivides v1(a, b), and the lattice sieve is repeated

for other (prime,root) pairs quntil enough relations have been found.

A relation thus found and involving some prime q06=qdividing v1(a, b)may also be found

when lattice sieving with q0= (q0, s0). Thus, unlike line sieving, when lattice sieving duplicates

will be found: these have to be removed (while, obviously, keeping one copy).

In practice one ﬁxes bounds Iand Jindependent of qand deﬁnes Sq={iu +jv :i, j ∈

Z,−I/2≤i < I/2,0< j < J }, where u, v form a basis for Lqthat minimizes the norms

v1(a, b)for (a, b)∈Sq. Such a basis is found by partially reducing the initial basis (q, 0),(s, 1)

for Lqsuch that, roughly speaking, the ﬁrst coordinate is about κtimes bigger than the second,

thereby minimizing the norms v1(a, b)for (a, b)∈Sq, cf. skewness of the sieving region S.

Actual sieving is carried out over the set {(i, j)∈Z×Z>0:−I/2≤i < I/2,0< j < J},

interpreted as Sqin the above manner.

For RSA-768 we used I= 216 and J= 215 which implies that we did lattice sieving over

an area of size roughly 231 ≈2·109(we did not use any line sieving). With b1= 11 ·108

and b2= 2 ·108, the majority of the primes that are being sieved with can be expected to

hit Sqonly a few times. Thus, for any pthat one will be sieving with, only a few of the

j-values (the lines) will be hit, unlike the situation in plain line sieving where the lines are

so long that each line will be hit several times by each prime. It implies that, when lattice

sieving, one cannot aﬀord to look at all lines 0< j < J per p, and that a sieving method

more sophisticated than line sieving must be used. This sieving by vectors, as it was referred

to in [32], was ﬁrst implemented in [13] and used for many factorizations in the 1990s, such

as those reported in [10] and [7]. We used a particularly eﬃcient implementation of sieving by

vectors, the details of which have never appeared in print (cf. [12]). They are summarized in

Appendix A.

Lattice sieving with sieving by vectors as above was done for most of the about 480 million

(prime,root) pairs (q, s)for special qs between 450 million and 11.1 billion (and some spe-

cial qs below 450 million, with a smaller b1-value) by eight contributing parties during the

period August 2007 until April 2009. Table 1 lists the percentages contributed by the eight

participants. Scaled to a 2.2 GHz Opteron core with 2 GB RAM, a single (prime,root) pair

(q, s)was processed in less than a hundred seconds on average and produced about 134 rela-

tions, for an average of about four relations every three seconds. This average rate varies by

a factor of about two between both ends of the special qrange that we considered.

In total 64 334 489 730 relations were collected, each requiring about 150 bytes. Compressed

they occupied about 5 terabytes of disk space, backed up at various locations. Of these rela-

tions, 27.4% were duplicates. Duplicate removal can simply be done using bucket sorting (on

b-values, for instance) followed by application of UNIX commands sort and uniq, but may

be complicated by odd data formats, and may be done in any way one deems convenient. We

8

name of contributor relations matrix

contribution stage 1 stage 2 stage 3 total

Bonn (University and BSI) 8.14%

CWI 3.44%

EPFL 29.78% 34.3% 100% 78.2% 51.9%

INRIA LORIA (ALADDIN-G5K) 37.97% 46.8% 17.3% 35.0%

NTT 15.01% 18.9% 4.5% 13.1%

Scott Contini (Australia) 0.43%

Paul Leyland (UK) 0.69%

Heinz Stockinger (Enabling Grids for E-sciencE) 4.54%

Table 1: Percentages contributed (with matrix stages 1, 2, and 3 contributing 3

5th, 0, and 2

5th to total).

used hashing. Most uniqueing was done during the sieving. Overall it took less than 10 days

on a 2.66 GHz Core2 processor with ten 1TB hard disks. After inclusion of 57 223 462 free

relations (cf. [19]) the uniqueing resulted in 47 762 243 404 relations involving 35 288 334 017

prime ideals.

Given the set of unique relations, those that involve a prime ideal that does not occur in

any other relation, the singletons, cannot be part of a dependency and can thus be removed.

Recognizing singletons can be done using a simple hash table with a count (counting “0”,

“1”, “many”). Doing this once reduced the set of relations to 28 984 986 047 elements, with

14 498 007 183 prime ideals. However, removal of singletons usually leads to new singletons,

and singleton removal must be repeated until no more relations are removed. Note that the ﬁrst

(usually high volume) runs can be made to accept diﬀerent prime ideals with the same hash to

lower memory requirements for the hash table. Once singletons have been removed, we have

to make sure that the number of remaining relations is larger than the total number of prime

ideals occurring in them, so that there will indeed be dependencies among the exponent vectors.

After a few more singleton removals 24 615 168 385 relations involving at most 9 976 671 468

prime ideals were left.

At this point further singleton removal iterations were combined with clique removal, i.e.,

search of combinations with matching ﬁrst degree prime ideals of norms larger than bk. Re-

moval of almost all singletons together with clique removal (cf. [6]) led to 2 458 248 361 relations

involving a total of 1 697 618 199 prime ideals, and still containing an almost insigniﬁcant num-

ber 604 423 of free relations. This not only clearly indicated that we had done enough sieving,

but also gave us lots of ﬂexibility while creating the matrix. In the next section it is described

how dependencies were found. Overall singleton removal and clique ﬁnding took less than 10

days on the same platform as used for the uniqueing.

2.4 The matrix step

Finding dependencies among the rows of a sparse matrix using block Wiedemann or block

Lanzcos, currently the two favorite methods, takes time proportional to the product of the

dimension and the weight (i.e., number of non-zero entries) of the matrix. Merging is a generic

term for the set of strategies developed to build a matrix for which close to optimal performance

can be expected for the dependency search of one’s choice. It is described in [6]. We ran about

10 merging jobs, aiming for various optimizations (low dimension, low weight, best-of-both,

etc.), which each took a couple of days on a single core per node of a 37 node 2.66 GHz

9

Core2 cluster with 16 GB RAM per node, and a not particularly fast interconnection network.

The matrix that we ended up using was generated by a 5-day process running on two to

three cores on the same 37-node cluster, where the long duration was probably due to the

larger communication overhead. It produced a 192 796 550 ×192 795 550-matrix of total weight

27 797 115 920 (thus, on average 144 non-zeros per row). Storage of the matrix required about

105 gigabytes. When we started the project, we expected dimension about a quarter billion

and density about 150, which would have been about 7

4times harder than what we managed

to ﬁnd (thanks to our large amount of oversieving).

We obtained various other matrices that would all lead to worse performance for the next step,

determining dependencies. For instance, for the full collection of relations we found a matrix

with about 181 million rows and average weight about 171 per row, which is only slightly

worse. Restriction to the set of 1 296 488 663 relations (involving fewer than 1 321 104 619

prime ideals) satisfying the much smaller b`-value of 234 still led, after singleton removal, to

more relations (905 325 141) than prime ideals (894 248 046), as required, and ultimately to a

252 735 215×252 734 215-matrix of weight 37 268 770 998. For b`= 235 there were 3 169 194 001

relations on at most 2 620 011 949 prime ideals, of which 2 487 635 021 relations on 1 892 034 766

prime ideals remained after singleton removal, and which led to a 208 065 007 ×208 064 007-

matrix of weight 31 440 035 830. Neither matrix was competitive.

We used the block Wiedemann algorithm [9] with block width 8·64 to ﬁnd dependencies

modulo 2 among the approximately 193 million rows resulting from the ﬁltering. The details

of this algorithm can be found in [9,40] and [2, Section 5.1]. Below we give a higher level

explanation of the three basic steps of the Wiedemann algorithm (cf. [18, Section 2.19]).

Given a non-singular d×dmatrix Mover the ﬁnite ﬁeld F2of two elements and b∈Fd

2,

we wish to solve the system Mx =b. The minimal polynomial Fof Mon the vector space

spanned by b,Mb,M2b,. . . has degree at most d, so that

F(M)b=

d

X

i=0

FiMib= 0.

From F0= 1 it follows that x=Pd

i=1 FiMi−1b, so it suﬃces to ﬁnd the Fi’s.

Denoting by mi,j the jth coordinate of the vector Mib, it follows from the above equation

that for each jwith 1≤j≤dthe sequence (mi,j)∞

i=0 satisﬁes a linear recurrence relation of

order at most ddeﬁned by the coeﬃcients Fi: for any t≥0and 1≤j≤dwe have that

d

X

i=0

Fimi+t,j = 0.

It is well known that 2d+ 1 consecutive terms of an order dlinear recurrence suﬃce to

compute the coeﬃcients generating the recurrence, and that this can be done eﬃciently using

the Berlekamp-Massey method (cf. [22,40]). Each particular jmay lead to a polynomial of

smaller degree than F, but taking, if necessary, the least common multiple of the polynomials

found for a small number of diﬀerent indices j, the correct minimal polynomial will be found.

Summarizing the above, there are three major stages: a ﬁrst iteration consisting of 2dmatrix

times vector multiplication steps to generate 2d+ 1 consecutive terms of the linear recurrence,

the Berlekamp-Massey stage to calculate the Fi’s, and ﬁnally the second iteration consisting

10

of dmatrix times vector multiplication steps to calculate the solution using the Fi’s. For large

matrices the two iterations, i.e., the ﬁrst and the ﬁnal stage, are the most time consuming.

In practice it is common to use blocking, taking advantage of the fact that on 64-bit machines

64 diﬀerent vectors bover F2can be processed simultaneously, at little or no extra cost

compared to a single vector (cf. [9]), while using the same three main stages. If the vector ¯

b

is 64 bits wide and in the ﬁrst stage, the ﬁrst 64 coordinates of each of the generated 64 bits

wide vectors Mi¯

bare kept, the number of matrix (M) times vector (¯

b) multiplication steps

in both the ﬁrst and the ﬁnal stage is reduced by a factor of 64 compared to the number of

Mtimes bmultiplication steps, while making the central Berlekamp-Massey stage a bit more

cumbersome. It is less common to take the blocking a step further and run both iteration

stages spread over a small number n0of diﬀerent sequences, possibly run on diﬀerent clusters

at diﬀerent locations; in [2] this was done with n0= 4 sequences run on three clusters at two

locations. If for each sequence one keeps the ﬁrst 64 ·n0coordinates of each of the 64 bits

wide vectors they generate during the ﬁrst stage, the number of steps to be carried out (per

sequence) is further reduced by a factor of n0, while allowing independent and simultaneous

execution on possibly n0diﬀerent clusters. After the ﬁrst stage the data generated for the n0

sequences have to be gathered at a central location where the Berlekamp-Massey stage will

be carried out.

While keeping the ﬁrst 64 ·n0coordinates per step for each sequence results in a reduction

of the number of steps per sequence by a factor of 64 ·n0, keeping a diﬀerent number of

coordinates while using n0sequences results in another reduction in the number of steps for

the ﬁrst stage. Following [2, Section 5.1], if the ﬁrst 64 ·m0coordinates are kept of the 64

bits wide vectors for n0sequences, the numbers of steps become d

64·m0+d

64·n0= ( n0

m0+ 1) d

64·n0

and d

64·n0for the ﬁrst and third stage, respectively and for each of the n0sequences. The

choices of m0and n0should be weighed oﬀ against the cost of the Berlekamp-Massey step

with time and space complexities proportional to (m0+n0)3

n0d1+o(1) and (m0+n0)2

n0d, respectively

and for d→ ∞, and where the exponent “3” may be replaced by the matrix multiplication

exponent (our implementation uses “3”).

When actually running the ﬁrst stage of block Wiedemann in this way using n0diﬀerent

sequences, the eﬀect of non-identical resources used for the diﬀerent sequences quickly becomes

apparent: some locations ﬁnish their allotted work faster than others (illustrated in Fig. 1). To

keep the fast contributors busy and to reduce the work of the slower ones (thereby reducing

the wall-clock time), it is possible to let a quickly processed ﬁrst stage sequence continue

for ssteps beyond (n0

m0+ 1) d

64·n0while reducing the number of steps in another ﬁrst stage

sequence by the same s. As described in Appendix B, this can be done in a very ﬂexible

way, as long as the overall number of steps over all ﬁrst stage sequences adds up to n0·

(n0

m0+ 1) d

64·n0. The termination points of the sequences in the third stage need to be adapted

accordingly. This is easily arranged for, since the third stage allows much easier and much

wider parallelization anyhow (assuming checkpoints from the ﬁrst stage are kept). Another

way to keep all parties busy is swapping jobs, thus requiring data exchanges, synchronization,

and more human interaction, making it a less attractive option altogether.

For our matrix with d≈193·106we used, as in [2], m0= 2n0. But where n0= 4 was used in [2],

we used n0= 8, thereby quadrupling the Berlekamp-Massey runtime and doubling its memory

compared to the matrix from [2], on top of the increased runtime and memory demands caused

by the larger dimension of the matrix. On the other hand, the compute intensive ﬁrst and third

11

(a) First stage contributions. (b) Final shot of third stage bookkeeping tool.

Fig. 1: Contributions to sequences 0-7 with blue indicating INRIA, orange EPFL, and pink NTT.

stages could be split up into twice as many independent jobs as before. For the ﬁrst stage on

average (8

16 + 1)193·106

64·8≈565 000 steps needed to be taken per sequence (for 8 sequences), for

the second stage the average was about 193·106

64·8≈380 000 steps. The actual numbers of steps

varied, approximately, between 490 000 and 650 000 for the ﬁrst stage and between 300 000

and 430 000 for the third stage. The calculation of these stages was carried out on a wide

variety of clusters accessed from three locations: a 56-node cluster of 2.2GHz dual hex-core

AMD processors with Inﬁniband at EPFL (installed while the ﬁrst stage was in progress),

a variety of ALADDIN-G5K clusters in France accessed from INRIA LORIA, and a cluster

of 110 3.0GHz Pentium-D processors on a Gb Ethernet at NTT. A comprehensive overview

of clusters and timings is given in Appendix D. It follows from those timings that doing the

entire ﬁrst and third stage would have taken 98 days on 48 nodes (576 cores) of the 56-node

EPFL cluster.

The ﬁrst stage was split up into eight independent jobs run in parallel on those clusters, with

each of the eight sequences check-pointing once every 214 steps. Running a ﬁrst (or third)

stage sequence required 180 GB RAM, a single 64 bits wide ¯

btook 1.5 gigabytes, and a single

mimatrix 8 kilobytes, of which 565 000 were kept, on average, per ﬁrst stage sequence. Each

partial sum during the third stage evaluation required 12 gigabytes.

The central Berlekamp-Massey stage was done in 17 hours and 20 minutes on the 56-node

EPFL cluster (with 16 GB RAM per node) mentioned above, while using just 4 of the 12

available cores per node. Most of the time the available 896 GB RAM suﬃced, but during a

central part of the calculation more memory was needed (up to about 1 TB) and some swapping

occurred. The third stage started right after completion of the second stage, running as many

jobs in parallel as possible. The actual bookkeeping sheet used is pictured in Fig. 1b. Fig. 1a

pictures the ﬁrst stage contributions apocryphally but accurately. Calendar time for the entire

block Wiedemann step was 119 days, ﬁnishing on December 8, 2009. The percentages of the

contributions to the three stages are given more precisely in Table 1.

2.5 That’s a bingo7

There is nothing new to be reported for the square root step, except for the resulting factor-

ization of RSA-768. Nevertheless, and for the record, we present some of the details.

7“Is that the way you say it? “That’s a bingo?” ”

“You just say “bingo”.” (cf. [39])

12

As expected the matrix step resulted in 512 = 64 ·8linear dependencies modulo 2 among the

exponent vectors. This was more than enough to include the quadratic characters, which were

not included in the matrix, at this stage (cf. [1]). As as result, the solution space was reduced

from 512 to 460 elements, giving us 460 independent chances of about 1

2to factor RSA-768.

In the 52 = 512 −460 diﬀerence, a dimension of 46 can be attributed to prime ideals dividing

the leading coeﬃcients or the discriminant that were not included in the matrix.

The square roots of the algebraic numbers were calculated by means of the method from [23]

(see also [31]), which uses the known factorization of the algebraic numbers into small prime

ideals of known norms. The implementation based on [3] turned out to have a bug when

computing the valuations for the free relations of the prime ideals lying above the divisor

17 722 398 737 >232 of the discriminant of f1. Along with a bug in the quadratic character

calculation, this delayed completion of the square root step by a few (harrowing) days. Another

approach would be to write down the algebraic number as a product over the a−bα1in a

dependency and to compute the square root using a mathematics software package. Although

the latter method is attractively simple to program and has been used for small scale NFS

factorizations, it was not considered for the present calculation due to the less appealing size

of the numbers involved. We estimate that writing down the algebraic number would have

required about 36 gigabytes, and that the calculation would ﬁt in 50 to 100 GB RAM. We

did the calculation on the rational side: on a single core of a 2.83GHz Xeon E5440 computing

the six gigabyte rational number took two hours, extracting the squareroot half an hour.

Once the bugs were located and ﬁxed, it took two hours using the hard disk and one core on

each of twelve dual hex-core 2.2GHz AMD processors to compute the exponents of all prime

ideals for eight solutions simultaneously. Computing a square root using the implementation

from [3] took one hour and forty minutes on such a dual hex-core processor. The ﬁrst one (and

four of the other seven) led to the factorization, found at 20:16 GMT on December 12, 2009:

RSA-768 = 3347807169895689878604416984821269081770479498371376856891

2431388982883793878002287614711652531743087737814467999489 ·

3674604366679959042824463379962795263227915816434308764267

6032283815739666511279233373417143396810270092798736308917.

Both factors have 384 bits and 116 decimal digits. The factorizations of the factors ±1can be

found in Appendix C.

3 Concluding remarks

It is customary to conclude a paper reporting a new factoring record with a preview of coming

attractions. For the present case it suﬃces to remark that our main conclusion has already

been reported in [2, Section 7], and was summarized in the ﬁrst paragraph of the introduction

to this paper: at this point factoring a 1024-bit RSA modulus looks more than ﬁve times

easier than a 768-bit RSA modulus looked back in 1999, when we achieved the ﬁrst public

factorization of a 512-bit RSA modulus. Nevertheless, a 1024-bit RSA modulus is still about

one thousand times harder to factor than a 768-bit one. If we are optimistic, it may be possible

to factor a 1024-bit RSA modulus within the next decade by means of an academic eﬀort on

13

the same limited scale as the eﬀort presented here. From a practical security point of view

this is not a big deal, given that standards recommend phasing out such moduli by the end

of the year 2010 (cf. [27,28]). See also [21].

Another conclusion from our work is that we can quite conﬁdently say that if we restrict our-

selves to an open community, academic eﬀort as ours and unless something dramatic happens

in factoring, we will not be able to factor a 1024-bit RSA modulus within the next ﬁve years

(cf. [29]). After that, all bets are oﬀ.

The ratio between sieving and matrix time was almost 10. This is probably not optimal if

one wants to minimize the overall runtime. But, as already alluded to in the introduction,

minimization of runtime may not be the most important criterion. Sieving is easy, and doing

more sieving may be a good investment if it leads to a less painful matrix step. We expect

that the relations collected for RSA-768 will enable us to get a better insight in the trade-oﬀ

between sieving and matrix eﬀorts, where also the choice of the bound b`may play a role. This

is a subject for further study that may be expected to lead, ultimately, to a recommendation

for close to optimal parameter choices – depending on what one wants to optimize – for NFS

factorizations of numbers of 700 to 800 bits.

Our computation required more than 1020 operations. With the equivalent of almost 2000

years of computing on a single core 2.2GHz AMD Opteron, on the order of 267 instructions

were carried out. The overall eﬀort is suﬃciently low that even for short-term protection of

data of little value, 768-bit RSA moduli can no longer be recommended. This conclusion is

the opposite of the one arrived at on [38], which is based on a hypothetical factoring eﬀort of

six months on 100 000 workstations, i.e., about two orders of magnitude more than we spent.

Acknowledgements

This work was supported by the Swiss National Science Foundation under grant numbers

200021-119776 and 206021-128727 and by the Netherlands Organization for Scientiﬁc Research

(NWO) as project 617.023.613. Experiments presented in this paper were carried out using the

Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development

action with support from CNRS, RENATER and several universities as well as other funding

bodies (see https://www.grid5000.fr). Condor middleware was used on EPFL’s Greedy net-

work. We gratefully acknowledge sieving contributions by BSI, Scott Contini (using resources

provided by AC3, the Australian Centre for Advanced Computing and Communications), Paul

Leyland (using teaching lab machines at the Genetics Department of Cambridge University),

and Heinz Stockinger (using EGEE, Enabling Grids for E-sciencE). Part of this paper was

inspired by Col. Hans Landa.

References

1. L.M. Adleman, Factoring numbers using singular integers, Proceedings 23rd Annual ACM Symposium on

Theory of Computing (STOC) (1991) 64–71.

2. K. Aoki, J. Franke, T. Kleinjung, A.K. Lenstra, D.A. Osvik, A kilobit special number ﬁeld sieve factor-

ization, Proceedings Asiacrypt 2007, Springer-Verlag, LNCS 4833 (2007) 1–12.

3. F. Bahr, Liniensieben und Quadratwurzelberechnung für das Zahlkörpersieb, Diplomarbeit, University of

Bonn, 2005.

14

4. F. Bahr, M. Böhm, J. Franke, T. Kleinjung, Factorization of RSA-200, May 2005, http://www.loria.fr/

~zimmerma/records/rsa200.

5. J. Buhler, P.L. Montgomery, R. Robson, R. Ruby, Technical report implementing the number ﬁeld sieve,

Oregon State University, Corvallis, OR, 1994.

6. S. Cavallar, Strategies for ﬁltering in the number ﬁeld sieve, Proceedings ANTS IV, Springer-Verlag, LNCS

1838 (2000) 209–231.

7. S. Cavallar, B. Dodson, A. K. Lenstra, P. Leyland, P. L. Montgomery, B. Murphy, H. te Riele, P. Zim-

mermann, et al., Factoring a 512-bit RSA modulus, Proceedings Eurocrypt 2000, Springer-Verlag, LNCS

1807 (2000) 1–18.

8. D. Coppersmith, Solving linear equations over GF(2): block Lanczos algorithm, Linear algebra and its

applications 192 (1993) 33–60.

9. D. Coppersmith, Solving homogeneous linear equations over GF(2) via block Wiedemann algorithm, Math.

of Comp. 62 (1994) 333–350.

10. J. Cowie, B. Dodson. R.M. Elkenbracht-Huizing, A.K. Lenstra, P.L. Montgomery, J. Zayer, A world wide

number ﬁeld sieve factoring record: on to 512 bits, Proceedings Asiacrypt’96, Springer-Verlag, LNCS 1163

(1996), 382–394.

11. J.D. Dixon, Asymptotically fast factorization of integers, Math. Comp. 36 (1981) 255–260.

12. J. Franke, T. Kleinjung, Continued fractions and lattice sieving; Proceedings SHARCS 2005; http://www.

ruhr-uni-bochum.de/itsc/tanja/SHARCS/talks/FrankeKleinjung.pdf.

13. R. Golliver, A.K. Lenstra, K. McCurley, Lattice sieving and trial division, Proceedings ANTS’94, Springer-

Verlag, LNCS 877 (1994) 18–27.

14. T. Kleinjung, Cofactorisation strategies for the number ﬁeld sieve and an estimate for the sieving step

for factoring 1024 bit integers, http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf,

2005.

15. T. Kleinjung, On polynomial selection for the general number ﬁeld sieve. Math. Comp. 75 (2006) 2037–

2047.

16. T. Kleinjung, Polynomial selection, talk presented at the CADO workshop on integer factorization,

http://cado.gforge.inria.fr/workshop/slides/kleinjung.pdf, 2008.

17. A.K. Lenstra, Computational methods in public key cryptology, pp. 175–238 in: H. Niederreiter, Coding

theory and cryptology, Singapore University Press, 2002.

18. A.K. Lenstra, H.W. Lenstra, Jr., Algorithms in number theory, chapter 12 in Handbook of theoretical

computer science, Volume A, algorithms and complexity (J. van Leeuwen, ed.), Elsevier, Amsterdam (1990).

19. A.K. Lenstra, H.W. Lenstra, Jr. (editors), The development of the number ﬁeld sieve, Springer-Verlag,

LNM 1554, August 1993.

20. A.K. Lenstra, H.W. Lenstra, Jr., M.S. Manasse, J. Pollard, The factorization of the ninth Fermat number,

Math. Comp. 61 (1994) 319–349.

21. A.K. Lenstra, E. Tromer, A. Shamir, W. Kortsmit, B. Dodson, J. Hughes, P. Leyland, Factoring estimates

for a 1024-bit RSA modulus, Proceedings Asiacrypt 2003, Springer-Verlag, LNCS 2894 (2003) 55–74.

22. J. Massey, Shift-register synthesis and BCH decoding, IEEE Trans. Inf. Th., 15 (1969) 122–127.

23. P.L. Montgomery, Square roots of products of algebraic numbers, http://ftp.cwi.nl/pub/pmontgom/

sqrt.ps.gz.

24. P.L. Montgomery, B. Murphy, Improved polynomial selection for the number ﬁeld sieve, extended abstract

for the conference on the mathematics of public-key cryptography, June 13-17, 1999, the Fields institute,

Toronto, Ontario, Canada.

25. M.A. Morrison, J. Brillhart, A method of factorization and the factorization of F7, Math. Comp. 29 (1975)

183–205.

26. B.A. Murphy, Modelling the Yield of Number Field Sieve Polynomials, Proceedings ANTS III, Springer-

Verlag, LNCS 1423 (1998) 137–147.

27. National Institute of Standards and Technology, Suite B Cryptography: http://csrc.nist.gov/groups/

SMA/ispab/documents/minutes/2006-03/E_Barker-March2006-ISPAB.pdf.

28. National Institute of Standards and Technology, Special Publication 800-57: Recommendation for Key

Management Part 1: General (Revised),

http://csrc.nist.gov/publications/nistpubs/800-57/sp800-57-Part1-revised2_Mar08-2007.pdf.

29. National Institute of Standards and Technology, Discussion paper: the transitioning of cryptographic

algorithms and key sizes,

http://csrc.nist.gov/groups/ST/key_mgmt/documents/Transitioning_CryptoAlgos_070209.pdf.

30. NFS@home, http://escatter11.fullerton.edu/nfs.

15

31. P. Nguyen, A Montgomery-like square root for the number ﬁeld sieve, Proceedings ANTS III, Springer-

Verlag, LNCS 1423 (1998) 151–168.

32. J.M. Pollard, The lattice sieve, 43–49 in [19].

33. C. Pomerance, Analysis and comparison of some integer factoring algorithms, in Computational methods

in number theory (H.W. Lenstra, Jr., R. Tijdeman, eds.), Math. Centre Tracts 154,155, Mathematisch

Centrum, Amsterdam (1983) 89-139.

34. C. Pomerance, The quadratic sieve factoring algorithm, Proceedings Eurocrypt’84, Springer-Verlag, LNCS

209 (1985) 169-182.

35. C. Pomerance, A tale of two sieves, http://www.ams.org/notices/199612/pomerance.pdf.

36. R. Rivest, A. Shamir, L. Adleman, A method for obtaining digital signatures and public key cryptosystems,

Commun. of the ACM, 21 (1978) 120–126.

37. The RSA challenge numbers, formerly on http://www.rsa.com/rsalabs/node.asp?id=2093, now on for

instance http://en.wikipedia.org/wiki/RSA_numbers.

38. The RSA factoring challenge FAQ, http://www.rsa.com/rsalabs/node.asp?id=2094.

39. Q. Tarantino, http://www.youtube.com/watch?v=WtHTc8wIo4Q,

http://www.imdb.com/title/tt0361748/quotes.

40. E. Thomé, Subquadratic computation of vector generating polynomials and improvement of the block

Wiedemann algorithm, Journal of symbolic computation 33 (2002) 757–775.

A Sieving by vectors

The aim of this section is to give a short description of the implementation of lattice sieving

described in [12] and used for most of the number ﬁeld sieve factorization records of the

previous decade.

Let vk(a, b) = bdkfk(a/b). Recall that the idea of the method of lattice sieving, which was

introduced by Pollard [32], is to increase the probability of smoothness of v1(a, b)by only

looking at (a, b)-pairs for which v1(a, b)is divisible by some large prime q, called the special q.

Let smod qbe a residue class such that this is the case for a≡sb mod q. One constructs a

reduced base (u, v)of the lattice of all (a, b)∈Z2with a≡sb mod q. A scalar product adapted

to the skewness of the polynomial pair is used for this reduction. The problem is then to ﬁnd

all pairs (i, j),−I /2≤i < I /2,0< j < J , without common divisor such that v1(a, b)/q and

v2(a, b)are both smooth, with (a, b) = iu +jv. For the sake of simplicity we assume Ito be

even. As mentioned in Section 2.3, for practical values of the parameters, Itends to be much

smaller than the smoothness bounds b1and b2, and it is non-trivial to eﬃciently sieve such

regions.

Pollard proposed to do this by using, for each (prime,root) pair pwith prime pbounded by

the relevant smoothness bound bk, a reduced base of the lattice Γpof pairs (i, j)for which

vk(a, b)for the corresponding (a, b)-pair is divisible by p. In [13] that approach was used for p

larger than a small multiple of I, while avoiding storage of “even, even” sieve locations (and

using line sieving for the other primes). Our approach is similar, but instead uses a truncated

continued fraction expansion to determine a basis B=(α, β),(γ, δ)of Γpwith the following

properties:

aThe numbers βand δare positive.

bWe have −I < α ≤0≤γ < I and γ−α≥I.

Let us assume that Γpconsists of all (i, j)for which i≡ρj mod p, where 0< ρ < p. The case

ρ= 0 and the case where Γpconsists of all (i, j)for which pdivides jare not treated, because

16

they produce just (0,1) and (1,0), respectively, as only coprime pairs. We also assume p≥I, as

smaller primes are better treated by line sieving. To construct a basis with the above properties,

one takes (i0, j0)=(−p, 0),(i1, j1) = (ρ, 1) and puts (i`+1, j`+1 )=(i`−1, j`−1) + r(i`, j`)with

r=−i`−1

i`. Note that (−1)`+1i`≥0, that ris positive and that the j`thus form an increasing

sequence of non-negative numbers. The process is stopped at the ﬁrst `with |i`|< I . If this

number `is odd, we put (α, β)=(i`−1, j`−1) + r(i`, j`), where ris the smallest integer for

which α > −I. If `is even, we put (γ , δ) = (i`−1, j`−1) + r(i`, j`), where ris the smallest

integer such that γ < I. In both cases, the element of B=(α, β),(γ , δ)not yet described is

given by (i`, j`).

To explain how to eﬃciently sieve using a basis with these properties, let (i, j)∈Γpsuch

that −I/2≤i < I/2. We want to ﬁnd the (uniquely determined) (i0, j0)∈Γpsuch that

−I/2≤i0< I/2,j0> j, and j0is as small as possible. As Bis a basis of Γp, there are integers

dand ewith

(i0, j0)−(i, j ) = d(α, β) + e(γ, δ ).

If the numbers dand ewere both diﬀerent from zero, with opposite signs, then condition bon

Bwould force the ﬁrst component of the right hand side to have absolute value ≥I, whereas

our constraints on iand i0force it to have absolute value < I. Since j0−j,β, and δare all

positive, we have d≥0and e≥0. It is now easy to see that the solution to our problem is:

(d, e) =

(0,1) if i < I/2−γ

(1,1) if I/2−γ≤i < −I/2−α

(1,0) if i≥ −I/2−α.

To convince oneself of the minimality of j0, one notes that d= 0 leads to a violation of i0< I/2

unless i < I/2−γ(i.e., save for the ﬁrst of the above cases) and that e= 0 leads to i0<−I/2

unless i≥ −I/2−α(i.e., save for the third of the above cases).

To implement this process on a modern CPU, it seems best to take I= 2ιfor some natural

number ι. It is possible to identify pairs (i, j)of integers with −I/2≤i < I/2with integers

xby putting x=j·I+i+I/2. If x0=j0·I+i0+I /2with (i0, j 0)as above, then x0=x+C,

x0=x+A+Cand x0=x+Ain the three cases above, with A=α+I·βand C=γ+I·δ.

The ﬁrst component of a pair (i, j),(α, β )or (γ, δ)is extracted from these numbers by using

a bitwise logical operation, and the selection of the appropriate one of the above three cases

is best done using conditional move instructions.

For cache eﬃciency, the sieving region Sqwas split into areas At,0≤t<T, of size equal to

the L1-cache size. For primes plarger than that size (or a small multiple thereof), sieving is

not done directly. Instead, the numbers xcorresponding to elements of Sq∩Γpwere calculated

ahead of the sieving process, and their oﬀsets into the appropriate region Atstored in the cor-

responding element of an array Sof Tstacks. To implement the trial division sieve eﬃciently,

the corresponding factor base index was also stored. Of course, this approach may also be

used for line sieving, and in fact was used in [3]. An approach which seems to be somewhat

similar has been described by T. Oliveira e Silva in connection with his implementation of the

Odlyzko-Lagarias-Lehmer-Meissel method.

17

Parallelization is possible in several diﬀerent ways. A topology for splitting the sieving region

among several nodes connected by a network is described in [12]. If one wants to split the task

among several cores sharing their main memory, it seems best to distribute the regions Atand

also the large factor base primes among them. Each core ﬁrst calculates its part of S, for its

assigned part of the large factor base elements, and then uses the information generated by

all cores to treat its share of regions At. A lattice siever parallelized that way was used for

a small part of the RSA-576 sieving tasks, but the code fell out of use and was not used for

the current project. The approach may be more useful today, with many cores per processor

being a standard.

B Unbalanced sequences in block Wiedemann

Before describing the modiﬁcation for unbalanced sequence lengths we give an overview of

Coppersmith’s block version of the Berlekamp-Massey algorithm. To avoid a too technical

description we simplify the presentation of Coppersmith’s algorithm and refer to [9] for details.

The modiﬁcation can also be applied to Thomé’s subquadratic algorithm (cf. [40]) which is

what we did and used. In the following the variables mand ndo not denote the common root

and a number to be factored, but have the same meaning as in Coppersmith’s article. As in

that article, the terms +O(1) are constants depending on mand n. We assume that mand n,

which play the role of 64 ·m0and 64 ·n0in Section 2.4, are much smaller than d.

Let Mbe a d×dmatrix over F2,m≥n,xk∈Fd

2,1≤k≤mand yj∈Fd

2,1≤j≤n

satisfying certain conditions. Set a(i)

j,k =xT

kMiyjand

A=X

i

(a(i)

j,k)Xi∈Matn,m [X].

In the ﬁrst step of block Wiedemann we calculate the coeﬃcients of Aup to degree d

m+d

n+O(1).

The goal of the Berlekamp-Massey algorithm is to ﬁnd polynomials of matrices F∈Matn,n[X]

and G∈Matn,m[X]satisfying deg(F)≤d

n+O(1),deg(G)≤d

n+O(1) and

F A ≡G(mod Xd

m+d

n+O(1)).

Intuitively, we want to produce at least dzero rows in the higher coeﬃcients of F A up to

degree d

m+d

n+O(1). Writing F=PdF

i=0(f(i)

j,k)Xi,dF= deg(F)the jth row of coeﬃcient

dF+bof Gbeing zero corresponds to

(Mbxh)Tvj= 0 for 1≤h≤m, 0≤b < d

m+O(1) where

vj=

n

X

k=1

dF

X

i=0

f(dF−i)

j,k ·Miyk.

Coppersmith’s algorithm produces a sequence of matrices (of m+nrows) Ft∈Matm+n,n[X]

and Gt∈Matm+n,m[X]for t=t0,..., d

m+d

n+O(1) (where t0=O(1)) such that

FtA≡Gt(mod Xt)

18

holds and the degrees of Ftand Gtare roughly m

m+nt. In a ﬁrst step t0and Ft0are chosen such

that certain conditions are satisﬁed, in particular we have deg(Ft0) = O(1) and deg(Gt0) =

O(1). To go from tto t+ 1 a polynomial of degree 1 of matrices Pt∈Matm+n,m+n[X]is

constructed and we set Ft+1 =PtFtand Gt+1 =PtGt. This construction is done as follows.

We have FtA≡Gt+EtXt(mod Xt+1)for some matrix Et. Respecting a restriction involving

the degrees of the rows of Gt(essentially we avoid destroying previously constructed zero rows

in the G’s) we perform a Gaussian elimination on Et, i.e., we obtain ˜

Ptsuch that

˜

PtEt=0

1m.

Then we set

Pt=1n0

01mX·˜

Pt.

In this way the degrees of at most mrows are increased when passing from Gtto Gt+1 (due to

the restriction mentioned above ˜

Ptdoes not increase the degrees), so the total number of zero

rows in the coeﬃcients is increased by n. Due to the restriction mentioned above the degrees

of the rows of Ftand of Gtgrow almost uniformly, i.e., they grow on average by m

m+nwhen

going from tto t+ 1.

After t=d

m+d

n+O(1) steps the total number of zero rows in the coeﬃcients of Gtis

m+n

md+O(1) such that we can select mrows that produce at least dzero rows in the coeﬃcients.

These mrows form Fand G.

We now consider unbalanced sequence lengths. Let `jbe the length of sequence j, i.e., a(i)

j,k

has been computed for all kand 0≤i≤`j. Without loss of generality we can assume

`1≤`2≤. . . ≤`n=`. The sum of the lengths of all sequences has to satisfy again Pj`j≥

d·(1 + n

m) + O(1). Moreover we can assume that `1≥d

m, otherwise we could drop sequence 1

completely, thus facilitating our task.

In this setting our goal is to achieve

F A ≡G(mod X`+O(1))

with dF= deg(F)≤`−d

m,deg(G)≤`−d

mand

X`−`k|F·,k (this denotes the kth column of F).

The latter condition will compensate our ignorance of some rows of the higher coeﬃcients

of A. Indeed, setting for simplicity dF=`−d

m, the vectors

vj=

n

X

k=1

`k−d

m

X

i=0

f(dF−i)

j,k ·Miyk

satisfy for 1≤h≤m, 0≤b < d

m

(Mbxh)Tvj=

n

X

k=1

`k−d

m

X

i=0

f(dF−i)

j,k a(i+b)

k,h =g(dF+b)

j,h = 0.

19

If i+b > `k(thus a(i+b)

k,h not being computed), we have dF−i < dF+b−`k≤`−`k, so

f(dF−i)

j,k = 0 and the sum computes g(dF+b)

j,h .

Our new goal is achieved as before, but we will need `steps and the construction of Pthas

to be modiﬁed as follows. In step twe have FtA≡Gt+EtXt(mod Xt+1). Let a≤nbe

maximal such that a−1

X

i=1

(m+i)(`n−i+1 −`n−i)≤mt

(awill increase during the computation). In the Gaussian elimination of Etwe do not use the

ﬁrst n−arows for elimination. As a consequence, ˜

Pthas the form

˜

Pt=1n−a∗

0∗.

Then we set

Pt=

1n−aX0 0

01a0

0 0 1mX

·˜

Pt.

Therefore the sum of the degrees of Ftwill be increased by m+n−aand the number of zero

rows in Gtwill be increased by awhen passing from tto t+1. For a ﬁxed a,(m+a)(`n−a+1−`n−a)

m

steps will increase the average degree of the last m+arows from `−`n−a+1 to `−`n−a. At

this point awill be increased.

To see why X`−`k|F·,k holds we have to describe the choice of Ft0(and t0). Let cbe the

number of maximal `j, i.e., `n−c< `n−c+1 =`n. Then Ft0will be of the form

Ft0=1n−cXt00

0∗.

The last m+crows will be chosen such that they are of degree at most t0−1and such that

the conditions in Coppersmith’s algorithm are satisﬁed. This construction will lead to a value

of t0near m

cinstead of the lower value near m

nin the original algorithm.

Let kbe such that `k< ` and consider the kth column of Ft. As long as n−a≥kthis column

will have the only non-zero entry at row kand this will be Xt. Since n−a≥kholds for

t≤`−`kthis column will be divisible by X`−`kfor all t≥`−`k.

For RSA-768 we used the algorithm as described above in the subquadratic version of Thomé.

However, a variant of the algorithm might be useful in certain situations (e.g., if one of the

sequences is much longer than the others) which we will sketch brieﬂy.

Suppose that `n−1< `n. Then for t < (m+1)(`n−`n−1)

mwe have a= 1 and Ptis of the form

Pt=1n−1X∗

0∗.

A product of several of these Ptwill have a similar form, namely an (n−1) ×(n−1) unit

matrix times a power of Xin the upper left corner and zeros below it.

The basic operations in Thomé’s subquadratic version are building a binary product tree of

these Ptand doing truncated multiplications of intermediate products with Ft0Aor similar

20

polynomials. If we split the computation in this algorithm into two stages, ﬁrst computing

the product of all Ptfor t < (m+1)(`n−`n−1)

mand then the remaining product, the matrix

multiplications in the ﬁrst stage become easier due to the special form of the Ptand its

products.

Obviously this can be done in as many stages as there are diﬀerent values among the `j.

C Factorizations of p±1and q±1

Referring to the smallest factor of RSA-768 as pand to its cofactor as q, the prime factor-

izations of p±1and q±1are as follows, where “pk” denotes a prime number of kdecimal

digits:

p−1 = 28·112·13 ·7193 ·160378082551 ·7721565388263419219 ·

111103163449484882484711393053 ·p47,

p+ 1 = 2 ·3·5·31932122749553372262005491861630345183416467 ·p71,

q−1 = 22·359 ·p113,

q+ 1 = 2 ·3·23 ·41 ·47 ·239875144072757917901 ·p90.

D Clusters used and block Wiedemann timings

21

Cluster number of CPU type clock speed cores GB RAM interconnect nodes cores seconds per iteration communication

location nodes (GHz) per node per node per job per job stage 1 stage 3

Lausanne 56 2×AMD 2427 2.2 12 16 ib20g 12 144 4.3-4.5 4.8 40%

Tokyo 110 2×Pentium 4 3.0 2 5 eth1g 110 220 5.8†, 6.4 7.8 33%†, 44%

Grenoble 34 2×Xeon E5420 2.5 8 8 ib20g 24 144 3.7 n/a 30%

Lille 46 2×Xeon E5440 2.8 8 8 mx10g 36 144 3.1 3.3 31%

32 256 3.8 n/a 38%

24 144 4.4 n/a 33%

Nancy 92 2×Xeon L5420 2.5 8 16 ib20g 64 256 2.2 2.4 41%

36 144 3.0 3.2 31%

24 144 3.5 4.2 30%

18 144 n/a 5.0 31%

16 64 n/a 6.5 19%

Orsay 120 2×AMD 250 2.4 2 2 mx10g 98 196 2.8 3.9 32%

Rennes 96 2×Xeon 5148 2.3 4 4 mx10g 64 256 2.5 2.7 37%

49 196 2.9 3.5 33%

Rennes 64 2×Xeon L5420 2.5 8 32 eth1g 49 196 6.2 n/a 67%

24 144 8.4 n/a 67%

18 144 10.0 n/a 68%

8 64 n/a 18.0 56%

Table 2: Data and ﬁrst and third stage block Wiedemann timings for all clusters used.

“n/a” means that the job conﬁguration was not used.

†: ﬁgure per iteration per sequence when two sequences are processed in parallel, in which case a part of the communication time is hidden in the local

computation time (the communication number shows the pure communication percentage),

for all other ﬁgures just a single sequence is processed.

22