Block Recombination Approach for Subquadratic Space Complexity Binary Field Multiplication Based on Toeplitz Matrix-Vector Product
ABSTRACT In this paper, we present a new method for parallel binary finite field multiplication which results in subquadratic space complexity. The method is based on decomposing the building blocks of the Fan-Hasan subquadratic Toeplitz matrix-vector multiplier. We reduce the space complexity of their architecture by recombining the building blocks. In comparison to other similar schemes available in the literature, our proposal presents a better space complexity while having the same time complexity. We also show that block recombination can be used for efficient implementation of the GHASH function of Galois Counter Mode (GCM).
Block Recombination Approach for
Subquadratic Space Complexity Binary
Field Multiplication based on Toeplitz
M. A. Hasan1, N. M´ eloni1, A. H. Namin1and C. Negre1,2
1ECE Department and CACR, University of Waterloo, Waterloo, Ontario, Canada.
2Team DALI/ELIAUS, Universit´ e de Perpignan, Perpignan, France.
In this paper, we present a new method for parallel binary finite field multiplication which results in subquadratic space
complexity. The method is based on decomposing the building blocks of Fan-Hasan subquadratic Toeplitz matrix-vector
multiplier. We reduce the space complexity of their architecture by recombining the building blocks. In comparison to other
similar schemes available in the literature, our proposal presents a better space complexity while having the same time
complexity. We also show that block recombination can be used for efficient implementation of the GHASH function of
Galois Counter Mode (GCM).
Keywords. Binary field, subquadratic space complexity multiplier, Toeplitz matrix, block recombination.
Finite fields have a wide range of applications in number theory, coding theory, and cryptography. Binary
fields are specially attractive for high speed cryptographic applications since they are inherently suitable
for hardware implementations. A binary field F2n is generally constructed as the set of binary polynomials
modulo an irreducible polynomial P of degree n. An element of F2n is then a binary polynomial of
degree smaller than n. Addition and multiplication in a binary field consist of polynomial addition and
multiplication modulo P.
In general, field multipliers can be categorized into three different categories: bit-level, digit-level, and
bit-parallel. Parallel multipliers present the fastest category of designs. Mainly, two types of parallel
multipliers exist in the literature. The first type are the quadratic complexity architectures which require
May 27, 2010DRAFT
quadratic space complexity (O(n2) gates for their implementation), cf. , . The second type are the
subquadratic space complexity designs which require smaller number of gates for their implementation
(O(nδ),δ < 2), cf. , , , . The latter presents practical architectures for hardware implementation
of large field sizes, specially used in elliptic curve cryptographic applications ,  (163 < n < 571).
There exist a limited number of algorithms to design subquadratic space complexity multipliers.
The most well-known algorithm is based on the integer Karatsuba multiplication scheme, which has
been widely applied to create subquadratic space complexity multipliers . Architectures proposed in
,  are based on this method. Another well-known algorithm is the Winograd short convolution
algorithm  used for the same purpose in . Chinese Reminder Theorem (CRT) is another example
of such algorithms that results in subquadratic complexity multipliers .
Recently, Fan and Hasan proposed a new scheme to design subquadratic space complexity multipliers
using Toeplitz matrix-vector product . In their scheme they have used shifted polynomial basis for
field elements representation to model the multiplication as a Toeplitz matrix product by a vector. Their
method can also be used to design subquadratic space complexity polynomial, dual, weakly-dual and
triangular basis multipliers. Then they proposed to perform two-way split or three-way split approaches
to break down each matrix-vector product into a number of smaller size matrix-vector products. In ,
recursive use of this approach results in subquadratic space complexity multipliers which outperform
the ones using Karatsuba, Winograd, and CRT.
In this paper, we propose an extension of the work of Fan and Hasan. We first decompose Fan-Hasan
multiplier based on Toeplitz-matrix vector product (TMVP) into a number of different blocks. Each block
performs independent computations (component matrix and vector formation, component multiplication
and reconstruction). We then propose to recombine these blocks in order to reduce the space complexity.
We first apply this recombination to a two-TMVPs-and-add architecture. We reorder the blocks and replace
one reconstruction block by a smaller bitwise addition block. The space complexity is then reduced. After
that, we use this recombination in a single TMVP multiplier. We express this product in terms of two-
TMVPs-and-add. We obtain a multiplier which has less space and time complexity.
The rest of this work is organized as follows: in Section 2 we briefly review binary field multiplication
and by recalling the multiplier of Fan and Hasan and we decompose it in four different blocks. In Section 3
we present our block recombination method in the special case of two-TMVPs-and-add architecture. Then
we apply this method to reduce the space complexity of a single TMVP architecture and compare the
results with previous similar approaches. We continue, in Section 4, by presenting an application of block
recombination in the design of a space efficient architecture for GHASH computation. Finally Section 5
presents some concluding remarks.
May 27, 2010 DRAFT
Assume that A =?n−1
C = A × B mod P as
BRIEF REVIEW OF PARALLEL BINARY FIELD MULTIPLICATION
i=0aiXiand B =?n−1
i=0biXiare two elements of F2n, we can compute the product
A × B =?n−1
Expanding the expression of B and considering that A(i)= AXimod P. This can be written through a
matrix vector product
Direct hardware implementation of this matrix-vector product results in a quadratic area complexity
circuit (i.e., it requires O(n2) gates). The two commonly used strategies to design an efficient hardware
multiplier via the above matrix vector product are the following:
1) The choice of the polynomial P must provide an efficient computation of the columns A(i). Until
now the all one polynomials (AOP) and trinomials seems to be the best possible choices ,
. However, none of them exists for all degrees of n. Consequently other types of irreducible
polynomials have been considered. Specifically, pentanomials of this form 
P = Xn+ Xk+2+ Xk+1+ Xk+ 1.
2) The second strategy consists of modifying the matrix
in order to obtain a Toeplitz matrix. Recall that an n × n Toeplitz matrix is a matrix [ti,j]0≤i,j≤n−1
such that ti,j= ti−1,j−1for 1 ≤ i,j ≤ n−1. We will see in Subsection 2.1 that a Toeplitz matrix-vector
product can be computed efficiently through a subquadratic complexity circuit (i.e., it requires O(nδ)
gates where δ < 2). Generally we can obtain the Toeplitz form of the matrix in (1) by performing
some row operations or column operations. In other words, we get this Toeplitz structure by
using different bases of representation as it was shown by Hasan and Bhargava in . This can
be performed efficiently on fields defined by trinomials or specific pentanomials .
For the remainder of this paper we assume that binary field multiplication C = A × B has already
been expressed as a Toeplitz matrix-vector product
C = TA· B.
2.1Fan-Hasan subquadratic Toeplitz matrix-vector multiplier
In this section we recall the method used to build a subquadratic circuit which computes a Toeplitz
matrix-vector product (TMVP) .
May 27, 2010DRAFT
If 2|n, Fan and Hasan proposed to use a two-way split approach shown in Table 1 to compute a matrix
vector product T · V , where T is an n × n Toeplitz matrix and V is a vector of size n. The two-way
split approach breaks down a Toeplitz matrix-vector product of size n into three Toeplitz matrix-vector
products of sizen
2. If 3|n, they proposed to use the three-way split approach which is also shown in Table 1.
The three-way split approach results in six Toeplitz matrix-vector products of sizen
Fan and Hasan two-way and three-way split formulas for TMVP , 
Two-way splitThree-way split
T · V =
T · V =
P0+ P3+ P4
P1+ P3+ P5
P2+ P4+ P5
=(T0+ T1) · V1,
T1· (V0+ V1),
(T1+ T2) · V0,
=(T0+ T1+ T2) · V2,
(T1+ T2+ T3) · V1,
(T2+ T3+ T4) · V0,
T1· (V1+ V2),
T2· (V0+ V2),
T3· (V0+ V1),
If n is a power of 2 or a power of 3, Fan and Hasan also proposed to recursively use the formulas
given in Table 1 to perform T ·V . Using this recursive process through parallel computation, the resulting
multiplier would have the complexities given in Table 2 (cf. ). In this table DArepresents the delay of a
two-input AND gate and DXthe delay of a two-input XOR gate. It is also possible to design subquadratic
TMVP multipliers for size n = 3i2jby combining the two-way and three-way split approaches in the
Asymptotic complexities of a single Fan-Hasan TMVP
Two-way split method Three-way split method
5.5nlog2(3)− 6n + 0.5
5nlog3(6)− 5n +1
May 27, 2010DRAFT
2.2Block decomposition of Fan-Hasan multiplier
In this subsection we decompose the Fan-Hasan multiplier into a number of independent blocks. We
will then evaluate the complexity of each block. We decompose the recursive formulas of Table 1 in four
• Component matrix formation (CMF). We call component matrix formation the recursive matrix com-
putation of Table 1. For example, for the two-way split approach, the first recursion on T com-
putes T0+ T1,T1,T1+ T2. This means that the component matrix formation can be expressed as
CMF(T) = (CMF(T0+T1),CMF(T1),CMF(T1+T2)). In the sequel we will often refer to CMF(T)
as the component representation of the Toeplitz matrix T.
• Component vector formation (CVF). This corresponds to the recursive computation on the vector V in
Table 1. For example, for the two-way split approach, we see that recursion is applied to V1,V0+V1,V0.
Consequently the component vector formation is expressed as CV F(V ) = (CFV (V1),CV F(V0+
V1),CV F(V0)). In the sequel we will often refer to CV F(V ) as the component representation of the
vector V .
• Component multiplication (CM). In Table 1 we see that CMF(T) and CV F(V ) are multiplied com-
ponent by component at the end of the recursion (this corresponds to the recursive multiplication
• Reconstruction (R). The last operation is the reconstruction of the product W = T · V from the
component multiplication of CTF(T) and CV F(V ). For example, for the two-way split case, let
ˆ W be equal to the component multiplication of V and T. If we splitˆ W = [ˆ W0,ˆ W1,ˆ W2], then the
formula in Table 1 states that W = R(ˆ W) = (R(ˆ W0) + R(ˆ W1),R(ˆ W1) + R(ˆ W2).
Then we can split the Fan-Hasan multiplier in four distinct blocks (CMF,CVF,CM and R) as shown in
Fig. 1. Block decomposition of the Fan-Hasan multiplier
matrix formationvector formation
May 27, 2010DRAFT