Content uploaded by Dawood Alnajjar
Author content
All content in this area was uploaded by Dawood Alnajjar on Sep 24, 2015
Content may be subject to copyright.
A Comprehensive Guide for CRC Hardware
Implementation
Dawood Alnajjar and Mauricio Suguiy
IDE A!Electronic Systems
Av. Romeu T´
ortima, 446
Campinas - SP - Brazil
Email: dawood.alnajjar@idea-ip.com, mauricio.suguiy@idea-ip.com
Abstract—Cyclic Redundancy Check (CRC) is an essential
component in various integrated circuits of the electronics in-
dustry. This paper is a CRC comprehensive guide that explores
various approaches for CRC implementations in hardware, and
demonstrates synthesis estimation results for understanding their
impact. Finally, it assists the designer to customize and optimize
his CRC implementation to meet different project requirements.
I. INTRODUCTION
CRC is the polynomial division process of a binary data
block by a divisor [1]. It is used in communications protocols
to send data reliably over a medium. After the remainder
(checksum) of the CRC division is calculated, it is appended
to the data stream and transmitted. On the receiver side,
the polynomial division process is performed again, and the
calculated checksum is compared to the received CRC. If they
are identical, it can be established that the data was received
as sent and no errors were introduced by the transmission
medium. Design engineers are constantly encountering design
issues for CRC implementation such as:
•maximum operation frequency of the CRC circuit,
•available area for the CRC circuit,
•latency of the CRC circuit,
•complexity introduced from widening the datapath for
the CRC circuit (parallelism),
•processing a partial amount of the data in the datapath
in the first and last cycles,
•out-of-order processing of packet data.
These issues result from the constant advancement of
the technology process, circuit speed, design complexity and
project requirements. Throughout this paper, various solutions
obtained from the literature pertaining the formerly mentioned
issues will be demonstrated. Additionally, area and timing
results were estimated for the various solutions and will be
demonstrated. The main objective is to guide engineers to
understand the numerous impacts of each implementation, and
help them select the one that fits their requirements best.
The paper is organized as follows: Section II will discuss
the theory behind the CRC algorithms; Section III will discuss
various CRC implementations; Section IV will demonstrate the
obtained results and discuss them. Finally, Section V concludes
the paper.
II. CRC ALGORITHMS
There are 2 algorithms for evaluating the CRC checksum:
•The CRC direct implementation (DI): (xg×data)
mod G(x)where data is the data stream to be pro-
cessed for CRC calculation, G(x) is the CRC polyno-
mial (divisor), gis the order of the CRC polynomial
G(x), and finally, xgis shifting the data to the left by
a number of 0s equal to g.
•The CRC indirect implementation (II) : Kmod G(x)
where Kis the data shifted with a number of 0s equal
to g.
There are two relevant CRC linearity properties [2] that
will be used throughout this paper:
1) Given A(x) = A1(x)⊕A2(x)⊕... ⊕An(x), then :
CRC(A(x)) = C RC(A1(x)) ⊕... ⊕CRC(An(x))
2) When the CRC of a chunk of data is calculated, the
CRC of that same chunk shifted by a number of zeros
equal to k is:
a) In direct CRC algorithm implementation
CRC(xkA(x)) =xkA(x)xgmodG(x)
=xkCRC(A(x))modG(x)
=xk−gCRC(A(x))xgmodG(x)
=CRC(xk−gC RC(A(x)))
b) In indirect CRC algorithm implementation
CRC(xkA(x)) =xkA(x)modG(x)
=xkCRC(A(x))modG(x)
=(xkCRC(A(x)))modG(x)
=CRC(xkC RC(A(x)))
Property 1 states that if a message is divided into sub-
messages with respect to the place value of each sub-message,
the CRC of the message is the XOR of the CRCs of sub-
messages (with the necessary appended zeros to align the sub-
message with its place value). Property 2(a) means that the
CRC of the data chunk shifted by kbits is equal to the CRC
of the CRC of that data chunk followed by k−gbits of
zeros. CRC property 2(b) states that the CRC of the data chunk
shifted by kbits is equal to the CRC of the CRC of that data
chunk followed by kbits of zeros.
Note that the two algorithms are identical if data is being
treated as one message. However, differences become more
apparent if data is to be divided into sub-messages and
each sub-message is processed independently. When data is
divided into sub-messages, in the direct implementation, each
sub-message is appended by gzeros, while in the indirect
implementation, only the last sub-message is appended by g
zeros.
Fig. 1. CRC32 serial implementation
III. CRC IMP LE ME NTATIONS
The CRC polynomial that will be used throughout the
implementations in this paper is the CRC-32 (0x04C11DB7),
which is used in many communication protocols, such as
Ethernet and PCI express. The datapath width of the imple-
mentations is 128 bits. The area and timing results calculated
in the following sections were done using a Global Foundries
65nm library and Cadence Encounter RTL compiler.
A. Serial CRC Direct Implementation
The most traditional CRC design is a serial CRC DI
implemented as a linear feedback shift register (LFSR) and an
XOR tree. The CRC32 polynomial is shown in Figure 1. This
circuit can run on a maximum of 2.39GHz (GF65nm) and it is
not a possible solution for PCIe 3.0, which has a transfer rate
of 8Gbps. Hence, parallel implementation become a desirable
alternative. Unfortunately, one cannot depend on the advanced
RTL programming capabilities of Verilog/VHDL as mentioned
in [3], where a serial CRC generator function is used to
generate a parallel CRC function. The maximum frequency
obtained for a 128-bit parallel CRC32 implementation with a
byte enable was as slow as 76MHz and occupies area as high as
28044µm2. Let it be noted that the byte enable is necessary to
have control over processing only some part of the data stream
(e.g. partial processing the datapath which represents the end
of a packet).
B. Parallel CRC direct implementations
The CRC DI can also be implemented as a parallel-input
CRC algorithm with a datapath width (e.g. multiples of 8 bits)
structurally (i.e. not depending on advanced RTL programming
capabilities of Verilog/VHDL). Similarly, parallel design are
composed of a LFSR, and an XOR tree; however, the complex-
ity of the XOR tree increases as the datapath width increases.
Parallel structural implementations can be generated using the
tool in [4] or as explained in [5]. Both yield the same results.
This case study is demonstrated in Figure 2. The design uses
parallel CRCs of various input lengths with granularity of 1
byte for the purpose of processing the last sub-message of data
(multiples of 1 byte). If it is known that the message size is
multiples of 128 bits, then only the 128-bit CRC32 generator
would be necessary. The X-bit input CRC calculators used
were generated from [4]. It can be noted that all of the CRC
generators share the same CRC checksum register. A parallel
CRC DI with a double word enable (4 bytes) was found to
have an area of 16577µm2and a maximum frequency of
974MHz. It can be noticed that with this circuit the maximum
frequency can surpass 8Gbps rate. In an attempt to analyze
the contribution of each X-bit input CRC module, the area
and delay were estimated separately as shown in Figure 3.
C. CRC incremental calculation indirect implementations
The following case studies are inspired by [6] and [7].
Taking advantage of the CRC properties in Section II,
the implementations in Figure 4 can be produced. The direct
implementation will not be considered in this case study for
reasons that will be apparent later on when this case study
is demonstrated. Note that Func A and Func B can be either
XOR trees, or lookup tables (LUTs) that contain pre-calculated
values of the CRC as demonstrated by [7] or a combination of
both. According to [7], using LUTs with pre-calculated CRC
values will reduce the critical path of the CRC implementation.
1) LUT-based indirect CRC incremental implementation:
This design is shown in Figure 4(b) based on property 1 and
property 2(b) of the CRC and is an II of CRC. Func A and Func
B will be implemented using LUTs (LUTA and LUTB) and/or
XOR trees. Only the last sub-message is going to be appended
by 32 zeros. Meaning that after all the data is processed by
the CRC, the CRC still needs to process 32 zeros. In this
implementation, the circuit can be optimized, since the 4 LSBs
of the sub-message do not need to be processed as they are
smaller than the CRC polynomial, and the result of the division
is the same 4 LSBs. That leaves us with an LUTA function of
96-bit inputs, and an LUTB function of 32-bit inputs. LUTA
will perform the CRC equation mentioned in Section II based
on the property 1 of the CRC.
LUTA and LUTB can be divided again using property 1
of the CRC to multiple smaller LUTs ranging from 96 1-bit
input 32-bit output LUTs to single large 96-bit input 32-bit
output LUT. The trade-offs are shown in table I.
Func B will perform the CRC incremental calculation
shown in CRC property 2(b). where the calculated CRC in
that cycle is XORed with the previously calculated and shifted
CRC.
Fig. 2. CRC32 parallel implementation with variable width processing
LUT input Required Required
width LUTs LUT memory
1 96 6 KB
2 48 6 KB
4 24 12 KB
8 12 96 KB
16 6 12 MB
32 3 384 GB
TABLE I. ESTIMATED REQUIRED MEMORY FOR DESIGNS OF
DI FFER EN T LUT INPUT LENGTH
The function of the last bit in LUTA is
(x127data)modG(x), while the function of the first bit
of LUTB is (x128data)modG(x). This means that LUTA
and LUTB can be combined to make one big LUT since the
inputs can be aligned. This does not apply to the CRC DI;
which is what makes it more complex to implement with
no extra latency. In CRC DI, The function of the last bit in
LUTA is (x127data)modG(x), while the function of the first
bit of LUTB is (x96data)modG(x)based on property 2(a).
Now in the last iteration, 3 points need to be considered:
•the appended 32-bit zeros,
•size of last sub-message,
•the processing of the accumulated CRC, since the
last sub-message may not be 128 bits (according to
property 2(b)).
If the number of data-bits in the last sub-message to be
processed is less than 128 and greater than 96, then extra
circuitry will be necessary to process the 32-bits of zeros that
will be appended, or the 32-bits of zeros can be processed in
the next cycle incurring one extra cycle of latency. In case
the CRC is a CRC checker, the 32 zeros will replace the 32-
bit CRC result that was associated with the message at the
transmitter side, so the circuit can be accommodated in a way
where the 32 zeros would not incur one extra cycle of latency.
The accumulated CRC will use the same LUTB function.
On the other hand, if the number of bits to be processed is
less than 96 (say M), then the CRC must be prepended to the
last sub-message (which function will be (xMdata)modG(x)
as shown in property 2(b)). After the CRC is shifted to the
right by k−Mpositions, the leftmost k−Mbits will be
filled with zeros.
Fig. 3. Area and timing results pertaining X-bit CRC DI obtained from [4]
Fig. 4. Implementations deduced from the explained properties of CRC
Accommodating the initial state of the LFSR: When there
is a need to set the initial value of the CRC register to anything
other than 0 (due to a design requirement), extra care must be
taken into account if the DI is transformed to an II since the
initial value is different. In the DI, the initial value is simply
stored in the LFSR of the direct CRC implementation. For
the II, there are various way to calculate the equivalent initial
value
•Multiply the initial value of the DI by the Generator
Matrix G [8] 32 times, or multiplying by G32. At-
tention must be paid when multiplying by G [8] in
terms of bit order of the LFSR. Removing P zeros
from a message A has the same effect on CRC(A) as
multiplying CRC(A) by GP. By looking at the II and
the DI functions above, when dividing the message
into sub-messages, one can see that the difference
between the two is that the DI is advanced by 32
cycles with data bits equal to zero, while in the II, the
32 zeros are added at the end of the message. (E.g.
the initial value of 0xFFFFFFF in the DI domain will
produce the same CRC result as that of the II if the
initial value was set to 0x46AF6449 for CRC32).
•Online tools such as [9].
2) Function based indirect CRC incremental implemen-
tation: This design is an optimized implementation of the
LUT-based one mentioned in Section III-C1 where the LUT
input width is 1. The block diagram is shown in Figure 4(b)
where Func A is a 96-bit input parallel DI CRC obtained
from [4]. Taking advantage of the LFSR characteristics of the
CRC, Func B is replaced with a multiplier that multiplies the
CRC register by the H matrix 128 times. The H matrix is
the matrix associated with the LFSR (parity matrix) of the
direct implementation of the CRC [8]. Appending P zeros to
a message A has the same effect on CRC(A) as multiplying
CRC(A) by HP. Similarly, as it is an indirect implementation,
the initial value in the direct implementation domain needs to
be translated to this domain before use.
As for the final variable length sub-message, the same
circuit that was implemented for the LUT-based II in Section
III-C1 is used here.
Replacing the multiplexer and shifter with H matrices: In
order to reduce the critical path by removing the shifter and
multiplexer that pre-pends the CRC to the last sub-message
as mentioned in Section III-C1, it might seem reasonable to
create a Func B that encompasses the whole task of evaluating
the next cycle CRC. If we assume that the granularity of the
datapath is double words, then there will be a need for 4
H matrix multiplication functions: H32,H64 ,H96,H128 as
shown in Figure 5(a). While processing sub-messages other
than the last sub-message, H128 will be used. In the last sub-
message processing, the accumulated CRC is multiplied by one
of the four Hxfunctions depending on the size of the last sub-
message. Let it be noted that two H32 circuits can be cascaded
vertically to generate a slower and larger H64. This can be
taken advantage of if the path in Func A is the critical one,
and there is some positive slack in Func B. Then there would
be margin to implement a slightly more area optimized Func
B as shown in Figure 5(b). Four H32 are used to generate any
of the four H functions H32,H64 ,H96,H128 . This results in
smaller area, but greater longest path. Our findings demonstrate
that by adopting the design in Figure 5(b) instead of the one in
Figure 5(a), the area of Func B would be decreased by 6.13%
on the expense of an increase in delay that is as low as 0.73%.
Fig. 5. Cascading H multiplication functions to achieve area-delay trade-off
IV. RES ULT S AN D DISCUSSION
The area and timing results for all the previously mentioned
implementations are shown in table II. It can be noticed
that the LUT based implementations have similar areas and
maximum frequencies although they have different XOR tree
sizes combining the outputs of the LUTs. It was initially
expected that the 16-input LUT would have a larger area
and a higher maximum frequency; however, this result may
have been affected by the synthesis tool since the Cadence
RTL compiler synthesizes the LUTs as an unstructured mass
of standard cells and applies optimization techniques to it
resulting in similar area and maximum frequency numbers
for all LUT implementations and the Function-based II. The
advanced RTL programming capability based implementation
mentioned in Section III-A (ARPC) achieves a low maximum
frequency and a large area.
In summary, we still cannot depend on the advanced RTL
programming capabilities of Verilog and VHDL due to the fact
that the results are strongly dependent on the synthesis tool.
As for the parallel direct implementation, the maximum
frequency and the area are significantly high. For high speed
implementations this design might be favorable.
Several implementations and techniques have been demon-
strated, and depending on the project requirements, a CRC
design that best fits the needs can be derived.
Implementation Area Max. Frequency
(µm2) (MHz)
Serial ARPC-based 28044 76
Parallel DI 16577 974
1-input LUT 8898 571
2-input LUT 8968 567
4-input LUT 8904 572
8-input LUT 8835 567
16-input LUT 9164 571
Function-based 10558 591
TABLE II. RE SULT S UMM ARY
V. CONCLUSION
This paper proposes some design ideas for CRC hardware
implementations, and compiles various ideas from the liter-
ature. Compilation results were appended to show a rough
estimation of the area and delay impact that each design
incurs. It works as a comprehensive guide for design engineers
to understanding CRC hardware implementations, and as a
starting ground point for any CRC design.
ACKNOWLEDGMENT
The authors would like to thank Davi Castro, Thiago
Crespo, Roberto Borgognoni, Carlos Castro and the remaining
members of IDEA!Electronic Systems for their feedback and
discussions. The authors would also like to acknowledge the
support of CNPq (Conselho Nacional de Desenvolvimento
Cient´
ıfico e Tecnol´
ogico), SEPIN (Secretaria de Pol´
ıtica de
Inform´
atica), MCTI (Minist´
erio da Ciˆ
encia, Tecnologia e
Inovac¸˜
ao) and Eldorado Research Institute.
REF ER EN CE S
[1] W. Peterson and D. Brown “Cyclic codes for error detection”, Proceed-
ings of the IRE, vol. 49, no. 1, pp. 228–235, Jan. 1961.
[2] I. S. Satran, D. Sheinwald, “Out of Order Incremental CRC Computa-
tion”, IEEE Trans. Comp., vol. 54, pp. 1178–1181, 2005.
[3] A. Simionescu, “CRC tool, Computing in Parallel for Ethernet”, 2001
Online: http://outputlogic.com/my-stuff/parallel crc byte enable.pdf
[4] Easics, “CRC generation tool”, On-
line:http://www.easics.com/webtools/crctool
[5] E. Stavinov, “A practical Parallel CRC Generation Method”, Circuit
Cellary, 2010
[6] IBM labs, “Out of order Incremental CRC computation”, 2003
[7] Y. Sun and M. Kim, “High Performance Table-based algorithm for
Pipelined CRC calculation”, Journal of communications 2012
[8] M. Walma - Intel Corp., “Pipelined Cyclic Redundancy Check Cal-
culation, Proceedings of 16th International Conference on Computer
Communications and Networks, 2007.
[9] Tool to generate initial values for direct and indirect CRC implementa-
tions, Online:http://www.zorc.breitbandkatze.de/crc.html