Conference PaperPDF Available

Low-Latency LDPC Decoding Achieved by Code and Architecture Co-Design


Abstract and Figures

A novel low-density parity-check decoder architecture is presented that can achieve a high data throughput while retaining the flexibility to decode a wide range of quasi-cyclic codes. The proposed architecture allows to combine multiple message-update schedules, providing an additional degree of freedom to jointly optimize the code and decoder architecture. Protograph-based code constructions are introduced that exploit this added degree of freedom in order to maximize data throughput, and that are also optimized to reduce the complexity of the required parallel data accesses. For some examples and under an ideal pipeline speedup assumption, the proposed architecture and code designs reduce decoding latency by a factor of 3.2× compared to a decoder using a strict sequential schedule.
Content may be subject to copyright.
Low-Latency LDPC Decoding Achieved by
Code and Architecture Co-Design
Elsa Dupraz?, Franc¸ois Leduc-Primeau?, and Franc¸ois Gagnon
?IMT Atlantique, Lab-STICC, UBL, Brest, France
Ecole de Technologie Sup´
erieure, Montr´
eal, Canada
Abstract—A novel low-density parity-check decoder architec-
ture is presented that can achieve a high data throughput while
retaining the flexibility to decode a wide range of quasi-cyclic
codes. The proposed architecture allows to combine multiple
message-update schedules, providing an additional degree of
freedom to jointly optimize the code and decoder architecture.
Protograph-based code constructions are introduced that exploit
this added degree of freedom in order to maximize data through-
put, and that are also optimized to reduce the complexity of the
required parallel data accesses. For some examples and under
an ideal pipeline speedup assumption, the proposed architecture
and code designs reduce decoding latency by a factor of 3.2×
compared to a decoder using a strict sequential schedule.
A desirable feature of low-density parity-check (LDPC)
decoders is the ability to support a wide range of code
characteristics, in order to allow code rate and length adap-
tation, or to handle multiple communication standards with a
single decoder. In this paper, we are interested in designing
highly parallel LDPC decoder architectures that retain the
flexibility to decode any quasi-cyclic (QC) code that satisfies
basic constraints (maximum node degrees, maximum lifting
factor, etc.). Highly parallel architectures are interesting for
applications that demand large data throughputs. They are also
useful for low-power operation, since the latency reduction
obtained from parallel execution can be traded off to tolerate
an increase in propagation delays resulting from the low-
voltage operation of the circuit.
Three key strategies are widely used to achieve high-
throughput LDPC decoders. The first consists in generating
the decoder messages by following a sequential (also known as
serial) update schedule, which reduces the number of decoding
iterations approximately by a factor of two [1]. The other two
are standard circuit design strategies: implementing several
processing units in parallel, and using circuit pipelining to split
up each unit into several stages and thus increase the clock
frequency. Unfortunately, it is not possible to use the three
techniques simultaneously, because the sequential schedule in
general prevents the overlap of computations belonging to
different layers.
We propose a novel decoder architecture that can simulta-
neously use a large number of processing units together with
pipelining while also taking advantage of an efficient message-
passing schedule. This architecture uses a mechanism called
-updates” to maintain the correctness of the computation
irrespective of the message-update schedule. As a result, the
schedule can be chosen on a node-by-node basis, providing an
additional design parameter that can be optimized. We show
that this allows to combine parallel processing with pipelining
with only a small penalty in the average number of iterations,
resulting in a decoder with a faster convergence time.
Because of the highly parallel nature of the proposed
architecture, it becomes challenging to ensure that the re-
quired data is always accessible, while maintaining a low
complexity for the memory management circuits. We discuss
how to optimize the data management for general QC codes,
and furthermore propose an optimized code construction that
reduces the decoder complexity.
In our construction, the code degree distributions are de-
scribed by protographs [2], which allows to obtain very
efficient QC codes [3]. The standard approach for constructing
QC codes from protographs consists of a two-step lifting [3]
that aims to improve the minimum distance and girth proper-
ties of the code. The first lifting step produces a base matrix
from a given protograph by means of a Progressive Edge-
Growth (PEG) algorithm [4] that seeks to maximize the girth
of the base matrix. The second step is realized with a circulant-
PEG algorithm [5] and consists of replacing all the non-zero
components of the base matrix by circulant matrices. We
propose a modified PEG algorithm for constructing the base
matrix at the first lifting step. As shown in our simulation
results, the modified construction enables efficient data man-
agement at the price of a slight performance degradation.
The remainder of this paper is organized as follows. Sec-
tion II reviews the standard protograph-based code construc-
tion approach. Section III briefly reviews some decoder archi-
tectures available in the literature and describes the proposed
architecture. Then, Section IV presents our architecture-aware
optimized code constructions. Finally, Section V evaluates the
error-correction and throughput performance of the proposed
codes and architecture.
A. Parity-check matrix
The parity check matrix Hof size M×Nof an LDPC
code can be represented by a bipartite Tanner graph. In this
Tanner graph, the set of vertices is composed of NVariable
Nodes (VNs) V={v1,· · · , vN}and MCheck Nodes (CNs)
C={c1,· · · , cM}. There is an edge between a VN vnand a
CN cmif Hm,n = 1. We denote by dcthe degree of a CN, and
by dc,max the largest CN degree in the Tanner graph. We also
denote by Cv⊆ C the set of CNs that are connected to VN v,
and by Vc⊆ V the set of VNs that are connected to CN c. We
now describe the standard method for constructing a parity-
check matrix Hthat ensures good decoding performance.
B. Protographs
A protograph [2] is a small Tanner graph that describes the
connections between CNs and VNs in the full Tanner graph of
the code. We denote by MS×NSthe size of the protograph,
and the matrix representation Sof the protograph is given by
S=S1,1· · · S1,NS
SMS,1· · · SMS,NS,(1)
where the coefficients Si,j are positive integers. A protograph
describes the connections between MStypes of CNs and
NStypes of VNs. In any LDPC code constructed from the
protograph S, any CN of type iwill be connected to Si,j VNs
of type j. The coefficients Si,j can be greater than 1, which
will give parallel edges in the Tanner graph representation of
The final code performance highly depends on its under-
lying protograph S. For an AWGN channel, Density Evolu-
tion [6] evaluates the protograph threshold as the minimum
SNR that can be tolerated by the decoder to reconstruct the
original codeword without error, when the codeword length
tends to infinity. For a given rate, the protograph can be
optimized by Differential Evolution [7], which aims at finding
the protograph with the smallest threshold.
From a given protograph, we can construct a QC parity
check matrix Hof the desired size by applying the two-steps
lifting procedure of [3]. This two-steps lifting will not only
allow us to improve the minimum distance and girth properties
of the code, but also to address the constraints of the decoder
implementation by proposing novel code constructions that
only modify one of the two steps of the lifting.
C. Two-steps lifting
The first lifting step aims to construct a base matrix Bof
size MB×NBfrom the protograph, where MB=Z1MS,
NB=Z1NS, and Z1is called the first lifting factor. A base
matrix constructed from a given protograph Swill contain
Z1VNs of each of the NStypes and Z1CNs of each of
the MStypes. In the following, the VNs (resp. CNs) of the
base matrix are referred to as B-VN (resp. B-CN). The first
lifting is realized by means of a copy-and-permute procedure
that first consists of duplicating Z1times the protograph S.
The edges of the obtained Tanner graph are then interleaved so
that the protograph degrees Si,j are fulfilled, the Tanner graph
of Bis connected, and there is no remaining parallel edges.
Edge interleaving is realized by using a PEG algorithm [4] that
reduces the amount of short cycles in B, since short cycles
could degrade the final code performance.
The second lifting aims to construct a QC parity-check
matrix Hof size M×Nfrom the base matrix B, where
M=Z2MB,N=Z2NB, and Z2is called the second lifting
factor. The second lifting is done by replacing all the non-
zero components of the matrix Bby circulant matrices of
CWMemManager CWMemManager
+ Δ
+ Δ
Fig. 1. High-level view of the decoder architecture.
size Z2×Z2. This replacement is realized by a circulant PEG
algorithm [5] that again aims at reducing the amount of short
cycles in the final parity-check matrix H.
A. Review of state-of-the-art architectures
Most architectures in the literature targeted at QC codes
perform the processing row-wise and implement the well-
known Offset Min-Sum (OMS) algorithm. Two main ap-
proaches allow combining the use of a strict row-layered
message schedule with parallel computations. The first consists
in implementing one [8], [9] or two [10] small processing units
that in each clock cycle accept as input the messages from
one B-VN to one B-CN and output the messages from one
B-CN to one B-VN. The processing of one row layer (i.e. all
messages to/from one B-CN) then requires at least dc/U clock
cycles, where Uis the number of processing units. Because of
the relatively large number of cycles required per layer, it is
possible to order the computations in such a way that a deep
pipeline can be used while keeping the number of stall cycles
at a minimum. A second approach consists in implementing
large processing units that process one row layer per clock
cycle. In general, such an architecture cannot use pipelining
because the layered message schedule requires the processing
of the current layer to be completed before the next layer can
start. Exceptionally, if the parity-check matrix is designed to
ensure that consecutive layers never share a variable node, then
a two-stage pipeline can be used [11].
B. The -update architecture
Our proposed architecture is similar to the second approach
described above. However, unlike the solution of [11], our
architecture admits the use of a moderately deep pipeline
by introducing the possibility of ignoring some of the data
dependencies of the layered schedule. The architecture, shown
in Fig. 1, can be split in two parts. The top part is com-
posed of memory management units that store the belief1
sums associated with each variable node. The bottom part is
composed of at least Z2processing units called CNPEs, each
1We call belief a log-likelihood ratio scaled by a constant.
one responsible for evaluating all the messages sent to and
from a particular check node.
Typically, the processing units of a parallel row-layered
architecture would take as input a vector Λof VN belief
sums, and output an updated vector Λ0. The first novelty of
the proposed architecture is that the processing units, rather
than generating updated VN sums, compute the difference
=Λ0Λ. Once the processing completes, this difference
is used to update the VN sums. As a result, the architecture
seamlessly supports any kind of message schedule. If a par-
ticular B-VN is involved in multiple concurrent check-node
computations, its message-update schedule is simply altered,
while other B-VNs can still benefit from sequential updates.
Compared to a standard row-layered architecture, this mod-
ified architecture has one minor drawback. Most state-of-the-
art architectures only require one shifting unit per check-node
input, by allowing the position of each VN in memory to
change throughout the decoding operation. In this architec-
ture, since a particular B-VN might be involved in multiple
concurrent check-node computations, the position of VNs in
memory must remain fixed, and a second write-side shifting
unit is required, as shown in Fig. 1. Note that this shifting unit
is smaller than the read-side one, since it routes vectors
that require fewer bits per element than Λvectors. Also, the
additional delay introduced by this shifting unit is not a major
concern since the -update architecture enables the use of a
deeper pipeline.
C. Memory access
At each cycle, the processing units must access the belief
sums associated with dcB-VNs, where dcis the degree of
the B-CN currently being processed. Since the architecture
is intended to support any quasi-cyclic code, it must support
parallel data access to a Λvector corresponding to any B-VN
subset of size dc,max. To avoid requiring the costly routing
logic that would be necessary to select such arbitrary subsets,
we propose to group all B-VNs into Kdc,max memory
banks such that no two B-VNs placed in the same bank need
to be accessed simultaneously. We then design the CNPE units
so they accommodate up to Kinputs. Since the computation
involves finding minimum values, unused inputs can easily be
disabled by setting their value to the maximum representable
value. With this strategy, the complexity of the architecture
depends on K. In the following section, we propose a B-VN
grouping method that minimizes K.
A. Memory layout optimization
To optimize the proposed architecture, we would like to
group into the same memory bank only B-VNs that do not
share any B-CNs as neighbors. More formally, consider K
memory banks and denote by Mk,k∈ {1,· · · , K}, the set
of B-VNs that are allocated to the k-th memory bank. For all
k∈ {1,· · · , K}, the set Mkis constructed such that for all
v, v0∈ Mksuch that v6=v0,
Cv∩ Cv0=.(2)
This condition ensures that the B-VNs allocated to the same
memory bank cannot be updated in parallel, so that there
is no conflict in memory access. In order to dimension the
memory and to allocate each B-VN to a memory bank, we
want to partition the set of B-VNs into Ksets Mkthat satisfy
condition (2). This partitioning problem could be solved as a
graph coloring problem applied on a VN-only graph. The VN-
only graph contains all the B-VNs as vertices, and there is an
edge between two B-VNs if they are connected to at least one
common B-CN. A standard graph coloring algorithm [12] is
then applied on the VN-only graph in order to construct the
sets Mk.
The graph coloring algorithm aims to partition the graph
into the minimum possible number of colors. However, with
the above approach, this minimum number is determined by
the structure of the Tanner graph and some Tanner graphs may
not allow for a small number of colors. This is why we would
like to minimize the number of colors directly during the code
construction. For this, we propose a modified PEG algorithm
which we now describe.
B. Modified PEG algorithm
Our modified PEG algorithm replaces the standard PEG
algorithm used for the first lifting in the code construction
of Section II. This first lifting constructs the base matrix B
from a given protograph S. In this section, for simplicity “VN”
refers to “B-VN” and “CN” refers to “B-CN”.
The proposed algorithm takes the maximum number of
colors Kdc,max as input, which gives a set of colors
{1,2,· · · , K}. Each CN cmaintains a list of colors Lc
containing the colors of its VN neighbors. Each VN valso
maintains a list Lvof the colors of all the VNs with which it
shares a common CN. At the beginning of the algorithm, all
the lists of colors Lcand Lvare initialized to .
When our modified algorithm needs to add new edges to
the Tanner graph, it first selects a VN vat random in V,
starting with VNs of highest degrees. Once a VN is selected,
the algorithm chooses all its connections in succession instead
of just one at random as in the standard PEG. This will allow
the algorithm to assign a color to VN vonce all its connections
are established. When vis selected, its list of colors is given
by Lv=, since it has no connection yet with any CN. For
every edge it wants to assign, the algorithm computes all the
distances d(v, c)between this VN and all the CNs c∈ C,
where d(v, c)is the length of the shortest path between vand
c. If there is no path between vand c, then d(v, c) = +.
In order to add one edge, the algorithm verifies the following
saturation and colors condition.
1) Saturation condition: the algorithm first constitutes a set
that contains for all j∈ {1,· · · , SM}, all CNs of type
jsuch that VN vhas strictly less than Si,j connections
with CNs of type j. From this set, it constitutes Sby
retaining for all j∈ {1,· · · , SM}only the CNs of type
jthat have strictly less than Si,j connections with VNs
of type i.
2) Color condition: for the current VN v, the algorithm
computes the union between its list of colors Lvand
the list of colors of all the CNs c∈ S. The set Dis then
composed by the CNs that satisfy the color condition,
i.e. for which the size of the union is strictly lower than
the maximum number of colors.
At this step, if D=, then the algorithm is restarted. If
after a given number of restarts, the algorithm is not able to
construct the code, the maximum number of colors must be
augmented. If D 6=, the algorithm selects at random a CN ˆc
that both belongs to Dand that has maximum distance with
VN vamong D. To finish, it adds an edge between vand ˆc,
and it updates the list of colors Lvof vas Lv=Lv∪ Lˆc.
Once it added a new edge, the algorithm moves to the next
one, until all the connections of VN vhave been assigned. It
then attributes a color fvto VN v. This color is selected at
random over the set {1,2,· · · , K}\Lv. The color condition
guarantees that this set is not empty. The algorithm also
updates the lists of colors Lcof all the CNs c∈ Cvas
Lc=Lc∪ {fv}. The algorithm may also update all the lists
of colors of all the VNs that are connected to CNs c∈ Cv, but
this is not useful since the edges of these VNs have already
been assigned by the algorithm.
When adding a new edge, our algorithm must verify the
color condition, which is an additional condition compared to
the standard PEG. In the simulation results section, we discuss
the influence of this condition on the code performance.
C. Message schedule optimization
Since the decoder architecture is pipelined and processes
one B-CN per cycle, TB-CNs are processed concurrently,
where Tis the number of pipeline stages (we assume that
TMB). For a pair of B-CNs present at the same time
in the pipeline, some data dependencies of the sequential
message-update schedule will be ignored for any B-VN that
is connected to both B-CNs. To speed up the convergence of
the decoder, we wish to optimize the order in which the B-
CNs are processed to minimize the number of such ignored
Let us define a weight wi,j that represents the number of
dependencies between B-CNs ciand cj, i.e., for i6=j,wi,j =
|Vci∩ Vcj|. We wish to find an ordering of the B-CNs that
minimizes T1
where id= (i1 + dmod MB)+1.
Since the number of base-row permutations MB!is usually
too large to be explored exhaustively, we rely on the following
randomized greedy algorithm. This algorithm takes as input
the set of B-CNs C, and iteratively outputs an ordering σ(t),
t∈ {1,2,· · · , MB}. As the algorithm iterates, it keeps track
of the content of the processing pipeline as a vector P, which
contains up to T1indices.
1) Initialization: The first element σ(1) is chosen randomly
from the set of B-CNs having the smallest total weight.
Formally let wci=PMB
j=1 wi,j . Then σ(1) is chosen
randomly from the set Sinit ={i:wci= minc∈C (wc)}
and added as the first element of P.
2) Iteration t > 1:Subsequent B-CNs are chosen to
minimize their dependencies with other nodes in the
pipeline. We define wi,P =PjPwi,j . The next ele-
ment σ(t)is chosen randomly from the set S={i:
wi,P = minjU(wj,P )}, where U={1,· · · , MB} \
{σ(1),· · · , σ(t1)}is the set of unassigned indices.
3) Pipeline update: After each iteration, the new element
σ(t)is added at the end of P. After this, if Pcontains
more than T1elements, P(0) is discarded and all
other elements are moved to the next lower index.
This randomized algorithm can be invoked multiple times to
try to improve the global score given by (3).
To evaluate the performance obtained using the proposed
QC codes and decoder architecture, we consider a binary-input
additive white Gaussian noise channel. The channel output is
given by y=x+w, where x∈ {−1,1}and wis a Gaussian
random variable with mean 0and variance σ2.
All codes were constructed from the same protograph,
which was optimized by differential evolution. In order to
increase the sparsity of the base matrices obtained from this
protograph, we set MS= 2,NS= 4, and we imposed a
maximum value of 3for the coefficients Si,j. The optimization
procedure yielded the protograph with dc,max = 7:
From this protograph, we applied the two-steps lifting
introduced in Section II. We first used the modified PEG
algorithm introduced in Section IV with lifting factor Z1= 36
and three different maximum number of colors K= 7,8,9.
This provided three base matrices of size 72 ×144. We also
constructed a fourth base matrix of size 72 ×144 by applying
the standard PEG algorithm without any color restriction. In
the following, the codes obtained from K= 7,8,9, are called
C7,C8,C9, respectively, and the code constructed without a
color restriction is called CNR.
We evaluated that the base matrix of C7has girth 4, while
the three other base matrices have girth 6. This girth difference
can be explained by the fact that for C7,K=dc,max = 7,
which places a difficult constraint on the code construction.
We further observed that the base matrices of C8,C9, and
CNR have approximately the same number of length-6cycles.
Although the code performance does not only depend on
cycle distribution, this means that there is a good chance that
the final decoding performance of C8,C9, and CNR, will be
similar. For the second lifting step, we considered Z2= 18
and we applied the standard circulant PEG algorithm to the
four base matrices in order to obtain QC matrices of size
1296 ×2592. It is worth noting that all four obtained QC-
codes have girth 8.
1.4 1.6 1.8 2 2.2 2.4 2.6
SNR (dB)
C7, K=7
C8, K=8
C9, K=9
CNR, no color restriction
Fig. 2. Performance comparison of the four constructed QC-codes
The bit-error rate (BER) performance of the four codes is
obtained with an OMS decoder implemented according to the
architecture described in Section III. For each codeword bit,
the decoder takes as input a belief value µ=αy/σ2, where
αis set to 4and µis quantized on 6 bits by rounding it
to the nearest integer and saturating it within [31,31]. The
maximum number of iterations is set to 25 and the OMS offset
parameter is set to 1. The constructed codes have base matrices
with a relatively low density (4.5% of non-zero elements).
As a result, it is in fact possible to use the algorithm of
Section IV-C to find a row ordering that is compatible with
a strict row-layered message schedule (i.e. for which (3) is
zero) up to a pipeline depth of T= 5. The BER results for
this case are shown in Figure 2. Each BER point was obtained
from 100 frames in error. We first observe that C7shows
degraded performance compared to the three other codes. This
result was expected since this code is the only one for which
the base matrix has girth 4. On the other hand, we observe
that C8and C9have similar performance. C8shows a slight
performance degradation in the error floor compared to C9, but
it interestingly reduces the architecture memory requirements.
Surprisingly, CNR also shows a small performance degradation
compared to C9. The modified PEG algorithm constructs the
edges in a different order than the standard PEG, which may
explain the performance improvement.
The proposed architecture allows to increase the pipeline
depth by ignoring some data dependencies of the row-layered
message schedule. To illustrate the impact of this approach, let
us assume that the pipeline is ideal, that is it permits a clock
period of τ /T , where τis the clock period without pipelining.
We take code C8and consider increasing the pipeline depth to
T= 20. After optimizing the row ordering using the algorithm
of Section IV-C, we obtain the BER through Monte-Carlo
simulation with an iteration limit of 25 iterations. We find
that this BER is approximately equal to the BER obtained
using a strict row-layered schedule with a limit of 20 iterations.
Therefore, under the ideal pipelining assumption, the deeper
pipeline combined with the use of a relaxed schedule decreases
latency by a factor of 20/5·20/25 = 3.2.
The proposed approach can also be applied to existing
codes. For instance, we consider the rate 1
2code defined in the
IEEE 802.11n (WiFi) standard, which has a base matrix den-
sity of 30%, and cannot be pipelined under a strict row-layered
schedule. We evaluated the BER performance of a pipelined
decoder with T= 4 stages, optimized B-CN ordering, and
a maximum of 25 iterations. We find that a decoder using a
strict schedule requires 20 iterations to achieve approximately
the same BER. Therefore, the proposed pipelined decoder also
reduces latency by a factor of 4·20/25 = 3.2on this code.
This paper introduced a novel LDPC decoder architecture
that greatly reduces the decoding latency by carefully combin-
ing parallel processing and pipelining. It also proposed new
QC code constructions that further improve this throughput
and lower the memory requirements of the architecture. Future
work will be dedicated to the optimization of the code and de-
coder parameters for improved latency, decoding performance,
and energy consumption.
The authors were supported by the grant ANR-17-CE40-
0020 of the French National Research Agency ANR (project
[1] E. Sharon, S. Litsyn, and J. Goldberger, “Efficient serial message-passing
schedules for LDPC decoding,” IEEE Trans. on Information Theory,
vol. 53, no. 11, pp. 4076–4091, Nov 2007.
[2] J. Thorpe, “Low-density parity-check (LDPC) codes constructed from
protographs,” IPN progress report, vol. 42, no. 154, pp. 42–154, 2003.
[3] D. G. Mitchell, R. Smarandache, and D. J. Costello, “Quasi-cyclic
LDPC codes based on pre-lifted protographs,” IEEE Transactions on
Information Theory, vol. 60, no. 10, pp. 5856–5874, 2014.
[4] X.-Y. Hu, E. Eleftheriou, and D.-M. Arnold, “Regular and irregular pro-
gressive edge-growth Tanner graphs,” IEEE Transactions on Information
Theory, vol. 51, no. 1, pp. 386–398, 2005.
[5] J. Thorpe, K. Andrews, and S. Dolinar, “Methodologies for designing
LDPC codes using protographs and circulants,” in Intl. Symp. on
Information Theory (ISIT), 2004, p. 238.
[6] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke, “Design of
capacity-approaching irregular low-density parity-check codes,IEEE
Transactions on Information Theory, vol. 47, no. 2, pp. 619–637, 2001.
[7] R. Storn and K. Price, “Differential evolution–a simple and efficient
heuristic for global optimization over continuous spaces,Journal of
global optimization, vol. 11, no. 4, pp. 341–359, 1997.
[8] C. Studer, N. Preyss, C. Roth, and A. Burg, “Configurable high-
throughput decoder architecture for quasi-cyclic LDPC codes,” in 2008
42nd Asilomar Conference on Signals, Systems and Computers, Oct
2008, pp. 1137–1142.
[9] C. Marchand, L. Conde-Canencia, and E. Boutillon, “Architecture and
finite precision optimization for layered LDPC decoders,” in 2010 IEEE
Workshop On Signal Processing Systems, Oct 2010, pp. 350–355.
[10] A. Balatsoukas-Stimming, N. Preyss, A. Cevrero, A. Burg, and C. Roth,
“A parallelized layered QC-LDPC decoder for IEEE 802.11ad,” in 11th
Intl. New Circuits and Systems Conf. (NEWCAS), June 2013, pp. 1–4.
[11] T. T. Nguyen-Ly, V. Savin, K. Le, D. Declercq, F. Ghaffari, and
O. Boncalo, “Analysis and design of cost-effective, high-throughput ldpc
decoders,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 26, no. 3, pp. 508–521, March 2018.
[12] F. T. Leighton, “A graph coloring algorithm for large scheduling prob-
lems,” Journal of research of the national bureau of standards, vol. 84,
no. 6, pp. 489–506, 1979.
... In this paper, we consider a quantized offset Min-Sum decoder [6], [8], [9], implemented with the architecture proposed in [16]. For simplicity, no pipeline stages are considered, which corresponds to a row-layered scheduling. ...
... where α is a scaling parameter. In the architecture proposed in [16], the VN messages β ...
... In addition, the DE equations were derived by considering that the memory faults are introducing after computation of the check-to-variable messages β (ℓ) i→j , as in [8]. This slightly differs from the hardware decoder of [16] described in Section II, where the faults are introduced when the VN messages β (ℓ) i are read. However, we observed through simulations a negligible difference on the obtained error probabilities. ...
Full-text available
The objective of this paper is to minimize the energy consumption of a quantized Min-Sum LDPC decoder, by considering aggressive voltage downscaling of the decoder circuit. Since low power supply may introduce faults in the memories used by the decoder architecture, this paper proposes to optimize the energy consumption of the faulty Min-Sum decoder while satisfying a given performance criterion. The proposed optimization method relies on a coordinate descent algorithm that optimizes code and decoder parameters which have a strong influence on the decoder energy consumption: codeword length, number of quantization bits, and failure probability of the memories. Optimal parameter values are provided for several codes defined by their protographs, and significant energy gains are observed compared to non-optimized setups.
... In this work, we consider a quantized version of the Min-Sum decoder for which a specific architecture was developed in the EF-FECtive project [1] [57]. We then develop high-level energy models for the considered architecture, and we introduce methods to optimize the code and decoder parameters in order to minimize the decoder energy consumption under given performance constraints. ...
... In the above section, we have seen that the energy consumption depends on the considered decoder. Therefore, before providing the energy models, we first introduce the Min-sum architecture of [1] [57] which we consider in this work. This architecture uses a large number of processing units in parallel, and relies on datapath pipelining, which is a technique for increasing parallelism at a small cost in the circuit. ...
... In the architecture proposed in [1], messages β ( ) i calculated in variable nodes at iteration are given by: ...
Full-text available
There are different types of error correction codes (CCE), each of which gives different trade-offs interms of decoding performanceand energy consumption. We propose to deal with this problem for Low-Density Parity Check (LDPC) codes. In this work, we considered LDPC codes constructed from protographs together with a quantized Min-Sum decoder, for their good performance and efficient hardware implementation. We used a method based on Density Evolution to evaluate the finite-length performance of the decoder for a given protograph.Then, we introduced two models to estimate the energy consumption of the quantized Min-Sum decoder. From these models, we developed an optimization method in order to select protographs that minimize the decoder energy consumption while satisfying a given performance criterion. The proposed optimization method was based on a genetic algorithm called differential evolution. In the second part of the thesis, we considered a faulty LDPC decoder, and we assumed that the circuit introduces some faults in the memory units used by the decoder. We then updated the memory energy model so as to take into account the noise in the decoder. Therefore, we proposed an alternate method in order to optimize the model parameters so as to minimize the decoder energy consumption for a given protograph.
... However, [6] considers hard-decision Gallager B decoders with poor performance, and [4] considers infinite precision sum-product decoding algorithms which cannot be implemented directly on hardware. To circumvent these issues, in [5], we proposed two models to evaluate the energy consumption of a more practical quantized Min-Sum decoder [7]. The first model evaluates the energy consumption from the number of operations realized in the decoder, and the second model counts the number of memory writes in the decoder. ...
... In this paper, we also study the quantized Min-sum decoder of [7], and consider Quasy-Cyclic (QC) codes constructed from protographs, for their easy hardware implementation. We rely on the two energy models introduced in [5]. ...
... In this paper, we consider the Min-Sum decoder implementation proposed in [7], as this implementation allows for high degree of parallelism and reduced decoding latency. In the decoder, messages are quantized on q bits and take values between −Q and +Q, where Q = 2 q−1 − 1. ...
Full-text available
This paper considers protograph-based LDPC codes, and proposes an optimization method to select protographs that minimize the energy consumption of quantized min-sum decoders. This method first estimates the average number of iterations required by the decoder, and includes this estimate into two high-level models that evaluate the decoder energy consumption. The optimization problem is then formulated as minimizing the energy consumption of the decoder while satisfying a performance criterion on the frame error rate. Finally, an optimization algorithm based on differential evolution is introduced. Protograph optimized for energy consumption shows a gain in energy of approximately 15% compared with a baseline protograph optimized for performance only.
Conference Paper
Critical communication requirements included in Factory Automation applications are complex to implement due to the difficulties encountered in guaranteeing high reliability and ultra-low latencies at the same time. In this work-in-progress, a technical solution for the physical layer is proposed: the Quasi-Cyclic LDPC of the Progressive Edge Growth family (QC-PEGLDPC). This coding scheme is considered as a promising candidate due to two main factors: the good decoding performance for short packet transmissions and the low latency that can be obtained by using full parallel decoding architectures. The obtained results are compared with the 5G New Radio coding scheme, which includes LDPCs as part of the solution for Ultra Reliable Low Latency (URLLC) use cases. In these first results, QC-PEG-LDPC shows a performance improvement of 1 dB when compared with the 5G LDPC codes for a message length of 128 bits. Latency analysis indicate that QC-PEG-LDPC could allow decoding latencies of 0.13 μs providing that the full parallel decoding architecture is enabled.
Full-text available
This paper introduces a new approach to cost-effective, high-throughput hardware designs for Low Density Parity Check (LDPC) decoders. The proposed approach, called Non-Surjective Finite Alphabet Iterative Decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different trade-offs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.
Full-text available
Quasi-cyclic low-density parity-check (QC-LDPC) codes based on protographs are of great interest to code designers because analysis and implementation are facilitated by the protograph structure and the use of circulant permutation matrices for protograph lifting. However, these restrictions impose undesirable fixed upper limits on important code parameters, such as minimum distance and girth. In this paper, we consider an approach to constructing QC-LDPC codes that uses a two-step lifting procedure based on a protograph, and, by following this method instead of the usual one-step procedure, we obtain improved minimum distance and girth properties. We also present two new design rules for constructing good QC-LDPC codes using this two-step lifting procedure, and in each case we obtain a significant increase in minimum distance and achieve a certain guaranteed girth compared to one-step circulant-based liftings. The expected performance improvement is verified by simulation results.
Conference Paper
Full-text available
We present a doubly parallelized layered quasi-cyclic low-density parity-check decoder for the emerging IEEE 802.11ad multigigabit wireless standard. The decoding algorithm is equivalent to a non-parallelized layered decoder and, thus, retains its favorable convergence characteristics, which are known to be superior to those of flooding schedule based decoders. The proposed architecture was synthesized using a TSMC 40 nm CMOS technology, resulting in a cell area of 0.18 mm2 and a clock frequency of 850 MHz. At this clock frequency, the decoder achieves a coded throughput of 3.12 Gbps, thus meeting the throughput requirements when using both the mandatory BPSK modulation and the optional QPSK modulation.
Full-text available
Layered decoding is known to provide effi-cient and high-throughput implementation of LDPC decoders. However, two main issues affect performance and area of practical implementations: quantization and memory. Quantization can strongly degrade per-formance and memory area can constitute up to 70% of the total area of the decoder implementation. This is the case of the DVB-S2,-T2 and -C2 decoders when considering long frames. This paper is then dedicated to the optimization of these decoders. We first focus on the reduction of the number of quantization bits and propose solutions based on the efficient saturation of the channel values, the extrinsic messages and the a posteriori probabilities (APP). We reduce from 6 to 5 the number of quantization bits for the channel and the extrinsic messages and from 8 to 6 the APPs, without introducing any performance loss. We then consider the optimization of the size of the extrinsic memory considering a multiple code rates decoder. The paper finally presents an optimized fixed-point architecture of a DVB-S2 layered decoder and its implementation on an FPGA device.
Full-text available
A new heuristic approach for minimizing possiblynonlinear and non-differentiable continuous spacefunctions is presented. By means of an extensivetestbed it is demonstrated that the new methodconverges faster and with more certainty than manyother acclaimed global optimization methods. The newmethod requires few control variables, is robust, easyto use, and lends itself very well to parallelcomputation.
Conference Paper
Full-text available
We describe a fully reconfigurable low-density parity check (LDPC) decoder for quasi-cyclic (QC) codes. The proposed hardware architecture is able to decode virtually any QC-LDPC code that fits into the allocated memories while achieving high decoding throughput. Our VLSI implementation has been optimized for the IEEE 802.11 n standard and achieves a throughput of 780 Mbit/s with a core area of 3.39 mm<sup>2</sup> in 0.18 mum CMOS technology.
A new heuristic approach for minimizing possiblynonlinear and non-differentiable continuous spacefunctions is presented. By means of an extensivetestbed it is demonstrated that the new methodconverges faster and with more certainty than manyother acclaimed global optimization methods. The newmethod requires few control variables, is robust, easyto use, and lends itself very well to parallelcomputation.
A new graph coloring algorithm is presented and compared to a wide variety of known algorithms. The algorithm is shown to exhibit O(n**2) time behavior for most sparse graphs and is found to be particularly well suited for use with large-scale scheduling problems. In addition, a procedure for generating large random test graphs with known chromatic number is presented and is used to evaluate heuristically the capabilities of the algorithms discussed.
We introduce a new class of low-density parity-check (LDPC) codes constructed from a template called a protograph. The protograph serves as a blueprint for constructing LDPC codes of arbitrary size whose performance can be predicted by analyzing the protograph. We apply standard density evolution techniques to predict the performance of large protograph codes. Finally, we use a randomized search algorithm to find good protographs. In this article, we introduce a new class of LDPC codes constructed from a template called a proto- graph. The protograph serves as a blueprint for constructing LDPC codes of arbitrary size whose perfor- mance can be predicted by analyzing the protograph. We apply standard density evolution techniques to predict the performance of large protograph codes. Finally, we use a randomized search algorithm to find good protographs.
Conference Paper
A method is presented for constructing LDPC codes with excellent performance, simple hardware implementation, low encoder complexity, and which can be concisely documented. The simple code structure is achieved by using a base graph, expanded with circulants. The base graph is chosen by computer search using simulated annealing, driven by density evolution's decoding threshold as determined by the reciprocal channel approximation. To build a full parity check matrix, each edge of the base graph is replaced by a circulant permutation, chosen to maximize loop length by using a Viterbi-like algorithm.