Conference Paper

Transport Triggered Polar Decoders

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This is why AFF3CT comes with a large database of pre-simulated performance curves with all the required parameters. Some research projects have been using AFF3CT as a Ref. [24][25][26][27][28][29][30]. All pre-computed simulation results are available at a glance on the online comparator, 4 with corresponding command lines to reproduce them. ...
Article
Full-text available
AFF3CT is an open source toolbox dedicated to Forward Error Correction (FEC or channel coding). It supports a broad range of codes: from widespread turbo codes and Low-Density Parity-Check (LDPC) codes to more recent polar codes. The toolbox is written in C++ and can be used either as a simulator to quickly evaluate algorithms characteristics, or as a library in Software Defined Radio (SDR) systems or for other specific needs. Most of the decoding algorithm implementations aim at low latency and high throughput, targeting multiple Gb/s on modern CPUs. This is crucial in both simulation and SDR use cases: Monte Carlo simulations require high performance implementation as they commonly target the estimation of approximately 1012bits. On the other hand, the implementations in real systems have to be very efficient to be competitive against dedicated hardware ones. Finally, AFF3CT emphasizes the reproducibility of state-of-the-art results by providing public references and open, modular source code. Keywords: Communication chain, Channel coding, Monte Carlo simulation, Forward error correction library, Digital modulation, Reproducible science, Multi-node, Multi-thread, Vectorization
... In other works, AFF3CT has been enriched to support new features. In [Léo+18b] the P-EDGE generator tool (see Section 2.5.3.2) has been modified to generate Transport Triggered Architecture (TTA ≈ VLIW) instructions while in [TB20] a new LDPC code construction method is proposed and directly implemented in the AFF3CT simulator. In some cases AFF3CT is used as a library from which some sub-parts of the toolbox are reused or 6 As AFF3CT is open-source, some of the previous works have been integrated inside the toolbox. ...
Thesis
Full-text available
A software-defined radio is a radio communication system where components traditionally implemented in hardware are instead implemented by means of software. With the growing number of complex digital communication standards and the general purpose processors increasing power, it becomes interesting to trade the energy efficiency of the dedicated architectures for the flexibility and the reduced time to market on general purpose processors.Even if the resulting implementation of a signal processing is made on an application-specific integrated circuit, the software version of this processing is necessary to evaluate and verify the correct properties of the functionality. This is generally the role of the simulation. Simulations are often expensive in terms of computational time. To evaluate the global performance of a communication system can require from few days to few weeks.In this context, this thesis proposes to study the most time consuming algorithms in today's digital communication chains. These algorithms often are the channel decoders located on the receivers. The role of the channel coding is to improve the error resilience of the system. Indeed, errors can occur at the channel level during the transmission between the transmitter and the receiver. Three main channel coding families are then presented: the LDPC codes, the polar codes and the turbo codes. These three code families are used in most of the current digital communication standards like the Wi-Fi, the Ethernet, the 3G, 4G and 5G mobile networks, the digital television, etc. The resulting decoders offer the best compromise between error resistance and decoding speed known to date. Each of these families comes with specific decoding algorithms. One of the main challenge of this thesis is to propose optimized software implementations for each of them. Specific efficient implementations are proposed as well as more general optimization strategies. The idea is to extract the generic optimization strategies from a representative subset of decoders.The last part of the thesis focuses on the implementation of a complete digital communication system in software. Thanks to the efficient decoding implementations proposed before, a full transceiver, compatible with the DVB-S2 standard, is implemented. This standard is typically used for broadcasting multimedia contents via satellite. To this purpose, an embedded domain specific language targeting the software-defined radio is introduced. The main objective of this language is to take advantage of the parallel architecture of the current general purpose processors. The results show that the system achieves sufficient throughputs to be deployed in real-world conditions.These contributions have been made in a dynamic of openness, sharing and reusability, it results in an open source library named AFF3CT for A Fast Forward Error Correction Toolbox. Thus, all the results proposed in this thesis can easily be reproduced and extended. This philosophy is detailed in a specific chapter of the thesis manuscript.
... Les travaux présentés dans ce chapitre ont été valorisés à travers une publication à la conférence ISTC 2018 [83]. ...
Thesis
Full-text available
Les codes polaires constituent une classe de codes correcteurs d’erreurs inventés récemment qui suscite l’intérêt des chercheurs et des industriels, comme en atteste leur sélection pour le codage des canaux de contrôle dans la prochaine génération de téléphonie mobile (5G). Un des enjeux des futurs réseaux mobiles est la virtualisation des traitements numériques du signal, et en particulier les algorithmes de codage et de décodage. Afin d’améliorer la flexibilité du réseau, ces algorithmes doivent être décrits de manière logicielle et être déployés sur des architectures programmables. Une telle infrastructure de réseau permet de mieux répartir l’effort de calcul sur l’ensemble des noeuds et d’améliorer la coopération entre cellules. Ces techniques ont pour but de réduire la consommation d’énergie, d’augmenter le débit et de diminuer la latence des communications. Les travaux présentés dans ce manuscrit portent sur l’implémentation logicielle des algorithmes de décodage de codes polaires et la conception d’architectures programmables spécialisées pour leur exécution.Une des caractéristiques principales d’une chaîne de communication mobile est l’instabilité du canal de communication. Afin de remédier à cette instabilité, des techniques de modulations et de codages adaptatifs sont utilisées dans les normes de communication.Ces techniques impliquent que les décodeurs supportent une vaste gamme de codes : ils doivent être génériques. La première contribution de ces travaux est l’implémentation logicielle de décodeurs génériques des algorithmes de décodage "à Liste" sur des processeurs à usage général. En plus d’être génériques, les décodeurs proposés sont également flexibles.Ils permettent en effet des compromis entre pouvoir de correction, débit et latence de décodage par la paramétrisation fine des algorithmes. En outre, les débits des décodeurs proposés atteignent les performances de l’état de l’art et, dans certains cas, les dépassent.La deuxième contribution de ces travaux est la proposition d’une nouvelle architecture programmable performante spécialisée dans le décodage de codes polaires. Elle fait partie de la famille des processeurs à jeu d’instructions dédiés à l’application. Un processeur de type RISC à faible consommation en constitue la base. Cette base est ensuite configurée,son jeu d’instructions est étendu et des unités matérielles dédiées lui sont ajoutées. Les simulations montrent que cette architecture atteint des débits et des latences proches des implémentations logicielles de l’état de l’art sur des processeurs à usage général. La consommation énergétique est réduite d’un ordre de grandeur. En effet, lorsque l’on considère le décodage par annulation successive d’un code polaire (1024,512), l’énergie nécessaire par bit décodé est de l’ordre de 10 nJ sur des processeurs à usage général contre 1 nJ sur les processeurs proposés.La troisième contribution de ces travaux est également une architecture de processeur à jeu d’instructions dédié à l’application. Elle se différencie de la précédente par l’utilisation d’une méthodologie de conception alternative. Au lieu d’être basée sur une architecture de type RISC, l’architecture du processeur proposé fait partie de la classe des architectures déclenchées par le transport. Elle est caractérisée par une plus grande modularité qui permet d’améliorer très significativement l’efficacité du processeur. Les débits mesurés sont alors supérieurs à ceux obtenus sur les processeurs à usage général. La consommation énergétique est réduite à environ 0.1 nJ par bit décodé pour un code polaire (1024,512) avec l’algorithme de décodage par annulation successive. Cela correspond à une réduction de deux ordres de grandeur en comparaison de la consommation mesurée sur des processeurs à usage général.
Conference Paper
Full-text available
Cloud Radio Access Network is foreseen as one of the key features of the future 5G mobile communication standard. In this context, all the baseband processing is intended to be performed on CPUs in order to keep a high level of flexibility. The challenge is then to propose efficient software implementation of baseband processing algorithms that guarantee a sufficient throughput while limiting the energy consumption. In this paper, as an alternative to general purpose processors, we propose an implementation of an Application Specific Instruction set Processor customized for the Successive Cancellation decoding of polar codes. The resulting software decoder achieves throughput similar to state-of-the-art ARM processor implementations while reducing the energy consumption by a factor 10.
Poster
Full-text available
This demonstration intends to present AFF3CT (A Fast Forward 3rror Correction Tool). The main objective of AFF3CT is to provide a portable, open source, fast and flexible software to the channel coding community in such a way that researchers can spend more time on channel coding / algorithmic problems instead of software development issues. It is also intended to facilitate the process of hardware verification and debug with the objective of fast prototyping.
Article
Full-text available
Polar codes are a recently proposed class of block codes that provably achieve the capacity of various communication channels. They received a lot of attention as they can do so with low-complexity encoding and decoding algorithms, and they have an explicit construction. Their recent inclusion in a 5G communication standard will only spur more research. However, only a couple of ASICs featuring decoders for polar codes were fabricated, and none of them implements a list-based decoding algorithm. In this paper, we present ASIC measurement results for a fabricated 28 nm CMOS chip that implements two different decoders: the first decoder is tailored toward error-correction performance and flexibility. It supports any code rate as well as three different decoding algorithms: successive cancellation (SC), SC flip and SC list (SCL). The flexible decoder can also decode both non-systematic and systematic polar codes. The second decoder targets speed and energy efficiency. We present measurement results for the first silicon-proven SCL decoder, where its coded throughput is shown to be of 306.8 Mbps with a latency of 3.34 us and an energy per bit of 418.3 pJ/bit at a clock frequency of 721 MHz for a supply of 1.3 V. The energy per bit drops down to 178.1 pJ/bit with a more modest clock frequency of 308 MHz, lower throughput of 130.9 Mbps and a reduced supply voltage of 0.9 V. For the other two operating modes, the energy per bit is shown to be of approximately 95 pJ/bit. The less flexible high-throughput unrolled decoder can achieve a coded throughput of 9.2 Gbps and a latency of 628 ns for a measured energy per bit of 1.15 pJ/bit at 451 MHz.
Chapter
Full-text available
Customized processors are an interesting option for implementing software defined radios; they bring benefits of tailored fixed function hardware while adding new advantages such as reduced implementation verification effort and increased post-fabrication flexibility. To reduce the engineering costs and the time-to-market of platforms with new computing devices, the processor customization process should be supported with automated design flows that include tools like retargeting compilers, instruction-set simulators, and RTL generators. This chapter presents an open source processor co-design toolset that is based on a computation resource oriented design methodology where the primary design choices are the set of resources to include in the processor at hand, instead of focusing on instruction encoding details. The toolset is based on a retargetable high-level language compiler and a scalable exposed datapath template which support different styles of parallelism available in applications. In addition to various published academic processor design examples for SDR algorithms, the tools have been used to design and program processors that have been implemented down to silicon layout level and integrated in commercial grade chips.
Article
Full-text available
We analyze interleaved concatenation schemes of polar codes with outer binary BCH codes and convolutional codes. We show that both BCH-polar and Conv-polar codes can have a frame error rate that decays exponentially with the code length for all rates up to capacity, which is a substantial improvement in the error exponent over stand-alone polar codes. Interleaved concatenation with long constraint length convolutional codes is an effective way to leverage the fact that polarization increases the cutoff rate of the channel. Simulation results show that Conv-polar codes when decoded with the proposed soft-output multistage iterative decoding algorithm can outperform stand-alone polar codes decoded with successive cancellation or belief propagation decoding. It may be comparable to stand-alone polar codes with list decoding in the high SNR regime. In addition to this, we show that the proposed concatenation scheme requires lower memory and decoding complexity in comparison to belief propagation and list decoding of polar codes. Practically, the scheme enables rate compatible outer codes which ease hardware implementation. Our results suggest that the proposed method may strike a better balance between performance and complexity compared to existing methods in the finite-length regime.
Article
Full-text available
Polar decoders are well suited for high-speed software implementations. In this work, we present a framework for generating fully-unrolled software polar decoders with branchless data flow. We discuss the memory layout of data in these decoders and show the optimization techniques used. At 335 Mbps, when decoding a (2048, 1707) polar code, the resulting decoder has more than twice the speed of the state of the art floating-point software polar decoder.
Article
Full-text available
Error Correction Code decoding algorithms for consumer products such as Internet of Things (IoT) devices are usually implemented as dedicated hardware circuits. As processors are becoming increasingly powerful and energy efficient, there is now a strong desire to perform this processing in software to reduce production costs and time to market. The recently introduced family of Successive Cancellation decoders for Polar codes has been shown in several research works to efficiently leverage the ubiquitous SIMD units in modern CPUs, while offering strong potentials for a wide range of optimizations. The P-EDGE environment introduced in this paper, combines a specialized skeleton generator and a building blocks library routines to provide a generic, extensible Polar code exploration workbench. It enables ECC code designers to easily experiments with combinations of existing and new optimizations , while delivering performance close to state-of-art decoders.
Conference Paper
Full-text available
Polar Codes can provably achieve the capacity of discrete memoryless channels. In order to make practical, it is necessary to propose efficient hardware decoder architec-tures. In this paper, the first hardware decoder architecture implementing the Soft-output CANcellation (SCAN) decoding algorithm, is presented. This decoder was implemented on Field Programmable Gate Array (FPGA) devices. The proposed architecture is parametrizable for any number of iterations without adding hardware complexity. The SCAN decoder architecture is compared to another soft-output decoder that implements a Belief Propagation (BP) algorithm. The SCAN decoder can reach a higher throughput than a BP decoder, with a lower memory footprint. Moreover, only one iteration with the SCAN algorithm leads to better decoding performance than 50 iterations of the BP algorithm.
Article
Full-text available
This paper presents an optimized software implementation of a Successive Cancellation (SC) decoder for polar codes. Despite the strong data dependencies in SC decoding, a highly parallel software polar decoder is devised for x86 processor target. A high level of performance is achieved by exploiting the parallelism inherent in today's processor architectures (SIMD, multicore, etc.). Some optimizations that were originally thought for hardware implementation (memory reduction techniques and algorithmic simplifications) were also applied to enhance the throughput of the software implementation. Finally, some low level optimizations such as explicit assembly description or data packing are used to improve the throughput even more. The resulting decoder description is implemented on different x86 processor targets. An analysis of the decoder in terms of latency and throughput is proposed. The influence of several parameters on the throughput and the latency is investigated: the selected target, the code rate, the code length, the SIMD mode (SSE/AVX), the multithreading mode, etc. The energy per decoded bit is also estimated. The proposed software decoder compares favorably with state of the art software polar decoders. Extensive experimentations demonstrate that the proposed software polar decoder exceeds 1 Gb/s for code lengths N ≤ 217 on a single core and reaches multi-Gb/s throughputs when using four cores in parallel in AVX mode.
Conference Paper
Full-text available
Turbo coding is commonly used in the current wireless standards such as 3G and 4G. However, due to the high computational requirements, its software-defined implementation is challenging. This paper proposes a static multi-issue exposed datapath processor design tailored for turbo decoding. In order to utilize the parallel processor datapath efficiently without resorting to low level assembly programming, the turbo decoder is implemented using OpenCL, a parallel programming standard for heterogeneous devices. The proposed implementation includes only a small set of Turbo-specific custom operations to accelerate the most critical parts of the algorithm. Most of the computation is performed using general-purpose integer operations. Thus, the processor design can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well. The proposed processor design was evaluated both by implementing it using a Xilinx Virtex 6 FPGA and by ASIC synthesis using 130 nm and 40 nm technology libraries. The implementation achieves over 63 Mbps Turbo decoding throughput on a single low-power core. According to the ASIC synthesis, the maximum operating clock frequency is 344 MHz/1 050 MHz (130 nm/40 nm).
Article
Full-text available
Polar codes provably achieve the symmetric capacity of a memoryless channel while having an explicit construction. The adoption of polar codes however, has been hampered by the low throughput of their decoding algorithm. This work aims to increase the throughput of polar decoding hardware by an order of magnitude relative to successive-cancellation decoders and is more than 8 times faster than the current fastest polar decoder. We present an algorithm, architecture, and FPGA implementation of a flexible, gigabit-per-second polar decoder.
Article
Full-text available
Among modern error-correcting codes, the newly discovered polar codes are the first ones to provably achieve channel capacity with an explicit construction. While their error-correction performance at moderate block lengths is mediocre, polar decoders can be implemented with a throughput that scales well with length and rate. We present a software implementation of the RSM-SSC decoding algorithm that leverages the SIMD units available in most general-purpose CPUs. The throughput per kilohertz per core (TNDC) of this design is shown to be well suited for software-defined-radio applications. We also show that, for a similar error-correction performance, the TNDC of a systematic polar code surpasses that of state-of-the-art software LDPC decoders.
Conference Paper
Full-text available
A popular way to exploit high level programming languages in FPGA designs is to use a soft-core with accompanying software development tools. However, a common shortcoming with the current soft-core offerings is their limited software execution capability: the required performance for the implementation can be often reached only with instruction set extensions. In this paper, we propose and evaluate an application-specific processor design toolset that uses a multi-issue exposed data path processor architecture template. The main benefit of the architecture is scalability with respect to instruction-level parallelism (ILP). The design flow allows the designer to freely customize the data path resources in the core to exploit the ILP available in computation intensive kernels. The design toolset includes a retargetable C compiler and an architecture simulator, making design space exploration feasible. The experiments show that a relatively small soft-core tailored with the toolset provides significant speedups on software execution without using any instruction set extensions. The best measured speedup in comparison to the major commercial soft-cores was fourfold in applications from the CHStone benchmark suite, while the amount of consumed FPGA resources remained moderate.
Conference Paper
Full-text available
We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Conference Paper
Error Correction Code decoding algorithms for consumer products such as Internet of Things (IoT) devices are usually implemented as dedicated hardware circuits. As processors are becoming increasingly powerful and energy efficient, there is now a strong desire to perform this processing in software to reduce production costs and time to market. The recently introduced family of Successive Cancellation decoders for Polar codes has been shown in several research works to efficiently leverage the ubiquitous SIMD units in modern CPUs, while offering strong potentials for a wide range of optimizations. The P-EDGE environment introduced in this paper, combines a specialized skeleton generator and a building blocks library routines to provide a generic, extensible Polar code exploration workbench. It enables ECC code designers to easily experiments with combinations of existing and new optimizations, while delivering performance close to state-of-art decoders.
Article
The state-of-the-art soft-output decoder for polar codes is a message-passing algorithm based on belief propagation, which performs well at the cost of high processing and storage requirements. In this paper, we propose a low-complexity alternative for soft-output decoding of polar codes that offers better performance but with significantly reduced processing and storage requirements. In particular we show that the complexity of the proposed decoder is only 4% of the total complexity of the belief propagation decoder for a rate one-half polar code of dimension 4096 in the dicode channel, while achieving comparable error-rate performance. Furthermore, we show that the proposed decoder requires about 39% of the memory required by the belief propagation decoder for a block length of 32768.
Conference Paper
This paper presents the first ASIC implementation of a successive cancellation (SC) decoder for polar codes. The implemented ASIC relies on a semi-parallel architecture where processing resources are reused to achieve good hardware efficiency. A speculative decoding technique is employed to increase the throughput by 25% at the cost of very limited added complexity. The resulting architecture is implemented in a 180nm technology. The fabricated chip can be clocked at 150 MHz and uses 183k gates. It was verified using an FPGA testing setup and provides reference for the true silicon complexity of SC decoders for polar codes.
Article
A method is proposed, called channel polarization, to construct code sequences that achieve the symmetric capacity I(W) of any given binary-input discrete memoryless channel (B-DMC) W. The symmetric capacity is the highest rate achievable subject to using the input letters of the channel with equal probability. Channel polarization refers to the fact that it is possible to synthesize, out of N independent copies of a given B-DMC W, a second set of N binary-input channels {WN(i)1 les i les N} such that, as N becomes large, the fraction of indices i for which I(WN(i)) is near 1 approaches I(W) and the fraction for which I(WN(i)) is near 0 approaches 1-I(W). The polarized channels {WN(i)} are well-conditioned for channel coding: one need only send data at rate 1 through those with capacity near 1 and at rate 0 through the remaining. Codes constructed on the basis of this idea are called polar codes. The paper proves that, given any B-DMC W with I(W) > 0 and any target rate R< I(W) there exists a sequence of polar codes {Cfrn;nges1} such that Cfrn has block-length N=2n , rate ges R, and probability of block error under successive cancellation decoding bounded as Pe(N,R) les O(N-1/4) independently of the code rate. This performance is achievable by encoders and decoders with complexity O(N logN) for each.
Article
A modification is introduced of the successive-cancellation decoder for polar codes, in which local decoders for rate-one constituent codes are simplified. This modification reduces the decoding latency and algorithmic complexity of the conventional decoder, while preserving the bit and block error rate. Significant latency and complexity reductions are achieved over a wide range of code rates.
The recently-discovered polar codes are widely seen as a major breakthrough in coding theory. These codes achieve the capacity of many important channels under successive cancellation decoding. Motivated by the rapid progress in the theory of polar codes, we propose a family of architectures for efficient hardware implementation of successive cancellation decoders. We show that such decoders can be implemented with O(n) processing elements and O(n) memory elements, while providing constant throughput. We also propose a technique for overlapping the decoding of several consecutive codewords, thereby achieving a significant speed-up factor. We furthermore show that successive cancellation decoding can be implemented in the logarithmic domain, thereby eliminating the multiplication and division operations and greatly reducing the complexity of each processing element.
Article
The conditional MOVE processor (CMOVE) has been proposed for replacement of logic table driven sequencers like traffic light controllers and microcomputer I/O processors, in order to take better advantage of hardware-software tradeoffs. Herein the architecture of the CMOVE processor is sketched, and its application to traditional numerical control problems is studied. Two basis types of controllers, of potential use in industrial process control, are taken into account: the digital filter type, expressed as a ratio of two 2-transform polynomials (the proportional-integral-differential (PID) controllers is a particular case of the above), and the matrix multiplication type, which produces a control vector in response to a state vector input. A detailed program for a CMOVE realizatiorn of the digital filter is presented. A number of alternative realizations of the matrix controller are discussed in detail and evatuated.
Parallel Programming of a Symmetric Transport-Triggered Architecture with Applications in Flexible LDPC Encoding