Article

SCALES: SCAL able and Area- E fficient S ystolic Accelerator for Ternary Polynomial Multiplication

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Polynomial multiplication is a key component in many post-quantum cryptography and homomorphic encryption schemes. One recurring variation, ternary polynomial multiplication over ring Zq/(xn+1)\mathbb {Z}_{q}/(x^{n}+1) where one input polynomial has ternary coefficients {-1,0,1} and the other has large integer coefficients {0, q1q-1 }, has recently drawn significant attention from various communities. Following this trend, this paper presents a novel SCAL able and area- E fficient S ystolic (SCALES) accelerator for ternary polynomial multiplication. In total, we have carried out three layers of coherent interdependent efforts. First, we have rigorously derived a novel block-processing strategy and algorithm based on the schoolbook method for polynomial multiplication. Then, we have innovatively implemented the proposed algorithm as the SCALES accelerator with the help of a number of field-programmable gate array (FPGA)-oriented optimization techniques. Lastly, we have conducted a thorough implementation analysis to showcase the efficiency of the proposed accelerator. The comparison demonstrated that the SCALES accelerator has at least 19.0% and 23.8% less equivalent area-time product (eATP) than the state-of-the-art designs. We hope this work can stimulate continued research in the field.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With recent development in internet speed and reliability, cloud computing has become a more reliable solution for the user. In many cases where data privacy is critical, fully homomorphic encryption (FHE) can be a security solution for securing cloud computing. FHE enables computation on encrypted data, hence it ensures data privacy in case of cloud computing. One popular scheme of FHE is the BFV homomorphic encryption scheme, which is based on ring learning with error (RLWE) computation. The BFV scheme uses ring polynomials as the main object, hence its encryption, decryption, and evaluation require high-degree polynomial multiplication. In this paper, we present comprehensive design and implementation of a hardware architecture to accelerate encryption and decryption in BFV scheme. Our accelerator uses convolution approach for calculating a polynomial multiplication. To implement the convolution, we use a systolic array to calculate polynomial convolution followed by a simple delayed subtraction to calculate polynomial modulo reduction inside our accelerator’s core. Moreover, we use a built-in Gaussian pseudo-random number generator to generate Gaussian noise in the encryption operations. Finally, we implement the 1024 degrees BFV accelerator on the Xilinx PYNQ Z1 board and compare the encryption and decryption performances to other methods as well as a software implementation on Intel Core i7 with 8GB memory. Experimental results show that our accelerator outperforms the clock cycles of other methods with the same polynomial degrees 1024 up to 22×. Moreover, our proposed Gaussian PRNG has better 2× correlation compared to the rotation-only-based PRNG. Finally, our accelerator accelerates up to 9× for encryption and 3.5× for decryption as well as 6.8× for overall compared to Microsoft SEAL on Intel Core i7 processor with 8GB memory. The proposed design is scalable for higher degrees polynomial multiplication and useful for security technology such as high-speed secure cloud computing, blind computing, and secure communication.
Article
Full-text available
Efficient implementation of finite field multipliers based on a reordered normal basis (RNB) is highly desirable in the current/emerging cryptosystems since it offers almost free realization of squaring operation. Therefore, in this paper, we propose novel bit-parallel and digit-serial finite field multipliers over GF(2 m ) based on RNB. By efficient transformation of the core multiplication algorithm using a unique circular shifting feature, we have derived an efficient algorithm for low-complexity systolic mapping. Both bit-parallel and digit-serial structures of the multipliers are then obtained and optimized to enhance the area-time efficiency. We have also utilized the unique feature of the proposed multiplication algorithm to obtain the systolic multipliers by Karatsubalike decomposition. Detailed analysis and comparison show the superior performance of the proposed implementation. For example, the proposed regular and Karatsuba-based bit-parallel designs involve at least 48.4% less area-delay product (ADP) and 42.2% less power-delay product (PDP) than the best existing ones (37.7% and 55.3% less ADP and PDP on field-programmable gate array platform), respectively. The proposed multipliers, because of their lower area-time complexities, can be used for efficient realization of cryptographic applications.
Article
Fully homomorphic encryption has become a key technique for solving the conflict between cloud services and privacy preservation. The most time-consuming step in homomorphic schemes is ring polynomial multiplication (RPM). Number theory transform (NTT) and Karatsuba algorithms are efficient to accelerate RPM, yet they are limited by the modulus operations and degrees of the polynomial. The systolic array is adopted for RPM processing recently. However, a modular reduction operation is required as post-processing which increases the overall delay. This brief has proposed a cyclic systolic array architecture without a dedicated reduction unit by re-routing the output of the systolic array for reusing, resulting in a 50% clock cycles saving of processing time. The corresponding FPGA implementation has a reduction of 72.9% and 33.8% when n=256 and n=1024 for equivalent area time product (eATP), respectively, therefore achieving an improved trade-off between performance and resource consumption.
Article
Multiplication of polynomials of large degrees is the predominant operation in lattice-based cryptosystems in terms of execution time. This motivates the study of its fast and efficient implementations in hardware. Also, applications such as those using homomorphic encryption need to operate with polynomials of different parameter sets. This calls for design of configurable hardware architectures that can support multiplication of polynomials of various degrees and coefficient sizes. In this work, we present the design and an FPGA implementation of a run-time configurable and highly parallelized NTT-based polynomial multiplication architecture, which proves to be effective as an accelerator for lattice-based cryptosystems. The proposed polynomial multiplier can also be used to perform Number Theoretic Transform (NTT) and Inverse NTT (INTT) operations. It supports 6 different parameter sets, which are used in lattice-based homomorphic encryption and/or post-quantum cryptosystems. We also present a hardware/software co-design framework, which provides high-speed communication between the CPU and the FPGA connected by PCIe standard interface provided by the RIFFA driver [1]. For proof of concept, the proposed polynomial multiplier is deployed in this framework to accelerate the decryption operation of Brakerski/Fan-Vercauteren (BFV) homomorphic encryption scheme implemented in Simple Encrypted Arithmetic Library (SEAL), by the Cryptography Research Group at Microsoft Research [2]. In the proposed framework, polynomial multiplication operation in the decryption of the BFV scheme is offloaded to the accelerator in the FPGA via PCIe bus while the rest of operations in the decryption are executed in software running on an off-the-shelf desktop computer. The hardware part of the proposed framework targets Xilinx Virtex-7 FPGA device and the proposed framework achieves the speedup of almost 7 × in latency for the offloaded operations compared to their pure software implementations, excluding I/O overhead.
Article
Fully homomorphic encryption (FHE) is a technique that allows computations on encrypted data without the need for decryption and it provides privacy in various applications such as privacy-preserving cloud computing. In this article, we present two hardware architectures optimized for accelerating the encryption and decryption operations of the Brakerski/Fan-Vercauteren (BFV) homomorphic encryption scheme with high-performance polynomial multipliers. For proof of concept, we utilize our architectures in a hardware/software codesign accelerator framework, in which encryption and decryption operations are offloaded to an FPGA device, while the rest of operations in the BFV scheme are executed in software running on an off-the-shelf desktop computer. Specifically, our accelerator framework is optimized to accelerate Simple Encrypted Arithmetic Library (SEAL), developed by the Cryptography Research Group at Microsoft Research. The hardware part of the proposed framework targets the XILINX VIRTEX-7 FPGA device, which communicates with its software part via a peripheral component interconnect express (PCIe) connection. For proof of concept, we implemented our designs targeting 1024-degree polynomials with 8-bit and 32-bit coefficients for plaintext and ciphertext, respectively. The proposed framework achieves almost 12x and 7x latency speedups, including I/O operations for the offloaded encryption and decryption operations, respectively, compared to their pure software implementations.
Article
A transform analogous to the discrete Fourier transform may be defined in a finite field, and may be calculated efficiently by the 'fast Fourier transform algorithm. The transform may be applied to the problem of calculating convolutions of long integer sequences by means of integer arithmetic.
Article
We present a new tensoring technique for LWE-based fully homomorphic encryption. While in all previous works, the ciphertext noise grows quadratically (B → B 2·poly(n)) with every multiplication (before “refreshing”), our noise only grows linearly (B → B·poly(n)). We use this technique to construct a scale-invariant fully homomorphic encryption scheme, whose properties only depend on the ratio between the modulus q and the initial noise level B, and not on their absolute values. Our scheme has a number of advantages over previous candidates: It uses the same modulus throughout the evaluation process (no need for “modulus switching”), and this modulus can take arbitrary form. In addition, security can be classically reduced from the worst-case hardness of the GapSVP problem (with quasi-polynomial approximation factor), whereas previous constructions could only exhibit a quantum reduction from GapSVP.
Article
First Page of the Article