Publications (5)0 Total impact

Article: An RSA Encryption Hardware Algorithm using a Single DSP Block and a Single Block RAM on the FPGA.
[Show abstract] [Hide abstract]
ABSTRACT: The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex6 family FPGA. The implementation results showed that our RSA module for 2048bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.  [Show abstract] [Hide abstract]
ABSTRACT: The main contribution of this paper is to present efficient hardware algorithms for the modulo exponentiation P<sup>E</sup> mod M used in RSA encryption and decryption, and implement them on the FPGA. The key ideas to accelerate the modulo exponentiation are to use the Montgomery modulo multiplication on the redundant radix64 K number system in the FPGA, and to use embedded 18 times 18bit multipliers and embedded 18 kbit block RAMs in effective way. Our hardware algorithms for the modulo exponentiation for Rbit numbers P, E, and M can run in less than (2R + 4)(R/16 + 1) clock cycles and in expected (1.5R + 4)(R/16 +1) clock cycles. We have implemented our modulo exponentiation hardware algorithms on Xilinx VirtexII Pro family FPGA XC2VP306. The implementation results shows that our hardware algorithm for 1024bit modulo exponentiation can be implemented to run in less than 2.521 ms and in expected 1.892 ms. 
Conference Paper: Accelerating Montgomery Modulo Multiplication for Redundant Radix64k Number System on the FPGA Using DualPort Block RAMs
[Show abstract] [Hide abstract]
ABSTRACT: The main contribution of this paper is to present hardware algorithms for redundant radix2<sup>r</sup> number system in the FPGA to accelerate Montgomery modulo multiplication with many bits, which have applications in security systems such as RSA encryption and decryption. Quite surprisingly, our hardware algorithm for Montgomery modulo multiplication of two drbit numbers can be completed in only d+1 clock cycles. Since most FPGAs have 18bit multipliers and 18 kbit block RAMs, it makes sense to let r=16. Our hardware algorithm for Montgomery modulo multiplication for 256bit numbers runs only 17 clock cycles using redundant radix64 k (i.e.radix2<sup>16</sup>) number system. The experimental results for Xilinx VirtexII Pro Family FPGA XC2VP1006 show that the clock frequency of our circuit is independent of d. Further, the hardware algorithm for 1024bit Montgomery modulo multiplication using the redundant number system is 3 times faster than that using the conventional number system. Also, for 256bit Montgomery modulo multiplication, our hardware algorithm runs in 0.322 mus, while a previously known implementation runs in 1.22 mus although our implementation uses less than a half slices.  [Show abstract] [Hide abstract]
ABSTRACT: The main contribution of this paper is to present a simple, scalable, and portable tiny processing system which can be implemented in various FPGAs. Our processing system includes a 16bit processor, a cross assembler, and a cross compiler. The 16bit processor runs in 89 MHz on the Xilinx Spartan3A family FPGAXC3S700A using 336 out of 5888 slices (5.7%)and in 76 MHz on the Altera Cyclon III family EP3C25F324 using 569 out of 24624 logic elements (2.3%). Every instruction can be executed in only one clock cycle, that is, CPI=1. Using a cross assembler and a cross compiler that we have developed, a Cbased language program can be translated into a machine language object code, which can be executed on the 16bit processor. The source codes of our processing system are very simple and compact. The 16bit processor is designed by Verilog~HDL using 268 lines, and the cross assembler is written in 38 lines using Perl language. The cross compiler has 23 lines of Flex grammar file for lexical analysis, and 90 lines of Bison grammar file for context analysis and code generation. Hence, our tiny processing system is portable and easy to understand and the function expansion is not difficult. Actually, the tiny processing system has been used for the embedded system course of graduate students as a course material. Further, the 16bit processor is scalable, that is, the word size can be changed from standard 16 bits. As reallife applications, we have developed a PONGlike mini game and an RSA encryption/decryption system based on the tiny processing system. Therefore, our tiny processing system benefits computer system education and small embedded system development. 
Conference Paper: Redundant Radix2(r) Number System for Accelerating Arithmetic Operations on the FPGAs
[Show abstract] [Hide abstract]
ABSTRACT: The main contribution of this paper is to present hardware algorithms for redundant radix2r number system in the FPGA to speed the arithmetic operations for numbers with many bits, which have applications in security systems such as RSA encryption and decryption. Our hardware algorithms accelerate arithmetic operations including addition, multiplication, and Montgomery modulo multiplication.Quite surprisingly, our hardware algorithms of the multiplication and Montgomery multiplication for two 1024bit numbers runs only 64 clock cycles using redundant radix216 number system. Also, the experimental results for Xilinx VirtexII Pro Family FPGA XC2VP1006 show that the clock frequency of our circuit is independent of the number of bits. The speed up factors of our hardware algorithm using the redundant number system over those using the conventional number system are 8.3 for 1024bit addition, 3.4 for 1024bit multiplication, and 2.5 for 1024bit Montgomery modulo multiplication. Further, for 256bit Montgomery modulo multiplication, our hardware algorithm runs in 0.38 mus, while a previously known implementation runs in 1.22 mus. Thus, our approach using redundant number system for arithmetic operations is very efficient.
Publication Stats
25  Citations  
Top Journals
Institutions

20082009

Hiroshima University
 Department of Information Engineering
Hirosima, Hiroshima, Japan
