[Show abstract][Hide abstract] ABSTRACT: The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048-bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex-6 family FPGA. The implementation results showed that our RSA module for 2048-bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.
[Show abstract][Hide abstract] ABSTRACT: The main contribution of this paper is to present efficient hardware algorithms for the modulo exponentiation P<sup>E</sup> mod M used in RSA encryption and decryption, and implement them on the FPGA. The key ideas to accelerate the modulo exponentiation are to use the Montgomery modulo multiplication on the redundant radix-64 K number system in the FPGA, and to use embedded 18 times 18-bit multipliers and embedded 18 k-bit block RAMs in effective way. Our hardware algorithms for the modulo exponentiation for R-bit numbers P, E, and M can run in less than (2R + 4)(R/16 + 1) clock cycles and in expected (1.5R + 4)(R/16 +1) clock cycles. We have implemented our modulo exponentiation hardware algorithms on Xilinx VirtexII Pro family FPGA XC2VP30-6. The implementation results shows that our hardware algorithm for 1024-bit modulo exponentiation can be implemented to run in less than 2.521 ms and in expected 1.892 ms.
[Show abstract][Hide abstract] ABSTRACT: The main contribution of this paper is to present hardware algorithms for redundant radix-2<sup>r</sup> number system in the FPGA to accelerate Montgomery modulo multiplication with many bits, which have applications in security systems such as RSA encryption and decryption. Quite surprisingly, our hardware algorithm for Montgomery modulo multiplication of two dr-bit numbers can be completed in only d+1 clock cycles. Since most FPGAs have 18-bit multipliers and 18 k-bit block RAMs, it makes sense to let r=16. Our hardware algorithm for Montgomery modulo multiplication for 256-bit numbers runs only 17 clock cycles using redundant radix-64 k (i.e.radix-2<sup>16</sup>) number system. The experimental results for Xilinx Virtex-II Pro Family FPGA XC2VP100-6 show that the clock frequency of our circuit is independent of d. Further, the hardware algorithm for 1024-bit Montgomery modulo multiplication using the redundant number system is 3 times faster than that using the conventional number system. Also, for 256-bit Montgomery modulo multiplication, our hardware algorithm runs in 0.322 mus, while a previously known implementation runs in 1.22 mus although our implementation uses less than a half slices.
Embedded and Ubiquitous Computing, 2008. EUC '08. IEEE/IFIP International Conference on; 01/2009
[Show abstract][Hide abstract] ABSTRACT: The main contribution of this paper is to present a simple, scalable, and portable tiny processing system which can be implemented in various FPGAs. Our processing system includes a 16-bit processor, a cross assembler, and a cross compiler. The 16-bit processor runs in 89 MHz on the Xilinx Spartan-3A family FPGAXC3S700A using 336 out of 5888 slices (5.7%)and in 76 MHz on the Altera Cyclon III family EP3C25F324 using 569 out of 24624 logic elements (2.3%). Every instruction can be executed in only one clock cycle, that is, CPI=1. Using a cross assembler and a cross compiler that we have developed, a C-based language program can be translated into a machine language object code, which can be executed on the 16-bit processor. The source codes of our processing system are very simple and compact. The 16-bit processor is designed by Verilog~HDL using 268 lines, and the cross assembler is written in 38 lines using Perl language. The cross compiler has 23 lines of Flex grammar file for lexical analysis, and 90 lines of Bison grammar file for context analysis and code generation. Hence, our tiny processing system is portable and easy to understand and the function expansion is not difficult. Actually, the tiny processing system has been used for the embedded system course of graduate students as a course material. Further, the 16-bit processor is scalable, that is, the word size can be changed from standard 16 bits. As real-life applications, we have developed a PONG-like mini game and an RSA encryption/decryption system based on the tiny processing system. Therefore, our tiny processing system benefits computer system education and small embedded system development.
2008 IEEE/IPIP International Conference on Embedded and Ubiquitous Computing (EUC 2008), Shanghai, China, December 17-20, 2008, Volume II: Workshops; 01/2008
[Show abstract][Hide abstract] ABSTRACT: The main contribution of this paper is to present hardware algorithms for redundant radix-2r number system in the FPGA to speed the arithmetic operations for numbers with many bits, which have applications in security systems such as RSA encryption and decryption. Our hardware algorithms accelerate arithmetic operations including addition, multiplication, and Montgomery modulo multiplication.Quite surprisingly, our hardware algorithms of the multiplication and Montgomery multiplication for two 1024-bit numbers runs only 64 clock cycles using redundant radix-216 number system. Also, the experimental results for Xilinx Virtex-II Pro Family FPGA XC2VP100-6 show that the clock frequency of our circuit is independent of the number of bits. The speed up factors of our hardware algorithm using the redundant number system over those using the conventional number system are 8.3 for 1024-bit addition, 3.4 for 1024-bit multiplication, and 2.5 for 1024-bit Montgomery modulo multiplication. Further, for 256-bit Montgomery modulo multiplication, our hardware algorithm runs in 0.38 mus, while a previously known implementation runs in 1.22 mus. Thus, our approach using redundant number system for arithmetic operations is very efficient.
Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2008, Dunedin, Otago, New Zealand, 1-4 December 2008; 01/2008