[Show abstract][Hide abstract] ABSTRACT: Modular processing of large numbers requires high speed computing resources. In particular an operation slowing the whole computing process heavily is modular exponentiation.A previous method reduces the computation of |xe|m to n simpler modular exponentiations |xiηi|mi at most, where mi is an element of the factorization of m, xi=|x|mi and ηimi. A technique speeding up |xiηi|mi is presented. The idea is to extend the discrete logarithm, reducing to a sole fixed base exponentiation, which can be tabulated. More precisely, depending on modulus value, specific expressions are devised leading to the definition of an extension of the discrete logarithm. Using these expressions a method based on tables is presented and evaluated in term of table size. An implementation of the method is also presented and its response time is derived; for moduli up to about 28 bits it amounts to 120–150ns.
[Show abstract][Hide abstract] ABSTRACT: In this work, a fast digital device is defined, which is customized to implement an artificial neuron. Its high computational speed is obtained by mapping data from floating point to integer residue representation, and by computing neuron functions through residue arithmetic operations, with the use of table look-up techniques. Specifically, the logic design of a residue neuron is described and complexity figures of area occupancy and time consumption of the proposed device are derived. The approach was applied to the logic design of a residue neuron with 12 inputs and with a Residue Number System defined in such a way as to attain an accuracy better than or equal to the accuracy of a 20-bit floating point system. The proposed design (NEUROM) exploits the RNS carry independence property to speed up computations, in addition it is very suitable for using look-up tables. The response time of our device is about 8 x T(ACC), where T(ACC) is the ROM access time. With a value of T(ACC) close to the 10 ns allowed by the current ROM technology, the proposed neuron responds within 80 ns, NEUROM is therefore the neuron device proposed in the literature which allows for maximum throughput. Moreover, when a pipeline mode of operation is adopted, the pipeline delay can assume a value as low as about 14 ns. In the case study considered, the total amount of ROM is about 5.55 Mbits. Thus, using current technology, it is possible to integrate several residue neurons into a single VLSI chip, thereby enhancing chip throughput. The paper also discusses how this amount of memory could be reduced, at the expense of the response time.
[Show abstract][Hide abstract] ABSTRACT: In many problems, modular exponentiation |xb|m is a basic computation, often responsible for the overall time performance, as in some cryptosystems, since its implementation requires a large number of multiplications.It is known that |xb|m=|x|b|ϕ(m)|m for any x in [1,m−1] if m is prime; in this case the number of multiplications depends on ϕ(m) instead of depending on b. It was also stated that previous relation holds in the case m=pq, with p and q prime; this case occurs in the RSA method.In this paper it is proved that such a relation holds in general for any x in [1,m−1] when m is a product of any number n of distinct primes and that it does not hold in the other cases for the whole range [1,m−1].Moreover, a general method is given to compute |xb|m without any hypothesis on m, for any x in [1,m−1], with a number of modular multiplications not exceeding those required when m is a product of primes.Next, it is shown that representing x in a residue number system (RNS) with proper moduli mi allows to compute |xb|m by n modular exponentiations |xib|mi in parallel and, in turn, to replace b by |b|ϕ(mi) in the worst case, thus executing a very low number of multiplications, namely ⌈log2mi⌉ for each residue digit.A general architecture is also proposed and evaluated, as a possible implementation of the proposed method for the modular exponentiation.
Journal of Systems Architecture 08/2002; 47(14-15-47):1079-1088. DOI:10.1016/S1383-7621(02)00058-9 · 0.44 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The parallelism of computation, that characterizes some operations in residue number systems (RNS), is heavily reduced in operations as division, magnitude and sign detection, since numbers must be converted to the weighted system thus reducing efficiency, in spite of the efforts to speed up the conversion. In this work the problem of detecting the sign of numbers represented in RNS is considered and a procedure is devised, which keeps numbers in residue notation, and requires a redundant modulus mp+1⩾2. A sign detecting circuit is also designed that, merely to speed up the operation, exploits a further redundant modulus mr⩾p in the signed number representation. Circuit response time is evaluated, both from the complexity point of view and in a finite case, where 50 gate delays are estimated for a range [−264,264−1].
Journal of Systems Architecture 11/1998; 45(3-45):251-258. DOI:10.1016/S1383-7621(97)00085-4 · 0.44 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this study a logic design, based on RNS units, to perform the N point FFT on a continuous data stream is proposed, and its performance is evaluated in terms of asymptotic VLSI complexity. Such a structure is based on quadratic residue number systems (QRNS), which allow a simplified processing of complex numbers.It is known in the literature that a lower bound on complexity for a single instance of the FFT problem is A(N)T2(N)=Ω(N2 log2N), and that optimal constructive designs were proposed with T(N)=ϑ(log2N). The architecture proposed here is based on a very high degree of processing parallelism and on a communication parallelism tailored to the response time of adders and multipliers used in the design; furthermore, pipelining data it performs as a single FFT instance optimal design and features an upper bound for the mean service time Tm(N)=ϑ(log log log N) for each FFT instance.The approach has been also applied in the design of a structure performing 1024-point FFT, with 23 bit data.
[Show abstract][Hide abstract] ABSTRACT: Repeated modular additions and overflow detection are possible in redundant hybrid number systems (RHNS). In this paper a circuit is proposed that implements the overflow-detecting procedure in such systems and allows a mean addition time of about 10.5 gate delays for numbers having a magnitude order normally distributed in the range [−233, 233 − 1], versus a 14 gate delay required by 32-bit CLA adders.
[Show abstract][Hide abstract] ABSTRACT: The problem of designing efficient multioperand modular adders is
approached. A carry-save adder tree is used allowing a response time
lower than values in the literature, without restrictions regarding
modules. Moreover it is shown that a structure such as the proposed one
is suitable for evaluating |X|<sub>m</sub> within a logarithmic time
[Show abstract][Hide abstract] ABSTRACT: A lower bound AT <sup>2</sup>=Ω( n <sup>2
</sup>) for the conversion from positional to residue representation is
derived according to VLSI complexity theory, and existing solutions for
the same problem are briefly reviewed in the light of such a bound. A
VLSI system is proposed, one that operates according to a pipeline
scheme and works asymptotically emulating an optimal structure,
independently of residue number system parameters. This solution has
been applied to a design of specific size (64-b input stream), and it
has been found that a single CMOS custom chip can implement the design
with a throughput of one residue representation every 30-40 ns
IEEE Transactions on Computers 09/1993; DOI:10.1109/12.238486 · 1.66 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A novel method to compute the exact digits of the modulo m product of integers is proposed, and a modulo m multiply structure is defined. Such a structure can be implemented by means of a few fast VLSI binary multipliers, and a response time of about 150-200 ns to perform modular multiplications with moduli up to 32767 can be reached. A comparison to ROM-based structures is also provided. The modular multiplier has been evaluated asymptotically, according to the VLSI complexity theory, and it turned out to be an optimal design. This structure can be used to implement a residue multiplier in arithmetic structures using residue number systems (RNSs). The complexity of this residue multiplier has been evaluated and lower complexity figures than for ROM-based multiply structures have been obtained under several hypotheses on RNS parameters
[Show abstract][Hide abstract] ABSTRACT: High-computing speed and modularity have made RNS-based arithmetic processors attractive for a long time, especially in signal processing, where additions and multiplications are very frequent. The VLSI technology renewed this interest because RNS-based circuits are becoming more feasible; however, intermodular operations degradate their performance and a great effort results on this topic. We deal with the problem of performing the basic operation X(mod m), that is the remainder of the integer division X/m, for large values of the integer X, following an approximating and correcting approach, which guarantees the correctness of the result. We also efine a structure to compute X(mod m) by means of few fast VLSI binary multipliers, which is ex-emplified for 32-bit long numbers, obtaining a total response time lower than 200 nsec. Furthermore, such a structure is evaluated in terms of VLSI complexity and area and time figures A=ϑ(n 2 /T M 2 ) and T=ϑ(T M ) for the parameter T M in [log n, n] are derived. A simple positional-to-residue converter is finally presented, based on this structure; it improves some complexity results previously obtained by authors.
Journal of VLSI Signal Processing 03/1990; 1(4). DOI:10.1007/BF00929920 · 0.73 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Residue Number Systems (RNS) are suited for high-speed applications and VLSI implementations, because of the modular and parallel nature of their arithmetic. In this paper a new solution is provided to the problem of designing VLSI structures for converting integers to and from residue number systems in the area of pipelined applications, with the constraint that the layout width is comparable with the data stream width. The proposed structure is suited for both direct and reverse conversion and has complexity figures better than previously known results, evaluated under several hypotheses on the RNS parameters.
[Show abstract][Hide abstract] ABSTRACT: A method to interpolate ordered sets of points is presented. This method is based upon a new spline function (the angular spline) which has been derived on the basis of geometrical considerations and whose parametric expressions use a properly chosen angle as a parameter. The method has several important features, e.g., smoothness, continuity, flexibility, local property and invariance under affine transformations. Moreover, the only information required are the data points to be fitted. The main analytical and geometrical properties are proved and several examples are given to show the graphical behaviour of the interpolation method and the ability to reproduce a vast variety of shapes.
Computer Vision Graphics and Image Processing 07/1987; 39(1):56–72. DOI:10.1016/S0734-189X(87)80202-5
[Show abstract][Hide abstract] ABSTRACT: Special purpose hardware devices devoted to display families of curves are very attractive when high performance graphic tools are to be designed. In this paper an architecture well suited for fast hardware curve generators is proposed, mainly based on the use of vector generators and ROMs. Curve graphs are approximated by polygonal lines, the extremes of which and a selected subset of vertices can be obtained with the required precision. Output rate is shown to be very close to available vector generators rate. As an example, a device adopting this architecture has been designed for the generation of conic and exponential curves. Precision figures have been obtained in the hypothesis that the generator hardware complexity allows a single chip implementation. The architecture is easily extensible to three-dimensional curves.
[Show abstract][Hide abstract] ABSTRACT: Residue Number Systems (RNS) are proved to be useful in many applications, as for example in signal processing. In this paper, a VLSI computing architecture is proposed for converting an integer number N from the weighted binary representation into and out a residue code based on s moduli. For this architecture a possible layout is given and its complexity is evaluated in terms of area and time. Under several hypotheses on RNS parameters, constructive upper bounds ranging from 0(n^{2} log n) to 0(n^{2} log log n) and from 0(log^{2} n) to 0(log n) for area and time, respectively, have been obtained for the direct conversion. On the contrary, constructive upper bounds A = 0(n^{2} log n) and T = 0(log^{2} n) have been found independent of the formed hypotheses, for the reverse conversion.
IEEE Transactions on Circuits and Systems 01/1985; DOI:10.1109/TCS.1984.1085465
[Show abstract][Hide abstract] ABSTRACT: Many FFT processor designs have been proposed, most of which have been limited by hardware costs when a large number of points is to be processed.In recent years, VLSI technology modified design methodology and determined a general reduction of costs. The scope of this work is to present a fast near optimum VLSI architecture for solving an N-point FFT which exhibits T= ϑ(log log N) and AT2 = ϑ(N2log2N log log N). Main features are: very high parallelism, proper communication parallelism, residue arithmetic, table look-up techniques and pipeline of data.Moreover, it will be shown that design performance does not depend on the input and output data representation (residue or weighted notation).
[Show abstract][Hide abstract] ABSTRACT: The advent of VLSI technology has deeply modified the design of digital systems. The structure of special algorithms is now close to the structure of communication and computing resources on the silicon chip. Modular and regular structures allow parallel VLSI algorithms with good figures of complexity in terms of speed and size. In this paper systolic arrays of processors are used to define two new faster VLSI algorithms for solving the problem of multiplying two band matrices. The proposed algorithms are based on different area-time trade-off: they exhibit wA · wB processors, n steps and n2 processors, minwa, wb steps respectively, compared with wA · wB processors, 3n steps of the previously known VLSI algorithm.
[Show abstract][Hide abstract] ABSTRACT: First developed for telephone applications, Circuit Switching Networks (CSN) are being used more and more extensively also in computer architecture, but a good understanding of their features are required in order to take advantage of them. Several types of functionalities have been defined in the literature to describe the logical behavior of CSN's. In this work the relationships among the main CSN functionalities are investigated and a domain of definition is identified. In this environment relations as inclusion, composition and specularity of functionalities are considered, and for each functionality upper and lower bounds on complexity are derived or reported from the literature.
International Journal of Parallel Programming 01/1982; 11(2):123-138. DOI:10.1007/BF00995527 · 0.49 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper deals with the problem of describing the behaviour of LSI components for a three-valued functional simulation. The proposed functional description uses a set of predefined functional modules, named primitives, for handling data signals, and test blocks for handling control signals. Some primitives with relative algorithms are described and a procedure for test blocks management with three-valued control signals is proposed.