Page 1

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 3, MARCH 1995251

A 4.4 ns CMOS 54

Using Pass-Transistor Multiplexer

54-b Multiplier

Norio Ohkubo, Makoto Suzuki, Member, IEEE, Toshinobu Shinbo, Toshiaki Yamanaka, Member, IEEE,

Akihiro Shimizu, Katsuro Sasaki, Member, IEEE, and Yoshinobu Nakagome, Member, IEEE

Abstract—A 54 ? ? ? 54-b multiplier using pass-transistor mul-

tiplexers has been fabricated by 0.25 ? ? ?m CMOS technology.

To enhance the speed performance, a new 4-2 compressor and

a carry lookahead adder (CLA), both featuring pass-transistor

multiplexers, have been developed. The new circuits have a speed

advantage over conventional CMOS circuits because the number

of critical-path gate stages is minimized due to the high logic

functionality of pass-transistor multiplexers. The active size of

the 54 ? ? ? 54-b multiplier is 3.77 ? ? ? 3.41 mm. The multiplication

time is 4.4 ns at a 2.5-V power supply.

I. INTRODUCTION

E

cessors. In particular, high-speed multiplication is becoming

increasingly important in RISC’s, DSP’s, graphics accelera-

tors, and so on, because of increasing demand for multimedia

applications. Recent high-end microprocessors call for an

operating frequency of 200 MHz or over. Furthermore, a

multiplier will be required for single-clock-cycle operation.

However, no CMOS 54

54-b multiplier with a delay time

less than 5 ns has yet been reported [1], [2].

This paper describes a 54

oped formantissa multiplication of 2 double-precision num-

bers, as outlined in the IEEE standard [3]. The target multipli-

cation time is less than 5 ns. To reduce the multiplication

time, a new 4-2 compressor and a carry lookahead adder

(CLA), both featuring pass-transistor multiplexers, have been

developed. The new circuits provide a speed advantage over

conventional CMOS circuits because the number of critical-

path gate stages is minimized due to the high logic func-

tionality of pass-transistor multiplexers. In addition, power

reduction is important in attaining such high performance. For

this purpose, we employed 0.25

reduced the supply voltage to 2.5 V.

The architecture of the 54

in Section II. In Section III, the circuit design based on

Booth’s algorithm, as well as the design of the pass-transistor

NHANCING the performance of floating-point operation

is indispensable for current high-performance micropro-

54-b multiplier macro devel-

m CMOS technology and

54-b multiplier is described

Manuscript received August 1, 1994; revised 10/21/94.

N. Ohkubo, M. Suzuki, T. Yamanaka, and Y. Nakagome are with the Central

Research Laboratory, Hitachi, Ltd., Kokubunji, Tokyo 185, Japan.

T. Shinbo and A. Shimizu are with the Hitachi VLSI Engineering Corpo-

ration, Kodaira, Tokyo 187, Japan.

K. Sasaki is with the R&D Division, Hitachi America, Ltd., Brisbane, CA

94005 USA.

IEEE Log Number 9408745.

Fig. 1.

multiplexers.

Block diagram of the 54 ? 54-b multiplier usingpass-transistor

multiplexer, 4-2 compressor, and carry lookahead adder are

discussed. Section IV describes the fabrication of a test chip.

Some experimental results are shown in Section V, and the

conclusions are summarized in Section VI.

II. ARCHITECTURE

The block diagram of the 54

in Fig. 1. It employs Booth’s algorithm [4], Wallace’s tree

[5], and a conditional carry-selection (CCS) adder [6]. The

number of partial products is halved by Booth’s algorithm.

The partial products are summed by Wallace’s tree, without

carry propagation. The summed results are then added by the

CCS adder with high-speed carry propagation.

Reducing the delay of Wallace’s tree is important in reduc-

ing multiplication time, so we used a 4-2 compressor, which

has 5 inputs and 3 outputs. The carry-out (

the next higher bit 4-2 compressor’s carry-in (

in Fig. 1. Without propagating the carry to a higher bit, the

4-2 compressor can add four partial products (II-I4) because

the carry-out (

) does not depend on the carry-in (

using this 4-2 compressor, only four addition stages are needed

for Wallace’s tree, as shown in Fig. 1. It is known that the

4-2 compressor has a speed advantage over full-adder-based

designs, because of the reduced number of addition stages

[1], [2]. For further improvement, we have developed a new

4-2 compressor that reduces the critical path gate stages by

exploiting the high logic functionality of the pass-transistor

multiplexer.

54-b multiplier is shown

) is connected to

), as shown

). By

0018–9200/95$04.00 1995 IEEE

Page 2

252IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 3, MARCH 1995

(a)

(b)

Fig. 2. Booth’s algorithm. (a) Booth’s encoder. (b) Partial-product generator.

Furthermore, we have developed a high-speed 108-b CLA

adder, which is another important component of the high-speed

multiplier. We have already reported a 32-b ALU with a 4-b

lookahead carry scheme called conditional carry-selection [6].

To apply this scheme to the final adder of the multiplier, the

4-b CLA was modified into an 8-b CLA.

III. CIRCUIT DESIGN

A. Booth’s Algorithm

The purpose of using Booth’s algorithm in this design is

not to reduce delay time, but to reduce the chip area. If the

full-adder-based design is applied in Wallace’s tree, the delay

time can be shortened by using Booth’s algorithm because

it reduces the number of addition stages. However, because

one addition stage of the 4-2 compressor halves the number

of partial products, the extra delay time without Booth’s

algorithm is only that of one 4-2 compressor. Since the delay

time in generating the partial products without Booth’s encoder

is shorter than that with Booth’s encoder, the multiplication

time, either with or without Booth’s algorithm, are almost the

same.

Booth’s encoder is shown in Fig. 2(a). The multiplicands,

,, and , are encoded by this circuit. Encoding

the data halves the number of partial products. The simulated

propagation delay time is 0.50 ns. The partial-product genera-

tor is shown in Fig. 2(b). A multiplier, either

is selected depending on whether encoded data,

is high and inverted by encoded data, NEG. The simulated

propagation delay time is 0.56 ns.

or,

or

B. Pass-Transistor Multiplexer

The pass-transistor multiplexer used in the 4-2 compressor

and 108-b CLA adder is shown in Fig. 3. When the control

signal

is low, data is selected, and when the control

Fig. 3. Pass-transistor multiplexer circuit.

(a) (b)

(c)

Fig. 4.

tristate inverter. (b) Pass-transistor tristate inverter. (c) Comparisonof delay

time.

Comparison of CMOS and pass-transistor logic circuits. (a) CMOS

signal

the control signal input for the next-stage multiplexer. Thus,

the multiplexer has both positive and negative output. It can

reduce the propagation delay by eliminating an inverter.

Several pass-transistor logic circuits have been proposed

to improve the performance of CMOS circuits. The NMOS

pass-transistor logic circuits [7] is one example. It has been

shown to result in high speed due to its low input capacitance

and high logic functionality. However, particularly in reduced

supply voltage designs, it is important to take into account

the problems of noise margins and speed degradation. These

are caused by mismatches between the input signal levels

and the logic threshold voltage of the CMOS gates, which

fluctuates with process variations. To avoid these problems,

the multiplexer in this design consists of both NMOS and

PMOS pass transistors.

The delay time of the pass-transistor multiplexer is shorter

than that of a CMOS gate, because of the pass-transistor-

based design where both the NMOS and PMOS are turned on.

The CMOS tristate inverter and pass-transistor tristate inverter

are shown in Fig. 4(a) and (b), which are used in CMOS

and pass-transistor multiplexers, respectively. The number

of transistors in both circuits is the same, and both have

is high, datais selected. The output is used as

Page 3

OHKUBO et al.: A CMOS MULTIPLIER USING PASS-TRANSISTOR MULTIPLEXER253

(a)

(b)

Fig. 5.

Full-adder-based construction. (b) Proposed construction.

4-2 compressor circuits using pass-transistor multiplexers. (a)

equal input capacitance. A simulated comparison is shown

in Fig. 4(c), showing the dependence of the delay time from

“in” to “out” on the output load capacitance. The low driving-

source impedance attained by using the pass-transistor makes

the delay time of the pass-transistor shorter than that of a

CMOS gate.

C. 4-2 Compressor Circuit

The 4-2 compressor circuits using pass-transistor multiplex-

ers are shown in Fig. 5. The signal lines in the figure represent

positive and negative signals. The inputs of the multiplexers

are either two different signals, or one signal and logical invert.

All outputs ( ,

, and ) have buffers to enhance driving

ability. The 4-2 compressor circuits add four partial products

(I1–I4) and generate a sum signal ( ) and two carry signals

(

and).

Since the pass-transistor multiplexer circuit shown in Fig. 3

has high logic functionality, a full-adder circuit is constructed

from three pass-transistor multiplexers. The 4-2 compressor is

constructed from 2 full adders, such that there are 4 critical-

path gate stages, as shown in Fig. 5(a). This circuit is faster

than conventional CMOS circuits due to the use of pass-

transistor multiplexers.

For further speed improvement, we developed a new 4-2

compressor. Though the number of multiplexers is the same,

the number of critical-path gate stages in this circuit is reduced

to 3 by exploiting parallelism, as shown in Fig. 5(b). In this

new configuration, the carry-out (

the carry-in (Ci), so, the advantage of the 4-2 compressor is

maintained with this new configuration. The simulated delay

) does not depend on

Fig. 6. Simulated comparison of 4-2 compressor circuits.

Fig. 7. Construction of Wallace’s tree.

comparison for these 4-2 compressor circuits is shown in

Fig. 6 The proposed circuit reduces the propagation delay time

by 18% from that of a full-adder-based circuit.

Wallace’s tree, shown in Fig. 7, is constructed from partial-

product generators, 4-2 compressors, full adders, and half

adders. Using the 4-2 compressor simplifies the construction of

Wallace’s tree. The Wallace’s tree consist of only five kinds

of blocks and four kinds of wire shifters. The five kinds of

blocks include an 8

4 partial-product generator ( ), eight

4-2 compressors (8 ), six half adders (

(

), and 21 half adders and a full adder (

four kinds of wire shifters include an 8-b right shifter, an 8-b

left shifter, a 16-b right shifter, and a 16-b left shifter. The

sum and carry signals of 4-2 compressors are shifted by the

wire shifters. In addition, the carry signals are shifted to the

left one more bit.

), 15 half adders

). The

D. Conditional Carry-Selection Circuit

A number of fast-adder architectures have been proposed

[8]–[14], some of which use pass-transistor logic circuits

for carry propagation [13], [14]. These circuits gain their

Page 4

254IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 3, MARCH 1995

(a)

(b)

Fig. 8.

circuit. (b) 8-b modified conditional carry-selectcircuit.

Carry lookahead circuits. (a) 4-b conditional carry-select (CCS)

speed advantage over CMOS circuits due to their high logic

functionality. However, a problem with this architecture is

the serial connection of the pass transistors in the carry

propagation path.

We have already reported a new look ahead carry scheme

called conditional carry-selection (CCS) [6]. The 4-b CLA

is constructed so as to have three critical-path gate stages,

as shown in Fig. 8(a). In the CCS architecture, conditional

carry signals for each bit,

carry of “0”) or

(assuming an incoming group-carry

of “1”), are selected by the multiplexers depending on the

conditional carry signals of the previous bit,

as expressed by

(assuming an incoming group-

or,

(1)

(2)

where

CCS adder is faster than the conventional pass-transistor-based

design because the pass-transistors are not serially connected

is carry generation andis carry propagation. The

Fig. 9.Construction of 108-b adder.

in the carry propagation path. This is because the multiplexer’s

output is directly connected to the next multiplexer’s control

signal input.

To apply the CCS scheme to the final 108-b adder, the 4-

b CLA is modified into an 8-b CLA, as shown in Fig. 8(b).

In the modified CCS architecture, conditional carry signals

for each bit,

or, are selected by the multiplexers

depending on the conditional carry signals of the two previous

bits,

or, as expressed by

if(3)

(4)if

if(5)

(6)if

In this configuration, there are several series-connected pass-

transistors, as shown in Fig. 8(b). However, the critical path

of addition in the multiplier has no series-connected pass-

transistors. This critical path is created because the

critical path of multiplier arrives later than the other bits. The

new 8-b CLA has only four critical-path gate stages because

it makes use of parallelism. It reduces the 108-b addition time

to 1.52 ns from 1.82 ns.

Fig. 9 shows a construction of the 108-b adder based on

carry-selected architecture [10]. DPL gates [6] are used for the

in the

Page 5

OHKUBO et al.: A CMOS MULTIPLIER USING PASS-TRANSISTOR MULTIPLEXER255

Fig. 10. Block carry lookahead circuit CLA2.

TABLE I

PROCESS TECHNOLOGY

TABLE II

CHARACTERISTICS OF THE 54 ? 54-BIT MULTIPLIER

half adders (HA) circuits. The conditional-sum selection (CSS)

circuit consists of a multiplexer which selects the conditional

sums,

or, according to the incoming block carry

signal. The CCS architecture is applied not only to the 8-

b carry lookahead circuit CLA1, but also to the block carry

look-ahead circuit CLA2, as shown in Fig. 10. The block carry

signals are generated by CLA2 assuming the carry of the lower

block carry signals to be 0 or 1, and are then selected by the

multiplexer according to the incoming true carry. To shorten

the carry propagation delay, the multiplexer of the critical carry

propagation path is separated from the other multiplexers, as

shown in Fig. 10. This architecture enhances parallelism and

results in fast operation, because the carry signals of the upper

32 b are calculated in parallel with those of the lower 32 b,

and the carry signals of the upper 32 b are generated after the

delay time of a single multiplexer.

Fig. 11.Micrograph of the 54 ? 54-b multiplier.

Fig. 12. Multiplication time of the 54 ? 54-b multiplier.

IV. FABRICATION

A 54

metal 0.25- m CMOS technology. The major process param-

eters are summarized in Table I. The first metal is tungsten,

and the second and third metals are aluminum. It operates on

a supply voltage of 2.5 V. Fig. 11 shows a micrograph of

the chip. The top and right-hand side are the input sections,

Booth’s encoder is also positioned on the right-hand side. The

final adder and the output are at the bottom. 100 200 transistors

are integrated in an active area of 3.77

is mostly occupied by the 4-2 compressors, the partial-product

generators, and wire shifters.

54-b multiplier test chip was fabricated by triple-

3.41 mm. The area

V. EVALUATION

The simulated multiplication time of the 54

tiplier is shown in Fig. 12. It is only 4.4 ns with a 2.5-V

power supply and typical process at room temperature. It

54-b mul-