A Low-Power Double-Edge-Triggered Address Pointer Circuit for FIFO Memory Design

Saravanan Ramamoorthy*        Haibo Wang†
Sarma Vrudhula‡

*Southern Illinois University Carbondale
†Southern Illinois University Carbondale, haibo@engr.siu.edu
‡Arizona State University at the Tempe Campus

2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

This paper is posted at OpenSIUC.
http://opensiuc.lib.siu.edu/ece_confs/51
A Low-Power Double-Edge-Triggered Address Pointer Circuit for FIFO Memory Design

Saravanan Ramamoorthy and Haibo Wang
Dept. of Electrical and Computer Engineering
Southern Illinois University
Carbondale, IL 62901

Sarma Vrudhula
Dept. of Computer Science and Engineering
Arizona State University
Tempe, AZ 85281

Abstract

This paper presents a novel design of address pointer for FIFO memory circuits. Advantages of the proposed design include: reduced capacitive load on the pointer clock path, the use of a true single-phase clock, and double-edge-triggering clock scheme. The circuit has low power consumption, is immune to circuit racing conditions and suitable for high-speed operations. Techniques to implement clock gating in pointer circuit design for further reducing power consumption are also discussed. The proposed circuit is implemented with a 65nm CMOS technology and its performance is compared with previous pointer circuits.

1 Introduction

First-in first-out (FIFO) memories have been widely used in modern electronic systems. A high-speed FIFO is usually implemented using a two-port RAM array (one port for read operation and the other for write operation) and two address pointers for tracing the read and write memory accesses [1, 2, 3, 4]. An address pointer functions as a token-passing circuit which passes a logic 1 (the token) along its outputs, which control the word-line drivers or column selection circuits of the RAM array. A straightforward implementation of the address pointer is a cyclic shift register chain. Since the number of flip-flops in the shift register increases with the size of the RAM array, address pointers designed for large RAM arrays normally occupy large silicon areas and have heavy cumulative capacitive load on the clock signal paths, resulting in large power consumption and degraded circuit speed.

Previously, several circuit techniques have been proposed to reduce pointer circuit area and the capacitive load on the clock paths of pointer circuits. D flip-flops (DFFs) without clear inputs are used in address pointer design [4], which results in a 17% layout area saving. However, due to the lack of a global clear input, the address pointer must go through a length multi-cycle reset operation for initialization. To reduce the pointer clock load, the pointer circuit presented in [1] uses pass-transistors, instead of complementary transmission gates, in the DFF circuit. While resulting in significant power reduction, this approach potentially suffers from notable speed degradation when low voltage supply is used. D latches are used in address pointer circuit presented in [5]. The latches are classified as odd latches and even latches according to their positions in the pointer circuit. Odd and even latches fetch data at different clock phases. Thus, a double-edge-triggering (DET) clock scheme is achieved. Since pass-transistors are also used in the latch circuits of the above design, it is not suitable for low-voltage application neither.

In this work, we present a novel address pointer design. At each stage of the proposed pointer circuit, only one transistor is connected to the pointer clock. Thus, the clock load is dramatically reduced in the proposed circuit. Unlike most of the previous pointer circuits that use both clock (\(clk\)) and its complementary signal (\(\overline{clk}\)), the proposed circuit only needs a true single-phase clock signal and hence is immune to circuit racing conditions caused by clock skew between \(clk\) and \(\overline{clk}\) [6]. In addition, the proposed design uses a DET clock scheme to accommodate the double date rate technique, which is now widely used in high-throughput system design. Finally, clock gating techniques are presented in the paper to further reduce power consumption of the address pointer circuit. Experimental results are presented to compare the performance of the proposed circuit with other designs.

The rest of the paper is organized as follows. Section 2 describes the proposed circuit. Clock gating techniques for pointer circuits are discussed in Section 3. Experimental results are presented in Section 4 and the paper is concluded in Section 5.

2 Proposed Design

The proposed address pointer circuit consists of two types of basic cells, referred to as \(n\)-cell and \(p\)-cell, respectively. The circuit structures of \(n\)-cell and \(p\)-cell are shown in Figure 1. Each type of cell circuit contains three inputs:
clock (CK), data (D), and clear (CLR) ports. In addition, each cell has an output port Q and its complementary output port QB. For an n-cell, its output Q is set to 1 only when the clock and data input D both are high. Q is reset to 0 when CLR=0. To ensure proper circuit operation and avoid a large DC current, the pull-up and pull-down network of the n-cell can never be activated simultaneously. The operation of a p-cell is the exactly reverse to the operation of the n-cell. The output port Q of the p-cell is set to 1 when both clock and data input D are low, and Q is reset to 0 when CLR=1. Similarly, the pull-up and pull-down network of the p-cell are never turned on at the same time. The inverters in the feedback paths of n-cells and p-cells are weak inverters (to be implemented by devices with small W/L ratios), which prevent circuit nodes from floating when both the pull-up and pull-down networks of n- or p-cells are off. Because of the sporadic nature of FIFO write and read operations, it is important to avoid floating circuit nodes (dynamic circuit behaviors) in address pointers. Otherwise, leakage current may corrupt the logic value on the floating node and cause circuit malfunction.

Figure 1. Schematic of the proposed n- and p-cells.

The connections of n- and p-cells as well as the overall circuit structure, including the starting circuit are shown in Figure 2. The key points of this structure are:

1. All the cells are initialized to 0 before starting operation. (This can be done by a multi-cycle reset operation similar to the one in [4] or adding a global reset input to all the cells.)

2. A starting circuit provides the 1 to be injected into the pointer circuit in the first shifting operation. When the 1 reaches the second cell, the SR latch is reset and from that point onwards, the D input of the first cell is logically connected to the output of the last cell.

3. The data input of a p-cell is connected to the complementary output of the previous n-cell so that the p-cell is set to 1 on the falling edge of the clock when the previous cell output is 1.

4. The data input of a n-cell is connected to the non-inverting output of the previous p-cell in order that the n-cell is set to 1 on the rising edge of the clock when the previous cell output is 1.

5. Whenever a cell output is set to 1, its complementary output will turn off the previous pointer output.

6. The CLR (clear) input of the n-cell in position j is connected to the complementary output of the n-cell in position j + 2. If the n-cell in position j received a 1 on the jth clock transition, then on the next rising clock transition, the 1 will appear in cell j + 2. Hence the complementary output (which is 0) of cell j + 2 resets cell j to 0. Similarly, the CLR input of the p-cell in position i is connected to the non-inverting output of the p-cell in position: i + 2.

Figure 2. The proposed pointer circuit.

The proposed address pointer circuit has a number of advantages. First, each cell in the proposed design contributes only one gate capacitance to the clock net. Also, the proposed design uses less number of transistors than most of the previous designs. It results in a smaller layout area and, consequently, a shorter clock routing path with smaller interconnect parasitic capacitance. These factors will dramatically decrease the capacitive load on the pointer clock path. Second, the pointer circuit only needs a true single-phase clock. Thus, it is immune to racing conditions, which makes it particularly suitable for high-speed design. The circuit avoids the use of single-channel pass-transistors and, hence, no threshold voltage loss occurs during signal propagation, making it attractive in designs with reduced power supply voltage.

3 Clock Gating Technique in Pointer Design

Further power reduction for the pointer circuit can be achieved by partitioning the whole circuit into several blocks...
and the clock signal is connected only to the block in which the logic 1 is shifted. This clock gating technique can be easily implemented by using RS-latches and AND gates as shown in Figure 3. The clock is fed into a block only when the output of the corresponding latch is logic 1. The operation of the clock gating circuit is explained by the following example. If the output of the $M_{th}$ RS-latch is high, the clock signal is connected to the $M_{th}$ block and logic 1 is being shifted within this block. When the last pointer output in this block is set to 1, the $M_{th}$ RS-latch is reset and the clock signal is disconnected from this block. Meanwhile, the $(M + 1)_{th}$ RS-latch is set and clock signal is connected to the $(M + 1)_{th}$ block. Therefore, after next clock transition, logic 1 is transferred from the $M_{th}$ block to the $(M + 1)_{th}$ block. As the pointer circuit is a cyclic structure, the set port of the first RS-latch is connected to the last pointer output. Before starting operation, all the RS-latches are reset to 0, except for the first latch, which is set to logic 1. Since both a positive and a negative edge can trigger shifting operation, the circuit should be designed carefully to prevent additional transitions from being generated at the output of AND gates. Thus, switching the clock signal from one block to another block must be always scheduled during the low period of the clock. This implies that the first and last cell of each block should be an n-cell and a p-cell, respectively, resulting in an even number of cells in each block.

Figure 3. Clock gating technique in pointer circuit.

4 Experimental Results

To compare the proposed design with other pointer implementations, four 256-bit DET pointer circuits, referred to as Proposed design, Ref. design 1, Ref. design 2, and Ref. design 3, are implemented using a 65nm CMOS technology. The proposed design uses the technique discussed in this paper. The Ref. design 1 is based on the technique presented in [5]. The Ref. design 2 and 3 are shift register based implementations with using DET DFFs presented in [7, 8], respectively. Transistor sizes used in all the designs are selected according to the following principle: the equivalent resistance of every pull-up or pull-down path in the circuits is approximately the same as the equivalent resistance of a minimum sized NMOS device of the given technology. Circuit simulations are performed to verify the function of the proposed circuit and compare its performance with reference designs. A 1V power supply and 500MHz clock signal are used in the simulation. Figure 4 shows the waveforms of the clock signal and the first three outputs of the proposed pointer circuit. It clearly shows the DET pointer function is realized by the proposed circuit structure.

![Figure 4. Simulated pointer outputs.](image)

The power consumption, clock to output delay, and power-delay product of the four implementations are also compared through circuit simulations. For more accurate comparison, parasitic capacitance on clock routing paths are estimated and included in circuit simulations. The procedure to estimate the wire load capacitance is briefly discussed as follows. First, according to circuit stick diagrams and design rules of the selected technology we estimated the width of each stage of the four pointer circuits. Second, we use the clock routing scheme as shown in Figure 5. We partition the 256 cells into 16 groups and each group contains 16 cells. Cells within a group share a single group clock buffer and a global clock buffer drives the 16 group clock buffers. Third, we assume clock interconnects are twice of the minimum wire width and located in low-k trenches. To consider the congested routing channels, we assume there are metal layers over and beneath the clock routing layer. With the above assumptions, the wire load capacitance are estimated according to capacitor parameters of the given technology. The estimated capacitor values are listed in Columns 2 and 3 in Table 1. Ref. design 1 uses the least number of transistors and has the smallest area. Thus, it has the smallest wire load on its clock path. On the contrary, Ref. design 3 has the largest wire load capacitance due to the use of complicated DET DFFs. The power, delay and power-delay product are obtained from simulation and listed in the third, fourth and fifth columns of Table 1. It shows the proposed circuit has the smallest power consumption and clock to output delay, thanks to the reduced overall clock load and avoiding the use of single-channel pass-transistors. Compared to Ref. designs 1, 2, and 3, the proposed circuit reduces power consumption by 15.6%, 65.6%, and 86.3%, respectively. The clock to output delay is also improved in the pro-
posed design by the percentages of 6%, 31%, and 44.7%. The power delay product is improved by the percentage of 19.6%, 76.4%, and 92.4% compared to the reference designs.

Table 1. Circuit performance comparison with estimated wire load.

<table>
<thead>
<tr>
<th>Circuits</th>
<th>$C_1$ (fF)</th>
<th>$C_2$ (fF)</th>
<th>Power ($\mu$W)</th>
<th>Delay (ps)</th>
<th>PDP ($\mu$W-ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prop. Design</td>
<td>96</td>
<td>6</td>
<td>151</td>
<td>240</td>
<td>36,240</td>
</tr>
<tr>
<td>Ref. Design 1</td>
<td>89.6</td>
<td>5.6</td>
<td>179</td>
<td>256</td>
<td>43,056</td>
</tr>
<tr>
<td>Ref. Design 2</td>
<td>166.4</td>
<td>10.4</td>
<td>439</td>
<td>350</td>
<td>153,650</td>
</tr>
<tr>
<td>Ref. Design 3</td>
<td>244.8</td>
<td>15.3</td>
<td>1102</td>
<td>434</td>
<td>478,268</td>
</tr>
</tbody>
</table>

Simulations are also performed to study circuit performance when a reduced power supply voltage is used. The obtained clock to output delays at 0.9V and 0.8V power supply voltage are listed in Table 2. The second and fourth columns of the table shows the delays of the four designs. The third and fifth columns list the delay improvement by using the proposed circuit. For example, the number 9.4%, at the third column and the row corresponding to Ref. design 1, means that 9.4% delay improvement is obtained by using the proposed design when compared to using Ref. design 1. At 0.8V power supply, Ref. design 4 fails to operate with a 500MHz clock. This is primarily due to its large parasitic capacitance on the clock path and the clock outputs from the clock buffers are severely degraded. The simulation results demonstrate that the proposed circuit is more suitable for low-voltage applications than the three reference designs.

Table 2. Circuit Delay with reduced supply voltage.

<table>
<thead>
<tr>
<th>Circuits</th>
<th>$V_{DD} = 0.9V$</th>
<th>$V_{DD} = 0.8V$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Delay (ps)</td>
<td>Improv.</td>
</tr>
<tr>
<td>Prop. Design</td>
<td>290</td>
<td>-</td>
</tr>
<tr>
<td>Ref. Design 1</td>
<td>520</td>
<td>9.4%</td>
</tr>
<tr>
<td>Ref. Design 2</td>
<td>425</td>
<td>31.8%</td>
</tr>
<tr>
<td>Ref. Design 3</td>
<td>546</td>
<td>46.9%</td>
</tr>
</tbody>
</table>

Finally, circuit simulations are performed to demonstrate the proposed clock gating technique. Figure 6 shows a snapshot of the clock waveforms obtained from simulation. The top waveform is the main clock before the clock gating logic. The second and third waveforms are the clock signals going to the first and second partitioned pointer blocks. The fourth waveform is the last output of the first pointer block and the fifth waveform is the first output of the second pointer block. Clearly, the pointer function is not affected by the clock gating scheme. Simulation results show that 51% power reduction can be achieved by the clock gating technique.

5 Conclusions

A novel double-edge-triggered address pointer is developed for FIFO memory design. The proposed design results in significant reduction on the cumulative capacitive load on the pointer clock path and hence consumes less power consumption compared to previous design. It uses a true single-phase clock and is immune to circuit racing conditions. The proposed design is suitable for low-voltage and high-speed applications.

References