IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Published by Institute of Electrical and Electronics Engineers
Print ISSN: 1063-8210
The proposed temperature sensor is based on CMOS ring oscillators and a frequency-to-digital converter capable of simple and efficient temperature conversion to digital value. The proposed temperature sensor consumes 400 uW at a conversion rate of 366 kS/s and performs the fastest temperature-to-digital conversion among those introduced in previous work. The whole block occupies 0.0066 mm<sup>2</sup> (0.0013 mm<sup>2</sup> for temperature sensor). Four multiphase clocks were utilized to enhance the resolution of the sensor 8 times better. After one point calibration, the chip-tochip measurement spread was +2.748degC ~ -2.899degC over the temperature range of -40degC to 110degC.
This paper presents a very low power/area design for the advanced encryption standard (AES) based on an 8-bit data path. The average measured core power on a 0.13-μm CMOS using a 100-kHz clock and a core voltage of 0.75 V is 692 nW. The core area is 21 000 μm<sup>2</sup> and the latency is 356 cycles. This design further challenges the low-resource end of the design space and is the first reported submicrowatt design for the AES; it has significant power-latency-area performance improvements over the previous state-of-the-art application-specific IC (ASIC) implementations.
A between-pair skew compensator for parallel data communications is presented. It can detect time skew between two independent data sequences using continuous-time correlations and then automatically align the two using a voltage controlled wide-bandwidth data delay line. A 5Gb/s sub-bit between-pair skew compensator in 0.13μm CMOS occupies 0.03mm<sup>2</sup> active die area and dissipates 22.5mW.
A digital delay-locked loop (DLL) suitable for generation of multiphase clocks in applications such as time-interleaved and pipelined analog-to-digital converters (ADCs) locks in a very wide (40×) frequency range. The DLL provides 12 uniformly delayed phases, free of false harmonic locking. A two-stage digital split-control loop is implemented: a fast-locking coarse acquisition is achieved in four cycles using binary search; a fine linear loop achieves low jitter (9 ps rms @ 600 MHz) and tracks process, voltage, and temperature (PVT) variations. The false harmonic locking detector, the frequency range and the jitter performance among other design considerations are analyzed in detail. The DLL consumes 20 mW and occupies a 470 μm × 800 μm in 0.13 μm CMOS.
This paper investigates the efficient design of the PHY layer architecture for wireless body area networks (WBAN), which targets on ultra-low power consumption with reliable quality of service (QoS). A low cost baseband transceiver specification and a data processing flow are proposed with a comparatively low-complexity control state machine. A multifunctional digital timing synchronization scheme is also proposed, which can achieve packet synchronization and data recovery. The proposed baseband transceiver is fabricated in an 0.18- μm CMOS process. With a 1.1 V supply and 4 MHz system clock, this baseband chip only consumes 34 μW in transmitter (TX) mode and 39.6 μW in receiver (RX) mode. To demonstrate and to optimize the reliability of the proposed design, the dedicated bit-error-rate and packet-error-rate analysis is reported.
A 10-Gb/s 90-dBOmega optical receiver analog front-end (AFE), including a transimpedance amplifier (TIA), an automatic gain control circuit, and a postamplifier (PA), is fabricated using a 0.18-mum CMOS technology. In contrast with a conventional limiting amplifier architecture, the PA is consisted of a voltage amplifier followed by a slicer. By means of the TIA and the PA codesign, the receiver front-end provides a -3-dB bandwidth of 7.86 GHz and a gain bandwidth product (GBW) of 248.5 THz-Omega. The tiny photocurrent received by the AFE is amplified to a differential voltage swing of 900 mV<sub>pp</sub> when driving 50-Omega output loads. The measured input sensitivity of the optical receiver is -13 dBm at a bit-error rate of 10<sup>-12</sup> with a 2<sup>31</sup>-1 pseudorandom test pattern. The optical receiver AFE dissipates a total power of 199 mW from a 1.8-V supply, among which 35 mW is consumed by the output buffer. The chip size is 1300 mumtimes1796 mum
The general objective of our work is to investigate the area and power-delay performances of low-voltage full adder cells in different CMOS logic styles for the predominating tree structured arithmetic circuits. A new hybrid style full adder circuit is also presented. The sum and carry generation circuits of the proposed full adder are designed with hybrid logic styles. To operate at ultra-low supply voltage, the pass logic circuit that cogenerates the intermediate XOR and XNOR outputs has been improved to overcome the switching delay problem. As full adders are frequently employed in a tree structured configuration for high-performance arithmetic circuits, a cascaded simulation structure is introduced to evaluate the full adders in a realistic application environment. A systematic and elegant procedure to scale the transistor for minimal power-delay product is proposed. The circuits being studied are optimized for energy efficiency at 0.18-/spl mu/m CMOS process technology. With the proposed simulation environment, it is shown that some survival cells in stand alone operation at low voltage may fail when cascaded in a larger circuit, either due to the lack of drivability or unsatisfactory speed of operation. The proposed hybrid full adder exhibits not only the full swing logic and balanced outputs but also strong output drivability. The increase in the transistor count of its complementary CMOS output stage is compensated by its area efficient layout. Therefore, it remains one of the best contenders for designing large tree structured arithmetic circuits with reduced energy consumption while keeping the increase in area to a minimum.
This paper presents a synchronous 50% duty-cycle clock generator (DCCG). The proposed DCCG circuit comprises of a clock generator and a phase error integrator. The clock generator is edge-triggered by an input signal to produce an output whose pulse width is determined by a delay line. The delay line is controlled by the phase error integrator which detects the phase difference between the input and output signals. The proposed DCCG is designed such that when the phase error is zeroed, i.e., the input and output signals are synchronized, the delay is properly adjusted and the output signal duty cycle converges to 50%. The proposed DCCG is implemented in a 0.35-μm CMOS process. The circuit can operate from 70 to 500 MHz, and accommodates a wide range of input duty cycle ranging from 5% to 95%. The duty-cycle error of the output signal is less than 1.5%. Operated from a 3.3-V supply voltage, this circuit dissipates 7 mA at 500 MHz.
This paper proposes a 0.5 V/100 MHz/sub-5 mW-operated 1-Mbit SRAM cell architecture which uses a boosted and offset-grounded data storage (BOGS) scheme. The key target of BOGS is to minimize the charge amount supplied from the embedded charge pump circuits, which are required to boost the effective gate to source voltage (V/sub 0/=V/sub GS/-V/sub T/) up to 0.8 V necessary to achieve 100 MHz operation even at 0.5 V single power supply. Thus, the key low-power strategy of BOGS is "putting the right (higher efficiency) boosted power-supply from charge pump circuit into the right position (less power consumed transistor) in a SRAM cell." This paper is focused on why BOGS can realize a greater savings of the charge amount supplied from the boosted power-line and can reduce the power dissipation to /spl les/1/30.4 and /spl les/1/3.9 compared to the previously reported negative source-line drive (NSD) scheme and negative word-line drive (NWD) scheme, respectively, while achieving a 0.5 V/100 MHz operation.
An all-digital on-chip clock skew measurement system via subsampling is presented. The clock nodes are subsampled with a near-frequency asynchronous sampling clock to result in beat signals which are themselves skewed in the same proportion but on a larger time scale. The beat signals are then suitably masked to extract only the skews of the rising edges of the clock signals. We propose a histogram of the arithmetic difference of the beat signals which decouples the relationship of clock jitter to the minimum measurable skew, and allows skews arbitrarily close to zero to be measured with a precision limited largely by measurement time, unlike the conventional XOR based histogram approach. We also analytically show that the proposed approach leads to an unbiased estimate of skew. The measured results from a 65 nm delay measurement front-end indicate that for an input skew range of ±1 fan-out-of-4 (FO4) delay, ±3σ resolution of 0.84 ps can be obtained with an integral error of 0.65 ps. We also experimentally demonstrate that a frequency modulation on a sampling clock maintains precision, indicating the robustness of the technique to jitter. We also show how FM modulation helps in restoring precision in case of rationally related clocks.
A CMOS local oscillator using a programmable delayed-lock loop based frequency multiplier is present in this paper. The maximum measured output frequency is 1.2 GHz. The frequency of the output clock is 8/spl times/ to 10/spl times/ of an input reference clock between 100 to 150 MHz at simulation. No LC-tank is used in the proposed design such that the power dissipation as well as the active area is drastically reduced. The design is carried out by TSMC 1P5M 0.25 /spl mu/m CMOS process at 2.5 V power supply. The average lock time is optimally shortened by initializing the start-up voltage of the voltage-controlled delay tap line at the midway of the working range. Meanwhile, the power dissipation is 52.5 mW at 1.2 GHz output.
In this paper, we present the design of a 32-b arithmetic and log unit (ALU) that allows low-power operation while supporting a design-for-test (DFT) scheme for delay-fault testability. The low-power techniques allow for 18% reduction in ALU total energy for 180-nm bulk CMOS technology with minimal performance degradation. In addition, there is a 22% reduction in standby mode leakage power and 23% lower peak current demand. In the test mode, we employ a built-in DFT scheme that can detect delay faults while reducing the test-mode automatic test equipment clock frequency.
This paper presents a design of a wide-range transceiver without an external reference clock. The self-biased and multi-band PLL with self-initialization technique is used for the wide-operating range of 140 Mb/s to 1.96 Gb/s and fast frequency acquisition time of 7.2 μs. A linear phase detector which has no dead-zone problem is proposed for a phase adjustment with a low-jitter performance. The RMS jitter of the recovered clock is 11.4 ps at 70 MHz operation. The overall transceiver consumes 388 mW at 2.5 V supply and occupies 3.41 mm<sup>2</sup> in a 0.25-μm 1P5M CMOS technology.
H.264/AVC intra-frame encoding contains several computation-intensive coding tools that form a long data dependency loop that is difficult to speed up. In this paper, we present a low-power and high-performance H.264/AVC intra-frame encoder. We propose several novel approaches to alleviate the performance bottleneck caused by the long data dependency loop among 4 × 4 luma blocks, integrate an efficient CABAC entropy encoder, and apply a clock-gating technique to reduce power consumption. Synthesized into a TSMC 0.13 μm CMOS cell library, our design requires 265.3 K gates at 114 MHz and consumes 23.56 mW to encode 1080pHD (1920 × 1088) video sequences at 30 frames per second (fps). It also delivers the same video quality as the H.264/AVC reference software. Compared with all state-of-the-art designs, our design has a lower working frequency and achieves both better bit-rate saving and lower power consumption.
In contrast to the prior arts, this paper proposes a 14-band CMOS I/Q frequency synthesizer based on a single-PLL architecture. With proper frequency planning, only divide-by-2 dividers are needed in the feedback path of the PLL. Thus, more precise in- phase and quadrature phase sub-harmonics can be derived from the divider chain for SSB frequency mixing. On the other hand, the number of mixer stages in cascade is reduced to 2 for full- band carrier generation. Using a sub-harmonic I/Q calibration, the image spurs are suppressed below -45dBc and more than 33dB SFDR is achieved for the full band generation.
An IEEE 1149.5 module test and maintenance (MTM) bus slave module interface core is presented, which is used for direct access from the system bus to the IEEE 1149.1 chip-level or on-chip buses to facilitate hierarchical system test and diagnosis. The hierarchical test methodology also is presented, which is applicable to the system-on-chip environment, All the standard 1149.1 instructions, such as SAMPLE/PRELOAD, EXTEST, BYPASS, and even RUNBIST, can be performed within three 1149.5 read/write-data message cycles. The messages are transmitted between the MTM-bus master module (Ill-module) and the slave module (S-module). We adopt the full test access port control method to activate the 1149.1 boundary-scan paths via the 1149.5 MTM-bus. Our S-module interface circuit implements 16 CORE commands and one read/write-data command. It has been prototyped using a field-programmable gate array chip and implemented by a full-custom chip. Hierarchical test of multiple 1149.1 compatible boards has been experimentally verified.
Capacitive crosstalk between adjacent signal wires has significant effect on performance and delay uncertainty of point-to-point on-chip buses in deep submicrometer (DSM) VLSI technologies. We propose a hybrid polarity repeater insertion technique that combines inverting and non-inverting repeater insertion to achieve constant average effective coupling capacitance per wire transition for all possible switching patterns. Theoretical analysis shows the superiority of the proposed method in terms of performance and delay uncertainty compared to conventional and staggered repeater insertion methods. Simulations at the 90-nm node on semi-global METAL5 layer show around 25% reduction in worst case delay and around 86% delay uncertainty minimization compared to standard bus with optimal repeater configuration. The reduction in worst case capacitive coupling reduces peak energy which is a critical factor for thermal regulation and packaging. Isodelay comparisons with standard bus show that the proposed technique achieves considerable reduction in total buffers area, which in turn reduces average energy and peak current. Comparisons with staggered repeater which is one of the simplest and most effective crosstalk reduction techniques in the literature show that hybrid polarity repeater offers higher performance, less delay uncertainty, and reduced sensitivity to repeater placement variation.
A ROM-less direct digital frequency synthesizer employing trigonometric quadruple angle formula is present in this paper. The worse case spectral purity is better than -130 dBc. The amplitude resolution is up to 13 bits, while the phase resolution is 12 bits. Neither any scaling table nor error correction tables are required. The maximum error is mathematically analyzed. The word length of each multiplier is carefully selected in the digital implementation such that the error range is circumscribed and the resolution is preserved.
In this brief, the first- and second-order approximation of the quadruple angle formula (QAF) interpolation methods introduced in the paper by Wang et al. in 2004, are revisited. The limitations of those methods are completely overlooked in the paper. One of the limitations is maximum achievable spurious-free dynamic range (SFDR) of the generated sinusoidal signals, which are significantly overestimated. In this paper, it is mathematically proven that the best achievable spurious-free dynamic ranges using QAF interpolation methods are significantly less than the values given in the paper by Wang et al. Moreover, the corrected and complete digital implementation of the second-order approximation is introduced.
Predicting the residual energy of the battery source that powers a portable electronic device is imperative in designing and applying an effective dynamic power management policy for the device. This paper starts up by showing that a 30% error in predicting the battery capacity of a lithium-ion battery can result in up to 20% performance degradation for a dynamic voltage and frequency scaling algorithm. Next, this paper presents a closed form analytical expression for predicting the remaining capacity of a lithium-ion battery. The proposed high-level model, which relies on online current and voltage measurements, correctly accounts for the temperature and cycle aging effects. The accuracy of the high-level model is validated by comparing it with DUALFOIL simulation results, demonstrating a maximum of 5% error between simulated and predicted data.
An 8:1 multiplexer (MUX) and a 1:8 demultiplexer (DEMUX) for 2.4-Gb/s optical communication systems have been developed using 0.35-/spl mu/m GaAs heterojunction field-effect transistors (FETs). To ensure timing margins, a new timing generator with latches and new clock buffers with cross-coupled inverters have been developed. These large-scale integrations (LSIs) operate at over 2.4 Gb/s with a power consumption of 150 mW (MUX) and 170 mW (DEMUX) at a supply voltage of 0.7 V, and at over 5 Gb/s with power consumption of 200 mW at a supply voltage of 0.8 V.
Rapid advances in semiconductor technology have made timing-related defects increasingly crucial in core-based system-on-chip designs. Currently, modular test strategies based on IEEE standard 1500 are applied to test the functionality of each embedded core in system-on-chip (SoC) designs but fail to verify the corresponding timing specifications. In this paper, to achieve high quality of delay tests, hardware implementation of an embedded delay test framework including the modified test wrappers and the embedded delay test mechanism is presented to build an entirely embedded delay test environment where at-speed clock is applied inside the chip to increase test accuracy. Additionally, the proposed delay test framework is capable of supporting all current solutions of core-based delay test. The experimental results successfully demonstrate the delay testing application using the proposed framework to a crypto processor with satisfying test quality and effectiveness.
IEEE 1500 Standard defines a standard test interface for embedded cores of a system-on-a-chip (SOC) to simplify the test problems. In this paper we present a systematic method to employ this standard in a SOC test platform so as to carry out on-chip at-speed testing for embedded SOC cores without using expensive external automatic test equipment. The cores that can be handled include scan-based logic cores, BIST-based memory cores, BIST-based mixed-signal devices, and hierarchical cores. All required test control signals for these cores can be generated on-chip by a single centralized test access mechanism (TAM) controller. These control signals along with test data formatted in a single buffer are transferred to the cores via a dedicated test bus, which facilitates parallel core testing. A number of design techniques, including on-chip comparison, direct memory access, hierarchical core test architecture, and hierarchical test bus design, are also employed to enhance the efficiency of the test platform. A sample SOC equipped with the test platform has been designed. Experimental results on both FPGA prototyping and real chip implementation confirm that the test platform can efficiently execute all test procedures and effectively identify potential defect(s) in the target circuit(s).
Core-based design and reuse are the two key elements for an efficient system-on-chip (SoC) development. Unfortunately, they also introduce new challenges in SoC testing, such as core test reuse and the need of a common test infrastructure working with cores originating from different vendors. The IEEE 1500 Standard for Embedded Core Testing addresses these issues by proposing a flexible hardware test wrapper architecture for embedded cores, together with a core test language (CTL) used to describe the implemented wrapper functionalities. Several intellectual property providers have already announced IEEE Standard 1500 compliance in both existing and future design blocks. In this paper, we address the problem of guaranteeing the compliance of a wrapper architecture and its CTL description to the IEEE Standard 1500. This step is mandatory to fully trust the wrapper functionalities in applying the test sequences to the core. We present a systematic methodology to build a verification framework for IEEE Standard 1500 compliant cores, allowing core providers and/or integrators to verify the compliance of their products (sold or purchased) to the standard.
The architecture and implementation of a programmable video signal processor dedicated as building block of a multiple instruction multiple data (MIMD)-based bus-connected multiprocessor system is presented. This system can either be constructed from several single processor chips, or it can be integrated on a large area integrated circuit containing several processors. The processor allows an efficient implementation of different video coding standards like H.261, H.263, MPEG-1 and MPEG-2. It consists of a RISC processor supplemented by a coprocessor for computation intensive convolution-like tasks, which provides a peak performance of more than 1 giga-arithmetic operations per second (GOPS). A large area integrated circuit integrating 9 processor elements (PE's) on an area of 16.6 cm/sup 2/ has been designed. Due to yield considerations redundancy concepts have been implemented, that-even in the presence of production defects-result in working chips utilizing a lower number of PE's. Each PE has built-in self-test (BIST) capabilities, which allow for an independent test of itself under the control of its integrated fault-tolerant BIST controller. Defective PE's are switched off. Only the PE's passing the BIST are used for video processing tasks. Prototypes have been fabricated in a 0.8 /spl mu/m complementary metal-oxide-semiconductor (CMOS) process structured by masks using wafer stepping with overlapping exposures. Employing redundancy, up to 6 PE's per chip were functional at 66 MHz, thus providing a peak arithmetic performance of up to 6 GOPS.
In this paper, a power efficient vertex processor for mobile graphics applications is presented. A four-threaded and four-issue expanded VLIW datapath with a quad-float vertex texture fetcher is proposed by exploiting graphics specific characteristics after evaluation of several candidate architectures. Instruction-level power control methods such as operand sharing and writeback re-allocation along with operand isolations and gated clocks result in 40.4% and 82% reduction in energy dissipation and energy delay product compared to the most widely used single threaded SIMD. The proposed processor with the optimized datapath and vertex caches implemented in a 0.18- mum 1P4M CMOS process achieves 186-Mvertices/s geometry performance which is the best result among the processors that are IEEE-754 compliant.
We propose a read-disturb-free, 1-read/1-write port, 8-transistor (8T) bitcell utilizing differential sensing. The conflicting design requirement of read versus write operation in a conventional 6T SRAM bitcell is eliminated using separate read/write access transistors. A distributed read-access transistor shared across the bitcells of every row enables read-disturb-free differential sensing operation with eight transistors per bitcell. Write-access transistors are upsized to form a diffusion-notch-free layout which would result in improved manufacturability. 1R/1W port nature of the proposed 8T bitcell makes it an attractive choice for the high speed, dense register file (RF) designs. Bitcell failure measurements on 20 test-chips fabricated in 90-nm CMOS technology demonstrate that the proposed differential 8T bitcell shows 220 mV lower read- V <sub>min</sub>, 40 mV lower hold- V <sub>min</sub>, 25 mV higher weak-write voltage compared to the iso-area 6T bitcell at iso-performance. At 600 mV, the proposed 8T bitcell array operates up to 67.2 MHz.
As industry moves towards multicore chips, networks-on-chip (NoCs) are emerging as the scalable fabric for interconnecting the cores. With power now the first-order design constraint, early-stage estimation of NoC power has become crucially important. In this work, we present ORION 2.0, an enhanced NoC power and area simulator, which offers significant accuracy improvement relative to its predecessor, ORION 1.0.
We present a system-on-chip (SoC) that integrates a TMS320C54x digital signal processor (DSP), which is commonly used in cellular phones, with a multigigahertz digital RF transmitter that meets the Bluetooth specifications. The RF transmitter is tightly coupled with the DSP and is directly mapped to its address space. The transmitter architecture is based on an all-digital phase-locked loop (ADPLL), which is built from the ground up using digital techniques and digital creation flow that exploit high speed and high density of a deep-submicrometer CMOS process while avoiding its weaker handling of voltage. The frequency synthesizer features a wideband frequency modulation capability. As part of the digital flow, the digitally controlled oscillator (DCO) and a class-E power-amplifier are created as ASIC cells with digital I/Os. All digital blocks, including the 2.4-GHz logic, are synthesized from VHDL and auto routed. The use of VHDL allows for a tight and seamless integration of RF with the DSP. To take advantage of the direct DSP-RF coupling and to demonstrate a software-defined radio (SDR) capability, a DSP program is written to perform modulation of the GSM standard. The chip is fabricated in a baseline 130-nm CMOS process with no analog extensions and features high logic gate density of 150 kgates per mm/sup 2/. The RF transmitter area occupies only 0.54 mm/sup 2/, and the current consumption (including the companion DSP) is 49 mA at 1.5-V supply and 4 mW of RF output. This proves attractiveness and competitiveness of the "digital RF" approach, whose goal is to replace RF functions with high-speed digital logic gates.
For ultra low power application, digital sub-threshold logic design has been explored. Extremely low power supply (VDD) of sub-threshold logic results in significant power reduction. However, it is difficult to convert signals from core logic to input/output (I/O) circuits since core VDD is vastly different from high I/O supply voltage. In this work, we propose a level converter based on dynamic logic style for sub-threshold I/O part, having a large dynamic range of conversion. For the level converter, high voltage clock signal needs to be delivered through separate clock path from core logic, leading to clock synchronization problem between high voltage and low voltage clocks. To overcome this issue, we employed a Clock Synchronizer. A test chip is fabricated in 130-nm CMOS technology in order to verify the proposed technique. Hardware measurement results show that the level converter successfully converts 0.3 V 8 MHz pulse to 2.5 V signal.
The design and implementation of two application specific integrated circuits used to build an ATM switch are described. The chip set is composed of the CMC which is an input/output processor of ATM cells implemented on a BICMOS 0.7 /spl mu/m technology and the ICM, a 0.7 /spl mu/m CMOS IC, that performs cell switching at 68 MHz. The ATM switch exploits parallelism and segmentation to perform 2.5 Gb/s switching per input/output. The main advantage of the high-speed link rates in the range of Gb/s, is the exploitation of statistical gain with bursty high peak rate sources. Another feature of the high speed ATM switches is that the number of interface devices and stages is reduced on an ATM network. To demonstrate the usefulness of the switch, an evaluation of the network efficiency improvement by using statistical gain is presented in the paper.
The excessive interconnection delay and fast increasing development cost, as well as complexity of the single-chip integration of different technologies, are likely to become the major stumbling blocks for the success of monolithic system-on-chips. To address the above problems, this paper investigates a new VLSI integration paradigm, the so-called 2.5-dimensional (2.5-D) integration scheme. Using this scheme, a VLSI system is implemented as a three-dimensional stacking of monolithic chips. A cost analysis framework was developed to justify the 2.5-D integration scheme from an economic point of view. Enabling technologies for the new integration scheme are also reviewed.
A 64-Mb chain ferroelectric RAM (chainFeRAM) is fabricated using 130-nm 3-metal CMOS technology. A newly developed quad bitline architecture, which combines folded bitline configuration with shield bitline scheme, eliminates bitline-bitline (BL-BL) coupling noise. The quad bitline architecture also reduces the number of sense amplifiers and activated bitlines, resulting in the reduction of die size by 6.5% and cell array power consumption by 28%. Fast read/write of 60-ns cycle time as well as reliability improvement are realized by two high-speed error checking and correcting (ECC) techniques: 1) fast pre-parity calculation ECC sequence and 2) all-“0”-write-before-data-write scheme. Moreover, among nonvolatile memories reported so far, the 64 Mb chain FeRAM has achieved the highest read/write bandwidth of 200 MB/s with ECC. The chip size is 87.5 mm<sup>2</sup> with average cell size of 0.7191 μm<sup>2</sup>.
A read-channel chip set for rewritable 3.5 in 230 Mbytes magneto-optical disk drives (MOD) is presented. The front-end chip includes an automatic gain control (AGC) circuit, a programmable six-pole two-zero equiripple filter/equalizer, a DC restore circuit, and pulse detectors. The back-end contains a frequency synthesizer phase-locked loop (PLL) and a data separator PLL with 3:1 operating range to support a constant density recording with 8-24 Mb/s data rate (or code rate of 16 to 48 Mb/s) in (2, 7) run-length limited (RLL) encoding format. The architecture of the chip provides high degree of programmability through a serial microprocessor interface, fast switching (<1 /spl mu/s) between sector mark and data detector modes, and four levels of power management in a 1.5 /spl mu/m 4 GHz BiCMOS process. With a nominal power supply of 5 V, the chip set dissipates 600 mW during normal operation and 1 mW during sleep mode.
An 8-bit*8-bit signed two's complement pipelined multiplier megacell implemented in 1.6- mu m single-poly, double-metal N-well CMOS is described. It is capable of throughputs of 230,000,000 multiplications/s at a clock frequency of 230 MHz, with a latency of 12 clock cycles. A half-bit level pipelined architecture, and the use of true single-phase clocked circuitry are the key features of this design. Simulation studies indicate that the multiplier dissipates 540 mW at 230 MHz. The multiplier cell has 5176 transistors, with dimensions of 1.5 mm*1.4 mm. This multiplier satisfies the need for very high-throughput multiplier cores required in DSP architectures.< >
A variable gain amplifier (VGA) is designed for a GSM sub-sampling receiver. The VGA is implemented in a 0.35-mum CMOS process and approximately occupies 0.64 mm(2). It operates at an IF frequency of 246 MHz. The VGA provides a 60-dB digitally controlled gain range in 2-dB steps. The overall gain accuracy is less than 0.3 dB. The current is 9 mA at 3 V supply. The noise figure at maximum gain is 8.7 dB. The IIP3 is -4 dBm at minimum gain, while the OIP3 is -1 dBm at maximum gain. The group delay is 1.5 ns across 5-MHz bandwidth.
This paper describes an application in high-performance signal processing using reconfigurable computing engines: a 250 MHz cross correlator for radio astronomy. Experimental results indicate that complementary metal-oxide-semiconductor (CMOS) field programmable gate arrays (FPGA's) can perform useful computation at 250 MHz. The notion of an "event horizon" for FPGA's leads to clear design constraints for high-speed application developers, and can be applied to a variety of real-time signal processing algorithms. Recent estimates indicate that higher performance FPGA's available early in 1998 can attain speeds of over 300 MHz using 20% fewer logic elements than current designs. The results of this design work provide important clues on how to improve FPGA architectures for signal processing at hundreds of MHz. Direct routing channels between logic elements can significantly increase performance. Routing architectures with four-way symmetry allow for rotations and reflections of subcircuits needed for optimal packing density. Experimental results indicate that clock buffering often limits the top speed of the FPGA. Wave pipelining of the clock distribution network may improve FPGA performance.
The design of high-throughput large-state Viterbi decoders relies on the use of multiple arithmetic units. The global communication channels among these parallel processors often consist of long interconnect wires, resulting in large area and high power consumption. In this paper, we propose a data transfer oriented design methodology to implement a low-power 256-state rate-1/3 Viterbi decoder. Our architectural level scheme uses operation partitioning, packing, and scheduling to analyze and optimize interconnect effects in early design stages. In comparison with other published Viterbi decoders, our approach reduces the global data transfers by up to 75% and decreases the amount of global buses by up to 48%, while enabling the use of deeply pipelined datapaths with no data forwarding. In the register-transfer level (RTL) implementation, we apply precomputation in conjunction with saturation arithmetic to further reduce power dissipation with provably no coding performance degradation. Designed using a 0.25 /spl mu/m standard cell library, our decoder achieves a throughput of 20 Mb/s in simulation and dissipates only 0.45 W.
In this paper, we study and analyze the computational complexity of the deblocking filter in H.264/AVC baseline decoder based on SimpleScalar/ARM simulator. The simulation result shows that the memory reference, content activity check operations, and filter operations are known to be very time consuming in the decoder of this new video coding standard. In order to improve overall system performance, we propose a configurable, extensible, and synthesizable window-based processing architecture which simultaneously processes the horizontal filtering of vertical edge and vertical filtering of horizontal edge. As a result, the memory performance of the proposed architecture is improved by four times when compared to previous designs. Moreover, the system performance of our window-based architecture significantly outperforms the previous designs from 7 times to 20 times.
Multiple-reference-frame, quarter-pixel accuracy, and variable-block-size motion estimation (VBSME) employed in H.264/AVC is one of the major contributors to its outstanding compression efficiency and video quality. However, due to its high computational complexity, VBSME needs acceleration for real-time application. We propose a high throughput hardware architecture for H.264/AVC fractional motion estimation (FME). The proposed architecture consists of three parallel processing engines. In addition, we propose a resource sharing method which leads to 50% hardware saving in the computation sum of absolute transformed difference (SATD). Synthesized into a TSMC 130 nm CMOS cell library, our design takes 311.7K gates at 154 MHz and can encode 1080 pHD video at 30 frames per second (fps). Compared to previous works, the proposed design runs at much lower frequency for the same resolution and frame rate.
Variable block size motion estimation (VBSME) is one of several contributors to H.264/AVC's excellent coding efficiency. However, its high computational complexity and huge memory traffic make deign difficult. In this paper, we propose a memory-efficient and highly parallel VLSI architecture for full search VBSME (FSVBSME). Our architecture consists of 16 2-D arrays each consists of 16 ?? 16 processing elements (PEs). Four arrays form a group to match in parallel four reference blocks against one current block. Four groups perform block matching for four current blocks in a pipelined fashion. Taking advantage of overlapping among multiple reference blocks of a current block and between search windows of adjacent current blocks, we propose a novel data reuse scheme to reduce memory access. Compared with the popular Level C data reuse scheme, our approach can save 98% of on-chip memory access with only 25% of local memory overhead. Synthesized into a TSMC 180-nm CMOS cell library, our design is capable of processing 1920 ?? 1088 30 fps video when running at 130 MHz. The architecture is scalable for wider search range, multiple reference frames and pixel truncation as well as down sampling. We suggest a criterion called design efficiency for comparing different works. It shows that the proposed design is 72% more efficient than the best design to date.
Prediction, including intra prediction and inter prediction, is the most critical issue in H.264/AVC decoding in terms of processing cycles and computation complexity. These two predictions demand a huge number of memory accesses and account for up to 80% of the total decoding cycles. In this paper, we present the design and VLSI implementation of a novel power-efficient and highly self-adaptive prediction engine that utilizes a 4 times 4 block level pipeline. Based on the different prediction requirements, the prediction pipeline stages, as well as the correlated memory accesses and datapaths, are fully adjustable, which helps to reduce unnecessary decoding operations and energy dissipation while retaining the fixed real-time throughput. Compared with conventional designs, this paper has the advantage of higher efficiency and lower power consumption due to the elimination of all redundant operations and the wide employment of the pipeline and parallel processing. Under different prediction modes, our design is able to decode each macroblock within 500 cycles. A prototype H.264/AVC baseline decoder chip that utilizes the proposed prediction engine is fabricated with UMC 0.18-mu CMOS 1P6 M technology. The prediction engine contains 79 K gates and 2.8 kb single-port on-chip SRAM, and occupies half of the whole chip area. When running at 1.5 MHz for QCIF 30 f/s real-time decoding, the prediction engine dissipates 268 muW at a 1.8-V power supply.
This paper presents an efficient architecture of an application specific processor (ASP) designed for the deblocking filter algorithm of the H.264 video compression standard. Several optimization techniques at different design levels, such as vector register, pipeline processing, very long instruction word (VLIW) processor, and predication, are utilized in this design. The proposed ASP can meet the real time constraint of the deblocking filter algorithm for the 16:9 video format (4690 times 2304) at 30 frames per second (fps) at 200-MHz clock rate.
This paper describes the architecture, functionality, and design of NX-2700, a digital television and media processor chip from Philips Semiconductors. The NX-2700 is the second generation of an architectural family of programmable multimedia processors targeted at the digital television (DTV) markets, including the United States Advanced Television Systems Committee (ATSC) DTV-standard-based applications. The chip not only supports all of the 18 ATSC formats from standard-definition to wide-angle, high-definition video, but has also the power to handle high-definition television (HDTV) video and audio source decoding (high-level MPEG-5 AC-3 and ProLogic audio, closed captioning, etc.) as well as the flexibility to process advanced interactive services. NX-2700 is a programmable processor with a very powerful, general-purpose very long instruction word (VLIW) central processing unit (CPU) core that implements many nontrivial multimedia algorithms, coordinates all on-chip activities, and runs a small real-time operating system. The CPU core, aided by an array of peripheral devices (multimedia coprocessors and input-output units) and high-performance buses, facilitates concurrent processing of audio, video, graphics, and communication-data.
In this paper, an efficient digit-serial systolic array is proposed for multiplication in finite field GF(2/sup m/) using the standard basis representation. From the least significant bit first multiplication algorithm, we obtain a new dependence graph and design an efficient digit-serial systolic multiplier. If input data come in continuously, the proposed array can produce multiplication results at a rate of one every /spl lceil/m/L/spl rceil/ clock cycles, where L is the selected digit size. Analysis shows that the computational delay time of the proposed architecture is significantly less than the previously proposed digit-serial systolic multiplier. Furthermore, since the new architecture has the features of regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementation.
Top-cited authors
Kaushik Roy
  • Purdue University
E.G. Friedman
  • University of Rochester
Luca Benini
  • University of Bologna
Massoud Pedram
  • University of Southern California
Mircea R Stan
  • University of Virginia