Conference PaperPDF Available

A 60Gb/s 173mW receiver frontend in 65nm CMOS technology

Authors:

Abstract

This paper presents a 65nm CMOS 60Gb/s receiver frontend incorporating CTLE, FFE, DFE, output slicers and clock generation as well as distribution circuits. Current-integration along with cascode gain control is used to maintain equalizer linearity under varying gain without
A 60Gb/s 173mW Receiver Frontend in 65nm CMOS Technology
Jaeduk Han1, Yue Lu2, Nicholas Sutardja1, Kwangmo Jung1, Elad Alon1
1University of California, Berkeley, CA 94704, USA 2Qualcomm Atheros Inc., San Jose, CA 95110, USA
Abstract
This paper presents a 65nm CMOS 60Gb/s receiver frontend
incorporating CTLE, FFE, DFE, output slicers and clock
generation as well as distribution circuits. Current-integration
along with cascode gain control is used to maintain equalizer
linearity under varying gain without sacrificing power
consumption. Interleaved deserializing slicers achieve the high
gain required for adaptive error sampling. The receiver
operates error free over >1e12 bits at 60Gb/s, occupies
0.16mm2, and consumes 173mW.
Introduction
The tremendous growth in data traffic has placed
increasingly stringent demands on high-speed wireline
communication systems. Since the total power budget assigned
to I/O remains fixed, the energy-efficiency of such high-speed
transceivers must be improved as their data-rates are increased
to the 40-60Gb/s range. Several groups have demonstrated
56-80Gb/s DFE and CTLE circuits [1-3] to address these needs.
However, in addition to the high-speed equalization, support
for equalizer adaptation as well as clocking must ultimately be
included; integration of multiple equalizer types (including
FFE), along with these functions while retaining energy
efficiency is critical. This work (Fig. 1) therefore presents a
60Gb/s receiver frontend including all of these components
while dissipating a total of 173mW.
Receiver Architecture and Circuit Implementation
Multiple types of equalizers are necessary to achieve
sufficient signal quality over a realistic channel operating at
60+Gb/s. Thus, in this design, three types of equalizers are
combined to cancel various types of ISI, as shown in Fig. 2.
First, the input ports are connected to a CTLE that removes
long-tail ISI and de-multiplexes the input signal for the
half-rate operation of subsequent stages. After the
CTLE+DMUX, a 2-tap FFE and 3-tap DFE are utilized to
cancel pre-cursor and post-cursor ISI, respectively.
In order to improve upon the energy-efficiency offered by
earlier efforts at these data-rates, this design makes extensive
use of current-integration techniques [4] (as opposed to
resistively loaded stages). For example, the CTLE integrator in
Fig. 3(a) uses a source-degenerated input pair, with clocked
cascode devices for demultiplexing. Note that the resetting
behavior of such current-integration stages is often not
desirable for proper operation of subsequent stages. Dynamic
latches are therefore inserted after the CTLE to convert the
return-to-zero signals to non-return-to-zero, and to provide a
set of UI delayed signals to the FFE.
An integration summer after the latches (Fig. 3(b)) replaces
the power-hungry continuous-time summer in [1] and also
includes a 2-tap FFE to correct pre-cursor ISI. The clocked
current source in the main integration branch turns off the tail
currents during the reset phase, thus reducing the power
consumption further. A key challenge in integrating an FFE
into the receiver is that FFEs require a mechanism to realize
tunable analog gain while retaining a sufficiently linear
transfer function across all such gain settings. Thus, unlike
DFE, one generally cannot set the magnitude of an equalizer
coefficient simply by changing the tail current of a branch in
the circuit, since this would result in low overdrive and hence
poor linearity at low gain. This design therefore instead varies
the cascode gate voltage bias of a cascoded differential pair
(Fig. 3(b)) to achieve variable gain; linearity over the gain
control range is maintained since the overdrive of the input
pair remains roughly constant. Common-mode (CM) control is
critical to ensure the functionality of the following stage, and
thus pull-up current sources [4] (to maintain CM during
integration) as well as a resistor DAC (to lower the CM during
reset) are utilized for this purpose.
Since it has extremely stringent feedback latency
requirements, the first post-cursor tap is cancelled in a separate
dynamic-latch based stage [1] following the integrating
FFE+DFE. Since the dynamic latch should ideally be
transparent when the integrator output is fully evaluated,
simulations showed that delaying the DFE clock relative to the
integrator clock by ~3ps resulted in the best overall
performance (Fig. 4(b)). Given the relatively small delay
compared to the bit period, this delay was simply implemented
by the passive RC network shown in Fig. 4(a).
Even though the output of the dynamic latch stage has
sufficient swing to force the DFE feedback devices into their
non-linear regime, significant additional gain is required after
this stage in order to support equalizer adaptation. This is
because the error slicer required for such adaptation is
intentionally driven by the adaptation loop to have essentially
zero input. Leveraging their regenerative nature and small
aperture windows [6], this design therefore uses two pairs
(even/odd, data/error) of interleaved, offset-cancelled
Fig. 3 (a) CTLE + DMUX (b) Current integrating FFE+DFE
Fig. 1 Receiver Architecture
Fig. 2 Equalizer Architecture
(a) (b)
ITAP3
D3 D3VIP VIM
VCKN ITAP2
D2 D2
VDD
VCKP
VDD
IBP1 IBP2
Main
integrator
DFE branches
(2nd, 3rd tap)
VOUT
gm2 gm3
VPIP VPIM
VCASC
TAP-1
FFE branch
gm-1
VDD
VCKP
VOUTE
VCKC
VIP VIM
RS
VDD
VCKP
VOUTO
VCKC
CS
IBN
LC OSC
DRV
ESD &
Tcoil
60Gb/s In DE DOUT[7:0]
EOUT[7:0]
Slicers
DO
/4
vdLev
CTLE+FFE+DFE
30GHz Clk
gm3 gm2
gm1
gm-1
gm0
gm0
gm3 gm2gm-1
CK
CK
CKD
CK
CK
Feedback
latches
FFE latches Integrating
FFE+DFE
Latched
Summer DFE
CTLE
+DMUX
CK CK
Vin
DE
DO
CK
CK
CK
CKD
LL
LL
CK
L
CK
L
CK
L
CK
L
gm1
CKD
CK
StrongArm latches following the DFE (Fig. 5) to efficiently
realize this gain requirement.
To complete the frontend, a 140 m by 250 m LC oscillator
with a number of digitally controlled capacitor arrays to
support CDR generates 30GHz differential clocks. The
oscillator also includes an injection locking path to support
testing, and is followed by a 220 m by 130 m resonant clock
buffer to efficiently drive the ~300fF differential capacitive
load from the equalizer circuitry.
Measurement Results
As shown in Fig. 7(a), the receiver was fabricated in a 65nm
CMOS process, with the frontend occupying 0.16mm2. To
characterize the design, a 60Gb/s PRBS7 pattern was
transmitted (by a separate chip/TX described in [1]) and
1/128x sub-samplers (using the same structure as Fig. 5, with
oscillator injection locking enabled) were used to reconstruct
the received pattern and measure the BER (Fig. 7(b)). Fig. 8(a)
and Fig. 8(b) show the eye diagram and pulse response of the
input signal, respectively. As shown by the timing bathtub
curve in Fig. 8(c), the receiver successfully recovers the
PRBS7 pattern, and operates error free over >1e12 bits in the
center region. The equalizer core consumes 48mW while
supporting substantially more functionality than previous
designs [1-3] (Table I); the complete frontend dissipates
173mW from 1.2V and 1.0V supplies.
Acknowledgements
The authors would like to thank Systems on Nanoscale
Information fabriCs (SONIC), BWRC, BDA, Integrand EMX,
Lorentz PeakView, the TSMC University Shuttle Program,
and B. Casper of Intel, K. Chang of Xilinx, P. Y. Chiang of
OSU, C.K. K. Yang of UCLA, P. K. Hanumolu of UIUC, and
V. Stojanovic of UC Berkeley.
References
[1] Yue Lu, Elad Alon, JSSC, Dec. 2013
[2] Takayuki Shibasaki, et al., VLSI Symp., 2014
[3] Ahmed Awny, et al., JSSC, Feb. 2014
[4] Matt Park, et al., ISSCC, Feb. 2007
[5] Rui Bai, et al., ISSCC, 2014
[6] Jaeha Kim, et al., TCAS I, Aug. 2009
TABLE I: PERFORMANCE SUMMARY
Reference
[1]
[3]
This
work
Process
65nm
CMOS
130nm
SiGe
65nm
CMOS
Data-rate (Gb/s)
66
80
60
Equalizer
3-tap
DFE
CTLE
1-tap DFE
1-tap
DFE
CTLE
2-tap FFE
3-tap DFE
VISI/VCURSOR or
Channel Loss (dB)
1.65
12 dB
1.54
Power (mW)
Equalizer
46
1772
48
Deserializer
28
Clock generation
52
Clock distribution
2228
45
Total
46
4000
173
Efficiency (pJ/bit)
0.7
50
2.88
: Includes output buffer : LC oscillator + divider + PI
*: Includes equalizer, 4:16 DES, clock distribution
Fig. 4 (a) Latched summer 1-tap DFE (b) DFE behavior without clock
delay (dashed) and with clock delay (solid)
Fig. 7 (a) Die photo (b) Measurement setup
Fig. 8 (a) Eye diagram at channel output (b) Estimated pulse response
(c) Bathtub curve after equalization
Fig. 5 Interleaved deserializing slicers
Fig. 6 Clock path architecture
VIP
VCKDN
D1 D1
VDD
VCKDP
VTAPCKDN 1st tap
branch
FFE+DFE
Integrator
VBDN VBTAPDNVBN
Passive delay
VCKN
VOUT
gm1
VIM
Integration Reset
(a) (b)
Equalizer Slicer
60Gb/s Tx
with ISI
10GHz clock
Subsamping
clock
BERT
RxIC
(b)
Injection
locked OSC.
(a)
(b)
Volts
(a)
BER
(c)
Tx phase (UI)
Time index (UI)
DoutEP
DoutEM
VDD
OFP OFM
IMAIN
IOFST
vdLevP
vdLevM
CK30GP
CK30GM /2 CML2
CMOS DESCLK
[3:0]
OFSTLE[3:0]
DESCLK[3:0]
DIV+PI
vdLevO
0
30GHz Clk
DoutO
DoutE
CK
Preamp
Strongarm
Latch
vdLevE
0
Equalizer in Fig. 2
/2 PI
OFSTLO[3:0]
L
EOUT[3:0], DOUT[3:0]
retimer
4x Interleaved Slicer Array
EOUT[7:4], DOUT[7:4]
LC tank
CTLE+
FFE+
DFE
Resonant
clock driver
Band select
Prop control
Intg control
/4
30GHz injection locked LC OSC
10GHz
Clk in
Clock receiver
IINJ IOSC
... In addition to the baseline design, the GBCR architecture offers an upstream test channel where an additional equalizer module is inserted to further reduce ISI and improve the signal quality. The design of this module builds on an original design by Berkeley Wireless Research Center [6] comprising a 2-tap FFE for pre-cursor cancellation combined with a 3-tap DFE for post-cursor cancellation. This circuit has been customized for the ATLAS ITk data transmission link resulting in a simplified design for operation at significantly lower speed, which required replacing the dynamic latches with static latches in order to avoid excessive voltage droop, as well as ensuring radiation hardness by resizing the transistors such that the dimensions comply with the design guidelines set by CERN. ...
Preprint
Full-text available
This paper presents the design and simulation results of a gigabit transceiver Application Specific Integrated Circuit (ASIC) called GBCR for the ATLAS Inner Tracker (ITk) Pixel detector readout upgrade. GBCR has four upstream receiver channels and a downstream transmitter channel. Each upstream channel operates at 5.12 Gbps, while the downstream channel operates at 2.56 Gbps. In each upstream channel, GBCR equalizes a signal received through a 5-meter 34-American Wire Gauge (AWG) twin-axial cable, retimes the data with a recovered clock, and drives an optical transmitter. In the downstream channel, GBCR receives the data from an optical receiver and drives the same type of cable as the upstream channels. The output jitter of an upstream channel is 26.5 ps and the jitter of the downstream channel after the cable is 33.5 ps. Each upstream channel consumes 78 mW and each downstream channel consumes 27 mW. Simulation results of the upstream test channel suggest that a significant jitter reduction could be achieved with minimally increased power consumption by using a Feed Forward Equalizer (FFE) + Decision Feedback Equalization (DFE) in addition to the linear equalization of the baseline channel. GBCR is designed in a 65-nm CMOS technology.
... where µ c is the step size,d is the data sample received at time i, e j is the error at time j = i + k between the analog voltage y j at the samplers and the desired voltage dLev corresponding to a '1' bit and c (k) is the k-th iteration on a generic parameter that can be adapted (the taps amplitude of FFE or DFE, the positions of poles and zeroes in a CTLE modelled e.g. as H ctle (s) = c 0 + c 1 s, the sampling phase as well as dLev itself). Note that sign d i sign e j corresponds to correlating the error made at a certain time with the bit received at possibly another time in the past or in the future, where obviously the latter can be considered only when data and errors are parallelized before computation of the fully-adaptive algorithm [8,17,27,[42][43][44]. Such correlations provide information www.astesj.com ...
Article
Full-text available
We describe an efficient system-level simulator that, starting from the architecture of a well-specified transmissive medium (a channel modelled as single-ended or coupled differential microstrips plus cables) and including the system-level characteristics of transmitter and receiver (voltage swing, impedance, etc.), computes the eye diagram and the bit-error rate that is obtained in high-speed serial interfaces. Various equalization techniques are included, such as feed-forward equalization at the transmitter, continuous-time linear equalization and decision-feedback equalization at the receiver. The impact of clock and data jitter on the overall system performance can easily be taken into account and fully adaptive equalization can be simulated without increasing the computational burden or the model’s complexity.
Article
We propose a simple approach to simulate the convergence and the performance of high-speed serial interfaces with adaptive equalization. Such a method uses as input the pulse response of the channel (PCB and cables, plus package/socket) that is then modified by adaptive techniques implementing feed-forward, decision-feedback and linear-continuous time equalization. Probabilistic considerations are used in this novel method to estimate the evolution during time of the LMS loop and the resulting evolution of the eye diagram. Sample results are reported for a realistic link operating at 20Gb/s.
Article
This paper presents a 40-Gb/s 3-tap forward feedback equalizer (FFE) incorporating broadband active delay cell, multiplier & summer and a delay-locked loop (DLL). The active delay cell employs capacitive degeneration and negative impedance structures to broaden the bandwidth. The source-degenerated linear transconductor based multiplier & summer circuits are used through appropriate setting of the FFE tap coefficients. The delay time of the delay cells are calibrated by a DLL against process, voltage, and temperature (PVT) variations. For improving calibration accuracy, the phase detector adopting two symmetric XOR gates and the charge pump utilizing current splitting and self-detection compensation techniques are designed. The proposed circuit is fabricated in 130 nm BiCMOS process, which achieves a data rate of 40 Gb/s through 20-inch FR-4 PCB trace and the horizontal and vertical eye openings of 0.55 UI and 150 mV.
Article
Design techniques for a complete 60 Gb/s receiver frontend with equalization, output slicing/demultiplexing, and clocking capabilities are described. Current integration combined with a cascode gate-voltage bias gain-control technique enables energy-efficient implementation of CTLE, FFE, and DFE circuits while operating near the speed limits of the technology. Despite following the DFE that has already in principle sliced the data, adaptive error-sampling requires high gain to resolve small residual error signals—this challenge is addressed by the addition of interleaved, offset-canceled deserializing samplers. Clock generation as well as distribution circuits are implemented to complete the receiver frontend. The proposed 65 nm CMOS receiver operates at 60 Gb/s, consuming 173 mW from 1.2 V and 1.0 V supplies.
  • Yue Lu
  • Elad Alon
Yue Lu, Elad Alon, JSSC, Dec. 2013
  • Takayuki Shibasaki
Takayuki Shibasaki, et al., VLSI Symp., 2014
  • Ahmed Awny
Ahmed Awny, et al., JSSC, Feb. 2014
  • Matt Park
Matt Park, et al., ISSCC, Feb. 2007
TABLE I: PERFORMANCE SUMMARY Reference
  • Jaeha Kim
Jaeha Kim, et al., TCAS I, Aug. 2009 TABLE I: PERFORMANCE SUMMARY Reference [1] [2] [3]
  • Rui Bai
Rui Bai, et al., ISSCC, 2014