VLSI Implementation of a CDMA Blind Adaptive Interference-Mitigating Detector

Luca Fanucci, Associate Member, IEEE, Edoardo Letta, Riccardo De Gaudenzi, Senior Member, IEEE, Filippo Giannetti, and Marco Luise, Senior Member, IEEE

Abstract—This paper presents the design and the main performance results of a single-ASIC implementation of the recently proposed extended complex-valued blind anchored interference-mitigating detector (EC-BAID) for code division multiple access (CDMA) transmission. Such a detector, which exhibits a remarkable robustness to multiple access interference, operates in blind mode, i.e., it only requires knowledge of the timing of the wanted user’s signature code, and it is therefore very well-suited for integration into handheld single-user terminal demodulators. The implementation of the interference-mitigating detector is based on a patented optimized architecture which leads, in 0.25-μm CMOS technology, to a roughly 25 Kgate plus 23-Kbit RAM single-chip ASIC supporting chip rates up to 4 Mchip/s with a maximum internal clock frequency of 32.768 MHz. The main design drivers are thoroughly discussed, and the relevant performance results are compared to the theoretical behavior. A possible extension to multirate CDMA systems adopting orthogonal variable spreading factor (OVSF) sequences is also briefly addressed.

Index Terms—Application-specific integrated circuits, code-division multiaccess, integrated circuit design, interference suppression.

I. INTRODUCTION

DIRECT-SEQUENCE code division multiple access (DS-CDMA) is the basic technology for the air interface of the universal wireless personal communication network planned for the start of the new century [1], [2]. CDMA radio systems provide several advantages (easy network planning, graceful degradation under loaded conditions, soft hand-off, and path diversity resolution) that are fundamental to improved quality of service (QoS) and overall capacity [3]–[5]. In the latter respect, one of the best approaches that can be pursued is application of multiuser detection in the reverse link (from mobile to fixed network) and to single-user interference-mitigating adaptive detection in the forward link (from fixed network to mobile).

Most of the initial theoretical (see [6]) as well as experimental work (see [7]) on advanced detection techniques for CDMA was focused on the reverse link, wherein multiuser detectors may be implemented. More recently, the interest has also turned toward the forward link. Recent investigations showed that multiple access interference (MAI) also affects the forward link of third generation wideband CDMA (W-CDMA) terrestrial [8] and satellite networks [10]. Differently from the reverse link, the one-to-many nature of the power-controlled CDMA forward link causes power unbalance among the different channels and may lead to near–far conditions, whereby interference mitigation may be of help. Also, a large number of pilot channels can be seen at boundaries in networks employing microcells, causing the so-called pilot pollution phenomena. CDMA system analysis [8]–[10] showed that linear interference-mitigating detectors can indeed provide relevant capacity and/or QoS improvement for both terrestrial and satellite networks.

Dealing with individual user terminals, it is natural to consider linear detectors that do not require joint detection and that are consequently simpler to implement [11]. The low-complexity blind minimum output energy (B-MOE) linear interference-mitigating detector scheme [12] minimizes the detrimental effects of MAI on the bit error rate (BER) performance of a CDMA demodulator without requiring training sequences or knowledge of interfering signals’ parameters. The B-MOE and its variants [13] only rely on side information about the useful channel signature sequence and chip timing, just like a conventional direct-sequence spread-spectrum (DS–SS) correlation receiver (CR) [12].

An enhanced version of the original B-MOE algorithm, named extended complex-valued blind anchored interference-mitigating detector (EC-BAID) [15], reveals robustness in the presence of interfering signals affected by a residual carrier-frequency shift. The new detector is also insensitive to a carrier-phase offset on the useful signal. This allows adoption of conventional carrier-phase estimators operating at symbol rate on the output of the detector [14]. Following promising theoretical and simulation results that confirm applicability of the EC-BAID to satellite networks,1 The European Space Agency initiated a research activity to develop a real-time EC-BAID ASIC-based demodulator [22]. The main point of such a project is to demonstrate that an advanced adaptive CDMA demodulator can be integrated into a compact ASIC with limited additional complexity compared to a conventional CR.

After this brief introduction, we will describe in Section II the adaptive detector architecture, and we will outline in Section III

1The EC-BAID cannot be applied to frequency selective channels because of the so-called anchor mismatch phenomenon [12]. Recent results point out how to extend the B-MOE technique to a frequency-selective fading channel [28].
the main issues related to the digital demodulator ASIC implementation, including numerical BER results. We will eventually summarize in Section IV the main outcomes of the paper.

II. ADAPTIVE DETECTOR ARCHITECTURE
A. CDMA Signal Format and Baseline EC-BAID Detector Outline

The signal format we introduce here corresponds to conventional DS–SS QPSK modulation with real spreading. The real-spreading option (initially proposed for the W-CDMA third-generation radio interface) has a number of advantages, when used in conjunction with a linear multiuser detector, that are described in [15]. The main point is that real spreading, as compared to complex spreading, halves the number of spanned signal–space dimensions and thus improves the EC-BAID interference mitigation capabilities. The incoming binary data stream at rate $R_b = 1/T_b$ for the $k$th user is split between the two phase-quadrature (P-Q) rails by means of a serial-to-parallel converter. The resulting symbols $a_k, q(u), a_k, q(u) \in \{-1, 1\}$ are both spread by the same signature sequences $c_k(l)$ and then filtered prior to P-Q carrier modulation. The resulting $k$th user signal is given by

$$c_k(t) = \sqrt{P_k} \sum_{l=0}^{\infty} \left[a_k, p(u) + j a_k, q(u)\right] s_k(t - uT_s - \tau_k) \cdot \exp \left[j(2\pi \Delta f_k t + \phi_k)\right]$$

$$s_k(t) = \sum_{l=0}^{L-1} c_k(l) g_T(t - lT_c)$$

where $P_k$ is the $k$th signal power, $L$ is the period for both spreading sequences, $T_c$ is the chip time, $T_s = LT_c$ is the symbol time, $\Delta f_k$ is the $k$th carrier frequency offset with respect to the nominal frequency $f_0$, $\phi_k$ is the $k$th user carrier phase, $g_T(t)$ is the impulse response of the chip-shaping filter, and $\tau_k$ is the $k$th user signal delay.

Without loss of generality, we also assume $0 \leq \tau_k < T_s$. The signature sequences $c_k(l)$ are composite sequences, such as Walsh–Hadamard (WH) binary functions overlaid by extended pseudonoise (PN) scrambling sequences with the same period and start epoch [17]. Notice that we also assumed short codes, i.e., $T_s = LT_c$ in order for the EC-BAID to be applicable. This option is presently available for the uplink of the third-generation W-CDMA UTRA standard, and it is also contemplated by the ITU satellite IMT-2000 radio interface A [16]. In this case, the code length $L$ is also coincident with the spreading factor $T_s/T_c$. The extension to multirate CDMA signals with orthogonal variable spreading factor (OVSF) codes [18] is discussed in Section II-A2, and the relevant implementation details will be given in Section III-B4.

For simplicity, we will assume that the carrier frequency error for the useful channel #1 is perfectly compensated for by means of an ideal automatic-frequency control (AFC) subsystem (i.e., $\Delta f_1 = 0$) and that perfect chip-timing recovery takes place ($\tau_1 = 0$). The impact of AFC residual carrier frequency error has been analyzed in [15]. The received signal will go through a baseband filter $g_R(t)$ performing Nyquist square-root-raised cosine chip-matched filtering (CMF), followed by chip-time sampling (or interpolation in a digital modem). Denote $r(t) = y_p(t) + y_d(t)$ as the AWGN process, whose P-Q components $y_p(t)$ and $y_d(t)$ have two-sided power spectral density $N_0$. Neglecting intersymbol interference due to symbols spread more than $T_s$ apart and considering that the maximum user delay is again $T_s$, the signal samples at the output of the CMF $g_R(t)$ at time $t_m = mT_c$

$$y(m) = \left[\sum_{k=1}^{K} c_k(t) + r(t)\right] \otimes g_R(t) \mid_{t=t_mT_c}$$

$$\approx \sum_{k=1}^{K} \sum_{l=-2}^{1} \sqrt{T_k} \cdot \left[\left[a_k, p(i + [m/L]) + j a_k, q(i + [m/L])\right] \cdot f_a(mT_c - iT_s - [m/L]T_s) \cdot \exp\left[(2\pi \Delta f_k mT_c + \phi_k)\right] + n(mT_c)\right]$$

where $[\cdot]$ denotes integer part, and $f_a(t) = s_k(t - \tau_k) \otimes g_R(t)$. The discrete-time complex-valued white Gaussian noise process $n(mT_c) \sim n_p(mT_c) + n_d(mT_c)$ has zero-mean white independent real–imaginary components with variance $\sigma_n^2 \triangleq E[n_n^2(m)] = N_0/T_c$, $h = p, q$. The range of the summation $i$ in (2) is asymmetric because on top of the ±1 adjacent intersymbol effect, the possible (positive) differential delay among interfering signals of up to one symbol maximum is taken into account. Now, the description of the detector is simplified by adopting a vector notation (boldface symbols). The EC-BAID uses a three-symbol observation window to detect one information-bearing symbol following results reported in [14]. Basically, it can be shown that, although in principle lengthening the observation window leads to improved BER performance in the presence of asynchronous MAI, an optimum window length exists for the adaptive detector (see Section III-C). Experimental results indicate that, for practical adaptation speeds, a three–symbol window exceeds requirements. The 3$L$-dimensional array of CMF samples observed by the detector is

$$\mathbf{Y}(r) = \left[\mathbf{y}_1^T(r), \mathbf{y}_2^T(r), \mathbf{y}_3^T(r)\right]^T$$

$$\mathbf{y}_w(r) = \left[y_w((r+w)L+L-1)T_c\right]^T$$

$$w = -1, 0, 1.$$
where $h_f^r(r)$ is the $3L$-dimensional array of the complex-valued detector coefficients. It is apparent that the detection of each symbol calls for observation of three symbol periods (i.e., the current, the leading, and the trailing ones). To ease implementation of the adaptive detector, as will be clear in the following, we resorted to the three-fold pipelined parallel implementation sketched in Fig. 1, wherein the first unit processes the $(r-1)$th, the $r$th, and the $(r+1)$th symbol periods for the detection of the $r$th symbol, the second unit processes the $r$th, the $(r+1)$th, and the $(r+2)$th periods, for the detection of the $(r+1)$th symbol, and the third unit processes the $(r+1)$th, the $(r+2)$th, and the $(r+3)$th periods, for the detection of the $(r+2)$th symbol.

The output data stream is obtained by sequentially selecting at rate $1/(3T_s)$ each of the three EC-BAID detector outputs by means of a multiplexer unit. We need thus a further clock reference ticking at the so-called supersymbol rate $B_{ss} = 1/(3T_s)$, i.e., once every three symbols. Using the unique decomposition (5), the EC-BAID output can be computed as

$$b_1(3s+n-1) = \frac{1}{L} h_{f,1}^{c,n}(s) \cdot y^r(3s+n-1) \quad n = 1, 2, 3$$

(5)

with $s$ running at supersymbol rate.

The impulse response vector is determined according to the maximum output energy (MOE) criterion [12]. This is achieved through a canonical representation of $h_f$ that helps find a blind adaptation rule, in that the complex detector coefficients are anchored to the user’s signature sequence [12], [14]

$$h_f^{c,n}(s) = c_1^r + x_f^{c,n}(s)$$

$$c_1^r = \begin{bmatrix} 0 \\ c_1 \\ 0 \end{bmatrix} \quad x_f^{c,n}(s) = \begin{bmatrix} x_f^{c,n-1}(s) \\ x_f^{c,n,0}(s) \\ x_f^{c,n,1}(s) \end{bmatrix}$$

(6)

through the “anchor” constraints

$$c_1^T \cdot x_{1,w}^n = 0 \quad w = -1, 0, 1.$$  

(7)

The error signal for detector $n$ is

$$e_{f,1}^{c,n}(s) = b_1(3s+n-1) \cdot \left[ y_f^{c,n}(3s+n-1) - \frac{1}{L} \frac{y_f^{c,n}(3s+n-1)^T \cdot c_1}{c_1} \right]_{w = -1, 0, 1}$$

(8)

where the asterisk denotes complex conjugation. It is shown in [15] that this error signal leads to a phase-invariant adaptive algorithm. Such invariance has also the side effect of making the EC-BAID algorithm robust to multiple-access interference carrier frequency offset [15]. If the three detectors were running independently, the updating equation for each detector would be simply

$$x_1^{c,n}(s+1) = x_1^{c,n}(s) - \gamma e_{f,1}^{c,n}(s)$$

$$e_{f,1}^{c,n}(s) = \begin{bmatrix} e_{f,1}^{c,n-1}(s) \\ e_{f,1}^{c,n,0}(s) \\ e_{f,1}^{c,n,1}(s) \end{bmatrix}$$

(9)

where $\gamma$ is the updating step, and with $s$ ticking at super-symbol rate. Equation (7) sets the so-called “chunk” orthogonality condition [14] for all of the three adaptive detector components $x_{L,s}$. This makes the detector robust to the presence of non-random data patterns, such as a time-interleaved series of pilot symbols or a unique word. The “chunk” orthogonality avoids that the $x_{L,s}$ and $x_{L,s}^*$ subarrays contain components in the useful signal space that, in the case of correlated data symbols, may have destructive effects on the useful signal component.

1) Optimized EC-BAID Architectures: The “baseline” equations for the EC-BAID developed in the previous subsection represent the starting point toward an optimized detector architecture that attains a higher convergence speed and a reduced hardware complexity. Romero-Garcia et al [15] introduce the so-called “overlap and add” architecture (O&A, Fig. 2), wherein the three units still process input data symbols and produce in turn the desired output $b_1$; but, this time, using a unique vector
Fig. 2. “O&A” top-level functional block diagram.

This vector is now updated at supersymbol time with the sum of the three-unit error signals. This new arrangement is summarized in the relevant equations in Table I.

A further enhanced architecture is the so-called “select and add” (S&A), which provides the same BER and convergence speed performance of the O&A but allows a considerable circuit complexity reduction. The S&A architecture exploits the possibility of using a clock of period $T_c/3$ so that the arithmetical part of the circuit can be reused three times for each chip period $T_c$ (sort of hardware multiplexing). This allows the computation of the output $b_1(r)$ in one period $T_c$, getting the entire product $x_1^T(r) \cdot c_1$ and to update, in the same period $T_c$, the whole vector $x_1^T$ with the $3L$ error signal coefficients. The advantage obtained from these architectures (O&A and S&A with respect to the “baseline” one) is two-fold: 1) (area saving) Only one vector $x_1^T$ rather than three needs to be stored, and in the S&A version, there are less synchronization delay elements than in the O&A; 2) (speed) the three-fold faster updating rate brings an increased convergence speed. The conjecture is confirmed by Fig. 3, which shows the simulated BER transient diagram of the S&A versus the baseline architecture; there, the system parameters are $L = 64$, $N = 32$ asynchronous users with uniformly-spaced delays on a symbol period, $E_b/N_0 = 6$ dB, $C/I = -6$ dB, and $\gamma = 1.22 \cdot 10^{-4}$; the signature sequences are Walsh–Hadamard channelization codes covered by a unique extended-PN scrambling sequence. The improvement of a factor of roughly three in the convergence speed is evident. Notice that in the S&A architecture, the $x_1^T$ vector is updated with one error contribution every $3T_c$, while in the “baseline” version, each $x_1^e$ vector is updated with one error contribution every $3T_c$. Again, the equations for the S&A architecture are shown in Table I, and the relevant HW implementation is depicted in Fig. 4. Blocks #1 and #2 evaluate the correlations $y(r)^T \cdot c_1$ and $x_1^T \cdot c_1$ respectively, yielding the output $b_1(r)$ at symbol rate. The vector $x_1^T$ is stored in memory #6 before undergoing the orthogonalization to the spreading code (7) (see also Section III-B), and each of its $3L$ elements is updated every $T_c/3$; in particular, the coefficients of $x_1^T$ relevant to the $i$th chip of $y(r-1)$, $y(r)$, and $y(r+1)$ are updated during the $i$th chip interval within the $r$th symbol period. To this aim, memory #7 stores the most recent $3L$ input chips, and multiplexer #8 properly realigns internal dataflow. The three vectors $(x_{1,w})^H \cdot c_1$ with $w = -1, 0, 1$, which have to be subtracted from $x_{1,w}^H \cdot c_1$ in order to make $x_{1,w}(r)$ orthogonal to the code, are built by the accumulator #5. The timing diagram of the S&A main signals is shown in Fig. 5. The AGC on the feedback loop (block #4) is needed in order to keep the output amplitude $|b_1|$ constant, irrespective of the different signal to noise + interference ratio (SNIR) operating conditions that may be experienced (see Section III-B2 for further details).

As discussed in Section II-A1, we focused on the S&A architecture because of its considerable hardware-complexity saving in terms of both arithmetical elements and memory cells. S&A calls for a single arithmetical unit, thanks to hardware multiplexing, and a single vector $x_1^T$, together with the $3L$ input samples ($y^r$), has to be stored. On the contrary, O&A needs three distinct arithmetical units and some extra delays (memory elements) to provide proper timing between signals of the various circuit branches (Fig. 2). S&A, without affecting overall performances, allows for nearly 50% gate complexity saving and 70% RAM capacity saving at the expense of a three-times-higher clock rate. In particular, by considering the overall gate com-
### TABLE I
**PROPOSED EC-BAID ALGORITHMS**

<table>
<thead>
<tr>
<th>Overlap and Add</th>
</tr>
</thead>
<tbody>
<tr>
<td>Output computation:</td>
</tr>
<tr>
<td>( b_1(3s + n - 1) = \frac{1}{L} h_1^T(s) \cdot y^*(3s + n - 1) ) with ( h_1^T(s) = x_1^T(s) + c_f^T )</td>
</tr>
<tr>
<td>Updating of vectors ( x_{1,w} = x_{1,w}(s) - \gamma (e_{1,w}(s - 1) + e_{1,w}^<em>(s - 1) + e_{1,w}^</em>(s - 1)) )</td>
</tr>
<tr>
<td>( e_{1,w}^<em>(s) = b_1(3s + n - 1) \left[ y_1^</em>(3s + n - 1) - \frac{y_1^*(3s + n - 1)^T \cdot c_f}{L} \right] ) with ( s ) super-symbol index and</td>
</tr>
<tr>
<td>( c_f^T = \begin{bmatrix} 0 &amp; 1 \ 1 &amp; 0 \end{bmatrix} ), ( x_1^T(s) = \begin{bmatrix} x_{1,-1}(s) \ x_{1,0}(s) \ x_{1,1}(s) \end{bmatrix} ), ( e_1^T(s) = \begin{bmatrix} e_{1,-1}(s) \ e_{1,0}(s) \ e_{1,1}(s) \end{bmatrix} )</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Select and Add</th>
</tr>
</thead>
<tbody>
<tr>
<td>Output computation:</td>
</tr>
<tr>
<td>( b_1(r) = \frac{1}{L} h_1^T(r) \cdot y^*(r) ) with ( h_1^T(r) = x_1^T(r) + c_f^T )</td>
</tr>
<tr>
<td>Updating of the vector ( x_{1,w} = x_{1,w}(r) - \gamma e_{1,w}(r - 1) )</td>
</tr>
<tr>
<td>( e_{1,w}^<em>(r) = b_1(r) \left[ y_1^</em>(r) - \frac{y_1^*(r)^T \cdot c_f}{L} \right] ), ( w = -1, 0, 1 )</td>
</tr>
<tr>
<td>with ( r ) symbol index and</td>
</tr>
<tr>
<td>( c_f^T = \begin{bmatrix} 0 &amp; 1 \ 1 &amp; 0 \end{bmatrix} ), ( x_1^T(r) = \begin{bmatrix} x_{1,-1}(r) \ x_{1,0}(r) \ x_{1,1}(r) \end{bmatrix} ), ( e_1^T(r) = \begin{bmatrix} e_{1,-1}(r) \ e_{1,0}(r) \ e_{1,1}(r) \end{bmatrix} )</td>
</tr>
</tbody>
</table>

---

2) **CDMA Multirate Extension:** All terrestrial or satellite third-generation standards for wireless CDMA encompass multirate access capability to accommodate multimedia services.

- Complexity reduction and clock frequency increase, a power consumption growth below 10% is expected.

- All rates are integer submultiples of a maximum rate, which depends on the kind of terminal (fixed, indoor, fully mobile, etc.) and network (indoor, outdoor terrestrial, satellite, etc.).

- Multirate CDMA is typically achieved by a combination of OVSF and multicode techniques (required only for the highest rate). Multirate transmission through multicodes has no impact on the receiver design, except for requiring an increased number of parallel detectors. CR implementation with OVSF is straightforward as the only modification consists in changing the accumulated chip samples according to the actual spreading-sequence length. The situation is quite different for the EC-BAID with OVSF. Assume that our reference channel (#1) has the highest rate allowed in the network. All of the interfering signals will bear a longer code than user #1. The latter will therefore see a set of cyclically varying spreading codes on the interfering signal, whose repetition period will be in general the ratio \( M \) between the highest and the lowest bit rate allowed in the network. A (single-symbol) detector for user #1 shall therefore be designed exploiting this cyclical regularity. It will be made of a bank of \( M \) conventional interference-canceling single-symbol detectors, which are cyclically operated every \( M \) symbol periods, in such a way that everyone always sees the same (sub)code on the interfering signals. The outputs of such detectors are then “demultiplexed” to give a symbol-rate stream for subsequent processing. Such a functional architecture is outlined in Fig. 6.

The situation is different if user #1 has the lowest rate in the network. In such a condition, it is easy to see that we can use a conventional receiver with no modifications. Of course, if user #1 has an intermediate bit-rate, the only parameter that matters is the ratio between its actual rate and the minimum rate in the network to build a multirate architecture as in Fig. 6. This figure represents the functional block diagram for a user operating at a bit rate \( M \) times the basic system bit rate \( 1/T_b^{\text{bkw}} \). As it is apparent
from the previous discussion, the basic EC-BAID detectors can be reused, although each of the \( M \) parallel detectors operates on a disjoint short symbol. The \( M \) detectors are sequentially activated with periodicity \( T_{\text{av}} \) and a duty cycle \( T_{\text{av}} / M \). The increased complexity to cope with high-data rates users is, however, apparent as it will be clarified in Section III-B4 dealing with ASIC multirate implementation issues.

### III. IMPLEMENTATION ISSUES

#### A. Front End

In order to carry out testing and performance measurement on the detector circuit, the EC-BAID ASIC is being integrated into a breadboard based on field programmable gate arrays (FPGAs) and DSP and featuring all of the functionalities of a full-digital multirate receiver: intermediate-frequency (IF) analog-to-digital conversion (ADC), front-end signal processing (i.e., downconversion, decimation, and filtering), and ancillary functions (i.e., clock and carrier synchronization). The breadboard, whose functional block diagram including the ASIC is sketched in Fig. 7, was designed in order to meet the technical specifications listed in Table II. The input of the receiver breadboard is a signal modulated at 70 MHz, which is a typical value for the intermediate frequency (IF) of satellite communications. Such an IF signal is then digitized on 7 bits by means of an ADC at the rate \( f_{\text{a}} \) and then digitally downconverted to baseband. Frequency conversion is carried out using a digitally controlled oscillator (DCO), which drives the in-phase (P) and quadrature (Q) rotators, featuring the so-called “coordinate rotation digital computer” (CORDIC) algorithm [25]. After down-conversion, the quantized signal samples undergo filtering and decimation down to the rate of \( f_{\text{b}} = 4 \) samples per chip interval. Since the front-end digital signal is significantly oversampled, decimation is carried out via a so-called cascaded integrator and comb (CIC) decimator–interpolator [23]. The amplitude response of the CIC filter (in our implementation, a fourth-order cascade with unit differential delay) is not flat within the useful signal’s band, so that some form of compensation (equalization) should be adopted [24]. To this aim, we synthesized a low-complexity joint equalizer–CMF as a single 33-tap FIR filter. The
8-bit P (or Q) samples at the output of the equalizer–CMF are subsequently fed into first-order (i.e., linear) interpolators controlled by an estimate of the actual timing delay coming from the chip-clock tracking unit (CCTU). Each interpolator outputs a stream of samples at twice the chip rate \( 2R_c \), that is split into two interleaved streams at chip rate \( R_c \). The 7-bit "on-time" prompt samples are used by the EC-BAID ASIC for data detection and by the frequency error detector (FED) [26] for fine carrier tuning of the DCO, while the early–late samples are used by the CCTU for fine chip-clock recovery by means of a digital delay lock loop (DDLL)[27].

Extensive bit-true simulations of the front-end chain were carried out to assess the performance degradation caused by signal quantization. In this respect, Fig. 8 shows the BER obtained using the quantized front end followed by a conventional CR (no carrier frequency–phase error). The solid line is the ideal theoretical BER (i.e., floating-point precision), while the marks are the bit-true model results. As is apparent, the implementation losses are around 0.1–0.2 dB, while the overall front-end complexity is kept at sustainable level. All bit-true simulations were carried out in FORTRAN through proper selection of the datapath wordlength for all of the variables involved in the simulations and using the same algorithm version as those that are implemented in the VHDL model.

B. Adaptive Detector Implementation

1) Quantization Effects: The first step in ASIC design was the assessment of the impact of fixed-point arithmetic on the performance of S&A. In order to reduce the hardware complexity with a minimum impact on the BER performances, some well-known design techniques were adopted: For example, bus dimensions are kept under control by discarding the least significant bits where possible or by saturating signals between proper levels. Such operations may lead to numerical instability if the algorithm loop is implemented in hardware according to (8) and (9).

Even if the adaptation error contribution \( e^T_r \) is made orthogonal to the code sequence by construction [see (8)], the above-mentioned truncation and saturation operations may cause a little deviation from perfect orthogonality; this unwanted \( e^T_r \) component parallel to the code may have a mean value which differs from the null vector, and then, it will build up an accumulation error that impairs convergence.

A solution to this problem (see also [12] and [19]) is to split the updating operations in two steps: First, the \( \mathbf{x}^{(i)}_1 \) vector is updated with a contribution which is not orthogonal to the code \( \mathbf{c}_4 \):

\[
\mathbf{x}^{(i)}_2, \text{not orthog}(r + 1) = \mathbf{x}^{(i)}_2, \text{orthog}(r) - \gamma b_1 (r - 1) y^{ex}(r - 1)
\]

(10)

and then it is made orthogonal to \( \mathbf{c}_4 \) (see block #5 in Fig. 4):

\[
\mathbf{x}^{(i)}_2, \text{orthog}(r) = \mathbf{x}^{(i)}_2, \text{not orthog}(r) - \frac{x^{not, \text{orthog}}_2, w}{L} \mathbf{c}_4
\]

(11)

In such a way, condition (7) is granted, even if there is some bit truncation on the \( e^T_r \) bus. Given this approach, finite arithmetic effects on all of the other S&A circuit internal signals can be regarded as additional noise without causing any particular trouble to the algorithm convergence toward the steady-state vector \( \mathbf{x}^{(i)}_2, \text{opt} \).

In order to select a good tradeoff between BER performance and hardware complexity, higher consideration was given to limit implementation losses; on the other hand, the VHDL description of the circuit is fully parametric, allowing for the design of an area-saving version of the detector with minimum efforts.

2) Automatic Loop-Gain Control: Fig. 9 shows the functional block diagram of the digital AGC circuit of Fig. 4. We recall that the goal of the AGC is to keep the EC-BAID convergence speed constant, irrespective of the different SNIR conditions. Considering the EC-BAID equations shown in Section II-A, the choice of the adaptation step \( \gamma \) determines the EC-BAID loop transient length and the rms error; in the practical implementation, the input signal \( y^2 \) has to be scaled by a factor, which depends on the SNIR conditions in order to be digitally converted by an ADC. The AGC in Fig. 9 is needed to compensate for such a variable factor, and it is realized by means of a closed loop which keeps the mean value of \( |b_1| \) at a constant value \( b_{1, \text{REF}} \), according to the following:

\[
b_1 = G \cdot b^{(i)}_4 \quad \epsilon = |b_1| - b_{1, \text{REF}}
\]

\[
G(r + 1) = G(r) - \gamma_{\text{AGC}} \cdot \epsilon(r).
\]

where \( b^{(i)}_4 \) and \( b_1 \) are, respectively, the input and the output signals of the AGC that is characterized by a variable gain \( G \) and an adaptation step \( \gamma_{\text{AGC}} \).
3) **EC-BAID ASIC Implementation**: The design of the circuit was carried out by using a “top-down” approach based on very high-speed integrated-circuit hardware description language (VHDL) to obtain a technology-independent, highly flexible description at the behavioral level.

Once a “synthesizable VHDL” description was reached, gate-level design was performed using the logic synthesis tool Synopsys together with the library of the relevant CMOS technology supplied by ST Microelectronics. At each design step, the VHDL model (behavioral, synthesizable, and gate-level) was automatically verified against the simulation results of a FORTRAN bit-truth model. The design database was subsequently transferred to the Cadence design framework via electronic data interchange format (EDIF) for design back-end operations (such as floor planning, placement, routing, and post-layout verification). The resulting ASIC, whose layout is shown in Fig. 10, contains 25 K gates plus 23 K bits of static RAM, which are totally equivalent to roughly 45 K gates. This figure corresponds to a nearly 35% complexity increase compared to a conventional detector, if we consider the whole receiver (EC-BAID or CR, with front-end and ancillary functions). The internal clock used to exploit hardware multiplexing is $4R_c$, instead of $3R_c$; this choice simplifies the external clock generation and allows for proper operation when the chip time-tracking unit needs to anticipate the interpolator outputs. The chip has 48 input–output (I–O) pads, 16 power–ground pads, and it is packaged in a ceramic 64 plastic leaded-chip carrier. The technology used is the HCMOS7 CMOS process characterized by 0.25-$\mu$m gate length (0.20 $\mu$m effective), up to 6 levels of metal layers, and 2.5 V power supply. A subset of the ASIC configuration parameters is serially transmitted on three pins in order to reduce ASIC pin count without negatively...
affecting the ASIC interface with the external world. This approach entails a minimum hardware complexity increase due to the additional serial protocol decoding circuitry, making the 64-pin package suitable for the chip.

4) Extension to Multirate Operation: The EC-BAID circuit is designed to operate with a maximum code-sequence length \( L = 128 \); to store the relevant \( 3L \) coefficients a 384 words/SRAM is needed. In case of shorter code \( (L = 32, 64) \), only a part of this memory has to be used; this allows for separate use of the different memory subblocks to achieve multirate demodulation within the same ASIC. As discussed in Section II-A2, the multirate demodulator operating with a (short) code length \( L \) must use \( M \) distinct vectors \( \mathbf{x}(r) \), one for each of the \( M \) subportions of the (long) interfering code, where \( M \) is the ratio between the current bit-rate and the minimum bit-rate allowed in the network. In case the interfering code lengths are within the ones supported by the circuit \( (L = 32, L = 64, \text{and } L = 128) \), it is possible to store the \( M \) \( \mathbf{x}(r) \) vectors \( (L \text{ chips long}) \) on the \( M \) memory portions \( (L \cdot M = L_{\text{MAX}}) \) and cyclically use the whole SRAM with a reasonable increase of complexity. The memory control block shown in Fig. 4 has to be added to
implement proper memory addressing, and some synchronization elements have to be inserted to ensure correct summations on the $x^3_1$ loop adders.

C. Overall Performance Results

We report hereafter a few noteworthy samples out of our extensive collection of simulation results. Figs. 11 and 12 show the simulated EC-BAID ASIC performance, both with floating point and bit-true FORTRAN simulations obtained for various code length and CDMA loading conditions. In particular, the BER performance of the whole bit-true system (EC-BAID with front-end and synchronization loops) is compared with that of the ideal one (EC-BAID floating-point algorithm with ideal synchronization). As is apparent from Figs. 11 and 12, even with large CDMA loading and power unbalance ($C/I = -6$ dB), implementation losses are kept under control and allowed to satisfy the overall demodulator impairment specification outlined in Table II. Design robustness to the near–far effect is confirmed by Fig. 13. The EC-BAID withstands a large range of C/I ratios with a minimum impact on the overall BER.

Due to the adaptive algorithm, the $3L$ elements of $x^3_1$ may vary over their steady-state values; more precisely, the most external vector elements may have a negligible mean value with respect to their fluctuations, and this is the reason why a better performance may be obtained by forcing them to 0 value. According to FORTRAN simulations results (see Fig. 14), the best BER performance for the selected value of $\gamma$ is obtained by reducing the $x^3_1$ vector length to $2L$ elements. All the simulation results presented in this section refer to such a shortened observation window. On the other hand, the ASIC allows the programming of the window’s length within the range $L$ to $3L$ with 1-code-chip precision.

A further improvement of the original EC-BAID algorithm [14] has been added in order to cope efficiently with synchronization errors. In the case of asynchronous MAI, the chip-timing loop may lead to a timing bias error up to $0.05T_c$ as resulted by computer simulations. Even if the EC-BAID can operate well under a timing jitter of such a range [14], this timing bias may cause a significant degradation of the BER performance. To mitigate this effect, a “leakage” variant [20] has been

---

3 This multirate extension additional control logic calls for nearly 5 K gates hardware complexity increase.
represents the leak factor. An example of the gate length (0.20 μm effective), up to six levels of metal layers, and a 2.5-V power supply. The EC-BAID was designed for a negligible implementation loss, while the overall receiver loss is estimated not to exceed 0.5 dB under the specified maximum loading conditions.

A multirate multigeneration extension to the demodulator is presently being implemented in the user terminal assembly of a testbed representative of the recently ITU-approved IMT-2000 satellite radio interface A (SW-CDMA) [16]. The proposed scheme has been shown to outperform the CR under practical operating conditions of a satellite mobile CDMA network [21].

Further work is required to design a new version of the proposed blind linear detector that could operate in the frequency-selective fading channel typical of terrestrial networks, as discussed in [28], for instance.

ACKNOWLEDGMENT

The authors gratefully acknowledge the stimulating discussion with J. Romero-Garcia, now with Nokia TLC, England, G. Colleoni from ST Microelectronics, Italy, and M. Rovini from Pisa University, Italy.

REFERENCES


[22] D. Boudreau et al., http://www.ing.unipi.it/~d7384.


Luca Fanucci (S’95–A’96) was born in Montecatini Terme, Italy, in 1965. He received the Doctorate of Engineering (cum laude) and the Research Doctorate degrees, both in electronic engineering, from the University of Pisa, Pisa, Italy, in 1992 and 1996, respectively. From 1992 to 1996, he was with the European Space Agency’s Research and Technology Center, Noordwijk, The Netherlands, where he was involved in several activities in the field of VLSI for digital communications. He is currently a Research Scientist with CNR, the Italian National Research Council, at the Centro Studio per Metodi e Dispositivi per Radiotrasmissioni (CSMDR), Pisa. His main interests are in the areas of digital filter design, high-speed CMOS integrated circuit design, VLSI architectures for real-time image and signal processing, and applications of VLSI technology to digital communication systems.

Riccardo De Gaudenzi (M’88–SM’97) was born in Italy in 1960. He received the Doctorate of Engineering degree (cum Laude) in electronic engineering from the University of Pisa, Pisa, Italy in 1985 and the Ph.D. degree from the Technical University of Delft, Delft, The Netherlands, in 1999. From 1986 to 1988, he was with the European Space Agency (ESA), Stations and Communications Engineering Department, Darmstadt (Germany), where he was involved in satellite telecommunication ground systems design and testing. In particular, he followed the development of two new ESA satellite tracking systems. In 1988, he joined ESA’s Research and Technology Centre (ESTEC), Noordwijk, The Netherlands, where he is presently the head of the Communication Systems Section. He is responsible for the definition and development of advanced satellite communication systems for fixed and mobile applications. He is also involved in the definition of the Galileo European Navigation System. In 1996, he spent one year with Qualcomm Inc., San Diego, CA, in the Globalstar LEO project system group under an ESA fellowship. His current interest is related to the efficient digital modulation and access techniques for fixed and mobile satellite services, synchronization topics, adaptive interference mitigation techniques, and communication systems simulation techniques.

Filippo Giannetti was born in Pontedera, Italy, on September 16, 1964. He received the Doctorate of Engineering (cum laude) and the Research Doctorate degrees in electronic engineering from the University of Pisa, Pisa, Italy, in 1989, and from the University of Padova, Padova, Italy, in 1993, respectively. In 1988 and 1989, he spent a research period at TELETTRA (now ALCA TEL), Vimercate, Milan, Italy, working on error-correcting codes for SONET–SDH radio modems. In 1992, he spent a research period at the European Space Agency Research and Technology Centre (ESA–ESTEC), Noordwijk, The Netherlands, where he was engaged in several activities in the field of digital satellite communications. From 1993 to 1998, he was a Research Scientist at the Department of Information Engineering, University of Pisa, where he is currently Associate Professor of Telecommunications. His main research interests are in mobile and satellite communications, synchronization, and spread-spectrum systems.

Eduardo Letta was born in Pisa, Italy, in 1972. He received the Doctorate of Engineering degree (cum laude) in electronic engineering from the University of Pisa, Pisa, Italy, in 1998. He is currently pursuing the Ph.D. degree in electronic engineering at the Department of Information Engineering, University of Pisa. His main interests are in the areas of digital filter design, VLSI architectures for real-time signal processing, and applications of VLSI technology to digital communication systems.

Marco Luise (M’88–SM’97) was born in Livorno, Italy, in 1960. He received the Doctorate of Engineering (cum laude) and the Research Doctorate degrees in electronic engineering from the University of Pisa, Pisa, Italy. In the past, he was a Research Fellow of the European Space Agency (ESA) at the European Space Research and Technology Centre (ESTEC), Noordwijk, The Netherlands, and a Research Scientist of CNR, the Italian National Research Council, at the Centro Studio Metodi Dispositivi Radiotrasmissioni (CSMDR), Pisa. He was recently appointed Full Professor of Telecommunications at the Department of Information Engineering, University of Pisa. His main research interests lie in the broad area of wireless communications, with particular emphasis on CDMA systems. Prof. Luise co-chaired four editions of the Tyrrhenian International Workshop on Digital Communications and in 1998 was the General Chairman of the URSI Symposium ISSSE. He served as Editor of Synchronization for the IEEE TRANSACTIONS ON COMMUNICATIONS.