N. Felber

ETH Zurich, Zürich, ZH, Switzerland

Are you N. Felber?

Claim your profile

Publications (61)16.71 Total impact

  • Conference Proceeding: SPARSITY-BASED REAL-TIME AUDIO RESTORATION
    ISCAS; 01/2012
  • Source
    Conference Proceeding: Hardware platform and implementation of a real-time multi-user MIMO-OFDM testbed
    [show abstract] [hide abstract]
    ABSTRACT: This paper describes a modular hardware platform of a multi-user (MU) multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) testbed. The hardware platform is based on multiple field programmable gate arrays (FPGAs), provides four integrated radio-frequency (RF) chains, and has capabilities for extension boards. The performance and modularity of the testbed enables real-time MU-MIMO-OFDM experiments as well as offline processing experiments. To this end, the MIMO physical (PHY) layer of Haene et al., IEEE J-SAC, 2008, has been adapted to the new hardware platform and extended with bi-directional communication facilities and a basic media access control (MAC) layer equipped with Ethernet connectivity.
    Circuits and Systems, 2009. ISCAS 2009. IEEE International Symposium on; 06/2009
  • Source
    Conference Proceeding: Gram-Schmidt-based QR decomposition for MIMO detection: VLSI implementation and comparison
    [show abstract] [hide abstract]
    ABSTRACT: The QR decomposition (QRD) is an important prerequisite for many different detection algorithms in multiple-input multiple-output (MIMO) wireless communication systems. This paper presents an optimized fixed-point VLSI implementation of the modified Gram-Schmidt (MGS) QRD algorithm that incorporates regularization and additional sorting of the MIMO channel matrix. Integrated in 0.18 mum CMOS technology, the proposed VLSI architecture processes up to 1.56 million complex-valued 4times4-dimensional matrices per second. The implementation results of this work are extensively compared to the Givens rotation (GR)-based QRD implementation of Luethi et al., ISCAS 2007. In order to ensure a fair comparison, both QRD circuits have been integrated in the same IC manufacturing technology, with equal functionality, and the same numeric precision. The comparison of the implementation results clearly showed superiority of the GR-based VLSI solution in terms of area, processing cycles, and throughput.
    Circuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific Conference on; 01/2009
  • Conference Proceeding: Hardware Comparison of the Hash Function Candidates RADIOGATÚN, MAME, and LAKE
    [show abstract] [hide abstract]
    ABSTRACT: Several hash functions have been presented to replace previous standard hash families SHA-1 and SHA-2. Besides resistance to cryptanalysis, new candidates should feature a good flexibility to be implemented in hardware. This paper investigates the VLSI design of three emerging hash algorithms. The functions RADIOGATUN, MAME, and LAKE have been implemented and synthesized for ASIC and FPGA target devices. The achieved results point out that the fastest circuit, exceeding 600 MHz in a 0.18 ¿m CMOS technology and 300 MHz in a Xilinx Virtex-4 FPGA, is RADIOGATUN, while the 9,1 k gate equivalents (GE) implementation of MAME demonstrates the suitability of the algorithm for applications under limited resources.
    NORCHIP, 2008.; 12/2008
  • Conference Proceeding: VLSI hardware evaluation of the stream ciphers Salsa20 and ChaCha, and the compression function Rumba
    [show abstract] [hide abstract]
    ABSTRACT: Salsa20 is a stream cipher candidate in the software-oriented profile of the eSTREAM project. ChaCha is a successor stream cipher with improved per round diffusion and, conjecturally, increased resistance to cryptanalysis. Based on the combination of four Salsa20 instances, Rumba is a compression function for hashing schemes. This paper presents the evaluation of five VLSI circuits for Salsa20. Synthesis results for a 0.18 mum CMOS technology point out that the fastest implementation achieves a throughput of 6.4Gbps, while the smallest design requires only an area of 10 k gate equivalents (GE) at 16 Mbps. This work also presents the first hardware implementations of ChaCha and Rumba. The fastest ChaCha design achieves 6.8 Gbps and the smallest design requires an area of 9.1 kGE at 16 Mbps. Furthermore, two Rumba implementations are able to achieve 17.9 Gbps or a compact area of 16.8 kGE at 12 Mbps.
    Signals, Circuits and Systems, 2008. SCS 2008. 2nd International Conference on; 12/2008
  • Conference Proceeding: FPGA implementation of a 2G fibre channel link encryptor with authenticated encryption mode GCM
    [show abstract] [hide abstract]
    ABSTRACT: The Galois/counter mode (GCM) algorithm enables fast encryption combined with per-packet message authentication. This paper presents an FPGA implementation of a complete bidirectional 2 Gbps fibre channel link encryptor hosting two area-optimized GCM cores for concurrent authenticated encryption and decryption. The proposed architecture fits into one Xilinx Virtex-4 device. Measurements in a working network link point out that per-packet authentication results in a speed decrease up to 20% of the channel capacity for a reference frame length of 256 bits. Two methods of frame encryption are investigated to reduce the required GCM overhead and to exploit different network configurations.
    System-on-Chip, 2008. SOC 2008. International Symposium on; 12/2008
  • Conference Proceeding: An automatic gain controller for MIMO-OFDM WLAN systems
    [show abstract] [hide abstract]
    ABSTRACT: MIMO-OFDM based wireless LAN standards are currently being defined. These systems employ packed-based communication, which requires fast and accurate automatic gain control. The precise estimation of the expected receive signal power of data symbols, based on preamble symbols is required in order to optimally detect the data signals. In this paper, two different preamble OFDM-symbols are considered and analyzed with regards to their suitability for received signal power estimation in MIMO-OFDM systems. An AGC architecture for an IEEE 802.11a based MIMO system is proposed and FPGA implementation results are reported.
    Circuits and Systems for Communications, 2008. ECCSC 2008. 4th European Conference on; 08/2008
  • Conference Proceeding: OFDM channel estimation algorithm and ASIC implementation
    [show abstract] [hide abstract]
    ABSTRACT: Coherent detection of OFDM signals requires channel state information which can be acquired by transmitting known training symbols and by appropriate channel estimation at the receiver. Focusing on robust algorithms suitable for integration in silicon, a novel channel estimation method is proposed. The algorithm is based on a suboptimal modification to the maximum-likelihood estimator and was designed to enable the use of highly optimized constant-coefficient multipliers that require less area on silicon compared to regular multipliers. The mean square error and the complexity of different estimators are analyzed analytically, while an actual ASIC implementation allows to assess the real-world silicon area requirements for our proposed algorithm.
    Circuits and Systems for Communications, 2008. ECCSC 2008. 4th European Conference on; 08/2008
  • Article: Transmission Gates Combined With Level-Restoring CMOS Gates Reduce Glitches in Low-Power Low-Frequency Multipliers
    [show abstract] [hide abstract]
    ABSTRACT: Various 16-bit multiplier architectures are compared in terms of dissipated energy, propagation delay, energy-delay product (EDP), and area occupation, in view of low-power low-voltage signal processing for low-frequency applications. A novel practical approach has been set up to investigate and graphically represent the mechanisms of glitch generation and propagation. It is found that spurious activity is a major cause of energy dissipation in multipliers. Measurements point out that, because of its shorter full-adder chains, the Wallace multiplier dissipates less energy than other traditional array multipliers (8.2 mu W/MHz versus 9.6 mu W/MHz for 0.18mum CMOS technology at 0.75 V). The benefits of transistor sizing are also evaluated (Wallace including minimum-size transistors dissipates 6.2 muW/MHz). By combining transmission gates with static CMOS in a Wallace architecture, a new approach is proposed to improve the energy-efficiency further (4.7 muW/MHz), beyond recently published low-power architectures. The innovation consists in suppressing glitches via resistance-capacitance low-pass filtering, while preserving unaltered driving capabilities. The reduced number of V <sub>dd</sub>-to-ground paths also contributes to a significant decrease of static consumption.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 08/2008; · 1.22 Impact Factor
  • Conference Proceeding: A 0.25 ¿m 0.92mW per Mb/s Viterbi Decoder Featuring Resonant Clocking for Ultra-Low-Power 54Mb/s WLAN Communication
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, resonant clocking is applied to a Viterbi decoder for ultra-low-power WLAN communication. Clock skew balancing and excessive cross-over currents are identified as the most relevant issues: H-clock-trees and a new latch circuit are proposed as innovative power-efficient design solutions. The chip has been integrated in a 0.25 mum CMOS process. Supplied at 1.75 V, the 1.35 mm<sup>2</sup> core dissipates 50 mW at 54 Mb/s throughput, with about 27% power savings compared to an equivalent circuit with conventional one-phase single-edge-triggered (SET) clocking strategy and a recently published competitor by C.C. Lin, et al (2005). The chip works up to 77 MHz.
    Custom Integrated Circuits Conference, 2007. CICC '07. IEEE; 10/2007
  • Conference Proceeding: A Self-Timed 16-bit Multiplier for Low-Power Low-Frequency Applications
    [show abstract] [hide abstract]
    ABSTRACT: A comprehensive study of spurious activity propagation, based on transistor-level simulations targeting a 0.18 CMOS process, is carried out in traditional multiplier architectures (Carry-Save, Carry-Save with Booth receding and Wallace tree). The results suggest to implement self-timed multipliers, i.e. multipliers in which partial products are triggered by an independent delay line: they have the property of suppressing unnecessary switching activity. They are discussed in terms of area occupation and, especially, power dissipation and Energy- Delay-Product (EDP). After that, a new self-timed multiplier architecture is introduced. Transistor-level simulations point out a dissipation of 2.0 muW/MHz against 4.8 muW/MHz of a recently published self-timed multiplier and 4.1 muW/MHz of the most efficient traditional architecture (Wallace), with a reduced 5% area overhead compared to the latter one.
    Circuits and Systems, 2006. MWSCAS '06. 49th IEEE International Midwest Symposium on; 09/2006
  • Conference Proceeding: 42% power savings through glitch-reducing clocking strategy in a hearing aid application
    [show abstract] [hide abstract]
    ABSTRACT: Glitches are responsible for a significant proportion of overall power dissipation in digital signal processing circuits. Activity-reduction techniques that involve an optimized clocking strategy have been applied to a front-end block in a DSP adaptive directional microphone for hearing aids. Functionally equivalent implementations, differing only in their clocking scheme, have been integrated on silicon in a 0.25 mum CMOS technology. Measurements and post-layout simulations confirm a 42% reduction over single-edge-triggered clocking with clock gating. An overall power dissipation of 20 muW (@ 1.4 V, 374 kHz) has been measured. This achievement has been made possible by combining two novel techniques: a multi-stage clock gating, and a symmetric two-phase level-sensitive clocking with glitch-aware re-distribution of data-path registers
    Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on; 06/2006
  • Conference Proceeding: A Frame-Start Detector for a 4×4 MIMO-OFDM System
    [show abstract] [hide abstract]
    ABSTRACT: Future wireless LANs will increase the peak data rate by employing multiple antennas at both transmitter and receiver. Well designed synchronization algorithms are a prerequisite for meeting stringent QoS requirements. In particular OFDM modulation, which constitutes the basics for WLAN, is very sensitive to timing synchronization errors which incur inter-symbol interference. In this paper, a novel frame synchronization algorithm is proposed that is implemented in the FPGA of a real-time MIMO-OFDM testbed. Simulations show it to be of sufficient performance in scenarios of interest, while the hardware complexity is suitable for an FPGA implementation. Additionally, the algorithm exhibits a good resilience against narrow-band interference, which causes problems in traditional frame-start detection algorithms
    Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on; 06/2006 · 4.63 Impact Factor
  • Source
    Conference Proceeding: GALS at ETH Zurich: success or failure?
    [show abstract] [hide abstract]
    ABSTRACT: The Integrated Systems Laboratory (IIS) of ETH Zurich (Swiss Federal Institute of Technology) has been active in globally-asynchronous locally-synchronous (GALS) research since 1998. During this time, a number of GALS circuits have been fabricated and tested successfully on silicon. From a hardware designers point of view, this article summarizes the evolution from proof of concept designs over multi-point interconnects to applications that specifically take advantage of GALS operation to improve cryptographic security. In spite of the fact that they fail to address numerous idiosyncrasies of GALS (such as good partitioning into synchronous islands, port controller design, pausable clock generators, design for test, etc.), hierarchical design flows have been found to form a workable basis. What prevents GALS from gaining a wider acceptance mainly is the initial effort required to come up with a design flow that is efficient and dependable.
    Asynchronous Circuits and Systems, 2006. 12th IEEE International Symposium on; 04/2006
  • Source
    Conference Proceeding: Improving DPA security by using globally-asynchronous locally-synchronous systems
    [show abstract] [hide abstract]
    ABSTRACT: Side channel analysis attacks, and particularly differential power analysis (DPA), pose a serious threat to cryptographic security. This is partly because the synchronous operation of traditional cipher hardware affords a fairly good correlation between the abstract power model used during analysis and the physical circuit under attack. As opposed to this, the globally-asynchronous locally-synchronous (GALS) AES cipher circuit discussed in this paper combines operation reordering and unpredictable latencies with three asynchronous clock domains and self-varying clock cycle times. Attackers are further confused by having functional units process random dummy data when idle. The design fabricated in a 0.25 μm CMOS technology comprises 39,000 gate-equivalents, occupies approximately 1 mm<sup>2</sup> and achieves a peak throughput of more than 256 Mb/s.
    Solid-State Circuits Conference, 2005. ESSCIRC 2005. Proceedings of the 31st European; 10/2005
  • Source
    Conference Proceeding: ASIC implementation of a MIMO-OFDM transceiver for 192 Mbps WLANs
    [show abstract] [hide abstract]
    ABSTRACT: Next generation wireless local area networks (WLANs) such as the IEEE 802.11a standard are expected to rely on multiple antennas at both transmitter and receiver to increase throughput and link reliability. However, these improvements come at a significant increase in signal processing and hence hardware complexity compared to existing single-antenna systems. This paper presents, to the best of the authors' knowledge, the first 4 × 4 MIMO-OFDM WLAN physical layer ASIC based on the OFDM specifications of the IEEE 802.1 la standard. The ASIC achieves an uncoded throughput of 192 Mbps in a 20 MHz channel resulting in a spectral efficiency of 9.6 bits/s/Hz. We describe the hardware architectural differences to single-antenna OFDM systems as well as the extensions made necessary by the use of multiple antennas. Our implementation provides reference for the silicon complexity of MIMO-OFDM systems.
    Solid-State Circuits Conference, 2005. ESSCIRC 2005. Proceedings of the 31st European; 10/2005
  • Conference Proceeding: A 2.7-/spl mu/W/MHz transmission-gate-based 16-bit multiplier for digital hearing aids
    [show abstract] [hide abstract]
    ABSTRACT: Various 16-bit multiplier architectures are compared in terms of dissipated energy, EDP (energy-delay product), and area occupation, in view of low-power low-voltage signal processing for digital hearing aids and similar applications. It is found that the propagation of glitches along uneven and reconvergent paths results in large unproductive node activity. Because of their shorter full-adder chains, Wallace-tree multipliers indeed dissipate less energy than the carry-save and other traditional array multipliers (5.4 to 6.1muW/MHz versus 9.4muW/MHz and more for 0.25mum CMOS technology at 0.75 V). By combining the Wallace-tree architecture with transmission gates, a new approach is proposed to further improve the energy-efficiency (2.7muW/MHz), beyond recently published low-power architectures. Beside the reduction of the overall capacitance, transmission gate full-adders act as RC-low-pass filters that attenuate undesired switching
    Circuits and Systems, 2005. 48th Midwest Symposium on; 09/2005
  • Conference Proceeding: Area, throughput and security considerations for AES crypto-ASICs
    [show abstract] [hide abstract]
    ABSTRACT: First Page of the Article
    Research in Microelectronics and Electronics, 2005 PhD; 08/2005
  • Conference Proceeding: Receiver design for multi-antenna wireless communications
    [show abstract] [hide abstract]
    ABSTRACT: First Page of the Article
    Research in Microelectronics and Electronics, 2005 PhD; 08/2005
  • Source
    Conference Proceeding: Towards an AES crypto-chip resistant to differential power analysis
    [show abstract] [hide abstract]
    ABSTRACT: Differential power analysis (DPA) implies measuring the supply current of a cipher-circuit in an attempt to uncover part of a cipher-key. Cryptographic security gets compromised if the current waveforms so obtained correlate with those from a hypothetical power model of the circuit. Such correlations can be minimized by masking datapath operations with random bits in a reversible way. We analyze such countermeasures and discuss how they perform and how well they lend themselves to being incorporated into dedicated hardware implementations of the advanced encryption standard (AES) block cipher. Our favorite masking scheme entails a performance penalty of some 40-50%. We also present a VLSI design that can serve for practical experiments with DPA.
    Solid-State Circuits Conference, 2004. ESSCIRC 2004. Proceeding of the 30th European; 10/2004

Institutions

  • 2000–2009
    • ETH Zurich
      • Integrated Systems Laboratory
      Zürich, ZH, Switzerland
  • 2004
    • Technische Universität Graz
      Graz, Styria, Austria
  • 1993–2001
    • École Polytechnique Fédérale de Lausanne
      • Laboratoire des systèmes intégrés
      Lausanne, VD, Switzerland
  • 1999
    • University of Zurich
      Zürich, ZH, Switzerland
    • Siemens
      München, Bavaria, USA
  • 1998
    • Integrated Laboratory Systems
      Chapel Hill, NC, USA