Conference Paper

SonicFFT: A system architecture for ultrasonic-based FFT acceleration

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Recently, an emerging ultrasonic wavefront computing (WFC) technique was proposed to compute the FFT 11,12 . This method uses the principles of wave mechanics in the acoustic domain by implementing the Fourier transform through ultrasonic waves propagating within Silicon. ...
... This method uses the principles of wave mechanics in the acoustic domain by implementing the Fourier transform through ultrasonic waves propagating within Silicon. According to Patel et al. 12 , the computational complexity of WFC is O(δ) , where δ is the transit time of the ultrasonic wavefront. This is because the number of cycles consumed in the microprocessor is comparable to the transit time of the ultrasonic wavefront. ...
... For a WFC module with an N × N transducer array, the computational complexity is O(N). The WFC technique achieves a 2317× system-level energy-delay product and benefits a simultaneous 117.69× speedup with 19.69× energy reduction, as compared to the state-of-the-art baseline all-digital configuration 12 . Table 1 summarizes the above mentioned physical Fourier transform realization approaches against digital computation, in terms of the complexity and its pros and cons. ...
Article
Full-text available
The recent emerging alternative to classic numerical Fast Fourier transform (FFT) computation, based on GHz ultrasonic waves generated from and detected by piezoelectric transducers for wavefront computing (WFC), is more efficient and energy-saving. In this paper, we present comprehensive studies on the modeling and simulation methods for ultrasonic WFC computation. We validate the design of the WFC system using ray-tracing, Fresnel diffraction (FD), and the full-wave finite element method (FEM). To effectively simulate the WFC system for inputs of 1-D signals and 2-D images, we verified the design parameters and focal length of an ideal plano-concave lens using the ray-tracing method. We also compared the analytical FFT solution with our Fourier transform (FT) results from 3-D and 2-D FD and novel 2-D full wave FEM simulations of a multi-level Fresnel lens with 1-D signals and 2-D images as inputs. Unlike the previously reported WFC system which catered only for 2-D images, our proposed method also can solve the 1-D FFT effectively. We validate our proposed 2-D full wave FEM simulation method by comparing our results with the theoretical FFT and Fresnel diffraction method. The FFT results from FD and FEM agree well with the digitally computed FFT, with computational complexity reduced from O(N2logN)O(N2logN)O(N^2 logN) to O(N) for 2-D FFT, and from O(NlogN) to O(N) for 1-D FFT with a large number of signal sampling points N.
Article
Full-text available
This paper presents a fully integrated ultrasound system based on a single piezoelectric micromachined ultrasonic transducer (PMUT) monolithically fabricated with a 0.13 μm complementary metal oxide semiconductor (CMOS) process analog front-end circuitry. The PMUT consists of an aluminum nitride, AlN, squared device with 80 μm side that resonates at 2.4 MHz in liquid environment. The monolithic integration of the PMUT with the CMOS circuitry allows a reduction of the parasitic capacitance, a reduction of the electronic noise contribution and a clear improvement in the Signal-to Noise ratio (SNR ∼ 27 dB better) compared to a non-integrated equivalent system. A pulse-echo experiment with the single PMUT-on-CMOS for transmitting and sensing simultaneously is demonstrated, ensuring a 17.3 dB SNR, higher than the minimal necessary for accurate fingerprint images, paving the way towards a pixel sized imaging system with no need of multiple simultaneous PMUTs transmitters. Consuming only 0.3 mW and getting an input-referred noise of 3.26 mPa/√Hz at 2.4 MHz, the proposed pulse-echo system achieves a competitive noise efficient factor in comparison with the state-of-the-art.
Article
Full-text available
This paper presents an analog front-end transceiver for an ultrasound imaging system based on a high-voltage (HV) transmitter, a low-noise front-end amplifier (RX), and a complementary-metal-oxide-semiconductor, aluminum nitride, piezoelectric micromachined ultrasonic transducer (CMOS-AlN-PMUT). The system was designed using the 0.13-μm Silterra CMOS process and the MEMS-on-CMOS platform, which allowed for the implementation of an AlN PMUT on top of the CMOS-integrated circuit. The HV transmitter drives a column of six 80-μm-square PMUTs excited with 32 V in order to generate enough acoustic pressure at a 2.1-mm axial distance. On the reception side, another six 80-μm-square PMUT columns convert the received echo into an electric charge that is amplified by the receiver front-end amplifier. A comparative analysis between a voltage front-end amplifier (VA) based on capacitive integration and a charge-sensitive front-end amplifier (CSA) is presented. Electrical and acoustic experiments successfully demonstrated the functionality of the designed low-power analog front-end circuitry, which outperformed a state-of-the art front-end application-specific integrated circuit (ASIC) in terms of power consumption, noise performance, and area.
Article
Full-text available
The fast Fourier transform (FFT) algorithm was developed by Cooley and Tukey in 1965. It could reduce the computational complexity of discrete Fourier transform significantly from O(N2)O(N^2) to O(Nlog2N)O(N\log _2 {N}). The invention of FFT is considered as a landmark development in the field of digital signal processing (DSP), since it could expedite the DSP algorithms significantly such that real-time digital signal processing could be possible. During the past 50 years, many researchers have contributed to the advancements in the FFT algorithm to make it faster and more efficient in order to match with the requirements of various applications. In this article, we present a brief overview of the key developments in FFT algorithms along with some popular applications in speech and image processing, signal analysis, and communication systems.
Article
Full-text available
To enable the design of large capacity memory structures, novel memory technologies such as non-volatile memory (NVM) and novel fabrication approaches, e.g., 3D stacking and multi-level cell (MLC) design have been explored. The existing modeling tools, however, cover only a few memory technologies, technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 2D/3D memories designed using SRAM, resistive RAM (ReRAM), spin transfer torque RAM (STT-RAM), phase change RAM (PCM) and embedded DRAM (eDRAM) and 2D memories designed using spin orbit torque RAM (SOT-RAM), domain wall memory (DWM) and Flash memory. In addition to single-level cell (SLC) designs for all of these memories, DESTINY also supports modeling MLC designs for NVMs. We have extensively validated DESTINY against commercial and research prototypes of these memories. DESTINY is very useful for performing design-space exploration across several dimensions, such as optimizing for a target (e.g., latency, area or energy-delay product) for a given memory technology, choosing the suitable memory technology or fabrication method (i.e., 2D v/s 3D) for a given optimization target, etc. We believe that DESTINY will boost studies of next-generation memory architectures used in systems ranging from mobile devices to extreme-scale supercomputers. The latest source-code of DESTINY is available from the following git repository: https://bitbucket.org/sparshmittal/destinyv2.
Article
Full-text available
Many applications of thin films necessitate detailed information about their thicknesses and sound velocities. Here, we study SiO2/LiNbO3 layer systems by picosecond photoacoustic metrology and measure the sound velocities of the respective layers and the film thickness of SiO2, which pose crucial information for the fabrication of surface-acoustic-wave filters for communication technology. Additionally, we utilize the birefringence and the accompanying change in the detection sensitivity of coherent acoustic phonons in the LiNbO3 layer to infer information about the LiNbO3 orientation and the layer interface.
Article
Full-text available
Recently, both industry and academia have proposed many different roadmaps for the future of DRAM. Consequently, there is a growing need for an extensible DRAM simulator, which can be easily modified to judge the merits of today's DRAM standards as well as those of tomorrow. In this paper, we present Ramulator, a fast and cycle-accurate DRAM simulator that is built from the ground up for extensibility. Unlike existing simulators, Ramulator is based on a generalized template for modeling a DRAM system, which is only later infused with the specific details of a DRAM standard. Thanks to such a decoupled and modular design, Ramulator is able to provide out-of-the-box support for a wide array of DRAM standards: DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, as well as some academic proposals (SALP, AL-DRAM, TL-DRAM, RowClone, and SARP). Importantly, Ramulator does not sacrifice simulation speed to gain extensibility: according to our evaluations, Ramulator is 2.5 faster than the next fastest simulator. Ramulator is released under the permissive BSD license.
Article
Full-text available
This article gives an overview on the techniques needed to implement the discrete Fourier transform (DFT) efficiently on current multicore systems. The focus is on Intel-compatible multicores, but we also discuss the IBM Cell and, briefly, graphics processing units (GPUs). The performance optimization is broken down into three key challenges: parallelization, vectorization, and memory hierarchy optimization. In each case, we use the Kronecker product formalism to formally derive the necessary algorithmic transformations based on a few hardware parameters. Further code-level optimizations are discussed. The rigorous nature of this framework enables the complete automation of the implementation task as shown by the program generator Spiral. Finally, we show and analyze DFT benchmarks of the fastest libraries available for the considered platforms.
Conference Paper
Full-text available
This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehen- sive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multipro- cessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory con- trollers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing mod- eling, area modeling, and dynamic, short-circuit, and leak- age power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double- gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay- area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clus- tering will bring interesting tradeoffs between area and per- formance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache shar- ing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out- of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into ac- count configuring clusters with 4 cores gives the best EDA2P and EDAP.
Article
Full-text available
The Analytical Theory of Heat / Joseph Fourier ; translated, with notes, by Alexander Freeman Note: The University of Adelaide Library eBooks @ Adelaide.
Conference Paper
Full-text available
The classical Cooley-Tukey fast Fourier transform (FFT) algorithm has the computational cost of O(Nlog2N) where N is the length of the discrete signal. Spectrum resolution is improved through padding zeros at the tail of the discrete signal, if (p -1)N zeros are padded (where p is an integer) at the tail of the data sequence, the computational cost through FFT becomes O(pNlog2pN). This paper proposes an alternate instance of padding zeros to the data sequence that results in computational cost reduction to O(pNlog2 N). It has been noted that this modification can be used to achieve non-uniform upsampling that would zoom-in or zoom-out a particular frequency band, in addition, it may be used for pruning the spectrum, which would reduce resolution of an unimportant frequency band
Conference Paper
High resolution Fast Fourier Transform (FFT) is important for various applications while increased memory access and parallelism requirement limits the traditional hardware. In this work, we explore acceleration opportunities for high resolution FFTs in spintronic computational RAM (CRAM) which supports true in-memory processing semantics. We experiment with Spin-Torque-Transfer (STT) and Spin-Hall-Effect (SHE) based CRAMs in implementing CRAFFT, a high resolution FFT accelerator in memory. For one million point fixed-point FFT, we demonstrate that CRAFFT can provide up to 2.57× speedup and 673× energy reduction. We also provide a proof-of-concept extension to floating-point FFT.
Article
Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 2 20 points. First, a parallel Cooley–Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm 2 and the power consumption of 91.3 mW at 25 °C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal’s performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89×18.89\times speedups in comparison to two software-only solutions and two hardware-dedicated solutions.
Conference Paper
In this paper we present designs of an aluminum nitride (AlN) based transducer stack for ultrasonic transmit/receive applications integrated in silicon. By optimal design of the mechanical layer thickness and material properties, channel gain, center frequency and bandwidth can be controlled to allow for the use of lower gain and on-chip power electronics for integrated ultrasonic information processing. Certain materials in the stack were fixed due to the fabrication processing capability, however some of the passive layers, and the thicknesses of the layers could be controlled. Simulations were done to select the desired thicknesses of each layer and the resulting chip was fabricated and verified. Previous tape-outs from the fab had resulted in receive signal levels of 200 µV at 1.3 GHz, whereas the current stack had signal levels of 18 mV at 1.3 GHz.
Article
At data rates beyond 10Gb/s, most wireline links employ NRZ signaling. Serial NRZ links as high as 56Gb/s and 60Gb/s have been reported [1]. Nevertheless, as the rate increases, the constraints imposed by the channel, package, and die become more severe and do not benefit from process scaling in the same fashion that circuit design does. Reflections from impedance discontinuities in the PCB and package caused by vias and connectors introduce significant signal loss and distortions at higher frequencies. Even with an ideal channel, at every package-die interface, there is an intrinsic parasitic capacitance due to the pads and the ESD circuit amounting to at least 150fF, and a 50° resistor termination at both the transmit and receive ends resulting in an intrinsic pole at 23GHz or lower. In light of all these limitations, serial NRZ signaling beyond 60Gb/s appears suboptimal in terms of both power and performance. Utilizing various modulation techniques such as PAM4, one can achieve a higher spectral efficiency [2]. To enable such transmission formats, high-speed moderate-resolution data converters are required. This paper describes a 36Gb/s transmitter based on an 18GS/s 8b DAC implemented in 28nm CMOS, compliant to the new IEEE802.3bj standard for 100G Ethernet over backplane and copper cables [3].
Conference Paper
On-chip wired interconnects presents a bottleneck for VLSI integrated circuits. An additional channel with which to communicate information would be beneficial to supplement traditional wired designs. Utilizing virtual, reconfigurable ultrasonic interconnects operating at high bit rate could open new vistas for computer architecture and low power computing. The first step to this goal has been demonstrated in this paper by using ultrasonic pulses to communicate between two aluminum nitride thin film transducers on a silicon wafer representative of a VLSI substrate. Direct output voltages on receive pixels were on the order of 40-60 μVpp for a drive voltage on transmit pixels of 0.5 Vpp at 900 MHz. An FEA model was used to verify the time-of-flight and signal amplitudes to demonstrate that the primary mechanism is bulk acoustic waves travelling through the silicon substrate.
Conference Paper
Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial. We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores. We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.
Conference Paper
We developed a high density 1R/1W SRAM macro based on 8T-SRAM with an effective scheme for Design for Testability. To achieve a smaller Macro area, a differential sense amplifier is introduced to read the data, where the reference voltage for reading 0/1 data is generated by unselected cell array. In addition, we proposed a screening test circuit for read disturb operation. A 512 kbit two port SRAM macro based upon 28nm process was designed, confirming experimentally that the worst minimum operation voltage (Vmin) can be reproduced by our test circuit. The bit density of 3.16 Mb/mm2 was achieved, which is the highest among recent literatures.
Article
Piezoelectric aluminum nitride (AlN) thin films have been developed to realize ultrasonic transducers. AlN up to 1.5m is deposited at low temperature (140 degree(s)C) by reactive DC magnetron sputtering of an Al target in argon and nitrogen on Si, Si/SiO2/Al, and Si/Al substrates, and is wet etched (rates from 0.1 micrometers /min to 0.2 micrometers /min and selectivity of 1:10 with Al, and no etching with Si). SiO2/Al/AlN/Al, Al/AlN/Al and Si/AlN/Al square and circular membranes, from 10 micrometers to 1.5 mm size are fabricated using silicon deep reactive ion etching (DRIE), which gives etch profiles about 90, which allows larger integration density than wet anisotropic etching for ultrasonic transducers arrays. By varying size and thickness of membranes, resonance frequencies from 10 kHz to 20 MHz are expected, acoustic and electrical measurements are in progress. Ultrasonic transducers using this technology will be used to measure flows velocity by Doppler method. Other potential applications for ultrasonic transducers include medical ultrasounds and sonar. Other structures are also in progress such as Thin Film Bulk Acoustic Resonator (TFBAR), and Lamb wave devices using this technology.
Book
This manuscript describes a number of algorithms that can be used to quickly evaluate a polynomial over a collection of points and interpolate these evaluations back into a polynomial. Engineers define the “Fast Fourier Transform” as a method of solving the interpolation problem where the coefficient ring used to construct the polynomials has a special multiplicative structure. Mathematicians define the “Fast Fourier Transform” as a method of solving the evaluation problem. One purpose of the document is to provide a mathematical treatment of the topic of the “Fast Fourier Transform” that can also be understood by someone who has an understanding of the topic from the engineering perspective. The manuscript will also introduce several new algorithms that solve the fast multipoint evaluation problem over certain finite fields and require fewer finite field operations than existing techniques. The document will also demonstrate that these new algorithms can be used to multiply polynomials with finite field coefficients with fewer operations than Schonhage's algorithm in most circumstances. A third objective of this document is to provide a mathematical perspective of several algorithms which can be used to multiply polynomials of size which is not a power of two. Several improvements to these algorithms will also be discussed. Finally, the document will describe several applications of the “Fast Fourier Transform” algorithms presented and will introduce improvements in several of these applications. In addition to polynomial multiplication, the applications of polynomial division with remainder, the greatest common divisor, decoding of Reed-Solomon error-correcting codes, and the computation of the coefficients of a discrete Fourier Series will be addressed.
Article
An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0
Conference Paper
A review is presented of the principles of both intrinsic and extrinsic fiber optic sensing systems and some of their applications to nondestructive evaluation (NDE), with special emphasis on current research demonstrating applications of fiber optics to ultrasonic NDE. Single- and multimode flexible optical-fiber elements, individually or in a variety of arrays, provide useful tools for many NDE applications ranging from directed photothermal excitation to remote sensing. Optical-fiber components have several important advantages. They are dielectric devices and thus are largely insensitive to electromagnetic interference. They can be readily adapted for use in harsh environments, and their dimensions and geometrical flexibility support compact, readily adaptable designs and facilitate access to remote or otherwise inaccessible locations
Conference Paper
We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.
Article
In this paper we investigate possible ways to improve the energy efficiency of a general purpose microprocessor. We show that the energy of a processor depends on its performance, so we chose the energy-delay product to compare different processors. To improve the energy-delay product we explore methods of reducing energy consumption that do not lead to performance loss (i.e. wasted energy), and explore methods to reduce delay by exploiting instruction level parallelism. We found that careful design reduced the energy dissipation by almost 25%. Pipelining can give approximately a 2× improvement in energy-delay product. Superscalar issue, however, does not improve the energy-delay product any further since the overhead required offsets the gains in performance. Further improvements will be hard to come by since a large fraction of the energy (50-80%) is dissipated in the clock network and the on-chip memories. Thus, the efficiency of processors will depend more on the technology being used and the algorithm chosen by the programmer than the micro-architecture
Fourier optics: basic concepts
  • perrin
Texas Instruments White Paper: Very large FFT for TMS320C6678 processors
  • Xiaohui Li
  • Ellen Blinka