[Show abstract][Hide abstract] ABSTRACT: A novel digitally driven pixel circuit for active-matrix organic light-emitting diode (OLED) microdisplays is proposed and evaluated. This circuit supports both pulse width modulation and pulse density modulation digital drive approaches. Only three transistors and one capacitor are required per pixel for the proposed circuit. A current mirror is used to compensate for the pixel current changes that occur because of the degradation of the OLEDs over time. The compensation current depends on the potential of the common cathode, the properties of the current mirror, and the Width/Length (W/L) ratio of the drive transistor. The proposed digital pixel circuit also has advantages in circuit layout compared with analog pixel circuits.
[Show abstract][Hide abstract] ABSTRACT: Most digital systems operate on a positional representation of data, such as binary radix. An alternative is to operate on random bit streams where the signal value is encoded by the probability of obtaining a one versus a zero. This representation is much less compact than binary radix. However, complex operations can be performed with very simple logic. Furthermore, since the representation is uniform, with all bits weighted equally, it is highly tolerant of soft errors (i.e., bit flips). Both combinational and sequential constructs have been proposed for operating on stochastic bit streams. Prior work has shown that combinational logic can implement multiplication and scaled addition effectively while linear finite-state machines (FSMs) can implement complex functions such as exponentiation and tanh effectively. Prior work on stochastic computation has largely been validated empirically.This paper provides a rigorous mathematical treatment of stochastic implementation of complex functions such as exponentiation and tanh implemented using linear FSMs. It presents two new functions, an absolute value function and exponentiation based on an absolute value, motivated by specific applications. Experimental results show that the linear FSM-based constructs for these functions have smaller area-delay products than the corresponding deterministic constructs. They also are much more tolerant of soft errors.
Full-text · Article · Jun 2014 · IEEE Transactions on Computers
[Show abstract][Hide abstract] ABSTRACT: Maintaining the reliability of integrated circuits as transistor sizes continue to shrink to nanoscale dimensions is a significant looming challenge for the industry. Computation on stochastic bit streams, which could replace conventional deterministic computation based on a binary radix, allows similar computation to be performed more reliably and often with less hardware area. Prior work discussed a variety of specific stochastic computational elements (SCEs) for applications such as artificial neural networks and control systems. Recently, very promising new SCEs have been developed based on finite-state machines (FSMs). In this paper, we introduce new SCEs based on FSMs for the task of digital image processing. We present five digital image processing algorithms as case studies of practical applications of the technique. We compare the error tolerance, hardware area, and latency of stochastic implementations to those of conventional deterministic implementations using binary radix encoding. We also provide a rigorous analysis of a particular function, namely the stochastic linear gain function, which had only been validated experimentally in prior work.
No preview · Article · Mar 2014 · IEEE Transactions on Very Large Scale Integration (VLSI) Systems
[Show abstract][Hide abstract] ABSTRACT: We consider the design of IIR filters operating on oversampled sigma-delta modulated bit streams using stochastic arithmetic. Conventional digital filters process multi-bit data at the Nyquist rate using multi-bit multipliers and adders. High resolution ADCs based on the sigma-delta modulation generate random bits at an oversampled rate as intermediate data. We propose to filter the sigma-delta modulated bit streams directly and present first and second order low pass IIR filters based on the stochastic integrator. Experimental results show a significant reduction in hardware area by using stochastic filters.
[Show abstract][Hide abstract] ABSTRACT: The use of model-based software development is increasingly popular due to recent advancements in modeling technology. Numerous approaches exist; this paper seeks to organize and characterize them. In particular, important terminological confusion, challenges, ...
Full-text · Article · Oct 2013 · Software and Systems Modeling
[Show abstract][Hide abstract] ABSTRACT: Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine  with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine , 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.
[Show abstract][Hide abstract] ABSTRACT: Since stochastic computing performs operations using streams of bits that represent probability values instead of deterministic values, it can tolerate a large number of failures in a noisy system. However, the simulation of a stochastic implementation is extremely time-consuming. In this paper, we investigate two approaches to speed up the stochastic simulation: a GPU-based simulation and an OpenMP-based simulation. To compare these two approaches, we start with several basic stochastic computing elements SCEs and then use the stochastic implementation of a frame difference-based image segmentation algorithm as case study to conduct extensive experiments. Measured results show that the GPU-based simulation with 448 processing elements can achieve up to 119x performance speedup compared to the single-threaded CPU simulation and 17x performance speedup over the OpenMP-based simulation with eight processor cores. In addition, we present several performance optimisations for the GPU-based simulation which significantly benefit the performance of stochastic simulation.
Preview · Article · Feb 2013 · International Journal of Computational Science and Engineering
[Show abstract][Hide abstract] ABSTRACT: Stochastic computing is a novel approach to real arithmetic, offering better error tolerance and lower hardware costs over the conventional implementations. Stochastic modules are digital systems that process random bit streams representing real values in the unit interval. Stochastic modules based on finite state machines (FSMs) have been shown to realize complicated arithmetic functions much more efficiently than combinational stochastic modules. However, a general approach to synthesize FSMs for realizing arbitrary functions has been elusive. We describe a systematic procedure to design FSMs that implement arbitrary real-valued functions in the unit interval using the Taylor series approximation.
[Show abstract][Hide abstract] ABSTRACT: Energy consumption is a fundamental issue in today's data centers as data continue growing dramatically. How to process these data in an energy-efficient way becomes more and more important. Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the performance gap between CPUs and main memory) by moving computing closer to the data. However, these methods have not been widely adopted due to high cost and limited performance improvements. In this paper, we propose the storage processing unit (SPU) which adds computing power into NAND flash memories at standard solid-state drive (SSD) cost. By pre-processing the data using the SPU, the data that needs to be transferred to host CPUs for further processing are significantly reduced. Simulation results show that the SPU-based system can result in at least 100 times lower energy per operation than a conventional system for data-intensive applications.
[Show abstract][Hide abstract] ABSTRACT: Most digital systems operate on a positional representation of data, such as binary radix. An alternative is to operate on random bit streams where the signal value is encoded by the probability of obtaining a one versus a zero. This representation is much less compact than binary radix. However, complex operations can be performed with very simple logic. Furthermore, since the representation is uniform, with all bits weighted equally, it is highly tolerant of soft errors (i.e., bit flips). Both combinational and sequential constructs have been proposed for operating on stochastic bit streams. Prior work has shown that combinational logic can implement multiplication and scaled addition effectively; linear finitestate machines (FSMs) can implement complex functions such as exponentiation and tanh effectively. Building on these prior results, this paper presents case studies of useful circuit constructs implement with the paradigm of logical computation on stochastic bit streams. Specifically, it describes finite state machine implementations of functions such as edge detection and median filter-based noise reduction.
[Show abstract][Hide abstract] ABSTRACT: Stochastic encoding represents a value using the probability of ones in a random bit stream. Computation based on this encoding has good fault-tolerance and low hardware cost. However, one of its major issues is long processing time. We have to use a long enough bit stream to represent a value to guarantee that random fluctuations introduce only small errors to final computation results. For example, for most digital image processing algorithms, we need a 512-bit stream to represent an 8-bit pixel value stochastically to guarantee that the final computation error is less than 5%. To solve this issue, this paper proposes to share bits between adjacent bit streams to represent adjacent deterministic values. For example, in image processing applications, the bit stream which represents the current pixel value can share parts of the bits in the bit stream which represents the previous pixel value. We use an image contrast stretching algorithm to evaluate this method. Our experimental results show that the proposed methods can improve the performance by 90%.
[Show abstract][Hide abstract] ABSTRACT: The paradigm of logical computation on stochastic bit streams has several key advantages compared to deterministic computation based on binary radix, including error-tolerance and low hardware area cost. Prior research has shown that sequential logic operating on stochastic bit streams can compute non-polynomial functions, such as the tanh function, with less energy than conventional implementations. However, the functions that can be computed in this way are quite limited. For example, high order polynomials and non-polynomial functions cannot be computed using prior approaches. This paper proposes a new finite-state machine (FSM) topology for complex arithmetic computation on stochastic bit streams. It describes a general methodology for synthesizing such FSMs. Experimental results show that these FSM-based implementations are more tolerant of soft errors and less costly in terms of the area-time product that conventional implementations.
[Show abstract][Hide abstract] ABSTRACT: Workload consolidation is a key technique in reducing costs in virtualized datacenters. When considering storage consolidation, a key problem is the unpredictable performance behavior of consolidated workloads on a given storage system. In practice, this often forces system administrators to grossly overprovision storage to meet application demands. In this paper, we show that existing modeling techniques are inaccurate and ineffective in the face of heterogenous devices. We introduce Romano, a storage performance management system designed to optimize truly heterogeneous virtualized datacenters. At its core, Romano constructs and adapts approximate workload-specific performance models of storage devices automatically, along with prediction intervals. It then applies these models to allow highly efficient IO load balancing. End-to-end experiments demonstrate that Romano reduces prediction error by 80% on average compared with existing techniques. The result is improved load balancing with lowered variance by 82% and reduced average and maximum latency observed across the storage systems by 52% and 78%, respectively.
[Show abstract][Hide abstract] ABSTRACT: Researchers showed that performing computation directly on storage devices improves system performance in terms of energy consumption and processing time. For example, Riedel et al.  proposed an active disk which performs computation using the processor in a hard disk drive (HDD). Their experimental results showed that the active disk-based system had a factor of 2x performance improvement . However, because the performance gap between the HDDs and CPUs becomes larger and larger, the active disk-based improvement is quite limited. As the role of flash memory increases in storage architectures, solid-state drives (SSDs) have gradually displaced the HDDs with higher access performance and lower power consumption. Researchers also proposed an active flash, which performs computation using a controller in the SSD . However, the SSD controller needs to implement a flash translation layer to make the SSD as an emulated HDD for most operating systems. It also needs to communicate with a host interface to transfer required data. The additional computation power can be utilized is quite limited. To maximize the computation power on the SSD, we propose a processor design called storage processing unit (SPU).
[Show abstract][Hide abstract] ABSTRACT: Computation performed on stochastic bit streams is less efficient than that based on a binary radix because of its long latency. However, for certain complex arithmetic operations, computation on stochastic bit streams can consume less energy and tolerate more soft errors. In addition, the latency issue could be solved by using a faster clock frequency or in combination with a parallel processing approach. To take advantage of this computing technique, previous work proposed a combinational logic-based reconfigurable architecture to perform complex arithmetic operations on stochastic streams of bits. In this paper, we enhance and extend this reconfigurable architecture using sequential logic. Compared to the previous approach, the proposed reconfigurable architecture takes less hardware area and consumes less energy, while achieving the same performance in terms of processing time and fault-tolerance.
[Show abstract][Hide abstract] ABSTRACT: Given an N-point sequence, finding its k largest components in the frequency domain is a problem of great interest. This problem, which is usually referred to as a sparse Fourier Transform, was recently brought back on stage by a newly proposed algorithm called the sFFT. In this paper, we present a parallel implementation of sFFT on both multi-core CPUs and GPUs using a human voice signal as a case study. Using this example, an estimate of k for the 3dB cutoff points was conducted through concrete experiments. In addition, three optimization strategies are presented in this paper. We demonstrate that the multi-core-based sFFT achieves speedups of up to three times a single-threaded sFFT while a GPU-based version achieves up to ten times speedup. For large scale cases, the GPU-based sFFT also shows its considerable advantages, which is about 40 times speedup compared to the latest out-of-card FFT implementations .
[Show abstract][Hide abstract] ABSTRACT: Numerical integration is a widely used approach for computing an approximate result of a definite integral. Conventional digital implementations of numerical integration using binary radix encoding are costly in terms of hardware and have long computational delay. This work proposes a novel method for performing numerical integration based on the paradigm of logical computation on stochastic bit streams. In this paradigm, ordinary digital circuits are employed but they operate on stochastic bit streams instead of deterministic values; the signal value is encoded by the probability of obtaining a one versus a zero in the streams. With this type of computation, complex arithmetic operations can be implemented with very simple circuitry. However, typically, such stochastic implementations have long computational delay, since long bit streams are required to encode precise values. This paper proposes a stochastic design for numerical integration characterized by both small area and short delay - so, in contrast to previous applications, a win on both metrics. The design is based on mathematical analysis that demonstrates that the summation of a large number of terms in the numerical integration could lead to a significant delay reduction. An architecture is proposed for this task. Experiments confirm that the stochastic implementation has smaller area and shorter delay than conventional implementations.
[Show abstract][Hide abstract] ABSTRACT: Recent advances in flash memory show great potential to replace traditional hard drives (HDDs) with flash-based solid state drives (SSDs) from personal computing to distributed systems. However, it is still a long way to go before completely using SSDs for enterprise data storage. Considering the cost, performance, and reliability of SSDs, a practical solution is to combine both SSDs and HDDs together. This paper proposes a hybrid storage system named PASS (Performance-dAta Synchronization - hybrid storage System) to tradeoff between I/O performance and data discrepancy between SSDs and HDDs. PASS includes a high-performance SSD and a traditional HDD to store mirrored data for reliability. All of the I/O requests are redirected to the primary SSD first and then the updated data blocks are copied to the backup HDD asynchronously. In order to hide the latency of copying operations, we use an I/O window to coalesce write requests and maintain an ordered I/O queue to shorten the HDD seek and rotation times. Depending on the charateristics of different I/O workloads, we develop an adaptive policy to dynamically balance the foreground I/O processing and background mirroring. We implement a prototype system of PASS by developing a Linux device driver and conduct experiments on the IoMeter, PostMark, and TPCC benchmarks. Our results show that PASS can achieve up to 12 times the performance of a RAID1 storage system for the IoMeter and PostMark workloads while tolerating less than 2% data discrepancy between the primary SSD and the backup HDD. More interestingly, while PASS does not produce any performance benefit for the TPC-C benchmark, it does allow the system to scale to larger sizes than when using an HDD-based RAID system alone.
[Show abstract][Hide abstract] ABSTRACT: Nanoelectromechanical systems (NEMS) is an emerging nanoscale technology that combines mechanical and electrical effects in devices. A variety of NEMS-based devices have been proposed for integrated chip designs. Amongst them are near-ideal digital switches. The electromechanical principles that are the basis of these switches impart the capability of extremely low power switching characteristics to digital circuits. NEMS switching devices have been mostly used as simple switches to provide digital operation, however, we observe that their unique operation can be used to accomplish logic functions directly. In this paper, we propose a novel technique called ‘weighted area logic’ to design logic circuits with NEMS-based switches. The technique takes advantage of the unique structural configurations possible with the NEMS devices to convert the digital switch from a simple ON-OFF switch to a logical switch. This transformation not only reduces the delay of complex logic units, but also decreases the power and area of the implementation further. To demonstrate this, we show the new designs of the logic functions of NAND, XOR and a three input function Y = A + B.C, and compose them into a 32-bit adder. Through simulation, we quantify the power, delay and area advantages of using the weighted area logic technique over a standard CMOS-like design technique applied to NEMS.
[Show abstract][Hide abstract] ABSTRACT: Phase change memory (PCM) is a promising technology to solve energy and performance bottlenecks for memory and storage systems. To help understand the reliability characteristics of PCM devices, we present a simple fault model to categorize four types of PCM errors. Based on our proposed fault model, we conduct extensive experiments on real PCM devices at the memory module level. Numerical results uncover many interesting trends in terms of the lifetime of PCM devices and error behaviors. Specifically, PCM lifetime for the memory chips we tested is greater than 14 million cycles, which is much longer than for flash memory devices. In addition, the distributions for four types of errors are quite different. These results can be used for estimating PCM lifetime and for measuring the fabrication quality of individual PCM memory chips.