David J. Lilja

University of Minnesota Twin Cities, Minneapolis, Minnesota, United States

Are you David J. Lilja?

Claim your profile

Publications (289)63.93 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Most digital systems operate on a positional representation of data, such as binary radix. An alternative is to operate on random bit streams where the signal value is encoded by the probability of obtaining a one versus a zero. This representation is much less compact than binary radix. However, complex operations can be performed with very simple logic. Furthermore, since the representation is uniform, with all bits weighted equally, it is highly tolerant of soft errors (i.e., bit flips). Both combinational and sequential constructs have been proposed for operating on stochastic bit streams. Prior work has shown that combinational logic can implement multiplication and scaled addition effectively while linear finite-state machines (FSMs) can implement complex functions such as exponentiation and tanh effectively. Prior work on stochastic computation has largely been validated empirically.This paper provides a rigorous mathematical treatment of stochastic implementation of complex functions such as exponentiation and tanh implemented using linear FSMs. It presents two new functions, an absolute value function and exponentiation based on an absolute value, motivated by specific applications. Experimental results show that the linear FSM-based constructs for these functions have smaller area-delay products than the corresponding deterministic constructs. They also are much more tolerant of soft errors.
    IEEE Transactions on Computers 01/2014; 63(6):1474-1486. · 1.38 Impact Factor
  • Source
    David J. Lilja, Raffaela Mirandola
    [Show abstract] [Hide abstract]
    ABSTRACT: The use of model-based software development is increasingly popular due to recent advancements in modeling technology. Numerous approaches exist; this paper seeks to organize and characterize them. In particular, important terminological confusion, challenges, ...
    Software and Systems Modeling 10/2013; · 1.25 Impact Factor
  • Ding Liu, Ruixuan Li, David J. Lilja, Weijun Xiao
    [Show abstract] [Hide abstract]
    ABSTRACT: Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.
    Proceedings of the ACM International Conference on Computing Frontiers; 05/2013
  • Source
    Weijun Xiao, Peng Li, David J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: Since stochastic computing performs operations using streams of bits that represent probability values instead of deterministic values, it can tolerate a large number of failures in a noisy system. However, the simulation of a stochastic implementation is extremely time-consuming. In this paper, we investigate two approaches to speed up the stochastic simulation: a GPU-based simulation and an OpenMP-based simulation. To compare these two approaches, we start with several basic stochastic computing elements SCEs and then use the stochastic implementation of a frame difference-based image segmentation algorithm as case study to conduct extensive experiments. Measured results show that the GPU-based simulation with 448 processing elements can achieve up to 119x performance speedup compared to the single-threaded CPU simulation and 17x performance speedup over the OpenMP-based simulation with eight processor cores. In addition, we present several performance optimisations for the GPU-based simulation which significantly benefit the performance of stochastic simulation.
    International Journal of Computational Science and Engineering. 02/2013; 8(1):34-46.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Stochastic computing is a novel approach to real arithmetic, offering better error tolerance and lower hardware costs over the conventional implementations. Stochastic modules are digital systems that process random bit streams representing real values in the unit interval. Stochastic modules based on finite state machines (FSMs) have been shown to realize complicated arithmetic functions much more efficiently than combinational stochastic modules. However, a general approach to synthesize FSMs for realizing arbitrary functions has been elusive. We describe a systematic procedure to design FSMs that implement arbitrary real-valued functions in the unit interval using the Taylor series approximation.
    Computer Design (ICCD), 2013 IEEE 31st International Conference on; 01/2013
  • Peng Li, K. Gomez, D.J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy consumption is a fundamental issue in today's data centers as data continue growing dramatically. How to process these data in an energy-efficient way becomes more and more important. Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the performance gap between CPUs and main memory) by moving computing closer to the data. However, these methods have not been widely adopted due to high cost and limited performance improvements. In this paper, we propose the storage processing unit (SPU) which adds computing power into NAND flash memories at standard solid-state drive (SSD) cost. By pre-processing the data using the SPU, the data that needs to be transferred to host CPUs for further processing are significantly reduced. Simulation results show that the SPU-based system can result in at least 100 times lower energy per operation than a conventional system for data-intensive applications.
    High Performance Extreme Computing Conference (HPEC), 2013 IEEE; 01/2013
  • Peng Li, D.J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: Stochastic encoding represents a value using the probability of ones in a random bit stream. Computation based on this encoding has good fault-tolerance and low hardware cost. However, one of its major issues is long processing time. We have to use a long enough bit stream to represent a value to guarantee that random fluctuations introduce only small errors to final computation results. For example, for most digital image processing algorithms, we need a 512-bit stream to represent an 8-bit pixel value stochastically to guarantee that the final computation error is less than 5%. To solve this issue, this paper proposes to share bits between adjacent bit streams to represent adjacent deterministic values. For example, in image processing applications, the bit stream which represents the current pixel value can share parts of the bits in the bit stream which represents the previous pixel value. We use an image contrast stretching algorithm to evaluate this method. Our experimental results show that the proposed methods can improve the performance by 90%.
    Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The paradigm of logical computation on stochastic bit streams has several key advantages compared to deterministic computation based on binary radix, including error-tolerance and low hardware area cost. Prior research has shown that sequential logic operating on stochastic bit streams can compute non-polynomial functions, such as the tanh function, with less energy than conventional implementations. However, the functions that can be computed in this way are quite limited. For example, high order polynomials and non-polynomial functions cannot be computed using prior approaches. This paper proposes a new finite-state machine (FSM) topology for complex arithmetic computation on stochastic bit streams. It describes a general methodology for synthesizing such FSMs. Experimental results show that these FSM-based implementations are more tolerant of soft errors and less costly in terms of the area-time product that conventional implementations.
    Computer-Aided Design (ICCAD), 2012 IEEE/ACM International Conference on; 11/2012
  • Nohhyun Park, Irfan Ahmad, David J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: Workload consolidation is a key technique in reducing costs in virtualized datacenters. When considering storage consolidation, a key problem is the unpredictable performance behavior of consolidated workloads on a given storage system. In practice, this often forces system administrators to grossly overprovision storage to meet application demands. In this paper, we show that existing modeling techniques are inaccurate and ineffective in the face of heterogenous devices. We introduce Romano, a storage performance management system designed to optimize truly heterogeneous virtualized datacenters. At its core, Romano constructs and adapts approximate workload-specific performance models of storage devices automatically, along with prediction intervals. It then applies these models to allow highly efficient IO load balancing. End-to-end experiments demonstrate that Romano reduces prediction error by 80% on average compared with existing techniques. The result is improved load balancing with lowered variance by 82% and reduced average and maximum latency observed across the storage systems by 52% and 78%, respectively.
    Proceedings of the Third ACM Symposium on Cloud Computing; 10/2012
  • Peng Li, Kevin Gomez, David J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: Researchers showed that performing computation directly on storage devices improves system performance in terms of energy consumption and processing time. For example, Riedel et al. [2] proposed an active disk which performs computation using the processor in a hard disk drive (HDD). Their experimental results showed that the active disk-based system had a factor of 2x performance improvement [2]. However, because the performance gap between the HDDs and CPUs becomes larger and larger, the active disk-based improvement is quite limited. As the role of flash memory increases in storage architectures, solid-state drives (SSDs) have gradually displaced the HDDs with higher access performance and lower power consumption. Researchers also proposed an active flash, which performs computation using a controller in the SSD [1]. However, the SSD controller needs to implement a flash translation layer to make the SSD as an emulated HDD for most operating systems. It also needs to communicate with a host interface to transfer required data. The additional computation power can be utilized is quite limited. To maximize the computation power on the SSD, we propose a processor design called storage processing unit (SPU).
    Proceedings of the 21st international conference on Parallel architectures and compilation techniques; 09/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Numerical integration is a widely used approach for computing an approximate result of a definite integral. Conventional digital implementations of numerical integration using binary radix encoding are costly in terms of hardware and have long computational delay. This work proposes a novel method for performing numerical integration based on the paradigm of logical computation on stochastic bit streams. In this paradigm, ordinary digital circuits are employed but they operate on stochastic bit streams instead of deterministic values; the signal value is encoded by the probability of obtaining a one versus a zero in the streams. With this type of computation, complex arithmetic operations can be implemented with very simple circuitry. However, typically, such stochastic implementations have long computational delay, since long bit streams are required to encode precise values. This paper proposes a stochastic design for numerical integration characterized by both small area and short delay - so, in contrast to previous applications, a win on both metrics. The design is based on mathematical analysis that demonstrates that the summation of a large number of terms in the numerical integration could lead to a significant delay reduction. An architecture is proposed for this task. Experiments confirm that the stochastic implementation has smaller area and shorter delay than conventional implementations.
    Computer-Aided Design (ICCAD), 2012 IEEE/ACM International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nanoelectromechanical systems (NEMS) is an emerging nanoscale technology that combines mechanical and electrical effects in devices. A variety of NEMS-based devices have been proposed for integrated chip designs. Amongst them are near-ideal digital switches. The electromechanical principles that are the basis of these switches impart the capability of extremely low power switching characteristics to digital circuits. NEMS switching devices have been mostly used as simple switches to provide digital operation, however, we observe that their unique operation can be used to accomplish logic functions directly. In this paper, we propose a novel technique called ‘weighted area logic’ to design logic circuits with NEMS-based switches. The technique takes advantage of the unique structural configurations possible with the NEMS devices to convert the digital switch from a simple ON-OFF switch to a logical switch. This transformation not only reduces the delay of complex logic units, but also decreases the power and area of the implementation further. To demonstrate this, we show the new designs of the logic functions of NAND, XOR and a three input function Y = A + B.C, and compose them into a 32-bit adder. Through simulation, we quantify the power, delay and area advantages of using the weighted area logic technique over a standard CMOS-like design technique applied to NEMS.
    01/2012;
  • Peng Li, Weikang Qian, D.J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: Computation performed on stochastic bit streams is less efficient than that based on a binary radix because of its long latency. However, for certain complex arithmetic operations, computation on stochastic bit streams can consume less energy and tolerate more soft errors. In addition, the latency issue could be solved by using a faster clock frequency or in combination with a parallel processing approach. To take advantage of this computing technique, previous work proposed a combinational logic-based reconfigurable architecture to perform complex arithmetic operations on stochastic streams of bits. In this paper, we enhance and extend this reconfigurable architecture using sequential logic. Compared to the previous approach, the proposed reconfigurable architecture takes less hardware area and consumes less energy, while achieving the same performance in terms of processing time and fault-tolerance.
    Computer Design (ICCD), 2012 IEEE 30th International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent advances in flash memory show great potential to replace traditional hard drives (HDDs) with flash-based solid state drives (SSDs) from personal computing to distributed systems. However, it is still a long way to go before completely using SSDs for enterprise data storage. Considering the cost, performance, and reliability of SSDs, a practical solution is to combine both SSDs and HDDs together. This paper proposes a hybrid storage system named PASS (Performance-dAta Synchronization - hybrid storage System) to tradeoff between I/O performance and data discrepancy between SSDs and HDDs. PASS includes a high-performance SSD and a traditional HDD to store mirrored data for reliability. All of the I/O requests are redirected to the primary SSD first and then the updated data blocks are copied to the backup HDD asynchronously. In order to hide the latency of copying operations, we use an I/O window to coalesce write requests and maintain an ordered I/O queue to shorten the HDD seek and rotation times. Depending on the charateristics of different I/O workloads, we develop an adaptive policy to dynamically balance the foreground I/O processing and background mirroring. We implement a prototype system of PASS by developing a Linux device driver and conduct experiments on the IoMeter, PostMark, and TPCC benchmarks. Our results show that PASS can achieve up to 12 times the performance of a RAID1 storage system for the IoMeter and PostMark workloads while tolerating less than 2% data discrepancy between the primary SSD and the backup HDD. More interestingly, while PASS does not produce any performance benefit for the TPC-C benchmark, it does allow the system to scale to larger sizes than when using an HDD-based RAID system alone.
    Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on; 01/2012
  • Shruti Patil, David J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: Computer performance has traditionally played a fundamental role in computer system design. Performance measurements and their quantitative analyses are useful in understanding and enhancing a computer system and in comparing different systems. Owing to the stochastic nature of systems, measuring performance is prone to noise and experimental errors. The field of statistics provides a rich set of tools and techniques to effectively deal with such noise and errors, allowing one to obtain meaningful measurements. Statistical techniques have also been applied in many other areas of computer performance measurement, such as designing experiments, simulation, analyzing results, and predicting performance. In this article, we present a comprehensive discussion of how statistical theories apply during the process of computer performance measurement and describe the statistical techniques that have been used in the literature. WIREs Comp Stat 2012, 4:98–106. doi: 10.1002/wics.192For further resources related to this article, please visit the WIREs website.
    Wiley Interdisciplinary Reviews: Computational Statistics. 01/2012; 4(1).
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Stochastic Computational Element (SCE) uses streams of random bits (stochastic bits streams) to perform computation with conventional digital logic gates. It can guarantee reliable computation using unreliable devices. In stochastic computing, the linear Finite State Machine (FSM) can be used to implement some sophisticated functions, such as the exponentiation and tanh functions, more efficiently than combinational logic. However, a general approach about how to synthesize a linear FSM-based SCE for a target function has not been available. In this paper, we will introduce three properties of the linear FSM used in stochastic computing and demonstrate a general approach to synthesize a linear FSM-based SCE for a target function. Experimental results show that our approach produces circuits that are much more tolerant of soft errors than deterministic implementations, while the area-delay product of the circuits are less than that of deterministic implementations.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Phase change memory (PCM) is a promising technology to solve energy and performance bottlenecks for memory and storage systems. To help understand the reliability characteristics of PCM devices, we present a simple fault model to categorize four types of PCM errors. Based on our proposed fault model, we conduct extensive experiments on real PCM devices at the memory module level. Numerical results uncover many interesting trends in terms of the lifetime of PCM devices and error behaviors. Specifically, PCM lifetime for the memory chips we tested is greater than 14 million cycles, which is much longer than for flash memory devices. In addition, the distributions for four types of errors are quite different. These results can be used for estimating PCM lifetime and for measuring the fabrication quality of individual PCM memory chips.
    Computer Design (ICCD), 2012 IEEE 30th International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Given an N-point sequence, finding its k largest components in the frequency domain is a problem of great interest. This problem, which is usually referred to as a sparse Fourier Transform, was recently brought back on stage by a newly proposed algorithm called the sFFT. In this paper, we present a parallel implementation of sFFT on both multi-core CPUs and GPUs using a human voice signal as a case study. Using this example, an estimate of k for the 3dB cutoff points was conducted through concrete experiments. In addition, three optimization strategies are presented in this paper. We demonstrate that the multi-core-based sFFT achieves speedups of up to three times a single-threaded sFFT while a GPU-based version achieves up to ten times speedup. For large scale cases, the GPU-based sFFT also shows its considerable advantages, which is about 40 times speedup compared to the latest out-of-card FFT implementations [2].
    Computer Architecture and High Performance Computing (SBAC-PAD), 2012 IEEE 24th International Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: We investigated magnetic tunnel junction (MTJ)-based circuit that allows direct communication between elements without intermediate sensing amplifiers. Two- and three-input circuits that consist of two and three MTJs connected in parallel, respectively, were fabricated and are compared. The direct communication is realized by connecting the output in series with the input and applying voltage across the series connections. The logic circuit relies on the fact that a change in resistance at the input modulates the voltage that is needed to supply the critical current for spin-transfer torque switching the output. The change in the resistance at the input resulted in a voltage swing of 50-200 mV and 250-300 mV for the closest input states for the three and two input designs, respectively. The two input logic gate realizes the AND, NAND, NOR, and OR logic functions. The three-input logic function realizes the majority, AND, NAND, NOR, and OR logic operations.
    IEEE Transactions on Magnetics 11/2011; · 1.42 Impact Factor
  • Peng Li, D.J. Lilja
    [Show abstract] [Hide abstract]
    ABSTRACT: The kernel density estimation (KDE)-based image segmentation algorithm has excellent segmentation performance. However, this algorithm is computational intensive. In addition, although this algorithm can tolerant noise in the input images, such as the noise due to snow, rain, or camera shaking, it is sensitive to the noise from the internal computing circuits, such as the noise due to soft errors or PVT (process, voltage, and temperature) variation. Tolerating this kind of noise becomes more and more important as device scaling continues to nanoscale dimensions. Stochastic computing, which uses streams of random bits (stochastic bits streams) to perform computation with conventional digital logic gates, can guarantee reliable computation using unreliable devices. In this paper, we present a stochastic computing implementation of the KDE-based image segmentation algorithm. Our experimental results show that, under the same time constraint, the stochastic implementation is much more tolerant of faults and consumes less hardware and power compared to a conventional (nonstochastic) implementation. Furthermore, compared to a Triple Modular Redundancy (TMR) fault tolerance technique, the stochastic architecture tolerates substantially more soft errors with lower power consumption.
    Application-Specific Systems, Architectures and Processors (ASAP), 2011 IEEE International Conference on; 10/2011

Publication Stats

3k Citations
63.93 Total Impact Points

Institutions

  • 1993–2013
    • University of Minnesota Twin Cities
      • • Department of Electrical and Computer Engineering
      • • Department of Computer Science and Engineering
      Minneapolis, Minnesota, United States
  • 1970–2013
    • University of Minnesota Duluth
      • Department of Electrical Engineering
      Duluth, Minnesota, United States
  • 2012
    • Shanghai Jiao Tong University
      Shanghai, Shanghai Shi, China
  • 2008
    • Virginia Polytechnic Institute and State University
      Blacksburg, Virginia, United States
  • 2005–2008
    • University of Rhode Island
      • Department of Electrical, Computer, and Biomedical Engineering
      Kingston, RI, United States
  • 2006–2007
    • Freescale Semiconductors, Inc
      Austin, Texas, United States
    • Lockheed Martin Corporation
      Maryland, United States
    • Azusa Pacific University
      Azusa, California, United States
    • University of Texas at Austin
      • Department of Electrical & Computer Engineering
      Port Aransas, TX, United States
  • 2004
    • Honeywell
      Morristown, New Jersey, United States
  • 2000
    • University of Wisconsin - River Falls
      River Falls, Wisconsin, United States
    • University of British Columbia - Vancouver
      • Department of Electrical and Computer Engineering
      Vancouver, British Columbia, Canada
  • 1998
    • Rice University
      Houston, Texas, United States
  • 1989–1991
    • University of Illinois, Urbana-Champaign
      Urbana, Illinois, United States