Ping Tang

Ping Tang
  • MTS at Rivos Inc

About

80
Publications
20,514
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,879
Citations
Current institution
Rivos Inc
Current position
  • MTS

Publications

Publications (80)
Article
Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook'...
Preprint
Full-text available
Soft error, namely silent corruption of signal or datum in a computer system, cannot be caverlierly ignored as compute and communication density grow exponentially. Soft error detection has been studied in the context of enterprise computing, high-performance computing and more recently in convolutional neural networks related to autonomous driving...
Preprint
In recommendation systems, practitioners observed that increase in the number of embedding tables and their sizes often leads to significant improvement in model performances. Given this and the business importance of these models to major internet companies, embedding tables for personalization tasks have grown to terabyte scale and continue to gr...
Preprint
Full-text available
A dynamical neural network consists of a set of interconnected neurons that interact over time continuously. It can exhibit computational properties in the sense that the dynamical system's evolution and/or limit points in the associated state space can correspond to numerical solutions to certain mathematical optimization or learning problems. Suc...
Article
Full-text available
The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise...
Article
Full-text available
In a spiking neural network (SNN), individual neurons operate autonomously and only communicate with other neurons sparingly and asynchronously via spike signals. These characteristics render a massively parallel hardware implementation of SNN a potentially powerful computer, albeit a non von Neumann one. But can one guarantee that a SNN computer s...
Article
Full-text available
Sparse methods and the use of Winograd convolutions are two orthogonal approaches each of which significantly accelerates convolution computations in modern CNNs. Sparse Winograd merges these two and thus has the potential to offer a combined performance benefit. Nevertheless, training convolution layers so that the resulting Winograd kernels are s...
Chapter
Full-text available
Erratum to: R.F. Boisvert and P.T.P. Tang (Eds.) The Architecture of Scientific Software DOI: 10.1007/978-0-387-35407-1
Article
Full-text available
The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually $32$--$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch...
Article
Full-text available
This paper establishes several convergence results about flows of the dynamical system LCA (Locally Competitive Algorithm) to the mixed $\ell_2$-$\ell_1$ minimization problem LASSO and the constrained version, called CLASSO here, where the parameters are required to be non-negative. (C)LASSO problems are closely related to various important applica...
Article
In this article, we present an efficient algorithm to compute the faithful rounding of the l2-norm of a floatingpoint vector. This means that the result is accurate to within 1 bit of the underlying floating-point type. This algorithm does not generate overflows or underflows spuriously, but does so when the final result calls for such a numerical...
Article
Full-text available
A detailed new upgrade of the FEAST eigensolver targeting non-Hermitian eigenvalue problems is presented and thoroughly discussed. It aims at broadening the class of eigenproblems that can be addressed within the framework of the FEAST algorithm. The algorithm is ideally suited for computing selected interior eigenvalues and their associated right/...
Article
Anton 2 is a second-generation special-purpose supercomputer for molecular dynamics simulations that achieves significant gains in performance, programmability, and capacity compared to its predecessor, Anton 1. The architecture of Anton 2 is tailored for fine-grained event-driven operation, which improves performance by increasing the overlap of c...
Patent
A new function for calculating the reciprocal residual of a floating-point number X is defined as recip_residual(X)=1−X*recip(X), where recip(X) represents the reciprocal of X. The function may be implemented using a fused multiply-add unit in a processor. The reciprocal value of X, recip(X), may be obtained from a lookup table. The recip_residual...
Article
Full-text available
The articles in this special issue focus on current trends and developments in the field of computer arithmetic. This is a field that encompasses the definition and standardization of arithmetic system for computers. The field also deals with issues of hardware and software implementations and their subsequent testing and verification. Many practit...
Article
Full-text available
The FEAST method for solving large sparse eigenproblems is equivalent to subspace iteration with an approximate spectral projector and implicit orthogonalization. This relation allows to characterize the convergence of this method in terms of the error of a certain rational approximant to an indicator function. We propose improved rational approxim...
Article
Full-text available
Calculating portions of eigenvalues and eigenvectors of matrices or matrix pencils has many applications. An approach to this calculation for Hermitian problems based on a density matrix has been proposed in 2009 and a software package called FEAST has been developed. The density-matrix approach allows FEAST's implementation to exploit a key streng...
Conference Paper
Full-text available
This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nod...
Article
Full-text available
The calculation of a segment of eigenvalues and their corresponding eigenvectors of a Hermitian matrix or matrix pencil has many applications. A new density-matrix-based algorithm has been proposed recently and a software package FEAST has been developed. The density-matrix approach allows FEAST's implementation to exploit a key strength of modern...
Article
In high-performance computing on distributed-memory systems, communication often represents a significant part of the overall execution time. The relative cost of communication will certainly continue to rise as compute-density growth follows the current technology and industry trends. Design of lower-communication alternatives to fundamental compu...
Article
Full-text available
Tackling computationally challenging problems with high efficiency often requires the combination of algorithmic innovation, advanced architecture, and thorough exploitation of parallelism. We demonstrate this synergy through synthetic aperture radar (SAR) via backprojection, an image reconstruction method that can require hundreds of TFLOPS. Compu...
Conference Paper
Full-text available
In high-performance computing on distributed-memory systems, communication often represents a significant part of the overall execution time. The relative cost of communication will certainly continue to rise as compute-density growth follows the current technology and industry trends. Design of lower-communication alternatives to fundamental compu...
Conference Paper
Full-text available
Tackling computationally challenging problems with high efficiency often requires the combination of algorithmic innovation, advanced architecture, and thorough exploitation of parallelism. We demonstrate this synergy through synthetic aperture radar (SAR) via backprojection, an image reconstruction method that can require hundreds of TFLOPS. Compu...
Conference Paper
Full-text available
Digit-by-rounding algorithms enable efficient hardware implementations of algebraic functions such as the reciprocal, square root, or reciprocal square root, but certifying the correctness of such algorithms is a nontrivial endeavor. Traditionally, sufficient conditions for correctness are derived as closed-form formulae relating key design paramet...
Conference Paper
Full-text available
We describe a high-performance digit-recurrence algorithm for computing exactly rounded reciprocals, square roots, and reciprocal square roots in hardware at a rate of three result bits—one radix-8 digit—per recurrence iteration. To achieve a single-cycle recurrence at a short cycle time, we adapted the digit-by-rounding algorithm, which is normall...
Article
Full-text available
The IEEE Standard 754-1985 for binary floating-point arithmetic [19] was revised [20], and an important addition is the definition of decimal floating-point arithmetic [8], [24]. This is intended mainly to provide a robust reliable framework for financial applications that are often subject to legal requirements concerning rounding and precision of...
Conference Paper
Full-text available
The IEEE Standard 754-1985 for binary floating-point arithmetic [1] was revised [2], and an important addition is the definition of decimal floating-point arithmetic. This is intended mainly to provide a robust, reliable framework for financial applications that are often subject to legal requirements concerning rounding and precision of the result...
Conference Paper
Full-text available
Most implementations of the modular exponentiation, M<sup>E</sup> mod N, computation in cryptographic algorithms employ Montgomery multiplication, ABR<sup>-1</sup> mod N, instead of modular multiplication, AB mod N, even the former requires some transformational overheads. This is so because a state-of-the-art Montgomery multiplication implementati...
Article
The Fast Fourier Transform (FFT) algorithm that calculates the Discrete Fourier Transform (DFT) is one of the major breakthroughs in scientific computing and is now an indispensable tool in a vast number of fields. Unfortunately, software applications that provide fast computation of DFT via FFT differ vastly in functionality and lack uniformity. A...
Conference Paper
Full-text available
New microprocessor architectures often require software support for basic arithmetic operations such as divide, or square root. The Intel® XScale™ processor, designed for low power mobile devices, provides no hardware support for floating point. We show that an efficient software implementation of the basic operations and math library routines can...
Article
Full-text available
The Intel® Itanium® architecture is increasingly becoming one of the major processor architectures present in the market today. Launched in 2001, the Intel Itanium processor was followed in 2002 by the Itanium 2 processor, with increased integer and floating-point performance. Measured by the SPEC CINT2000 benchmarks, the Itanium 2 processor still...
Conference Paper
New microprocessor architectures often require software support for basic arithmetic operations such as divide, or square root. The Intel¯ XScale¿ processor, designed for low power mobile devices, provides no hardware support for floating-point. We show that an efficient software implementation of the basic operations and math library routines can...
Article
Full-text available
The 64-bit Intel® Itanium® architecture is designed for high-performance scientific and enterprise computing, and the Itanium processor is its first silicon implementation. Features such as extensive arithmetic support, predication, speculation, and explicit parallelism can be used to provide a sound infrastructure for supercomputing. A large numbe...
Conference Paper
The 64-bit Intel® Itanium™ architecture is designed for high-performance scientific and enterprise computing, and the Itanium processor is itsfirst silicon implementation. Features such as extensive arithmetic support, predication, speculation, and explicit parallelism can be used to provide a sound infrastructure for supercomputing. A largenumber...
Article
The 64-bit Intel® Itanium#8482; architecture is designed for high-performance scientific and enterprise computing, and the Itanium processor is its first silicon implementation. Features such as extensive arithmetic support, predication, speculation, and explicit parallelism can be used to provide a sound infrastructure for supercomputing. A large...
Article
The fast and accurate evaluation of transcendental functions (e.g. exp, log, sin, and atan) is vitally important in many fields of scientific computing. Intel provides a software library of these functions that can be called from both the C and FORTRAN programming languages. By exploiting some of the key features of the IA-64 floating-point archite...
Article
Good preconditioner is extremely important in order for the conjugate gradient method to converge quickly. In the case of Toeplitz matrices, a number of recent studies were made to relate approximation of functions to good preconditioners. In this paper, we present a new result relating the quality of the Toeplitz preconditioner C for the matrix T...
Chapter
Full-text available
The Fast Fourier Transform (FFT) algorithm that calculates the Discrete Fourier Transform (DFT) is one of the major breakthrough in scientific computing and is now an indispensable tool in a vast number of fields. Unfortunately, software that provide fast computation of DFT via FFT differ vastly in functionality as well as uniformity. A widely acce...
Chapter
Full-text available
In this chapter we provide some background information on the conference that broght together the researchers whose work is described in this volume.
Book
Scientific applications involve very large computations that strain the resources of whatever computers are available. Such computations implement sophisticated mathematics, require deep scientific knowledge, depend on subtle interplay of different approximations, and may be subject to instabilities and sensitivity to external input. Software able...
Article
Full-text available
The Fast Fourier Transform (FFT) algorithm that calculates the Discrete Fourier Transform (DFT) is one of the major breakthrough in scientific computing and is now an indispensable tool in a vast number of fields. Unfortunately, software that provide fast computation of DFT via FFT di#er vastly in functionality as well as uniformity. A widely accep...
Conference Paper
Tight bound on rounding errors accumulated during a sequence of operations in expression evaluation can be difficult to obtain. We present two recently developed methods that help solve this problem in the case of rational expression in one variable. These methods have been successfully applied to recent software developments of highly-accurate tra...
Article
We introduce a pair of dual concepts: pivoted blocks and reverse pivoted blocks. These blocks are the outcome of a special column pivoting strategy in QR factorization. Our main result is that under such a column pivoting strategy, the QR factorization of a given matrix can give tight estimates of any two a priori-chosen consecutive singular values...
Conference Paper
Full-text available
The IA-64 architecture provides new opportunities and challenges for implementing an improved set of transcendental functions. Using several novel polynomial-based table-driven techniques, we are able to provide new algorithms for the transcendental functions. Major improvements include an accuracy level of about 0.6 ulps (units in the last place)...
Article
Full-text available
This paper presents a method for the design of FIR Hilbert transformers and differentiators in the complex domain. The method can be used to obtain conjugate-symmetric designs with smaller group delay compared to linear-phase designs. Non-conjugate symmetric Hilbert transformers are also designed. This paper is an extension of our previous work [sa...
Article
Full-text available
. This paper presents a generalization of incremental condition estimation, a technique for tracking the extremal singular values of a triangular matrix. While the original approach allowed for the estimation of the largest or smallest singular value, the generalized scheme allows for the estimation of any number of extremal singular values. For ex...
Article
Three-dimensional microscopy imaging is important for understanding complex biological assemblies. Computers and digital image acquisition systems have made possible the three-dimensional reconstruction of images obtained from optical microscopes. Since processing such images requires tremendous CPU, storage, and I/O capacities, a high-performance...
Article
Full-text available
We develop efficient algorithms for reliable and accurate evaluatins of the complex arcsine and arccosine functions. A tight error bound is derived for each algorithm; the results are valid for all machine-representable points in the complex plane. The algorithms are presented in a pseudocode that has a convenient exception-handling facility. Corre...
Conference Paper
Full-text available
The initial release of the Pentium processor has a flaw in its radix-4 SRT division implementation. It is widely-known that five entries were missing in the lookup table, yielding reduced-precision quotients occasionally. In this paper, we use mathematical techniques to analyze the divisors that can possibly cause failures. In particular, we show t...
Article
The intial release of the Pentium processor has a flaw in its radix-4 SRT division implementation. It is widely-known that five entries were missing in the lookup table, yielding reduced-precision quotients occasionally. In this paper, we use mathematical techniques to analyze the divisors that can possibly cause failures. In particular, we show th...
Article
Full-text available
We present a real-time MPEG (Motion Pictures Expert Group) software decoder that uses message-passing libraries such as MPL, p4, and MPI. The parallel MPEG decoder currently runs on the IBM SP system but can be easily ported to other parallel machines. This paper discusses our parallel MPEG decoding algorithm as well as the parallel programming env...
Article
We present an interpretation of multiresolution analysis of signal of any arbitrary finite length in terms of matrix theory. In particular, we present a new nonorthogonal MRA associated with 4 coefficients that has vanishing moments up to order 2. A more general theory of this class of transform is also presented.© (1995) COPYRIGHT SPIE--The Intern...
Article
Full-text available
We present an interpretation of multiresolution analysis of signal of any arbitrary finite length in terms of matrix theory. In particular, we present a new nonorthogonal MRA associated with 4 coefficients that has vanishing moments up to order 2. A more general theory of this class of transform is also presented.
Article
Full-text available
Wavelet theory and discrete wavelet transforms have had great impact on the field of signal and image processing... In this paper we propose a new class of discrete transforms. It "includes" the classical Haar and Daubechies transforms. Our transforms treat the endpoints of a signal in a different manner from that of conventional techniques. This n...
Article
Full-text available
. This paper presents an improved version of incremental condition estimation, a technique for tracking the extremal singular values of a triangular matrix as it is being constructed one column at a time. We present a new motivation for this estimation technique using orthogonal projections. The paper focuses on an implementation of this estimation...
Article
this paper we propose a new class of discrete transforms. It "includes" the classical Haar and Daubechies transforms. Our transforms treat the endpoints of a signal in a different manner from that of conventional techniques. This new approach allows us efficiently to handle signals of any length; thus, one is not restricted to work with signal or i...
Article
Full-text available
Algorithms are developed for reliable and accurate evaluations of the complex elementary functions required in Fortran 77 and Fortran 9, namely, cabs, csqrt, cexp, clog, csin, and ccos. The algorithms are presented in a pseudocode that has a convenient exception-handling facility. A tight error bound is derived for each algorithm. Corresponding For...
Article
It is shown here that the well-known Rayleigh-Ritz approximation method is applicable in dynamic condition estimation. In fact, it can be used as a common framework from which many recently proposed dynamic condition estimators can be viewed and understood. This framework leads to natural generalizations of some existing dynamic condition estimator...
Article
A good preconditioner is extremely important in order for the conjugate gradients method to converge quickly. In the case of Toeplitz matrices, a number of recent studies were made to relate approximation of functions to good preconditioners. In this paper, we present a new result relating the quality of the Toeplitz preconditionerC for the Toeplit...
Article
This paper presents an algorithm for maintaining Cholesky factors of symmetric positive definite matrices under arbitrary rank-one changes. The algorithm synthesizes Carlson's updating algorithm, and the downdating algorithm recently suggested by Pan to arrive at an algorithm which is both simple and allows for the pipelining of up- and downdates (...
Article
Algorithms and implementation details for the function ex - 1 in both single and double precision of IEEE 754 arithmetic are presented here. With a table of moderate size, the implementations need only working-precision arithmetic and are provably accurate to within 0.58 ulp.
Article
We introduce a pair of dual concepts: pivoted blocks and reverse pivoted blocks. These blocks are the outcome of a special column pivoting strategy in QR factorization. Our main result is that under such a column pivoting strategy, the QR factorization of a given matrix can give tight estimates of any two a priori-chosen consecutive singular values...
Article
A detailed analysis on the accuracy issues in calculating the eigensystems of rank 1 perturbed diagonal systems is presented. Such calculations are the core of the divide-and-conquer technique proposed by Bunch, Nielsen, and Sorensen and refined by Dongarra and Sorensen. In particular, the computed eigenvectors are proved to be guaranteed orthogona...
Article
A comprehensive set of elementary functions has been implemented portably in Ada. The high accuracy of the implementation has been confirmed by rigorous analysis. Moreover, we present new test methods that are efficient and offer a high resolution of 0.005 unit in the last place. These test methods have been implemented portably here and confirm th...
Conference Paper
Full-text available
Table-lookup algorithms for calculating elementary functions offer superior speed and accuracy when compared with more traditional algorithms. It is shown that, with careful design, it is feasible to implement table-lookup algorithms in hardware. A uniform approach for carrying out a tight error analysis for such implementations is presented. The a...
Article
Algorithms and implementation details for the logarithm functions in both single and double precision of IEEE 754 arithmetic are presented here. With a table of moderate size, the implementation need only working- precision arithmetic and are provably accurate to within 0.57 ulp.
Article
Table-driven techniques can be used to test highly accurate implementation of EXP LOG. The largest error observeed in EXP and LOG accurately to within 1/500 unit in the last place are reported in our tests. Methods to verify the tests' reliability are discussed. Results of applying the tests to our own as well as to a number of other implementation...
Article
We present several software implementations of the elementary functions sin and cos designed to fit a large class of machines. Implementation details are provided. We also provide a detailed error analysis that bounds the errors of these implementations, over the full range of input arguments, from 0.721 to 0.912 units in the last place. Tests perf...
Article
Algorithms and implementation details for the exponential function in both single- and double-precision of IEEE 754 arithmetic are presented here. With a table of moderate size, the implementations need only working-precision arithmetic and are provably accurate to within 0.54 ulp as long as the final result does not underflow. When the final resul...
Article
Full-text available
We propose a new algorithm for finding best minimax polynomial approximations in the complex plane. The algorithm is the first satisfactory generalization of the well-known Remez algorithm for real approximations. Among all available algorithms, ours is the only quadratically convergent one. Numerical examples are presented to illustrate rapid conv...
Article
Contenido: Introducción; Registros de aplicación, predicación, ramas y rotación de registros; Arquitectura de punto flotante; Memoria y especulación; Introducción al lenguaje de programación ensamblador para la arquitectura de Itanium; Microarquitectura del procesador Itanium; Manejo de excepciones de punto flotante en la familia de procesadores It...

Network

Cited By