About
170
Publications
19,756
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,064
Citations
Publications
Publications (170)
Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such cl...
This letter presents a novel and efficient hardware architecture to accelerate the computation of point multiplication (PM) primitive over arbitrary Montgomery curves. It is based on a new novel double field multiplier (DFM) that computes two field multiplications simultaneously. The DFM uses the interleaved multiplication technique, and it shorten...
The Number Theoretic Transform (NTT) is a central primitive to compute polynomial multiplication in a finite ring for both post-quantum cryptography (PQC) and fully homomorphic encryption (FHE) schemes. This brief presents a novel, efficient NTT hardware architecture suitable for CRYSTALS-Kyber, one of the NIST PQC standards. It is based on a new n...
Elliptic curve scalar multiplication (ECSM) is the primitive operation that is also the main computational hurdle in almost all protocols based on elliptic curve cryptography (ECC). This work proposes a novel ECSM hardware architecture by adopting several optimization strategies at circuit and system levels. On the circuit level, it is based on an...
Sparse tensor computing is a core computational part of numerous applications in areas such as data science, graph processing, and scientific computing. Sparse tensors offer the potential of skipping unnecessary computations caused by zero values. In this paper, we propose a new strategy for extending row-wise product sparse tensor accelerators. We...
Deep neural networks (DNN) have become significant applications in both cloud-server and edge devices. Meanwhile, the growing number of DNNs on those platforms raises the need to execute multiple DNNs on the same device. This paper proposes a dynamic partitioning algorithm to perform concurrent processing of multiple DNNs on a systolic-array-based...
Elliptic Curve Cryptography (ECC) based security protocols require much smaller key space which makes ECC the most suitable option for resource-constrained devices as compared to the other public key cryptography (PKC) schemes. This paper presents a highly efficient area-delay optimized ECC crypto processor over the general prime field (F
<sub xmln...
Deep neural networks are a promising solution for applications that solve problems based on learning data sets. DNN accelerators solve the processing bottleneck as a domain-specific processor. Like other hardware solutions, there must be exact compatibility between the accelerator and other software components, especially the compiler. This paper p...
Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be h...
FPGA-based accelerators are becoming increasingly popular for deep neural network inference due to their ability to scale performance with increasing degree of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++-and OpenCL-based desig...
Channel pruning is used to reduce the number of weights in a Convolutional Neural Network (CNN). Channel pruning removes slices of the weight tensor so that the convolution layer remains dense. The removal of these weight slices from a single layer causes mismatching number of feature maps between layers of the network. A simple solution is to forc...
Convolutional neural networks (CNNs) have dramatically improved the accuracy of image, video and audio processing for tasks such as object recognition, image segmentation and interactive speech systems. CNNs require large amounts of computing resources for both training and inference, primarily because the con-volution layers are computationally in...
FPGA-based accelerators are becoming more popular for deep neural network due to the ability to scale performance with increasing degree of specialization with dataflow architectures or custom data types. To reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (...
Convolutional neural networks (CNNs) have dramatically improved the accuracy of tasks such as object recognition, image segmentation and interactive speech systems. CNNs require large amounts of computing resources because ofcomputationally intensive convolution layers. Fast convolution algorithms such as Winograd convolution can greatly reduce the...
Elliptic curve cryptography (ECC) protocols due to higher security strength per bit have been widely accepted and deployed. Finite field multiplication is the most computational intensive operation in data security protocols developed using ECC. This paper presents two high-speed parallel re-configurable finite field multipliers: PIMD-2 and PIMD-3...
Channel pruning is used to reduce the number of weights in a Convolutional Neural Network (CNN). Channel pruning removes slices of the weight tensor so that the convolution layer remains dense. The removal of these weight slices from a single layer causes mismatching number of feature maps between layers of the network. A simple solution is to forc...
Pruning unimportant parameters can allow deep neural networks (DNNs) to reduce their heavy computation and memory requirements. A saliency metric estimates which parameters can be safely pruned with little impact on the classification performance of the DNN. Many saliency metrics have been proposed, each within the context of a wider pruning algori...
Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide, and square root. LNS with base-2 is most common, but in this article, we show that for low-precision LNS the choice of...
Constructing SSA form for static languages such as C/C++ and Java is a well-understood task. Dynamic scripting languages, such as PHP, Python, and JavaScript, present a much greater challenge.The information required to build SSA form is not available directly from the program source and cannot be derived from a simple analysis. Instead, we find a...
Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide and square root. LNS with base-2 is most common, but in this paper we show that for low-precision LNS the choice of bas...
Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is used in many DNN software frameworks. However, the performance gain is at the expense of a reduction in floating point (FP) numerical accuracy...
Next generation of embedded Information and Communication Technology (ICT) systems are interconnected and collaborative systems able to perform autonomous tasks. The remarkable expansion of the embedded ICT market, together with the rise and breakthroughs of Artificial Intelligence (AI), have put the focus on the Edge as it stands as one of the key...
Convolutional neural network (CNN) inference is commonly performed with 8-bit integer values. However, higher precision floating-point inference is required. Existing processors support 16- or 32 bit FP but do not typically support custom precision FP. We propose hardware optimized bit-sliced floating-point operators (HOBFLOPS), a method of generat...
Pruning and quantization are proven methods for improving the performance and storage efficiency of convolutional neural networks (CNNs). Pruning removes near-zero weights in tensors and masks weak connections between neurons in neighbouring layers. Quantization reduces the precision of weights by replacing them with numerically similar values that...
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory a...
Logarithmic number systems (LNS) reduce hardware complexity for multiplication and division in embedded systems, at the cost of more complicated addition and subtraction. Existing LNS typically use base-2, meaning that representable numbers are some (often fractional) power of two. We argue that other bases should be considered. The base of the LNS...
The computation and memory needed for Convolutional Neural Network (CNN) inference can be reduced by pruning weights from the trained network. Pruning is guided by a pruning saliency, which heuristically approximates the change in the loss function associated with the removal of specific weights. Many pruning signals have been proposed, but the per...
Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly effi...
Winograd convolution is widely used in deep neural networks (DNNs). Existing work for DNNs considers only the subset Winograd algorithms that are equivalent to Toom-Cook convolution. We investigate a wider range of Winograd algorithms for DNNs and show that these additional algorithms can significantly improve floating point (FP) accuracy in many c...
In this paper we investigate a method to reduce the number of computations and associated activations in Convolutional Neural Networks (CNN) by using bitmaps. The bitmaps are used to mask the input images to the network that fall within a rectangular window but do not fall within the boundaries of the objects the network is being trained upon. The...
Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly effi...
Convolutional neural networks (CNNs) are widely used for classification problems. However, they often require large amounts of computation and memory which are not readily available in resource constrained systems. Pruning unimportant parameters from CNNs to reduce these requirements has been a subject of intensive research in recent years. However...
We investigated a wider range of Winograd family convolution algorithms for Deep Neural Network. We presented the explicit Winograd convolution algorithm in general case (used the polynomials of the degrees higher than one). It allows us to construct more different versions in the aspect of performance than commonly used Winograd convolution algori...
Quantization of weights and activations in Deep Neural Networks (DNNs) is a powerful technique for network compression, and has enjoyed significant attention and success. However, much of the inference-time benefit of quantization is accessible only through the use of customized hardware accelerators or by providing an FPGA implementation of quanti...
Modern deep neural networks (DNNs) spend a large amount of their execution time computing convolutions. Winograd's minimal algorithm for small convolutions can greatly reduce the number of arithmetic operations. However, a large reduction in floating point (FP) operations in these algorithms can result in significantly reduced FP accuracy of the re...
Deep Neural Networks (DNNs) require very large amounts of computation, and many different algorithms have been proposed to implement their most expensive layers, each of which has a large number of variants with different trade-offs of parallelism, locality, memory footprint, and execution time. In addition, specific algorithms operate much more ef...
Deep Neural Networks (DNNs) require very large amounts of computation, and many different algorithms have been proposed to implement their most expensive layers, each of which has a large number of variants with different trade-offs of parallelism, locality, memory footprint, and execution time. In addition, specific algorithms operate much more ef...
Convolutional neural networks (CNNs) are one of the most successful machine learning techniques for image, voice and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs which typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of...
Deep Neural Networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. Many different algorithms have been proposed to implement the most computationally expensive layers of DNNs. Further, each of these algorithms has a large number of variants, which offer different trade-offs of paral...
Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as general matrix multiplication (GEMM). However, as we demonstrate in this paper, there are a great many different wa...
We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on two...
Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand th...
The critical path of a group of tasks is an important measure that is commonly used to guide task allocation and scheduling on parallel computers. The critical path is the longest chain of dependencies in an acyclic task dependence graph. A problem arises on heterogeneous parallel machines where computation and communication costs can vary between...
Previous research has shown that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations. However, this performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, poolin...
We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can accelerated using existing hardware vector units on a gener...
Convolutional Neural Networks (CNNs) are one of the most successful deep machine learning technologies for processing image, voice and video data. Implementations of CNNs require very large amounts of processing capacity and data, which is problematic for low power mobile and embedded systems. Several designs for hardware accelerators have been pro...
The shift towards multicore processing has led to a much wider population of developers being faced with the challenge of exploiting parallel cores to improve software performance. Debugging and optimizing parallel programs is a complex and demanding task. Tools which support development of parallel programs should provide salient information to al...
Customizing the precision of data can provide attractive trade-offs between accuracy and hardware resources. We propose a novel form of vector computing aimed at arrays of custom-precision floating point data. We represent these vectors in bitslice format. Bitwise instructions are used to implement arithmetic circuits in software that operate on cu...
We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on a ge...
Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous ac...
The minimal sets within a collection of sets are defined as the ones which do
not have a proper subset within the collection, and the maximal sets are the
ones which do not have a proper superset within the collection. Identifying
extremal sets is a fundamental problem with a wide-range of applications in SAT
solvers, data-mining and social network...
This paper addresses the problem of finding a class representative itemsets
up to subitemset isomorphism. An efficient algorithm is of practical importance
in the domain of optimal sorting networks. Although only super-exponential
algorithms for solving the problem exist in the literature, the complexity
classification of the problem has never been...
Modern processors can provide large amounts of processing power with vector SIMD units if the compiler or programmer can vectorize their code. With the advance of SIMD support in commodity processors, more and more advanced features are introduced, such as flexible SIMD lane-wise operations (e.g. blend instructions). However, existing vectorizing t...
We present a simulated annealing based partitioning technique for mapping task graphs, onto heterogeneous processing architectures. Task partitioning onto homogeneous architectures to minimize the makespan of a task graph, is a known NP-hard problem. Heterogeneity greatly complicates the aforementioned partitioning problem, thus making heuristic so...
In this paper we extend the knowledge on the problem of empirically searching
for sorting networks of minimal depth. We present new search space pruning
techniques for the last four levels of a candidate sorting network by
considering only the output set representation of a network. We present an
algorithm for checking whether an $n$-input sorting...
In this article, we partition and schedule Synchronous Dataflow (SDF) graphs onto heterogeneous execution architectures in such a way as to minimize energy consumption and maximize throughput. Partitioning and scheduling SDF graphs onto homogeneous architectures is a well-known NP-hard problem. The heterogeneity of the execution architecture makes...
A complete set of filters $F_n$ for the optimal-depth $n$-input sorting
network problem is such that if there exists an $n$-input sorting network of
depth $d$ then there exists one of the form $C \oplus C'$ for some $C \in F_n$.
Previous work on the topic presents a method for finding complete set of
filters $R_{n, 1}$ and $R_{n, 2}$ that consists...
In recent years, a new generation of ultralow-power processors have emerged that are aimed primarily at signal processing in mobile computing. However, their architecture could make some of these useful for other applications. Algorithms originally developed for scientific computing are used increasingly in signal conditioning and emerging fields s...
In this paper we put forward an annotation system for specifying a sequence of data layout transformations for loop vectorization. We propose four basic primitives for data layout transformations that programmers can compose to achieve complex data layout transformations. Our system automatically modifies all loops and other code operating on the t...
In this article we use model checking to statically distribute and schedule Synchronous DataFlow (SDF) graphs on heterogeneous execution architectures. We show that model checking is capable of providing an optimal solution and it arrives at these solutions faster (in terms of algorithm runtime) than equivalent ILP formulations. Furthermore, we als...
In this article we use model checking to statically distribute and schedule Synchronous DataFlow (SDF) graphs on heterogeneous execution architectures. We show that model checking is capable of providing an optimal solution and it arrives at these solutions faster (in terms of algorithm runtime) than equivalent ILP formulations. Furthermore, we als...
In this article we propose a novel framework -- Heterogeneous Multiconstraint Application Partitioner (HMAP) for exploiting parallelism on heterogeneous High performance computing (HPC) architectures. Given a heterogeneous HPC cluster with varying compute units, communication constraints and topology, HMAP framework can be utilized for partitioning...
Stream applications are often limited in their performance by their underlying communication system. A typical implementation relies on the operating system to handle the majority of network operations. In such cases, the communication stack, which was not designed to handle tremendous amounts of data, acts as a bottleneck and restricts the perform...
For most multi-threaded applications, data structures must be shared between threads. Ensuring thread safety on these data structures incurs overhead in the form of locking and other synchronization mechanisms. Where data is shared among multiple threads these costs are unavoidable. However, a common access pattern is that data is accessed primaril...
We propose a new language-neutral primitive for the LLVM compiler, which provides efficient context switching and message passing between lightweight threads of control. The primitive, called Swapstack, can be used by any language implementation based on LLVM to build higher-level language structures such as continuations, coroutines, and lightweig...
Understanding the baseline underwater acoustic signature of an offshore location is a necessary, early step in formulating an environmental impact assessment of wave energy conversion devices. But in order to even begin this understanding, infrastructure must be deployed to capture raw acoustic signals for an extended period of time. This infrastru...
Although scripting languages have become very popular, even mature scripting lan-guage implementations remain interpreted. Several compilers and reimplementations have been attempted, generally focusing on performance. Based on our survey of these reimplementations, we determine that there are three important features of scripting languages that ar...
Static single assignment form (SSA) [5] is nearly ubiquitous in the compiler world. It is dearly loved by most compiler writers, and even more so by undergraduate compiler-class instructors. Its popularity comes from a number of powerful features: • It fits neatly into a 45 minute exam question. • It provides flow-sensitivity for free. • It adds sp...
We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code...
Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can alleviate much of the risk. Modern processors include indirect branch predictors which use part of the target...
Virtual machines (VMs) are commonly used to execute programs written in languages such as Java, Python and Lua. VMs are typically implemented using an interpreter, a JIT compiler, or some combination of the two. A long-standing question in the design of VM interpreters is whether it is worthwhile to reorder the cases in the main interpreter loop to...
Recent Intel processors provide hardware instructions that implement a full AES round in a single instruction. Existing libraries
use hand-tuned assembly language to overlap the execution of multiple AES instructions and extract maximum performance. We
present a program generator that creates optimized AES code automatically from a simple, annotate...
We present an output sensitive algorithm for computing a maximum independent set of an unweighted circle graph. Our algorithm requires O(nmin{d,�}) time, for an n vertex circle graph where d is the density of the circle graph andis its independence number. Previous algorithms for this problem required �(nd) time.
Data must be encrypted if it is to remain confidential when sent over computer networks. Encryption solves many problems involving invasion of privacy, identity theft, fraud, and data theft. However for encryption to be widely used, it must be fast. The problem is so important that new Intel processors provide hardware support for encryption. These...
Dynamic scripting languages offer programmers increased flexibility by allowing properties of programs to be defined at run-time. Typically, program execution begins with an interpreter where type checks are implemented using conditional statements. Recent JIT compilers have begun removing run-time checks by specializing native code to program prop...
In this article, we experimentally compare a number of data structures operating over keys that are 32- and 64-bit integers. We examine traditional comparison-based search trees as well as data structures that take advantage of the fact that the keys are integers such as van Emde Boas trees and various trie-based data structures. We propose a varia...
Adaptive filters are widely used in many applications of digital signal processing. Digital com-munications and digital video broadcasting are just two examples. Traditionally, small embed-ded systems have employed the least computationally intensive filter adaptive algorithms, such as normalized least mean squares (NLMS). This article shows that F...
This paper improves our previous research effort [1] by providing an efficient method for kernel loop unrolling minimisation in the case of already scheduled loops, where circular lifetime intervals are known. When loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel lo...
Streaming languages were originally aimed at streaming architectures, but recent work has shown the stream programming model
to be useful in exploiting parallelism on general purpose processors. Current research in mapping stream code onto GPPs deals
with load balancing and generating threads based on hardware features. We look into improving prob...
Dynamic scripting languages are most commonly implemented using interpreters. These interpreters are highly portable but lack the performance required for use in demanding systems. Just-in-time (JIT) compilation has been used to improve the performance of some of these dynamic scripting languages. JIT compilers typically target a single plat- form...
Many important scientific, engineering and financial applications can benefit from offloading computation to emerging parallel
systems, such as the Cell Broadband EngineTM(Cell/B.E.). However, traditional remote procedure call (RPC) mechanisms require significant investment of time and effort
to rewrite applications to use a specific RPC system. As...
Although scripting languages are becoming increasingly popular, even mature scripting language implementations remain interpreted. Several compilers and reimplementations have been attempted, gen-erally focusing on performance. Based on our survey of these reimplementations, we determine that there are three important features of scripting language...
Scripting languages, such as PHP, are among the most widely used and fastest growing programming languages, particularly for web applications. Static analysis is an important tool for detecting se-curity flaws, finding bugs, and improving compilation of programs. However, static analysis of scripting languages is difficult due to features found in...
Full-text is available at http://www.doc.mmu.ac.uk/STAFF/A.Nisbet/PAPERS/fpga_opt.pdf We propose a classification of high and low-level compiler optimizations to reduce the clock period, power consumption and area requirements in Field-programmable Gate Array (FPGA) architectures. The potential of each optimization, its effect on clock period, powe...
Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various trade-offs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures that significantly influence performance. Caches and...