David Gregg

David Gregg
Trinity College Dublin | TCD · School of Computer Science and Statistics

About

170
Publications
19,756
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,064
Citations

Publications

Publications (170)
Article
Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such cl...
Article
This letter presents a novel and efficient hardware architecture to accelerate the computation of point multiplication (PM) primitive over arbitrary Montgomery curves. It is based on a new novel double field multiplier (DFM) that computes two field multiplications simultaneously. The DFM uses the interleaved multiplication technique, and it shorten...
Article
The Number Theoretic Transform (NTT) is a central primitive to compute polynomial multiplication in a finite ring for both post-quantum cryptography (PQC) and fully homomorphic encryption (FHE) schemes. This brief presents a novel, efficient NTT hardware architecture suitable for CRYSTALS-Kyber, one of the NIST PQC standards. It is based on a new n...
Article
Full-text available
Elliptic curve scalar multiplication (ECSM) is the primitive operation that is also the main computational hurdle in almost all protocols based on elliptic curve cryptography (ECC). This work proposes a novel ECSM hardware architecture by adopting several optimization strategies at circuit and system levels. On the circuit level, it is based on an...
Preprint
Sparse tensor computing is a core computational part of numerous applications in areas such as data science, graph processing, and scientific computing. Sparse tensors offer the potential of skipping unnecessary computations caused by zero values. In this paper, we propose a new strategy for extending row-wise product sparse tensor accelerators. We...
Preprint
Deep neural networks (DNN) have become significant applications in both cloud-server and edge devices. Meanwhile, the growing number of DNNs on those platforms raises the need to execute multiple DNNs on the same device. This paper proposes a dynamic partitioning algorithm to perform concurrent processing of multiple DNNs on a systolic-array-based...
Article
Full-text available
Elliptic Curve Cryptography (ECC) based security protocols require much smaller key space which makes ECC the most suitable option for resource-constrained devices as compared to the other public key cryptography (PKC) schemes. This paper presents a highly efficient area-delay optimized ECC crypto processor over the general prime field (F <sub xmln...
Preprint
Deep neural networks are a promising solution for applications that solve problems based on learning data sets. DNN accelerators solve the processing bottleneck as a domain-specific processor. Like other hardware solutions, there must be exact compatibility between the accelerator and other software components, especially the compiler. This paper p...
Conference Paper
Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be h...
Article
Full-text available
FPGA-based accelerators are becoming increasingly popular for deep neural network inference due to their ability to scale performance with increasing degree of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++-and OpenCL-based desig...
Preprint
Channel pruning is used to reduce the number of weights in a Convolutional Neural Network (CNN). Channel pruning removes slices of the weight tensor so that the convolution layer remains dense. The removal of these weight slices from a single layer causes mismatching number of feature maps between layers of the network. A simple solution is to forc...
Article
Full-text available
Convolutional neural networks (CNNs) have dramatically improved the accuracy of image, video and audio processing for tasks such as object recognition, image segmentation and interactive speech systems. CNNs require large amounts of computing resources for both training and inference, primarily because the con-volution layers are computationally in...
Preprint
Full-text available
FPGA-based accelerators are becoming more popular for deep neural network due to the ability to scale performance with increasing degree of specialization with dataflow architectures or custom data types. To reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (...
Preprint
Full-text available
Convolutional neural networks (CNNs) have dramatically improved the accuracy of tasks such as object recognition, image segmentation and interactive speech systems. CNNs require large amounts of computing resources because ofcomputationally intensive convolution layers. Fast convolution algorithms such as Winograd convolution can greatly reduce the...
Article
Elliptic curve cryptography (ECC) protocols due to higher security strength per bit have been widely accepted and deployed. Finite field multiplication is the most computational intensive operation in data security protocols developed using ECC. This paper presents two high-speed parallel re-configurable finite field multipliers: PIMD-2 and PIMD-3...
Chapter
Channel pruning is used to reduce the number of weights in a Convolutional Neural Network (CNN). Channel pruning removes slices of the weight tensor so that the convolution layer remains dense. The removal of these weight slices from a single layer causes mismatching number of feature maps between layers of the network. A simple solution is to forc...
Article
Full-text available
Pruning unimportant parameters can allow deep neural networks (DNNs) to reduce their heavy computation and memory requirements. A saliency metric estimates which parameters can be safely pruned with little impact on the classification performance of the DNN. Many saliency metrics have been proposed, each within the context of a wider pruning algori...
Article
Full-text available
Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide, and square root. LNS with base-2 is most common, but in this article, we show that for low-precision LNS the choice of...
Chapter
Constructing SSA form for static languages such as C/C++ and Java is a well-understood task. Dynamic scripting languages, such as PHP, Python, and JavaScript, present a much greater challenge.The information required to build SSA form is not available directly from the program source and cannot be derived from a simple analysis. Instead, we find a...
Preprint
Full-text available
Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide and square root. LNS with base-2 is most common, but in this paper we show that for low-precision LNS the choice of bas...
Article
Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is used in many DNN software frameworks. However, the performance gain is at the expense of a reduction in floating point (FP) numerical accuracy...
Article
Next generation of embedded Information and Communication Technology (ICT) systems are interconnected and collaborative systems able to perform autonomous tasks. The remarkable expansion of the embedded ICT market, together with the rise and breakthroughs of Artificial Intelligence (AI), have put the focus on the Edge as it stands as one of the key...
Preprint
Full-text available
Convolutional neural network (CNN) inference is commonly performed with 8-bit integer values. However, higher precision floating-point inference is required. Existing processors support 16- or 32 bit FP but do not typically support custom precision FP. We propose hardware optimized bit-sliced floating-point operators (HOBFLOPS), a method of generat...
Preprint
Pruning and quantization are proven methods for improving the performance and storage efficiency of convolutional neural networks (CNNs). Pruning removes near-zero weights in tensors and masks weak connections between neurons in neighbouring layers. Quantization reduces the precision of weights by replacing them with numerically similar values that...
Preprint
Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory a...
Conference Paper
Full-text available
Logarithmic number systems (LNS) reduce hardware complexity for multiplication and division in embedded systems, at the cost of more complicated addition and subtraction. Existing LNS typically use base-2, meaning that representable numbers are some (often fractional) power of two. We argue that other bases should be considered. The base of the LNS...
Preprint
The computation and memory needed for Convolutional Neural Network (CNN) inference can be reduced by pruning weights from the trained network. Pruning is guided by a pruning saliency, which heuristically approximates the change in the loss function associated with the removal of specific weights. Many pruning signals have been proposed, but the per...
Preprint
Full-text available
Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly effi...
Chapter
Winograd convolution is widely used in deep neural networks (DNNs). Existing work for DNNs considers only the subset Winograd algorithms that are equivalent to Toom-Cook convolution. We investigate a wider range of Winograd algorithms for DNNs and show that these additional algorithms can significantly improve floating point (FP) accuracy in many c...
Conference Paper
In this paper we investigate a method to reduce the number of computations and associated activations in Convolutional Neural Networks (CNN) by using bitmaps. The bitmaps are used to mask the input images to the network that fall within a rectangular window but do not fall within the boundaries of the objects the network is being trained upon. The...
Conference Paper
Full-text available
Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly effi...
Preprint
Convolutional neural networks (CNNs) are widely used for classification problems. However, they often require large amounts of computation and memory which are not readily available in resource constrained systems. Pruning unimportant parameters from CNNs to reduce these requirements has been a subject of intensive research in recent years. However...
Preprint
Full-text available
We investigated a wider range of Winograd family convolution algorithms for Deep Neural Network. We presented the explicit Winograd convolution algorithm in general case (used the polynomials of the degrees higher than one). It allows us to construct more different versions in the aspect of performance than commonly used Winograd convolution algori...
Preprint
Quantization of weights and activations in Deep Neural Networks (DNNs) is a powerful technique for network compression, and has enjoyed significant attention and success. However, much of the inference-time benefit of quantization is accessible only through the use of customized hardware accelerators or by providing an FPGA implementation of quanti...
Article
Full-text available
Modern deep neural networks (DNNs) spend a large amount of their execution time computing convolutions. Winograd's minimal algorithm for small convolutions can greatly reduce the number of arithmetic operations. However, a large reduction in floating point (FP) operations in these algorithms can result in significantly reduced FP accuracy of the re...
Conference Paper
Deep Neural Networks (DNNs) require very large amounts of computation, and many different algorithms have been proposed to implement their most expensive layers, each of which has a large number of variants with different trade-offs of parallelism, locality, memory footprint, and execution time. In addition, specific algorithms operate much more ef...
Conference Paper
Deep Neural Networks (DNNs) require very large amounts of computation, and many different algorithms have been proposed to implement their most expensive layers, each of which has a large number of variants with different trade-offs of parallelism, locality, memory footprint, and execution time. In addition, specific algorithms operate much more ef...
Article
Convolutional neural networks (CNNs) are one of the most successful machine learning techniques for image, voice and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs which typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of...
Article
Deep Neural Networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. Many different algorithms have been proposed to implement the most computationally expensive layers of DNNs. Further, each of these algorithms has a large number of variants, which offer different trade-offs of paral...
Article
Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as general matrix multiplication (GEMM). However, as we demonstrate in this paper, there are a great many different wa...
Article
We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on two...
Article
Full-text available
Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand th...
Article
Full-text available
The critical path of a group of tasks is an important measure that is commonly used to guide task allocation and scheduling on parallel computers. The critical path is the longest chain of dependencies in an acyclic task dependence graph. A problem arises on heterogeneous parallel machines where computation and communication costs can vary between...
Article
Previous research has shown that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations. However, this performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, poolin...
Conference Paper
We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can accelerated using existing hardware vector units on a gener...
Article
Convolutional Neural Networks (CNNs) are one of the most successful deep machine learning technologies for processing image, voice and video data. Implementations of CNNs require very large amounts of processing capacity and data, which is problematic for low power mobile and embedded systems. Several designs for hardware accelerators have been pro...
Article
The shift towards multicore processing has led to a much wider population of developers being faced with the challenge of exploiting parallel cores to improve software performance. Debugging and optimizing parallel programs is a complex and demanding task. Tools which support development of parallel programs should provide salient information to al...
Article
Customizing the precision of data can provide attractive trade-offs between accuracy and hardware resources. We propose a novel form of vector computing aimed at arrays of custom-precision floating point data. We represent these vectors in bitslice format. Bitwise instructions are used to implement arithmetic circuits in software that operate on cu...
Article
We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on a ge...
Article
Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous ac...
Article
The minimal sets within a collection of sets are defined as the ones which do not have a proper subset within the collection, and the maximal sets are the ones which do not have a proper superset within the collection. Identifying extremal sets is a fundamental problem with a wide-range of applications in SAT solvers, data-mining and social network...
Article
This paper addresses the problem of finding a class representative itemsets up to subitemset isomorphism. An efficient algorithm is of practical importance in the domain of optimal sorting networks. Although only super-exponential algorithms for solving the problem exist in the literature, the complexity classification of the problem has never been...
Conference Paper
Modern processors can provide large amounts of processing power with vector SIMD units if the compiler or programmer can vectorize their code. With the advance of SIMD support in commodity processors, more and more advanced features are introduced, such as flexible SIMD lane-wise operations (e.g. blend instructions). However, existing vectorizing t...
Article
We present a simulated annealing based partitioning technique for mapping task graphs, onto heterogeneous processing architectures. Task partitioning onto homogeneous architectures to minimize the makespan of a task graph, is a known NP-hard problem. Heterogeneity greatly complicates the aforementioned partitioning problem, thus making heuristic so...
Article
In this paper we extend the knowledge on the problem of empirically searching for sorting networks of minimal depth. We present new search space pruning techniques for the last four levels of a candidate sorting network by considering only the output set representation of a network. We present an algorithm for checking whether an $n$-input sorting...
Article
In this article, we partition and schedule Synchronous Dataflow (SDF) graphs onto heterogeneous execution architectures in such a way as to minimize energy consumption and maximize throughput. Partitioning and scheduling SDF graphs onto homogeneous architectures is a well-known NP-hard problem. The heterogeneity of the execution architecture makes...
Article
A complete set of filters $F_n$ for the optimal-depth $n$-input sorting network problem is such that if there exists an $n$-input sorting network of depth $d$ then there exists one of the form $C \oplus C'$ for some $C \in F_n$. Previous work on the topic presents a method for finding complete set of filters $R_{n, 1}$ and $R_{n, 2}$ that consists...
Article
Full-text available
In recent years, a new generation of ultralow-power processors have emerged that are aimed primarily at signal processing in mobile computing. However, their architecture could make some of these useful for other applications. Algorithms originally developed for scientific computing are used increasingly in signal conditioning and emerging fields s...
Conference Paper
In this paper we put forward an annotation system for specifying a sequence of data layout transformations for loop vectorization. We propose four basic primitives for data layout transformations that programmers can compose to achieve complex data layout transformations. Our system automatically modifies all loops and other code operating on the t...
Article
In this article we use model checking to statically distribute and schedule Synchronous DataFlow (SDF) graphs on heterogeneous execution architectures. We show that model checking is capable of providing an optimal solution and it arrives at these solutions faster (in terms of algorithm runtime) than equivalent ILP formulations. Furthermore, we als...
Article
In this article we use model checking to statically distribute and schedule Synchronous DataFlow (SDF) graphs on heterogeneous execution architectures. We show that model checking is capable of providing an optimal solution and it arrives at these solutions faster (in terms of algorithm runtime) than equivalent ILP formulations. Furthermore, we als...
Conference Paper
In this article we propose a novel framework -- Heterogeneous Multiconstraint Application Partitioner (HMAP) for exploiting parallelism on heterogeneous High performance computing (HPC) architectures. Given a heterogeneous HPC cluster with varying compute units, communication constraints and topology, HMAP framework can be utilized for partitioning...
Conference Paper
Stream applications are often limited in their performance by their underlying communication system. A typical implementation relies on the operating system to handle the majority of network operations. In such cases, the communication stack, which was not designed to handle tremendous amounts of data, acts as a bottleneck and restricts the perform...
Article
For most multi-threaded applications, data structures must be shared between threads. Ensuring thread safety on these data structures incurs overhead in the form of locking and other synchronization mechanisms. Where data is shared among multiple threads these costs are unavoidable. However, a common access pattern is that data is accessed primaril...
Article
We propose a new language-neutral primitive for the LLVM compiler, which provides efficient context switching and message passing between lightweight threads of control. The primitive, called Swapstack, can be used by any language implementation based on LLVM to build higher-level language structures such as continuations, coroutines, and lightweig...
Conference Paper
Understanding the baseline underwater acoustic signature of an offshore location is a necessary, early step in formulating an environmental impact assessment of wave energy conversion devices. But in order to even begin this understanding, infrastructure must be deployed to capture raw acoustic signals for an extended period of time. This infrastru...
Article
Although scripting languages have become very popular, even mature scripting lan-guage implementations remain interpreted. Several compilers and reimplementations have been attempted, generally focusing on performance. Based on our survey of these reimplementations, we determine that there are three important features of scripting languages that ar...
Article
Static single assignment form (SSA) [5] is nearly ubiquitous in the compiler world. It is dearly loved by most compiler writers, and even more so by undergraduate compiler-class instructors. Its popularity comes from a number of powerful features: • It fits neatly into a 45 minute exam question. • It provides flow-sensitivity for free. • It adds sp...
Article
Full-text available
We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code...
Article
Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can alleviate much of the risk. Modern processors include indirect branch predictors which use part of the target...
Article
Virtual machines (VMs) are commonly used to execute programs written in languages such as Java, Python and Lua. VMs are typically implemented using an interpreter, a JIT compiler, or some combination of the two. A long-standing question in the design of VM interpreters is whether it is worthwhile to reorder the cases in the main interpreter loop to...
Conference Paper
Recent Intel processors provide hardware instructions that implement a full AES round in a single instruction. Existing libraries use hand-tuned assembly language to overlap the execution of multiple AES instructions and extract maximum performance. We present a program generator that creates optimized AES code automatically from a simple, annotate...
Article
We present an output sensitive algorithm for computing a maximum independent set of an unweighted circle graph. Our algorithm requires O(nmin{d,�}) time, for an n vertex circle graph where d is the density of the circle graph andis its independence number. Previous algorithms for this problem required �(nd) time.
Conference Paper
Data must be encrypted if it is to remain confidential when sent over computer networks. Encryption solves many problems involving invasion of privacy, identity theft, fraud, and data theft. However for encryption to be widely used, it must be fast. The problem is so important that new Intel processors provide hardware support for encryption. These...
Conference Paper
Dynamic scripting languages offer programmers increased flexibility by allowing properties of programs to be defined at run-time. Typically, program execution begins with an interpreter where type checks are implemented using conditional statements. Recent JIT compilers have begun removing run-time checks by specializing native code to program prop...
Article
In this article, we experimentally compare a number of data structures operating over keys that are 32- and 64-bit integers. We examine traditional comparison-based search trees as well as data structures that take advantage of the fact that the keys are integers such as van Emde Boas trees and various trie-based data structures. We propose a varia...
Article
Full-text available
Adaptive filters are widely used in many applications of digital signal processing. Digital com-munications and digital video broadcasting are just two examples. Traditionally, small embed-ded systems have employed the least computationally intensive filter adaptive algorithms, such as normalized least mean squares (NLMS). This article shows that F...
Conference Paper
This paper improves our previous research effort [1] by providing an efficient method for kernel loop unrolling minimisation in the case of already scheduled loops, where circular lifetime intervals are known. When loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel lo...
Conference Paper
Streaming languages were originally aimed at streaming architectures, but recent work has shown the stream programming model to be useful in exploiting parallelism on general purpose processors. Current research in mapping stream code onto GPPs deals with load balancing and generating threads based on hardware features. We look into improving prob...
Conference Paper
Dynamic scripting languages are most commonly implemented using interpreters. These interpreters are highly portable but lack the performance required for use in demanding systems. Just-in-time (JIT) compilation has been used to improve the performance of some of these dynamic scripting languages. JIT compilers typically target a single plat- form...
Conference Paper
Many important scientific, engineering and financial applications can benefit from offloading computation to emerging parallel systems, such as the Cell Broadband EngineTM(Cell/B.E.). However, traditional remote procedure call (RPC) mechanisms require significant investment of time and effort to rewrite applications to use a specific RPC system. As...
Article
Although scripting languages are becoming increasingly popular, even mature scripting language implementations remain interpreted. Several compilers and reimplementations have been attempted, gen-erally focusing on performance. Based on our survey of these reimplementations, we determine that there are three important features of scripting language...
Article
Scripting languages, such as PHP, are among the most widely used and fastest growing programming languages, particularly for web applications. Static analysis is an important tool for detecting se-curity flaws, finding bugs, and improving compilation of programs. However, static analysis of scripting languages is difficult due to features found in...
Article
Full-text available
Full-text is available at http://www.doc.mmu.ac.uk/STAFF/A.Nisbet/PAPERS/fpga_opt.pdf We propose a classification of high and low-level compiler optimizations to reduce the clock period, power consumption and area requirements in Field-programmable Gate Array (FPGA) architectures. The potential of each optimization, its effect on clock period, powe...
Article
Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various trade-offs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures that significantly influence performance. Caches and...

Network

Cited By