Suhaib A Fahmy

Suhaib A Fahmy
King Abdullah University of Science and Technology | KAUST · Department of Computer Science

MEng PhD DIC

About

141
Publications
72,228
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,477
Citations
Citations since 2016
61 Research Items
1883 Citations
2016201720182019202020212022050100150200250300
2016201720182019202020212022050100150200250300
2016201720182019202020212022050100150200250300
2016201720182019202020212022050100150200250300
Introduction
Suhaib A. Fahmy is Associate Professor of Computer Science and Principal Investigator of the Accelerated Connected Computing Laboratory (ACCL) at KAUST. His research explores hardware acceleration of complex algorithms and the integration of these accelerators within wider computing infrastructure.
Additional affiliations
November 2020 - December 2020
King Abdullah University of Science and Technology
Position
  • Professor (Associate)
June 2019 - November 2020
The University of Warwick
Position
  • Lecturer
October 2015 - May 2019
The University of Warwick
Position
  • Professor (Associate)
Education
October 2003 - August 2007
Imperial College London
Field of study
  • Electrical and Electronic Engineering
October 1999 - June 2003
Imperial College London
Field of study
  • Information Systems Engineering

Publications

Publications (141)
Article
Long-Short Term Memory (LSTM) networks, and Recurrent Neural Networks (RNNs) in general, have demonstrated their suitability in many time series data applications, especially in Natural Language Processing (NLP). Computationally, LSTMs introduce dependencies on previous outputs in each layer that complicate their computation and the design of custo...
Article
Full-text available
A key challenge in building machine learning models for time series prediction is the incompleteness of the datasets. Missing data can arise for a variety of reasons, including sensor failure and network outages, resulting in datasets that can be missing significant periods of measurements. Models built using these datasets can therefore be biased....
Preprint
We present a design space exploration for synthesizing optimized, high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. Re-evaluating the characteristics of algorithms for the direct solution of tridiagonal systems, we develop a new tridiagonal solver library aimed at implementing high-performance comput...
Preprint
Full-text available
Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and...
Article
Full-text available
Coarse-grained FPGA overlays built around the runtime programmable DSP blocks in modern FPGAs can achieve high throughput and improved scalability compared to traditional overlays built without detailed consideration of FPGA architecture. These overlays can be mapped to using higher level compilers, achieving fast compilation, software-like program...
Conference Paper
We present StressBench, a network benchmarking framework written for testing MPI operations and file I/O concurrently. It is designed specifically to execute MPI communication and file access patterns that are representative of real-world scientific applications. Existing tools consider either the worst case congestion with small abstract patterns...
Conference Paper
A Cloud Computing Environment (CCE) leverages the advantages offered by virtualisation to enable virtual machines (VMs) within the same physical machine (PM) to share physical resources. Cloud service providers (CSPs) accommodate the fluctuating resource demands of cloud users dynamically, through elastic resource provisioning. CSPs use VM allocati...
Article
Battery-powered unmanned aerial vehicles (UAVs) have been widely used as enablers of wireless networks. In this letter, the optimal battery weight for UAV-enabled wireless sensor networks is studied. The energy available for communication by considering propulsion energy consumption is maximized. Both numerical and approximate solutions to the opti...
Article
Full-text available
Accurate air quality monitoring requires processing of multi-dimensional, multi-location sensor data, which has previously been considered in centralised machine learning models. These are often unsuitable for resource-constrained edge devices. In this article, we address this challenge by: (1) designing a novel hybrid deep learning model for hourl...
Preprint
Full-text available
This paper presents a workflow for synthesizing near-optimal FPGA implementations for structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class, its computation-communication pattern, and the architectural capabilities of the FPGA to accelerate solvers from the high-performance comput...
Article
The increasing size of modern FPGAs allows for ever more complex applications to be mapped onto them. However, long design implementation times for large designs can severely affect design productivity. A modular design methodology can improve design productivity in a divide and conqueror fashion but at the expense of degraded performance and power...
Article
The Internet of Things is manifested through a large number of low-capability connected devices. This means that for many applications, computation must be offloaded to more capable platforms. While this has typically been cloud datacenters accessed over the Internet, this is not feasible for latency sensitive applications. In this paper we investi...
Article
Digital signal processing (DSP) on field-programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP...
Article
Full-text available
Applications that involve analysis of data from distributed networked data sources typically involve computation performed centrally in a datacenter or cloud environment, with some minor pre-processing potentially performed at the data sources. As these applications grow in scale, this centralised approach leads to potentially impractical bandwidth...
Article
Full-text available
Advances in processor design have delivered performance improvements for decades. As physical limits are reached, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests...
Article
Full-text available
Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial archit...
Conference Paper
Full-text available
Coarse-grained overlays improve FPGA design productivity by providing fast compilation and software like pro-grammability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays...
Article
Full-text available
Air traffic has seen tremendous growth over the past decade pushing the need for enhanced air traffic management schemes. The $L$-band digital aeronautical communication system (LDACS) is gaining traction as a scheme of choice, and aims to exploit the capabilities of modern digital communication techniques and computing architectures. Cognitive rad...
Article
Full-text available
Computing in vehicles has increased dramatically, with electronic control units (ECUs) communicating over increasingly complex and heterogeneous networks and presenting challenges in scalability, validation, and security. In this article, we describe the concept of smart network interfaces incorporating programmable computation at the network layer...
Technical Report
Full-text available
FPGAs are well established in the signal processing domain, where their fine-grained programmable nature allows the inherent parallelism in these applications to be exploited for enhanced performance. As architectures have evolved, FPGA vendors have added more heterogeneous resources to allow often-used functions to be implemented with higher perfo...
Article
Full-text available
Cognitive radios that are able to operate across multiple standards depending on environmental conditions and spectral requirements, are becoming more important as the demand for higher bandwidth and efficient spectrum use increases. Traditional custom ASIC implementations cannot support such flexibility, with standards changing at a faster pace, w...
Article
Full-text available
Multi-context architectures like NATURE enable low-power applications to leverage fast context switching for improved energy efficiency and lower area footprint. The NATURE architecture incorporates 16-bit reconfigurable DSP blocks for accelerating arithmetic computations; however, their fixed precision prevents efficient reuse in mixed-width arith...
Conference Paper
Full-text available
Energy consumption is rapidly becoming a limiting factor in scientific computing. As a result, hardware manufacturers increasingly prioritise energy efficiency in their processor designs. Performance engineers are also beginning to explore software optimisation and hardware/software co-design as a means to reduce energy consumption. Energy efficien...
Article
Full-text available
FPGA vendors have recently started focusing on OpenCL for FPGAs because of its ability to leverage the parallelism inherent to heterogeneous computing platforms. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, the prohi...
Article
Full-text available
Modern vehicles employ a large amount of distributed computation and require the underlying communication scheme to provide high bandwidth and low latency. Existing communication protocols like Controller Area Network (CAN) and FlexRay do not provide the required bandwidth, paving the way for adoption of Ethernet as the next generation network back...
Article
Full-text available
With the increasing amount of interconnections between vehicles, the attack surface of internal vehicle networks is rising steeply. Although these networks are shielded against external attacks, they often do not have any internal security to protect against malicious components or adversaries who can breach the network perimeter. To secure the in-...
Article
With the increasing amount of interconnections between vehicles, the attack surface of internal vehicle networks is rising steeply. Although these networks are shielded against external attacks, they often do not have any internal security to protect against malicious components or adversaries who can breach the network perimeter. To secure the in-...
Article
Full-text available
FPGAs offer high performance coupled with energy efficiency, making them extremely attractive computational resources within a cloud ecosystem. However, to achieve this integration and make them easy to program, we first need to enable users with varying expertise to easily develop cloud applications that leverage FPGAs. With the growing size of FP...
Article
Full-text available
For complex datapaths, resource sharing can help reduce area consumption. Traditionally, resource sharing is applied when the same resource can be scheduled for different uses in different cycles, often resulting in a longer schedule. Multi-pumping is a method whereby a resource is clocked at a frequency that is a multiple of the surrounding circui...
Conference Paper
Full-text available
We present an approach for on-demand acceleration of data center workloads using high performance architecture-centric coarse-grained FPGA overlays. Proposed approach allows on-the-fly generation of accelerators on server node and dynamic reuse of FPGA resources for multiple workloads.
Poster
Full-text available
We present an approach for on-demand acceleration of data center workloads using high performance architecture-centric coarse-grained FPGA overlays. Proposed approach allows on-the-fly generation of accelerators on server node and dynamic reuse of FPGA resources for multiple workloads.
Conference Paper
Full-text available
L-band Digital Aeronautical Communication System (LDACS) is an emerging standard that aims at enhancing air traffic management by transitioning the traditional analog aeronautical communication systems to the superior and highly efficient digital domain. The standard places stringent requirements on the communication channels to allow them to coexi...
Article
Full-text available
With the increasing interconnection of vehicles, security challenges have moved into focus. Attacks on in-vehicle networks can cause accidents resulting in financial damages and even loss of life. The impact of an attack can be mitigated by secure internal vehicle networks, employing authentication of ECUs and authorization of messages. However, qu...
Conference Paper
Combining processors with hardware accelerators has become a norm with systems-on-chip (SoCs) ever present in modern compute devices. Heterogeneous programmable system on chip platforms sometimes referred to as hybrid FPGAs, tightly couple general purpose processors with high performance reconfigurable fabrics, providing a more flexible alternative...
Article
Full-text available
Coarse grained overlay architectures improve FPGA design productivity by providing fast compilation and software-like programmability. Throughput oriented spatially configurable overlays typically suffer from area overheads due to the requirement of one functional unit for each compute kernel operation. Hence, these overlays have often been of limi...
Conference Paper
Coarse-grained FPGA overlay architectures paired with general purpose processors offer a number of advantages for general purpose hardware acceleration because of software-like programmability, fast compilation, application portability, and improved design productivity. However, the area overheads of these overlays, and in particular architectures...
Conference Paper
Design productivity is a major concern preventing the mainstream adoption of FPGAs. Overlay architectures have emerged as one possible solution to this challenge, offering fast compilation and software-like programmability. However, overlays typically suffer from area and performance overheads due to limited consideration for the underlying FPGA ar...
Conference Paper
Full-text available
Modern vehicles are complex distributed systems with critical real-time electronic controls that have progressively replaced their mechanical/hydraulic counterparts, for performance and cost benefits. The harsh and varying vehicular environment can induce multiple errors in the computational/communication path, with temporary or permanent effects,...
Conference Paper
Improved quality of results from high level synthesis (HLS) tools has led to their increased adoption. Despite the automated translation from high level descriptions to register-transfer level (RTL) implementations, functional verification remains a major challenge. Verification can take significantly more time than the design process; if there is...
Article
Variable digital filters (VDFs) are used in software defined radio handsets for extraction of individual radio channels corresponding to multiple wireless communication standards. In this paper, we propose a VDF based on the improved coefficient decimation method (ICDM). The proposed VDF provides variable lowpass, highpass, bandpass, bandstop and m...
Conference Paper
Full-text available
Hardware accelerators implement custom architectures to significantly speed up computations in a wide range of domains. As performance scaling in server-class CPUs slows, we propose the integration of hardware accelerators in the cloud as a way to maintain a positive performance trend. Field programmable gate arrays (FPGAs) represent the ideal way...