Mikel Luján

Mikel Luján
The University of Manchester · Department of Computer Science

About

229
Publications
42,436
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,984
Citations
Introduction

Publications

Publications (229)
Conference Paper
The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations of ARMv8 processors support both AArch32 and AArch64, which comes at a cost in hardware complexity....
Article
In this article, we present FastPath_MP, a novel low-overhead and energy-efficient storage multi-path architecture that leverages FPGAs to operate transparently to the main processor and improve the performance and energy efficiency of accessing storage devices. We prototyped FastPath_MP on both Arm-FPGA Zynq 7000 SoC and Zynq UltraScale+ MPSoC and...
Conference Paper
Feature selection is the data analysis process that selects a smaller and curated subset of the original dataset by filtering out data (features) which are irrelevant or redundant. The most important features can be ranked and selected based on statistical measures, such as mutual information. Feature selection not only reduces the size of dataset...
Article
Full-text available
A typical machine learning development cycle maximizes performance during model training and then minimizes the memory and area footprint of the trained model for deployment on processing cores, graphics processing units, microcontrollers or custom hardware accelerators. However, this becomes increasingly difficult as machine learning models grow l...
Article
Full-text available
Visual Odometry (VO) systems are widely used to determine the position and orientation of a robot or camera in an unknown environment. They are deployed on resource-constrained platforms, such as drones, and virtual reality or augmented reality headsets. VO systems harnessing modern System-on-Chip (SoCs) with integrated Field Programmable Gate Arra...
Preprint
Full-text available
A typical machine learning (ML) development cycle for edge computing is to maximise the performance during model training and then minimise the memory/area footprint of the trained model for deployment on edge devices targeting CPUs, GPUs, microcontrollers, or custom hardware accelerators. This paper proposes a methodology for automatically generat...
Preprint
Full-text available
The AMD UltraScale+ XCZU9EG device is a Multi-Processor System-on-Chip (MPSoC) with embedded Programmable Logic (PL) that excels in many Edge (e.g., automotive or avionics) and Cloud (e.g., data centres) terrestrial applications. However, it incorporates a large amount of SRAM cells, making the device vulnerable to Neutron-induced Single Event Upse...
Preprint
We present a theory of ensemble diversity, explaining the nature and effect of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the holy grail of ensemble learning, an open question for over 30 years. Our framework reveals that diversity is in fact a hidden dim...
Article
Full-text available
The AMD UltraScale+ XCZU9EG, a multiprocessor system-on-chip (MPSoC) with integrated programmable logic (PL), is vulnerable to the effects of atmospheric radiation due to its large SRAM count. This article explores the effectiveness of the MPSoC's embedded soft-error mitigation mechanisms through accelerated atmospheric-like neutron radiation testi...
Article
Full-text available
Top-of-rack switches based on photonic switching fabrics (PSF) could provide higher bandwidth and energy efficiency for datacenters (DC) and high-performance computers (HPC) than these with traditional electronic crossbars. However, because of their bufferless nature, PFS are affected by contention much more drastically than traditional packet-swit...
Preprint
Full-text available
This paper studies the dependability of the Xilinx Deep-Learning Processing Unit (DPU) under neutron irradiation. It analyses the impact of Single Event Effects (SEEs) on the accuracy of the DPU running the resnet50 model on a Xilinx Ultrascale+ MPSoC.
Article
Full-text available
Simulation-based performance prediction is complicated and time-consuming. In this study, we apply supervised learning to predict the performance scores of Standard Performance Evaluation Corporation (SPEC) benchmarks. The SPEC CPU2017 is a public dataset of results obtained by executing 43 standardised performance benchmarks organised into 4 suite...
Preprint
QNNVerifier is the first open-source tool for verifying implementations of neural networks that takes into account the finite word-length (i.e. quantization) of their operands. The novel support for quantization is achieved by employing state-of-the-art software model checking (SMC) techniques. It translates the implementation of neural networks to...
Preprint
Full-text available
Progress in the last decade has brought about significant improvements in the accuracy and speed of SLAM systems, broadening their mapping capabilities. Despite these advancements, long-term operation remains a major challenge, primarily due to the wide spectrum of perturbations robotic systems may encounter. Increasing the robustness of SLAM algor...
Chapter
Dynamic Voltage and Frequency Scaling is the most commonly used power management technique in modern processors. However, the ability of an individual chip to operate under reduced supply voltage can no longer be predetermined at the design stage and may even change over time. This paper presents a dynamic power-management strategy for out-of-order...
Chapter
End-to-End training (E2E) is becoming more and more popular to train complex Deep Network architectures. An interesting question is whether this trend will continue—are there any clear failure cases for E2E training? We study this question in depth, for the specific case of E2E training an ensemble of networks. Our strategy is to blend the gradient...
Article
Silicon Photonic interconnects are a promising technology for scaling computing systems into the exa-scale domain. However, there exist significant challenges in terms of optical losses and complexity. In this work, we evaluate the applicability of a thermally/electrically tuned Beneš network based on Mach–Zehnder Interferometers for on-chip and in...
Article
Full-text available
SpiNNaker is a massively-parallel computer system optimized for the simulation, in real time, of very large networks of spiking neurons. The system consists of over 1 million, energy-efficient ARM cores distributed over 57,600 SpiNNaker chips, each of which contains 18 cores interconnected by a neurobiologically-inspired, asynchronous (clock-less)...
Thesis
Full-text available
Energy-efficient machine learning has been gaining interest due to the increase use of machine learning, in particular deep learning, in applications that run on mobile and embedded devices. These devices are constrained in terms of resources in computation, memory and power, which limit the adoption of deep learning-based solutions, which are know...
Preprint
Full-text available
Building predictive models that estimate energy consumption of convolutional neural networks. Considers empirical power measurements derived from power sensors and power pins on mobile platforms. Tested on Snapdragon820 and Jetson TX1.
Article
Full-text available
The design of new computer architectures relies heavily on simulation. New architectures that incorporate unconventional features or novel designs can not usually use established simulators and, therefore, designers have to adapt an existing one or develop their own from scratch. Traditionally, software-based simulators have been the main platform...
Preprint
Full-text available
Modern operating systems all support multi-users that users could share a computer simultaneously and not affect each other. However, there are some limitations. For example, privacy problem exists that users are visible to each other in terms of running processes and files. Moreover, users have little freedom to customize the system environment. L...
Article
HPC architects are currently facing myriad challenges from ever tighter power constraints and changing workload characteristics. In this article we discuss the current state of FPGAs within HPC systems. Recent technological advances show that they are well placed for penetration into the HPC market. However, there are still a number of research pro...
Conference Paper
To leverage existing virtual machine infrastructures is attractive for programming language implementors because competitive runtime performance may be achieved with a reduced effort. For example, the Truffle framework has enabled Ruby (TruffleRuby), and C (Sulong)guest language implementations to be hosted on a Java Virtual Machine(JVM). In this p...
Conference Paper
With micro-services continuously gaining popularity and low-power processors making their way into data centers, efficient execution of managed runtime systems on low-power architectures is also gaining interest. Apart from the inherent performance differences between high and low power processors, porting a managed runtime system to a low-power ar...
Conference Paper
WebAssembly is a binary format compilation target for languages such as C/C++, Rust and Go. It enables execution within Web browsers and as standalone programs. Compiled modules may interoperate with other languages such as JavaScript, and use external calls (imports) to interact with a host environment. Such interoperability dependencies influence...
Conference Paper
Full-text available
As the silicon industry moves into deep nanoscale technologies, preserving Mean Time to Failure at acceptable levels becomes a first-order challenge. The operational stress, along with the inefficient power dissipation and the unsustainable thermal thresholds increase the wear-induced failures. As a result, faster wear-out leads to earlier performa...
Conference Paper
Silicon Photonic interconnects are a promising technology for scaling computing systems into the exa-scale domain. However, significant challenges exist in terms of optical losses and complexity. In this work, we examine the applicability of thermally/electrically tuned Beneš network based on Mach-Zehnder Interferometers for on-chip interconnects a...
Conference Paper
Full-text available
The demand on memory capacity from applications has always challenged the available technologies. It is therefore important to understand that this demand and the consequential limitations in various aspects led to the appearance of new memory technologies and system designs. Fundamentally, not a single solution has managed to fully solve this memo...
Conference Paper
Interconnection networks are one of the main limiting factors when it comes to scale out computing systems. In this paper, we explore what role the hybridization of topologies has on the design of an state-of-the-art exascale-capable computing system. More precisely we compare several hybrid topologies and compare with common single-topology ones w...
Article
Full-text available
Modern applications generate massive amounts of data that is challenging to process or analyse. Graph algorithms have emerged as a solution for the analysis of such data because they can represent the entities participating in the generation of large-scale datasets in terms of vertices and their relationships in terms of edges. Graph analysis algor...
Chapter
Recent work has integrated semantics into the 3D scene models produced by visual SLAM systems. Though these systems operate close to real time, there is lacking a study of the ways to achieve real-time performance by trading off between semantic model accuracy and computational requirements. ORB-SLAM2 provides good scene accuracy and real-time proc...
Preprint
Full-text available
Predicting the execution time of queries is an important problem with applications in scheduling, service level agreements and error detection. During query planning, a cost is associated with the chosen execution plan and used to rank competing plans. It would be convenient to use that cost to predict execution time, but it has been claimed in the...
Conference Paper
Development of application specific accelerators for deep convolutional neural networks (ConvNets) have mainly focussed on accelerating the computationally intensive layers, that is the convolutional layers, to improve performance and energy efficiency. Traditional approaches in this space have relied on handcrafted dataflow implementations to leve...
Chapter
FPGAs are rapidly gaining traction in the domain of HPC thanks to the advent of FPGA-friendly data-flow workloads, as well as their flexibility and energy efficiency. However, these devices pose a new challenge in terms of how to better support their communications, since standard protocols are known to hinder their performance greatly either by re...
Conference Paper
Full-text available
We demonstrate the feasibility of undertaking performance evaluations for JVMs using: (1) a hybrid JVM/OS tool, such as async-profiler, (2) OS centric profiling and tracing tools based on Linux perf, and (3) the Extended Berkeley Packet Filter Tracing (eBPF) framework where we demonstrate the rationale behind the standard offwaketime tool, for anal...
Article
Full-text available
This paper presents INRFlow, a mature, frugal, flow-level simulation framework for modelling large-scale networks and computing systems. INRFlow is designed to carry out performance-related studies of interconnection networks for both high performance computing systems and datacentres. It features a completely modular design in which adding new top...
Preprint
Full-text available
Related code is available at https://github.com/grey-area/modular-loss-experiments We examine the practice of joint training for neural network ensembles, in which a multi-branch architecture is trained via single loss. This approach has recently gained traction, with claims of greater accuracy per parameter along with increased parallelism. We in...
Preprint
Full-text available
We examine the practice of joint training for neural network ensembles, in which a multi-branch architecture is trained via single loss. This approach has recently gained traction, with claims of greater accuracy per parameter along with increased parallelism. We introduce a family of novel loss functions generalizing multiple previously proposed a...
Conference Paper
Full-text available
There is a huge demand for on-device execution of deep learning algorithms on mobile and embedded platforms. These devices present constraints on the application due to limited hardware resources and power. However, current evaluation studies in existing deep learning frameworks (for example, Caffe, Tensorflow, Torch and others) are limited to perf...
Article
Ongoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (Multi‐Processor System‐on‐Chip), combining multiple hard‐core CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity t...
Presentation
Full-text available
Energy measurements and energy predictive models for Conv layers on ARM mobile platforms
Preprint
Full-text available
The proliferation of heterogeneous hardware in recent years means that every system we program is likely to include a mix of compute elements; each with different characteristics. By utilizing these available hardware resources, developers can improve the performance and energy efficiency of their applications. However, existing tools for heterogen...
Conference Paper
Full-text available
It is attractive to host new or existing language implementations on top of, or reusing components of, existing managed language runtimes such as the Java Virtual Machine (JVM) or the Microsoft Common Language Infrastructure (CLI). A benefit is that software development effort may be reduced, as only one managed language runtime needs to be optimis...
Preprint
SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functi...
Preprint
Full-text available
Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research ef...
Article
Full-text available
Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research ef...
Presentation
Full-text available
The presentation is about per-layer energy measurements and energy prediction of Convolutional Neural Networks on mobile systems like the Jetson TX1 and Snapdragon 820. It is developed in Caffe / Caffe2. It uses OpenBLAS and Eigen libraries to accelerate computations on the CPU.
Article
The convergence between computing‐ and data‐centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper, we investigate alternatives for the storage subsystem of a novel exascale‐capable system with special emphasis on how allocation strategies would affect the overall perfo...
Preprint
Full-text available
The constant growth of data and its importance to drive Machine Learning and Big Data is pushing storage systems towards ever increasing I/O bandwidth and lower latency requirements. In recent years, the Non Volatile Memory Express (NVMe) standard has enabled SSD drives to deliver high I/O rates by allowing the storage to be connected directly via...
Article
In large-scale software applications, programmers often combine different programming languages because this allows them to use the most suitable language for a given problem, to gradually migrate existing projects from one language to another, or to reuse existing source code. However, different programming languages have fundamentally different i...
Preprint
Full-text available
There is a huge demand for on-device execution of deep learning algorithms on mobile and embedded platforms. These devices present constraints on the application due to limited resources and power. Hence, developing energy-efficient solutions to address this issue will require innovation in algorithmic design, software and hardware. Such innovation...
Article
The ExaNeSt project started on December 2015 and is funded by EU H2020 research framework (call H2020-FETHPC-2014, n. 671553) to study the adoption of low-cost, Linux-based power-efficient 64-bit ARM processors clusters for Exascale-class systems. The ExaNeSt consortium pools partners with industrial and academic research expertise in storage, inte...
Presentation
Full-text available
Measuring and predicting the energy consumed by the Convolutional layers in a Convolutional neural network on the CPU of Jetson TX1.
Conference Paper
Dynamic Binary Modification (DBM) is a technique for modifying applications transparently while they are executed, working at the level of native code. However, DBM introduces a performance overhead, which in some cases can dominate execution time, making many uses impractical. The ARM hardware ecosystem poses unique challenges for high performance...
Chapter
Power consumption is the main hurdle in the race for designing Exascale-capable computing systems which would require deploying millions of computing elements. While this problem is being addressed by designing increasingly more power-efficient processing subsystems, little effort has been put on reducing the power consumption of the interconnectio...
Article
Optical on-chip data transmission enabled by silicon photonics (SiP) is widely considered a key technology to overcome the bandwidth and energy limitations of electrical interconnects. The possibility of integrating optical links into the on-chip communication fabric has opened up a fascinating new research field—Optical Networks-on-Chip (ONoCs)—wh...