The end of Dennard scaling [1], Moore’s law [2], and the resulting difficulty of
increasing clock frequency forced the engineering community to shift to the
multi/many-core processors and multi-node systems as an alternative way to
improve performance. An increased core number benefits many workloads,
but programming limitations still reduce the performance due to not fully
exploited parallelism.
From this perspective, new execution models are arising to overcome such
limitations to scale up performance. Execution models like Data-Flow can
take advantage of the full parallelism, thanks to the possibility of creating
many asynchronous threads that can run in parallel. These threads may encapsulate the data to be processed, their dependencies, and, once completed,
write their output for other threads. Data-Flow Threads (DF-Threads) is a
novel Data-Flow execution model for mapping threads on local or distributed
cores transparently to the programmer [3]. This model is capable of being
parallelized massively among different cores, and it handles even hundreds of
thousands or more Data-Flow threads with their associated data regions.
Further implementation and evaluation of the DF-Threads model (previously proposed by R.Giorgi [3]) are presented in this thesis. The proposed
model can be able to exploit the full parallelism of modern heterogeneous embedded architectures (e.g., the AXIOM-Board [4]). The work relies on introducing the ”Data-Flow engine” (DF-Engine), which is able to accelerate the function execution and spawn many asynchronous, data-driven threads across
several general-purpose cores of a multi-core/node system. The DF-Engine
can be placed either in software or directly implemented at the hardware
level by using a heterogeneous architecture (e.g., the AXIOM-Board). The
DF-Engine can handle the creation, the thread-dependency, and the locality
of many fine-grain threads, leaving the general-purpose core focusing only on
the execution of the threads. This implementation is a hybrid Data-Flow -
von-Neumann model, which harnesses the parallelism and data synchronization inherently to the Data-Flow, and yet maintains the programmability of
the von-Neumann model.
Starting from the DF-Threads execution model, we analyzed the tradeoffs of a
minimalistic API to enable an efficient implementation, which can distribute
the DF-Threads either locally across the core of a single multi-core system
or/and across the remote cores of a cluster.
Implement and evaluate the proposed model directly on a real architecture
requires time, resources, and effort. Therefore, the design has been preliminarily evaluated in a simulation framework, and then the validated model has
been gradually migrated into a real board in collaboration with my research
group.
The simulation framework presented in this thesis is based on the COTSon
simulator [5] and on a set of tools named ”MY Design Space Exploration”
(MYDSE) [6], which has been implemented and adopted by our research
group. Then, we explain how the validation phase of the simulation framework has been performed against real architecture like x86 64 and AArch64.
Moreover, we analyzed the impact of different Linux distributions on the execution.
Afterward, we explain the workflow adopted to migrate the design of the
DF-Threads execution model from the COTSon simulator to a High-Level
Synthesis framework (e.g., Xilinx HLS), targeting a heterogeneous architecture such as the AXIOM-Board [4]. We used a driving example that models a
two-way associative cache to demonstrate the simplicity and rapidness of our framework in migrating the design from the COTSon simulator to the HLS
framework. The methodology has been adopted in the context of the AXIOM
project [7], which helped our research team in reducing the development time
from days/weeks to minutes/hours.
In the end, we present the evaluation of the proposed DF-Threads execution model. We are interested in stressing and analyzing the efficiency of the
DF-Engine with thousands or more Data-Flow threads. For this goal, we
decided to use the Recursive Fibonacci algorithm, which gives us the possibility to generate such a high number of threads easily. Moreover, we want
to study the behavior of the execution model with data-intensive applications
for evaluating the performance with memory operations and data movements.
For this reason, we adopted the Matrix Multiplication benchmark, which is
the main computational kernel of widely used applications (e.g., Smart Home
Living, Smart Video Surveillance, Artificial Intelligence).
The proposed design has been evaluated against OpenMPI, which is typically adopted for cluster programming, and against CUDA, a parallel programming language for GPUs. DF-Threads achieve better performance-per-core compared to both OpenMPI and CUDA. In particular, OpenMPI exhibits much more Operating System (OS) kernel activity with respect to
DF-Threads. This OS activity slows down the OpenMPI performance. If
we consider the delivered GFLOPS per core, DF-Threads is also competitive
with respect to CUDA.