Project
RECIPE - FETHPC-02-2017 – Transition to Exascale Computing – project lead by POLIMI (www.recipe-project.eu)
Updates
0 new
0
Recommendations
0 new
0
Followers
0 new
14
Reads
1 new
202
Project log
The increasing power density in modern highperformance multi-processor system-on-chip (MPSoC) is fueling a revolution in thermal management. On the one hand, thermal phenomena are becoming a critical concern, making accurate and efficient simulation a necessity. On the other hand, a variety of physically heterogeneous solutions are coming into play: liquid, evaporative, thermoelectric cooling, and more. A new generation of simulators, with unprecedented flexibility, is thus required. In this paper, we present 3D-ICE 3.0, the first thermal simulator to allow for accurate nonlinear descriptions of complex and physically heterogeneous heat dissipation systems, while preserving the efficiency of latest compact modeling frameworks at the silicon die level. 3D-ICE 3.0 allows designers to extend the thermal simulator with new heat sink models while simplifying the time-consuming step of model validation. Support for nonlinear dynamic models is included, for instance to accurately represent variable coolant flows. Our results present validated models of a commercial water heat sink and an air heat sink plus fan that achieve an average error below 1∘C and simulate, respectively, up to 3x and 12x faster than the real physical phenomena.
Application requirements in High-Performance Computing (HPC) are becoming increasingly exacting, and the demand for computational resources is rising. In parallel, new application domains are emerging, as well as additional requirements, such as meeting real-time constraints. This requirement, typical of embedded systems, is difficult to guarantee when dealing with HPC infrastructures, due to the intrinsic complexity of the system. Traditional embedded systems static analyses to estimate the Worst-Case Execution Time (WCET) are not applicable to HPC, because modeling and analyzing all the system's hardware and software components is not practical. Measurement-based probabilistic analyses for the WCET emerged in the last decade to overcome these issues, but it requires the system to satisfy certain conditions to estimate a correct and safe WCET. In this work, we show the emerging application timing requirements, and we propose to exploit the probabilistic real-time theory to achieve the required time predictability. After a brief recap of the fundamentals of this methodology, we focus on its applicability to HPC systems, to check their ability to satisfy such conditions. In particular, we studied the advantages of having heterogeneous processors in HPC nodes and how resource management affects the applicability of the proposed technique.
Heterogeneous computing is a promising solution to scale the performance of computing systems maintaining energy and power efficiency. Managing such resources is, however, complex and it requires smart resource allocation strategies in both embedded and high-performance systems. In this short paper, we propose a game theory approach to allocate heterogeneous resources to applications, with a focus on performance, power, and energy requirements. The game congestion model has been selected and a cost function designed. The proposed allocation strategy is then evaluated by performing a preliminary experimental evaluation.
In modern embedded systems, the use of hardware-level online power monitors is crucial to support the run-time power optimizations required to meet the ever increasing demand for energy efficiency. To be effective and to deal with the time-to-market pressure, the presence of such requirements must be considered even during the design of the power monitoring infrastructure. This paper presents a power model identification and implementation strategy with two main advantages over the state-of-the-art. First, our solution trades the accuracy of the power model with the amount of resources allocated to the power monitoring infrastructure. Second, the use of an automatic power model instrumentation strategy ensures a timely implementation of the power monitor regardless the complexity of the target computing platforms. Our methodology has been validated against 8 accelerators generated through a High-Level-Synthesis flow and by considering a more complex RISC-V embedded computing platform. Depending on the imposed user-defined constraints and with respect to the unconstrained power monitoring state-of-the-art solutions, our methodology shows a resource saving between 37.3% and 81% while the maximum average accuracy loss stays within 5%, i.e., using the aggressive 20us temporal resolution. However, by varying the temporal resolution closer to the value proposed in the state of the art, i.e., in the range of hundreds of microseconds, the average accuracy loss of our power monitors is lower than 1% with almost the same overheads. In addition, our solution demonstrated the possibility of delivering a resource constrained power monitor employing a 20us temporal resolution, i.e., far higher the one used by current state-of-the-art solutions.
Modern embedded systems are in charge of an increasing number of tasks that extensively employ floating-point (FP) computations. The ever-increasing efficiency requirement, coupled with the additional computational effort to perform FP computations, motivates several microarchitectural optimizations of the FPU. This manuscript presents a novel modular FPU microarchitecture, which targets modern embedded systems and considers heterogeneous workloads including both best-effort and accuracy-sensitive applications. The design optimizes the EDP-accuracy-area figure of merit by allowing, at design-time, to independently configure the precision of each FP operation, while the FP dynamic range is kept common to the entire FPU to deliver a simpler micro-architecture. To ensure the correct execution of accuracy-sensitive applications, a novel compiler pass allows to substitute each FP operation for which a low-precision hardware support is offered with the corresponding soft-float function call. The assessment considers seven FPU variants encompassing three different state-of-the-art designs. The results on several representative use cases show that the binary32 FPU implementation offers an EDP gain of 15%, while, in case the FPU implements a mix of binary32 and bfloat16 operations, the EDP gain is 19%, the reduction in the resource utilization is 21% and the average accuracy loss is less than 2.5%. Moreover, the resource utilization of our FPU variants is aligned with the one of the FPU employing state-of-the-art, highly specialized FP hardware accelerators. Starting from the assessment, a set of guidelines is drawn to steer the design of the FP hardware support in modern embedded systems.
The strict requirements on the timing correctness biased the modeling and analysis of real-time systems toward the worst-case performances. Such focus on the worst-case, however , does not provide enough information to effectively steer the resource/energy optimization. In this article, we integrate a probabilistic-based energy prediction strategy with the precise scheduling of mixed-criticality tasks, where the timing correct-ness must be met for all tasks at all scenarios. The dynamic voltage and frequency scaling (DVFS) is applied to this precise scheduling policy to enable energy minimization. We propose a probabilistic technique to derive an energy-efficient speed (for the processor) that minimizes the average energy consumption, while guaranteeing the (worst-case) timing correctness for all tasks, including LO-criticality ones, under any execution condition. We present a response time analysis for such systems under the non-preemptive fixed-priority scheduling policy. Finally, we conduct an extensive simulation campaign based on randomly generated task sets to verify the effectiveness of our algorithm (with respect to energy savings) and it reports up to 46% energy-saving.
With the recent advances in quantum computing, code-based cryptography is foreseen to be one of the few mathematical solutions to design quantum resistant public-key cryptosystems. The binary polynomial multiplication dominates the computational time of the primitives in such cryptosystems, thus the design of efficient multipliers is crucial to optimize the performance of post-quantum public-key cryptographic solutions. This manuscript presents a flexible template architecture for the hardware implementation of large binary polynomial multipliers. The architecture combines the iterative application of the Karatsuba algorithm, to minimize the number of required partial products, with the Comba algorithm, used to optimize the schedule of their computations. In particular, the proposed multiplier architecture supports operands in the order of dozens of thousands of bits, and it offers a wide range of performance-resources trade-offs that is made independent from the size of the input operands. To demonstrate the effectiveness of our solution, we employed the nine configurations of the LEDAcrypt public-key cryptosystem as representative use cases for large-degree binary polynomial multiplications. For each configuration we showed that our template architecture can deliver a performance-optimized multiplier implementation for each FPGA of the Xilinx Artix-7 mid-range family. The experimental validation performed by implementing our multiplier for all the LEDAcrypt configurations on the Artix-7 12 and 200 FPGAs, i.e., the smallest and the largest devices of the Artix-7 family, demonstrated an average performance gain of 3.6x and 33.3x with respect to an optimized software implementation employing the gf2x C library.
The problem of estimating a tight and safe Worst-Case Execution Time (WCET), needed for certification in safety-critical environment, is a challenging problem for modern embedded systems. A possible solution proposed in past years is to exploit statistical tools to obtain a probability distribution of the WCET. These probabilistic real-time analyses for WCET are, however, subject to errors, even when all the applicability hypotheses are satisfied and verified. This is caused by the uncertainties of the probabilistic-WCET distribution estimator. This article aims at improving the measurement-based probabilistic timing analysis approach providing some techniques to analyze and deal with such uncertainties. The so-called region of acceptance model based on state-of-the-art statistical test procedures is defined over the distribution space parameters. From this model, a set of strategies is derived and discussed to provide the methodology to deal with the trade-off safety/tightness of the WCET estimation. These techniques are then tested over real datasets, including industrial safety-critical applications, to show the increased value of using the proposed approach in probabilistic WCET analyses.
With the Internet-of-things revolution, embedded devices are in charge of an ever increasing number of tasks ranging from sensing, up to Artificial Intelligence (AI) functions. In particular, AI is gaining importance since it can dramatically improve the QoS perceived by the final user and it allows to cope with problems whose algorithmic solution is hard to find. However, the associated computational requirements, mostly made of floating-point processing, impose a careful design and tuning of the computing platforms. In this scenario, there is a need for a set of benchmarks representative of the emerging AI applications and useful to compare the efficiency of different architectural solutions and computing platforms. In this paper we present a suite of benchmarks encompassing Computer Graphics, Computer Vision and Machine Learning applications, which are greatly used in many AI scenarios. Such benchmarks, differently from other suites, are kernels tailored to be effectively executed in bare-metal and specifically stress the floating-point support offered by the computing platform.
This paper presents a methodology and framework to model the behavior of superscalar microprocessors. The simulation is focused
on timing analysis and ignores all functional aspects. The methodology also provides a framework for building new simulators
for generic architectures. The results obtained show a good accuracy and a satisfactory computational efficiency. Furthermore,
the C++ SDK allows rapid development of new processor models making the methodology suitable for design space exploration
over new processor architectures.
In this paper we propose an application-level power consumption modeling and optimization technique for mobile devices. The application being considered is modeled as a FSM and the power consumption figures are associated with it through current measurements on selected states, followed by the application of a linear functional model. The FSM model is then used, together with a power management policy, to extend battery lifetime while guaranteeing the execution of essential states in the application. In this paper, the methodology is applied to a specific case study, namely the fruition of multimedia content in an E-Learning scenario.
With the Internet-of-Things (IoT), embedded devices are required to execute multiple applications under severe energy and performance constraints. The traditional approaches to control the energy-budget and the energy-allocation by means of the DVFS actuator are diminishing their effectiveness for two reasons. First, technological advancements prevent further voltage scaling. Second, the strict low-power requirements are moving the design of such control strategies from software- to hardware-level. This paper proposes a unified control-theoretic scheme to coordinate the design of energy-budget and energy-allocation solutions for low-power platforms. The scheme allows the integration of any performance policy, while still ensuring the theoretic exponential stability of the overall controller. We integrated the proposed control scheme into an open-hardware quad-core RISC architecture also demonstrating the use of two actuators, Dynamic Frequency Scaling, and Dynamic Clock Gating. The design has been implemented on the Nexys4-DDR board featuring a Xilinx Artix- 7 100t FPGA. We report an area occupation of the controller limited to 0.86% (FFs) and 5.3% (LUTs) of the target FPGA chip. Collected results show that the average efficiency in exploiting the imposed budget is 98.27%. The average budget overflow is 1.43mW. Last, the performance utility loss due to the control scheme is 1.87% on average.
RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project, which span run-time management, heterogeneous computing architectures, HPC memory/interconnection infrastructures, thermal modelling , reliability, programming models, and timing analysis. For each of these areas, the paper describes the relevant state of the art as well as the specific actions that the project will take to effectively address the identified technological challenges.
High-Performance Computing (HPC) is rapidly moving towards the adoption of nodes characterized by an heterogeneous set of processing resources. This has already shown benefits in terms of both performance and energy efficiency. On the other side, heterogeneous systems are challenging from the application development and the resource management perspective. In this work, we discuss some outcomes of the MANGO project, showing the results of the execution of real applications on a emulated deeply heterogeneous systems for HPC. Moreover, we assessed the achievements of a proposed resource allocation policy, aiming at identifying a priori the best resource allocation options for a starting application.
Current integrated circuits exhibit an impressive and increasing power density. In this scenario, thermal modelling plays a key role in the design of next generation cooling and thermal management solutions. However, extending existing thermal models, or designing new ones to account for new cooling solutions, requires parameter identification as well as a validation phase to ensure correctness of the results. In this paper, we propose a flexible solution to the validation issue, in the form of a hardware platform based on a Thermal Test Chip (TTC). The proposed platform allows to test a heat dissipation solution under realistic conditions, including fast spatial and temporal power gradients as well as hot spots, while collecting a temperature map of the active silicon layer. The combined power/temperature map is the key input to validate a thermal model, in both the steady state and transient case. This paper presents the current development of the platform, and provides a first validation dataset for the case of a commercial heat sink.
Terraneo [0000−0001−7475−6167] , Alberto Leva [0000−0003−2165−2078] , and William Fornaciari [0000−0001−8294−730X] DEIB, Abstract. Current integrated circuits exhibit an impressive and increasing component density, hence an alarming power density. Future devices will require breakthroughs in hardware power dissipation strategies and software active thermal management to operate reliably and maximise performance. In this scenario, thermal modelling plays a key role in the design of next generation cooling and thermal management solutions. However, extending existing thermal models, or designing new ones to account for new cooling solutions, requires parameter identification as well as a validation phase to ensure correctness of the results. In this paper, we propose a flexible solution to the validation issue, in the form of a hardware platform based on a Thermal Test Chip (TTC). The proposed platform allows to test a heat dissipation solution under realistic conditions, including fast spatial and temporal power gradients as well as hot spots, while collecting a temperature map of the active silicon layer. The combined power/temperature map is the key input to validate a thermal model, in both the steady state and transient case. This paper presents the current development of the platform, and provides a first validation dataset for the case of a commercial heat sink.
High-Performance Computing (HPC) is rapidly moving towards the adoption of nodes characterized by an heterogeneous set of processing resources. This has already shown benefits in terms of both performance and energy efficiency. On the other side, heterogeneous systems are challenging from the application development and the resource management perspective. In this work, we discuss some outcomes of the MANGO project, showing the results of the execution of real applications on a emulated deeply heterogeneous systems for HPC. Moreover, we assessed the achievements of a proposed resource allocation policy, aiming at identifying a priori the best resource allocation options for a starting application.
The probabilistic approaches for real-time systems are based on the estimation of the probabilistic-WCET distribution. Such estimation is naturally subject to errors, caused by both systematic and estimation uncertainties. To solve this problem, statistical tests are applied on the resulting distribution to check whether such errors affect or not the output validity. In this paper, we show that the reliability of these tests depends on the statistical power that must be estimated in order to select the proper sample size. This a priori analysis is required to obtain a reliable result of the probabilistic-WCET.
In battery-powered embedded systems, the energy budget management is a critical aspect. For systems using unreliable power sources, e.g. solar panels, the continuous system operation is a challenging requirement. In such scenarios, effective management policies must rely on accurate energy estimations. In this paper we propose a measurement-based probabilistic approach to address the worst-case energy consumption (WCEC) estimation, coupled with a job admission algorithm for energy-constrained task scheduling. The overall goal is to demonstrate how the proposed approach can introduce benefits also in mission-critical systems, where unsafe energy budget estimations cannot be tolerated.
Modern multi-core architectures are prone to a complex dynamic thermal management process: the presence of multiple cores adds challenges to the temperature estimation activity, due to the induced heat-up process from adjacent cores. Run-time estimation can benefit from floorplan information to better estimate the thermal characteristics of each core, and transient information can help the system to predict and avoid thermal alarms, increasing the system reliability and lifetime. In this context, the present work aims at defining a novel on-line measurement methodology based on neighbor nodes and transient information, providing a metric to be employed in thermal-aware designs, either at design-time to characterize application and platform from a thermal view-point, or at run-time in conjunction with the Dynamic Thermal Management subsystem. The proposed methodology intercepts floorplan-induced thermal behavior that would be otherwise unrecognized, and it also shows how a non-floorplan-aware methodology can reveal up to a 30% error in the estimate of thermal status.
Current hardware platforms provide the applications with an extended set of physical resources, as well as a well defined set of power and performance optimization mechanisms (i.e., hardware control knobs). The software stack, meanwhile, is responsible of taking direct advantage of these resources, in order to meet application functional and non-functional requirements. The support from the Operating System (OS) is of utmost importance, since it gives opportunity to optimize the system as a whole.
Purpose of this chapter is to introduce the reader to the challenge of managing physical and logical resources in complexmulti- and many-core architectures, focusing on emerging MPSoC platforms.
The execution of multiple multimedia applications on a modern Multi-Processor System-on-Chip (MPSoC) rises up the need of a Run-Time Management (RTM) layer to match hardware and application needs. This paper proposes a novel model for the run-time resource allocation problem taking into account both architectural and application standpoints. Our model considers clustered and non-clustered resources, migration and reconfiguration overheads, quality of service (QoS) and application priorities. A near optimal solution is computed focusing on spatial and computational constraints. Experiments reveal that our first implementation is able to manage tens of applications with an overhead of only fews milliseconds and a memory footprint of less than one hundred KB, thus suitable for usage on real systems.
The paper aims at defining a methodology for the optimization of the switching power related to the processor-to memory communication on system-level buses. First, a methodology to profile the switching activity related to system-level buses has been defined, based on the tracing of benchmark programs running on the Sun SPARC V8 architecture. The bus traces have been analyzed to identify temporal correlations between consecutive patterns. Second, a framework has been set up for the design of high-performance encoder/decoder architectures to reduce the transition activity of the system-level buses. Novel bus encoding schemes have been proposed, whose performance has been compared with the most widely adopted power-oriented encodings. The experimental results have shown that the proposed encoding techniques provide an average reduction in transition activity up to 74.11% over binary encoding for instruction address streams. The results indicate the suitability of the proposed techniques for high-capacitance wide buses, for which the power saving due to the transition activity reduction is not offset by the extra power dissipation introduced in the system by the encoding/decoding logic.
Server consolidation leverages hardware virtualization to reduce the operational cost of data centers through the intelligent placement of existing workloads. This work proposes a consolidation model that considers power, performance and reliability aspects simultaneously. There are two main innovative contributions in the model, focused on performance and reliability requirements. The first contribution is the possibility to guarantee average response time constraints for multi-tier workloads. The second contribution is the possibility to model active/active clusters of servers, with enough spare capacity on the fail-over servers to manage the load of the failed ones. At the heart of the proposal is a non-linear optimization model that has been linearized using two different exact techniques. Moreover, a heuristic method that allows for the fast computation of near optimal solutions has been developed and validated.
In current multi-core scenario, Networks-on-Chip (NoC) represent a suitable choice to face the increasing communication and performance requirements, however introducing additional design challenges to already complex architectures. In this perspective, there is a need for flexible and configurable virtual platforms for early-stage design exploration. We present the Heterogeneous Architectures and Networks-on-Chip Design and Simulation framework for large-scale high-performance computer simulation, integrating performance, power, thermal and reliability metrics under a unique methodology. Moreover, NoC exploration is possible from a reliability/performance and thermal/performance trade-offs.
The increasing complexity of multi-core architectures demands for a comprehensive evaluation of different solutions and alternatives at every stage of the design process, considering different aspects at the same time. Simulation frameworks are attractive tools to fulfil this requirement, due to their flexibility. Nevertheless, state-of-the-art simulation frameworks lack a joint analysis of power, performance, temperature profile and reliability projection at system-level, focusing only on a specific aspect. This paper presents a comprehensive estimation framework that jointly exploits these design metrics at system-level, considering processing cores, interconnect design and storage elements. We describe the framework in details, and provide a set of experiments that highlight its capability and flexibility, focusing on temperature and reliability analysis of multi-core architectures supported by Network-on-Chip interconnect.
Multi-core architectures are a promising paradigm to exploit the huge integration density reached by high-performance systems. Indeed, integration density and technology scaling are causing undesirable operating temperatures, having net impact on reduced reliability and increased cooling costs. Dynamic Thermal Management (DTM) approaches have been proposed in literature to control temperature profile at run-time, while design-time approaches generally provide floorplan-driven solutions to cope with temperature constraints. Nevertheless, a suitable approach to collect performance, thermal and reliability metrics has not been proposed, yet. This work presents a novel methodology to jointly optimize temperature/performance trade-off in reliable high-performance parallel architectures with security constraints achieved by workload physical isolation on each core. The proposed methodology is based on a linear formal model relating temperature and duty-cycle on one side, and performance and duty-cycle on the other side. Extensive experimental results on real-world use-case scenarios show the goodness of the proposed model, suitable for design-time system-wide optimization to be used in conjunction with DTM techniques.
Networks-on-Chip (NoCs) are a key component for the new many-core architectures, from the performance and reliability stand-points. Unfortunately, continuous scaling of CMOS technology poses severe concerns regarding failure mechanisms such as NBTI and stress-migration. Process variation makes harder the scenario, decreasing device lifetime and performance predictability during chip fabrication. This paper presents a novel cooperative sensor-wise methodology to reduce the NBTI degradation in the network on-chip (NoC) virtual channel (VC) buffers, considering process variation effects as well. The changes introduced to the reference NoC model exhibit an area overhead below 4%. Experimental validation is obtained using a cycle accurate simulator considering both real and synthetic traffic patterns. We compare our methodology to the best sensor-less round-robin approach used as reference model. The proposed sensor-wise strategy achieves up to 26.6% and 18.9% activity factor improvement over the reference policy on synthetic and real traffic patterns respectively. Moreover a net NBTI Vth saving up to 54.2% is shown against the baseline NoC that does not account for NBTI.
Network-on-Chips (NoC) play a central role in determining performance and reliability in current and future multi-core architectures. Continuous scaling of CMOS technology enable widespread adoption of multi-core architectures but, unfortunately, poses severe concerns regarding failures. Process variation (PV) is worsening the scenario, decreasing device lifetime and performance predictability during chip fabrication. This paper proposes two solutions exploiting power-gating to cope with NBTI effects in NoC buffers. The techniques are evaluated with respect to a variable number of virtual channels (VCs), in the presence of process variation. Moreover, power gating delay overhead is accounted. Experiments reveal a net NBTI Vth saving up to 54.2% against the baseline NoC, with an area overhead below 5%.
Considering the energy-cap problem in battery-powered devices, DVFS and power gating represent the de-facto state-of-the-art actuators. However, the limited margin available to reduce the operating voltage, the impossibility to massively integrate such actuators on-chip. together with their actuation latency force a revision of such design methodologies. We present an all-digital architecture and a design methodology that can effectively manage the energy-cap problem for CPUs and accelerators. Two quality metrics are put forward to capture the performance loss and the energy budget violations. We employed a vector processor supporting 4 hardware threads as representative usecase. Results show an average performance loss and energy cap violations limited to 2.9% and 3.8%, respectively. Compared to solutions employing the DFS actuator, our all-digital architecture improves the energy-cap violations by 3x while maintaining a similar performance loss.
Temperature profile optimization is one of the most relevant and challenging problems in modern multi-core architectures. Several Dynamic Thermal Management approaches have been proposed in literature, and run-time policies have been designed to direct the allocation of tasks according to temperature constraints. Thermal coupling is recognized to have a role of paramount importance in determining the thermal envelope of the processor, nevertheless several works in literature do not take directly into account this aspect while determining the status of the system at run-time. Without this information, the DTM design is not able to fully redistribute the roles that each core have on the system-level temperature, thus neglecting important information for temperature-constrained workload allocation.
Purpose of this work is to provide a novel mechanism to better support DTM policies, focusing on the estimation of the impact of thermal coupling in determining the appropriate status from a thermal stand-point. The presented approach is based on two stages: off-line characterization of the target architecture estimates thermal coupling coefficients, that will be used at run-time for proper DTM decisions.
An embodiment of a method and system are provided for managing both system resources and power consumption of a computer system, involving different layers of the system: an application layer, a middle layer where the operating system is running and where a power manager is provided, and a hardware layer used for communicating with the hardware devices. Hardware devices have different operating modes which provide distinct trade-offs between performances and power consumption. Performance requirements defined at the level of the application layer, as well as the device power status of the system, set constraints on the system resources. The middle layer power manager may be in charge of retrieving performance requirements in form of constraints set on system parameters, aggregating these constraints opportunely and communicating corresponding information to the device drivers which may then select a best operating mode.
The interest in probabilistic real-time is increasing, in response to the lack of traditional static WCET analysis methods for applications running on complex systems, like multi/many-cores and COTS platforms. However, the probabilistic theory is still immature and, furthermore, it requires strong guarantees on the timing traces, in order to provide safe probabilistic-WCET estimations. These requirements can be verified with appropriate statistical tests, as described in this paper, and tested with synthetic and realistic sources, to assess their ability to detect unreliable results. In this work, we identified also the challenges and the problems of using statistical test based procedures for probabilistic real-time computing.
Network-on-Chip (NoC) are considered the prominent interconnection solution for current and future many-core architectures. While power is a key concern to deal with during architectural design, power-performance trade-off exploitation requires suitable analytical models to highlight the relations between actuators and such optimization metrics. This paper presents a model of the dynamic relation between the frequency of a NoC router and its performance, to be used for the design of run-time Dynamic Voltage and Frequency Scaling (DVFS) schemes capable of optimizing the power consumption of a NoC. The model has been obtained starting from both physical considerations on the NoC routers and identification from traffic data collected using a cycle-accurate simulator. Experimental results show that the obtained model can explain the dependence of a router congestion on its operating frequency allowing to use it as a starting point to develop power-performance optimal control policies.
In the contest of cache-coherent Networks-on-Chip (NoCs), fully adaptive routing algorithms guarantee maximum flexibility to implement power-performance, fault tolerant, thermal and Quality of Service (QoS) management policies. However, to get rid of deadlock at both protocol and network level, their implementation imposes a relevant resource increase. Moreover, their performance are inferior to the one of deterministic and partially adaptive schemes mainly due to the additional constraints imposed to the virtual channel (VC) re-use policy. This work proposes a novel flow control scheme to improve the performance of fully adaptive routing algorithms by allowing an aggressive reuse of VCs in presence of both long and short packets. Our proposal works by splitting long packets in multiple chunks and by reallocating the VCs to the chunks rather that to the entire packet. By carefully sizing each chunk to fit the available space in the reallocated, eventually not empty, VC, we are avoiding deadlocks while increasing the NoC utilization and performance. Experimental results show that our solution offers a 23.8% increase, on average, in the saturation point when compared to the best state of the art flow control scheme for fully adaptive routing algorithms. Moreover, our flow control scheme offers similar or better performance than the XY routing algorithm with the same number of resources, and we also ensure superior flexibility in the definition of the routing function.
The increasing functional and nonfunctional requirements of real-time applications, the advent of mixed criticality computing, and the necessity of reducing costs are leading to an increase in the interest for employing COTS hardware in real-time domains. In this scenario, the Linux kernel is emerging as a valuable solution on the software side, thanks to the rich support for hardware devices and peripherals, along with a well-established programming environment. However, Linux has been developed as a general-purpose operating system, followed by several approaches to introduce actual real-time capabilities in the kernel. Among these, the PREEMPT_RT patch, developed by the kernel maintainers, has the goal to increase the predictability and reduce the latencies of the kernel directly modifying the existent kernel code. This article aims at providing a survey of the state-of-the-art approaches for building real-time Linux-based systems, with a focus on PREEMPT_RT, its evolution, and the challenges that should be addressed in order to move PREEMPT_RT one step ahead. Finally, we present some applications and use cases that have already benefited from the introduction of this patch.
The transition to Exascale computing is going to be characterised by an increased range of application classes. In addition to traditional massively parallel "number crunching" applications, new classes are emerging such as real-time HPC and data-intensive scalable computing. Furthermore, Exascale computing is characterised by a "democratisation" of HPC: to fully exploit the capabilities of Exascale-level facilities, HPC is moving towards enabling access to its resources to a wider range of new players, including SMEs, through cloud-based approaches [1]. Finally, the need for much higher energy efficiency is pushing towards deep heterogeneity, widening the range of options for acceleration, moving from the traditional CPU-only organization, to the CPU plus GPU which currently dominates the Green500¹, to more complex options including programmable accelerators and even (reconfigurable) hardware accelerators [2].
Measurement-Based Probabilistic Timing Analysis, a probabilistic real-time computing method, is based on the Extreme Value Theory (EVT), a statistical theory applied to Worst-Case Execution Time analysis on real-time embedded systems. The output of the EVT theory is a statistical distribution, in the form of Generalized Extreme Value Distribution or Generalized Pareto Distribution. Their cumulative distribution function can asymptotically assume one of three possible forms: light, exponential or heavy tail. Recently, several works proposed to upper-bound the light-tail distributions with their exponential version. In this paper, we show that this assumption is valid only under certain conditions and that it is often misinterpreted. This leads to unsafe estimations of the worst-case execution time, which cannot be accepted in applications targeting safety critical embedded systems.
Network-on-Chip (NoC) represents a flexible and scalable interconnection candidate for current and future multi-cores. In such a scenario power represents a major design obstacle, requiring accurate early-stage estimation for both cores and NoCs. In this perspective, Dynamic Frequency Scaling (DFS) techniques have been proposed as a flexible and scalable way to optimize the power-performance trade-off. However, there is a lack of tools that allow for an early-stage evaluation of different DFS solutions as well as asynchronous NoC. This work proposes a new cycle-accurate simulation framework supporting asynchronous NoC design, allowing also to assess heterogeneous and dynamic frequency schemes for NoC routers.
The multiprocessor revolution allows to have different processors accessing their local memory controller and the relative RAM portion, while the possibility for each processor to access the remote memory on another processor enables the so-called Non-Uniform-Memory-Access (NUMA) multiprocessor paradigm. This paper discusses a control-inspired iterative algorithm to allocate the memory requests by different applications in a NUMA systems, trying to maximize the RAM utilization under the locality requirement. The proposed method is tested in two case studies.
The silicon technology continues reducing scale following the Moore's law. Device variability increases due to a lost in controllability during silicon chip fabrication. The current methodologies based on error detection and thread re-execution (roll back) cannot be enough, when the number of errors increase and arrive to a threshold. This dynamic scenario can be very negative if we are executing programs in HPC systems where a correct, accurate and time constraints solution is expected. The objective of the paper is to show preliminary results of Barbeque OpenSource Project (BOSP) and its potential use in HPC systems. The BOSP framework is the core of an highly modular and extensible run-time resource manager which provides support for an easy integration and management of multiple applications competing on the usage of one (or more) shared MIMD many-core computation devices.
Power consumption is a critical consideration in high performance computing
systems and it is becoming the limiting factor to build and operate Petascale
and Exascale systems. When studying the power consumption of existing systems
running HPC workloads, we find that power, energy and performance are closely
related which leads to the possibility to optimize energy consumption without
sacrificing (much or at all) the performance. In this paper, we propose a HPC
system running with a GNU/Linux OS and a Real Time Resource Manager (RTRM) that
is aware and monitors the healthy of the platform. On the system, an
application for disaster management runs. The application can run with
different QoS depending on the situation. We defined two main situations.
Normal execution, when there is no risk of a disaster, even though we still
have to run the system to look ahead in the near future if the situation
changes suddenly. In the second scenario, the possibilities for a disaster are
very high. Then the allocation of more resources for improving the precision
and the human decision has to be taken into account. The paper shows that at
design time, it is possible to describe different optimal points that are going
to be used at runtime by the RTOS with the application. This environment helps
to the system that must run 24/7 in saving energy with the trade-off of losing
precision. The paper shows a model execution which can improve the precision of
results by 65% in average by increasing the number of iterations from 1e3 to
1e4. This also produces one order of magnitude longer execution time which
leads to the need to use a multi-node solution. The optimal trade-off between
precision vs. execution time is computed by the RTOS with the time overhead
less than 10% against a native execution.
While technology scaling allows to integrate more cores in the same chip, the complexity of current designs requires accurate and fast techniques to explore different trade-offs. Moreover, the increased power densities in current architectures highlight thermal issues as a first class design metric to be addressed. At the same time, the need to access to accurate models for the exploited actuators is of paramount importance, since their overheads can shadow the benefit of the proposed methodologies. This paper proposes a complete simulation framework for the assessment of run-time policies for thermal-performance and power-performance trade-offs optimization with two main improvements over the state of the art. First, it accurately models Dynamic Voltage and Frequency Scaling (DVFS) modules for both cores and NoC routers as well as a complete Globally Asynchronous Locally Synchronous (GALS) design paradigm and power gating support for crossabar and buffers in the NoC. Second, it accounts for the chip thermal dynamics as well as power and performance overheads for the actuators.
The increasing pervasiveness of mobile and embedded devices (IoT/Edge), combined with the access to Cloud infrastructures, makes it possible to build scalable distributed systems, characterized by a multi-dimensional architecture. The overall picture is a massive collection of computing devices, characterized by very heterogeneous levels of performance and power consumption, which we could exploit to reduce the need to access Cloud computing resources. This would play a key role, especially for emerging use cases, where huge amount of data are generated. However, the deployment of distributed applications on such an heterogeneous infrastructures requires suitable management layers. This position paper aims at: (a) proposing a fully-distributed, cooperative, dynamic and multi-layered architecture, capable of integrating different computing paradigms; (b) identifying a possible solution to manage workloads at run-time in a resources continuity perspective , through an analysis of the open research challenges.
Network-on-Chip (NoC) is a flexible and scalable solution to interconnect multi-cores, with a strong influence on the performance of the whole chip. On-chip network affects also the overall power consumption, thus requiring accurate early-stage estimation and optimization methodologies. In this scenario, the Dynamic Voltage Frequency Scaling (DVFS) technique have been proposed both for CPUs and NoCs. The promise is to be a flexible and scalable way to jointly optimize power-performance, addressing both static and dynamic power sources. Being simulation a de-facto prime solution to explore novel multi-core architectures, a reliable full system analysis requires to integrate in the toolchain accurate timing and power models for the DVFS block and for the resynchronization logic between different Voltage and Frequency Islands (VFIs). In such a way, a more accurate validation of novel optimization methodologies which exploit such actuator is possible, since both architectural and actuator overheads are considered at the same time. This work proposes a complete cycle accurate framework for multi-core design supporting Global Asynchronous Local Synchronous (GALS) NoC design and DVFS actuators for the NoC. Furthermore, static and dynamic frequency assignment is possible with or without the use of the voltage regulator. The proposed framework sits on accurate analytical timing model and SPICE-based power measures, providing accurate estimates of both timing and power overheads of the power control mechanisms.
Networks-on-Chip (NoCs) are considered a viable solution to fully exploit the computational power of multi- and many-cores, but their non negligible power consumption require ad-hoc power-performance design methodologies. In this perspective, several proposals exploited the possibility to dynamically tune voltage and frequency for the interconnect, taking steps from traditional CPU-based power management solutions. However, the impact of the actuators, i.e. the limited range of frequencies for a PLL (Phase Locked Loop) or the time to increase voltage and frequency for a Dynamic Voltage and Frequency Scaling (DVFS) modules, are often not carefully accounted for, thus overestimating the benefits. This paper presents a control-based methodology for the NoC power-performance optimization exploiting the Dynamic Frequency Scaling (DFS). Both timing and power overheads of the actuators are considered, thanks to an ad-hoc simulation framework. Moreover the proposed methodology eventually allows for user and/or OS interactions to change between different high level power-performance modes, i.e. to trigger performance oriented or power saving system behaviours. Experimental validation considered a 16-core architecture comparing our proposal with different settings of threshold-based policies. We achieved a speedup up to 3 for the timing and a reduction up to 33.17% of the power*time product against the best threshold-based policy. Moreover, our best control-based scheme provides an averaged power-performance product improvement of 16.50% and 34.79% against the best and the second considered threshold-based policy setting.
Router’s buffer design and management strongly influence energy, area and performance of on-chip networks, hence it is crucial to encompass all of these aspects in the design process. At the same time, the NoC design cannot disregard preventing network-level and protocol-level deadlocks by devoting ad-hoc buffer resources to that purpose. In Chip Multiprocessor Systems (CMPs) the coherence protocol usually requires different virtual networks (VNETs) to avoid deadlocks. Moreover, VNET utilization is highly unbalanced and there is no way to share buffers between them due to the need to isolate different traffic types. This paper proposes CUTBUF, a novel NoC router architecture to dynamically assign VCs to VNETs depending on the actual VNETs load to significantly reduce the number of physical buffers in routers, thus saving area and power without decreasing NoC performance. Moreover, CUTBUF allows to reuse the same buffer for different traffic types while ensuring that the optimized NoC is deadlock-free both at network and protocol level. In this perspective, all the VCs are considered spare queues not statically assigned to a specific VNET and the coherence protocol only imposes a minimum number of queues to be implemented. Synthetic applications as well as real benchmarks have been used to validate CUTBUF, considering architectures ranging from 16 up to 48 cores. Moreover, a complete RTL router has been designed to explore area and power overheads. Results highlight how CUTBUF can reduce router buffers up to 33% with 2% of performance degradation, a 5% of operating frequency decrease and area and power saving up to 30.6% and 30.7%, respectively. Conversely, the flexibility of the proposed architecture improves by 23.8% the performance of the baseline NoC router when the same number of buffers is used.
The rapid advance of computer architectures towards more powerful, but also more complex platforms, has the side effect of making the timing analysis of applications a challenging task (Cullmann et al., 2010). The increasing demand of computational power in
cyber-physical systems (CPS) is getting hard to fulfill, if we consider typical real-time
constrained applications. Time constraints in CPS are often mandatory requirements,
i.e. they must be satisfied in any condition because of the mission-critical system purpose.
The satisfaction of these constraints is traditionally demonstrated using well-established
static analyses, providing the Worst-Case Execution Time (WCET) (Wilhelm et al.,
2008). However, the increasing complexity of computing architectures – such as multicore, multi-level caches, complex pipelines, etc. (Berg, Engblom, & Wilhelm, 2004) –
makes these analyses computationally unaffordable or carrying out too pessimistic approximations. The problem grows when dealing with Commercial-Off-The-Shelf (COTS)
hardware (Dasari, Akesson, Nélis, Awan, & Petters, 2013) and complex operating systems
(Reghenzani, Massari, & Fornaciari, 2017).
Probabilistic approaches for hard real-time systems have been proposed as a possible solution to address this complexity increase (Bernat, Colin, & Petters, 2002). In particular,
the Measurement-Based Probabilistic Time Analysis (MBPTA) (Cucu-Grosjean
et al., 2012) is a probabilistic analysis branch for real-time systems to estimate the WCET
directly from the observed execution times of real-time tasks. The time samples are collected across the application input domain and the WCET is provided in probablistic
terms, the probabilistic-WCET (pWCET), i.e. a WCET with a probability of observing
higher execution times. The statistical theory at the basis of the WCET estimation is
the Extreme Value Theory (EVT) (E. Castillo, Hadi, Balakrishnan, & Sarabia, 2005)
(De Haan & Ferreira, 2007), typically used in natural disaster risk evaluation, However,
to obtain a safe pWCET estimation, the execution time traces must fulfill certain requirements. In particular, MBPTA requires the time measurements to be (Kosmidis, 2017):
(1) independent and identically distributed, (2) representative of all worst-case latencies.
The first requirement comes from the EVT, it can be checked with suitable statistical tests
and can be relaxed under some circumstances (Santinelli, Guet, & Morio, 2017), while
the latter is relative to the input representativity and to the system (hardware/software)
properties. Both requirements are necessary to obtain a safe, i.e. non-underestimated,
pWCET.
The chronovise framework is an open-source software aiming at standardizing the flow
of MBPTA process, integrating both estimation and testing phases. The few existing
software presented in literature (Lu, Nolte, Kraft, & Norstrom, 2010) (Lesage, Griffin,
Soboczenski, Bate, & Davis, 2015) lack of source code availability. Moreover, both works
include a limited set of features, other than poor maturity level due to the missing integration of the most recent scientific contributions. Another software is available as
open-source (Abella, 2017), but specialized for a variant of classical MBPTA analysis
Reghenzani et al., (2018). chronovise: Measurement-Based Probabilistic Timing Analysis framework. Journal of Open Source Software, 3(28),
711. https://doi.org/10.21105/joss.00711
1
called MBPTA-CV (Abella, Padilla, Castillo, & Cazorla, 2017). Our work aims at filling
the absence of a stable software with a well-defined EVT execution flow. The proposed
framework supports both Block-Maxima (BM), Peak-over-Threshold (PoT) and MBPTACV EVT approaches; the current available methods to estimate the extreme distribution.
The output distribution respectively assumes the Generalized Extreme Value (GEV) and
the Generalized Pareto Distribution (GPD) form. Three estimators, Maximum Likelihood
Estimator (MLE) (Bücher & Segers, 2017), Generalized-MLE (GMLE) (Martins & Stedinger, 2000), Probability Weighted Moment (PWM, called also L-moments) (Hosking &
Wallis, 1987), are already included, as well as some statistical tests: Kolmogorov-Smirnov
(Massey, 1951) and (Modified) Anderson-Darling (Sinclair, Spurr, & Ahmad, 1990). Finally, the implementation of an overall results confidence estimation procedure is also
available. The API provided allows users to specify or to implement new input generators and input representativity tests.
The software chronovise is in fact presented as a flexible and extensible framework, deployed as a static C++ library. The selection of C++ language enables the easy implementation of hardware-in-the-loop analyses. The underlying idea of chronovise is to
provide a common framework for both researchers and users. Even if EVT is a well-known
statistical theory, it is continuously evolving and it is still a hot topic in mathematical
environment. The application of EVT in real-time computing is immature and it still requires several theoretical advances. This has led us to implement this software: enabling
the exploitation of an already implemented EVT process, in order to perform experiments
of new theories and methods, without the need to reimplement algorithms from scratch.
With our framework we want to create a common software-base, that would increase both
the replicability of the experiments and the reliability of the results, which are common
issues in research. On the other hand, end-users – i.e. engineers that use the already
available algorithms to estimate the pWCET – can just implement the measurement part
and use the framework without introducing further changes.
The multi-core revolution push to the limit the need for a fresh interconnect and memory design to deliver architectures that can match current and future application requirements. Moreover, the current trend towards multi-node multi-core architectures to further improve the system performance impose the use of hierarchical and eventually heterogeneous interconnection subsystems.
In this scenario the Network Interface (NI) controller represents a critical component to allow an easy efficient and effective link between the cores and the memory blocks to the interconnect. Moreover, the buffers and the
pipeline stages of the NI must be carefully evaluated not to waste valuable on-chip resources.
The thesis explores the NI design in the multi-core multi-node architectures, with particular emphasis on the timing and performance metric. An NI design developed in RTL Verilog is used to analyze the timing metric
and is used as a basis to develop a cycle accurate software simulation model that has been integrated into GEM5 multi-core cycle accurate simulator. Simulation results show the benefit of the proposed NI model from the result accuracy viewpoint. In particular, our NI model demonstrates how the baseline NI GEM5 model overestimates the performance of the simulated architectures mainly due to the avoid of the contention between the cores in the same node to access the on-node bus interconnect. Moreover, presented results highlight the modest and linear (with the number of pipeline stages) impact due to the NI considering different number of stages.
Side channel attacks are a prominent threat to the security of embedded systems. To perform them, an adversary evaluates the goodness of fit of a set of key-dependent power consumption models to a collection of side channel measurements taken from an actual device, identifying the secret key value as the one yielding the best fitting model. In this work, we analyze for the first time the microarchitectural components of a 32-bit in-order RISC CPU, showing which one of them are accountable for unexpected side channel information leakage. We classify the leakage sources, identifying the data serialization points in the microarchitecture and providing a set of hints which can be fruitfully exploited to generate implementations resistant against side channel attacks, either writing or generating proper assembly code.
An increasing number of High-Performance Applications demand some form of time predictability, in particular in scenarios where correctness depends on both performance and timing requirements, and the failure to meet either of them is critical. Consequently, a more predictable HPC system is required, particularly for an emerging class of adaptive real-time HPC applications. Here we present our runtime approach which produces the results in the predictable time with the minimized allocation of hardware resources. The paper describes the advantages in terms of execution time reliability and the trade-offs regarding power/energy consumption and temperature of the system compared with the current GNU/Linux governors.
The increasing complexity of computing architectures is pushing for novel Dynamic Thermal Management (DTM) techniques. Accordingly, more accurate power and thermal models are required. In this work, we propose a thermal controller based on a constrained extremum-seeking algorithm, enabling resource allocation optimization under specific thermal constraints. This approach comes with many advantages. First, the controller does not require any model of the system, dropping the need for a complex and potentially imprecise estimation phase. Second, it allows the control of derived measurements. We show how this may positively impact on the CPU reliability.
The power consumption is a key metric to design computing platforms. In particular, the variety and complexity of current applications fueled an increasing number of run-time power-aware optimization solutions to dynamically trade the computational power for the power consumption. In this scenario, the online power monitoring methodologies are the core of any power-aware optimization, since the incorrect assessment of the run-time power consumption prevents any effective actuation. This work proposes PowerTap, an all-digital power modeling methodology for designing online power monitoring solutions. In contrast with state-of-the-art solutions , PowerTap adds domain-specific constraints to the data-driven power modeling problem. PowerTap identifies the power model iteratively to balance the accuracy error of the power estimates and the complexity of the final monitoring infrastructure. As a representative use-case, we employed a complex hardware multi-threaded SIMD processor, also considering different operating clock frequencies. The RTL implementation of the identified power model targeting an Xilinx Artix 7 XC7A200T FPGA highlights an accuracy error within 1.79% with an area overhead of 9.95% (LUT) and 3.87% (flip flops) and an average power overhead of 12.17 mW regardless of the operating conditions, i.e., number of software threads and operating frequency.
This paper is part of a long-term research on the application of event-based control to the thermal management of high-power, high-density microprocessors. Specifically, in this work we introduce and discuss a purpose-specific event-based realisation of a digital controller for thermal management, which integrates into a scheme that also takes care of the power/performance tradeoff, and carry out a stability and a preliminary performance analysis. We also present, extending previous research, a comprehensive Modelica library suitable to carry out control studies in the addressed domain. We finally show experimental results on a modern processor architecture, that also compare our solution to the state of the art, to demonstrate the effectiveness of the proposal.
The Network-on-Chip (NoC) router buffers play an instrumental role in the performance of both the interconnection fabric and the entire multi-/many-core system. Nevertheless, the buffers also constitute the major leakage power consumers in NoC implementations. Traditionally, they are designed to accommodate worst-case traffic scenarios, so they tend to remain idle, or under-utilized, for extended periods of time. The under-utilization of these valuable resources is exemplified when one profiles real application workloads; the generated traffic is bursty in nature, whereby high traffic periods are sporadic and infrequent, in general. The mitigation of the leakage power consumption of NoC buffers via power gating has been explored in the literature, both at coarse (router-level) and fine (buffer-level) granularities. However, power gating at the router granularity is suitable only for low and medium traffic conditions, where the routers have enough opportunities to be powered down. Under high traffic, the sleeping potential rapidly diminishes. Moreover, disabling an entire router greatly affects the NoC functionality and the network connectivity. This article
presents BlackOut, a fine-grained power-gating methodology targeting individual router buffers. The goal is to minimize leakage power consumption, without adversely impacting the system performance. The proposed framework is agnostic of the routing algorithm and the network topology, and it is applicable to any router micro-architecture. Evaluation results obtained using both synthetic traffic patterns and real applications in 64-core systems indicate energy savings of up to 70%, as compared to a baseline NoC, with a near-negligible performance overhead of around 2%. BlackOut is also shown to significantly outperform – by 35%, on average – two current state-of-the-art power-gating solutions, in terms of energy savings.
The density of modern microprocessors is so high, that operating all their units at full power would destroy them by thermal runaway. Hence, thermal control is vital, but at the same time has to integrate with power/performance management, to not unduly limit computational speed. In addition, the controller must be simple and computationally light, as millisecond-scale response is required. Finally, since microprocessors face a variety of operating conditions, postsilicon tuning is an issue. We here present a solution, by exploiting event-based control and a hardware/software partition to maximize efficiency, lightness, and flexibility. We show experiments on real hardware, evidencing the obtained advantages over the state of the art.
To sustain performance while facing always tighter power and energy envelopes, High Performance Computing (HPC) is increasingly leveraging heterogeneous architectures. This poses new challenges: to efficiently exploit the available resources, both in terms of hardware and energy, resource management must support a wide range of different heterogeneous devices and programming models that target different application domains. We present a strategy for resource management and programming model support for heterogeneous accelerators for HPC systems with requirements targeting performance, power and predictability. We show how resource management can, in addition to allowing multiple applications to share a set of resources, reduce the burden on the application developer and improve the efficiency of resource allocation.
The increasing pervasiveness of mobile devices combined with their replacement rate, led us to deal with the disposal of an increasing amount of still working electronic devices. This work proposes an approach to mitigate this problem by extending the mobile devices' lifetime, by integrating them as part of a distributed mobile computing system. Thanks also to the growing computational power of such devices, this paradigm opens up the opportunity to deploy mobile applications in a distributed manner. This, without forgetting the energy budget management as a paramount objective. In this work, we built a proof-of-concept, based on the extension of a run-time resource manager to support Android applications. We introduced a energy-aware device selection policy to dispatch the application workload according to both device capabilities and run-time status. Experimental results show that, as well as increasing the utilization of multiple mobile devices available to the single user, using an energy-efficiency and distributed approach can increase the battery duration between 12% and 36%.
The Last Level Cache (LLC) is a key element to improve application performance in multi-cores. To handle the worst case, the main design trend employs tiled architectures with a large LLC organized in banks, which goes underutilized in several realistic scenarios. Our proposal, named DarkCache, aims at properly powering off such unused banks to optimize the Energy-Delay Product (EDP) through an adaptive cache reconfiguration, thus aggressively reducing the leakage energy. The implemented solution is general and it can recognize and skip the activation of the DarkCache policy for the few strong memory intensive applications that actually require the use of the entire LLC. The validation has been carried out on 16- and 64-core architectures also accounting for two state-of-the-art methodologies. Compared to the baseline solution, DarkCache exhibits a performance overhead within 2% and an average EDP improvement of 32.58% and 36.41% considering 16 and 64 cores, respectively. Moreover, DarkCache shows an average EDP gain between 16.15% (16 cores) and 21.05% (64 cores) compared to the best state-of-the-art we evaluated, and it confirms a good scalability since the gain improves with the size of the architecture.