ArticlePDF Available

Abstract and Figures

Power management through dynamic voltage and frequency scaling (DVFS) is one of the most widely adopted techniques. However, it impacts application reliability (due to soft errors, circuit aging, and deadline misses). However, increased power density impacts the thermal reliability of the chip, sometimes leading to permanent failure. To balance both application- and thermal-reliability along with achieving power savings and maintaining performance, we propose application- and thermal-reliability-aware reinforcement learning–based multi-core power management in this work. The proposed power management scheme employs a reinforcement learner to consider the power savings and variations in the application and thermal reliability caused by DVFS. To overcome the computational overhead, the power management decisions are determined at the application-level rather than per-core or system-level granularity. Experimental evaluation of proposed multi-core power management on a microprocessor with up to 32 cores, running PARSEC applications, was done to demonstrate the applicability and efficiency of the proposed technique. Compared to the existing state-of-the-art techniques, the proposed technique enables an average energy savings of up to ∼20%, up to 4.926°C temperature reduction without degradation in the application- and thermal-reliability.
Average energy consumption with proposed power management for microprocessor with different numbers of cores. Manoj et al. (2015), Rountree et al. (2011), Yang et al. (2015), and Zaman et al. (2015) (with minor adaptations such as power management at application-level) for a fair comparison. The rationale for choosing these are as follows: In Manoj et al. (2015), prediction of workload using AutoRegressive Moving Average (ARMA) and a Singular Value Decomposition (SVD)-based VF-level assignment is carried out, which has shown better scalability for future multi-core systems. Machine learning equipped power management is proposed in Zaman et al. (2015), where SVM-based regression for predicting workloads and SVM classifier-based VF-level assignment is employed. The sparse encoding is not implemented, as the data is not as large as that in the original work. A linear regression with offline learning or modeling-based workload prediction and VF-level assignment are utilized in Yang et al. (2015) and Rountree et al. (2011), which is lightweight in nature. Similar resemblances can be observed from other existing works. Figure 5 presents the normalized energy consumption for multi-core system with 2, 4, 8, 16, and 32 cores. In Figure 5, X-axis represents the number of cores on which the benchmark applications are run and the Y-axis represents the normalized energy. In the legend of Figure 5, "Proposed," "Linear," "SVM," and "STM" represent the energy consumptions with proposed technique, linear regression-based power management (Rountree et al. 2011; Yang et al. 2015), SVM (Zaman et al. 2015), and space-time multiplexing (Manoj et al. 2015)-based power management techniques, respectively. For the experimental evaluation of proposed and other power management works, the benchmark applications are randomly assigned to cores. The following observations can be made: For a system with a small number of cores (two cores), use of lightweight techniques (such as linear regression-based power management) is beneficial. However, for a large number of cores, proposed power manager has higher performance compared to other techniques. The rationale for these differences can be mentioned as follows:
… 
Content may be subject to copyright.
33
Application and Thermal-reliability-aware Reinforcement
Learning Based Multi-core Power Management
SAI MANOJ PUDUKOTAI DINAKARRAO, George Mason University, USA
ARUN JOSEPH and ANAND HARIDASS, IBM Systems, India
MUHAMMAD SHAFIQUE, Vienna University of Technology (TU Wien), Austria
JÖRG HENKEL, Karlsruhe Institute of Technology, Germany
HOUMAN HOMAYOUN, University of California, Davis, USA
Power management through dynamic voltage and frequency scaling (DVFS) is one of the most widely adopted
techniques. However, it impacts application reliability (due to soft errors, circuit aging, and deadline misses).
However, increased power density impacts the thermal reliability of the chip, sometimes leading to permanent
failure. To balance both application- and thermal-reliability along with achieving power savings and main-
taining performance, we propose application- and thermal-reliability-aware reinforcement learning–based
multi-core power management in this work. The proposed power management scheme employs a reinforce-
ment learner to consider the power savings and variations in the application and thermal reliability caused
by DVFS. To overcome the computational overhead, the power management decisions are determined at
the application-level rather than per-core or system-level granularity. Experimental evaluation of proposed
multi-core power management on a microprocessor with up to 32 cores, running PARSEC applications, was
done to demonstrate the applicability and eciency of the proposed technique. Compared to the existing
state-of-the-art techniques, the proposed technique enables an average energy savings of up to 20%, up to
4.926 C temperature reduction without degradation in the application- and thermal-reliability.
CCS Concepts: Hardware On-chip resource management;Chip-level power issues;Temperature
optimization;Transient errors and upsets; Process, voltage and temperature variations;
Additional Key Words and Phrases: Multi-core processor, reinforcement learning, application reliability, ther-
mal reliability, power management, DVFS
ACM Reference format:
Sai Manoj Pudukotai Dinakarrao, Arun Joseph, Anand Haridass, Muhammad Shaque, Jörg Henkel, and
Houman Homayoun. 2019. Application and Thermal-reliability-aware Reinforcement Learning Based Multi-
core Power Management. J. Emerg. Technol. Comput. Syst. 15, 4, Article 33 (October 2019), 19 pages.
https://doi.org/10.1145/3323055
Coauthor Dr. Shaque’s contributions in this work are supported in part by the German Research Foundation (DFG) as
part of the GetSURE project in the scope of SPP-1500 priority program “Dependable Embedded Systems.”
Authors’ addresses: P. D. Sai Manoj, George Mason University, 4400 Patriot Circle, Fairfax, VA, 22030; email:
spudukot@gmu.edu; A. Joseph and A. Haridass, IBM Systems, Bannerghatta Rd, Bangalore, Karnataka, India; emails:
{arujosep, anharida}@in.ibm.com; M. Shaque, Vienna University of Technology, Institute of Computer Engineering,
Embedded Computing Systems, Treitlstraße 3, 1040 Wien, Österreich; email: muhammad.shaque@tuwien.ac.at;
J. Henkel, Haid-und-Neu-Str. 7, Bldg. 07.21, 76131 Karlsruhe, Germany; email: henkel@kit.edu; H. Homayoun, University
of California, Davis, 1 Shields Ave, Davis, CA, 95616; email: houmanhomayoun@gmail.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 Association for Computing Machinery.
1550-4832/2019/10-ART33 $15.00
https://doi.org/10.1145/3323055
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:2 P. D. Sai Manoj et al.
1 INTRODUCTION
The ever-increasing proliferation of multi-core processors into the computing systems (ranging
from portable devices to datacenters) facilitate the multi-program execution of multi-threaded ap-
plications. This enables high performance under tight power budgets (Bergamaschi et al. 2008;
Manoj et al. 2015,2017; Pagani et al. 2017; Pagani et al. 2018; Tarsa et al. 2014; Wang and Pedram
2016). The high performance with multi-core systems coupled with increased power density poses
multiple challenges, with reliability being one of the key design parameters to be considered along
with power/energy and performance across a wide range of computing platforms, from miniature
embedded systems to massive data centers (Shaque et al. 2014; Swaminathan et al. 2017). What
is more, the increased power consumption forms a positive loop with temperature leading to in-
creased temperatures, eventually leading to thermal runaway failures (Wu et al. 2014;Xuetal.
2015). To overcome such concerns, reliability-aware power management is critical for processors
embedded in small-scale systems as well as in datacenters. Here, the term “reliability” encompasses
both application reliability and thermal reliability.Application reliability is composed of two parts:
(i) functional reliability, i.e., for a given input, the correctness of output values of a given function
considering faults such as soft errors in the underlying hardware; and (ii) timing reliability, i.e., the
ability to meet the timing requirements. Though the thermal reliability is dependent on multiple
factors, we consider the predominant factors, oxide breakdown and the electron migration (Gnad
et al. 2015; Manoj et al. 2013; Pagani et al. 2014; Srinivasan et al. 2004), in this work.
Towards optimizing and meeting the power budget constraints, Dynamic Voltage and Frequency
Scaling (DVFS) (Esmaeilzadeh et al. 2011; Manoj et al. 2015; Pagani et al. 2017; Pagani et al. 2018;
Tarsa et al. 2014; Wang and Pedram 2016) has proven to be one of the most eective and widely used
techniques with adaptivity for power/energy savings. In the former works, DVFS is performed
considering dierent parameters such as worst-case execution time of the task (Choi et al. 2005),
temperature (Lee et al. 2010), and voltage demand (Choi et al. 2004; Dietrich et al. 2010). Many of
the existing works, such as Choi et al. (2005), Dietrich et al. (2010), and Wang and Pedram (2016),
perform DVFS by predicting one or more parameters for the next time interval(s). Based on this,
the VF settings are applied accordingly towards meeting the power/energy budgets under the
constraints of performance requirements. Advancements in the machine learning (ML) eld led to
its adoption for prediction and/or on-chip parameter adaptations required for power management
(DVFS) using techniques such as Bayesian learning (Wang et al. 2011), reinforcement learning
(Jung and Pedram 2010; Manoj et al. 2016;Shenetal.2013), and regression analysis (Bartolini
et al. 2013; Bartolini et al. 2011; Manoj et al. 2018;Yangetal.2015).
In addition to the power constraints, the on-chip temperature is one of the major concerns in
multi-core processors that can have non-trivial impact on the lifetime and reliability of the chip
(Shaque et al. 2014). Keeping the chip’s temperature under a certain thermal threshold (or critical
value) is of paramount importance, as otherwise high temperatures may cause permanent failures.
To achieve this, i.e., to dissipate the heat and reduce the temperature, chips are provided with a
cooling solution (e.g., the coupling of the thermal paste, heat spreader, heat sink, and cooling fan).
It needs to be noted that power management aids to reduce the on-chip hot-spots, as the heat is
generated from the consumed power. However, the power management primarily focuses on opti-
mizing the power, and persistent consumption of power (even if it is low) leads to hot-spots, which
might not be mitigated with power/energy saving–oriented DVFS techniques. To provide a better
temperature regulation, the multi-core systems are equipped with Dynamic Thermal Management
(DTM) technique. These DTM techniques are commonly reactive (i.e., triggered once the critical
temperature is exceeded) and can power-down cores, reduce their supply voltages and execution
frequencies, gate their clocks, boost the fan speed, and so on. In other words, if the chip heats up
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:3
Fig. 1. (a) Fault rate and power consumption (b) application reliability and (c) thermal reliability under
dierent VF seings.
above a critical value (identied using thermal sensors distributed across the chip), then the DTM
is triggered to reduce the temperature. Similar to power management, machine learning is widely
deployed for thermal management as well. Techniques for thermal management with machine
learning, such as DTM with temperature prediction by regression (Lee et al. 2010), Q-learning (Lu
et al. 2015;Shenetal.2012), and so on, are proposed in the literature.
Many of the existing power management and thermal management works primarily focus on
either optimizing the power and/or temperature of the system. Despite the power/thermal man-
agement optimizing the power and temperature of the chip, it highly degrades the reliability of the
system components such as processor core, application data (cache), and memories, especially in
the scaled geometries, resulting in induction of faults in the data (Makhzan et al. 2007; Sasan et al.
2009). The state-of-the-art soft-error reduction techniques mainly exploit software- and hardware-
level techniques (Kapadia and Pasricha 2015; Mukherjee et al. 2002;Qietal.2010;Shyeetal.2007;
Xu et al. 2013). However, these techniques are computationally expensive due to the continuous re-
dundancy checking happening at software-levels. Further, reliability-aware techniques with power
optimization, such as Dabiri et al. (2007), and Wu and Marculescu (2014), require technology node
changes such as transistor sizing. Though such techniques can achieve desired reliability, they de-
mand excessive design and manufacturing eorts; also, the reliability is more ane towards the
soft-errors, and not considering the physical reliability. To observe the impact of DVFS, i.e., VF
scaling on application and thermal reliability, and to determine the need of performing both ap-
plication and thermal reliability–based power management, a motivational case study is carried
out and presented below.
1.1 Motivational Case Study
A simple case study to understand the impact of VF scaling on the application and thermal relia-
bility is presented in Figure 1. The model based on which the application and thermal reliability
are derived is presented in Section 2.3, and the experimental settings are described in Section 6.
As can be observed from Figure 1(a), with scaling down of voltage-frequency levels, the power
consumption decreases, but the fault-rate increases. Also, from Figure 1(b) it can be seen that the
reliability for dierent applications is dierent, even under the same VF settings. Thus, an appli-
cation’s reliability is not a simple function of VF, rather it is a function of application characteristics
such as runtime and instruction prole. Similar ndings have been reported in Salehi et al. (2015)
and Salehi et al. (2015). It needs to be noted that the plotted functional reliability is under the
best settings, i.e., minimal failure rate and deadline misses. To address this problem, in the pro-
posed reinforcement learning–based power management, reliability is learned and considered as
a feedback for the DVFS, along with the achieved power saving.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:4 P. D. Sai Manoj et al.
In addition to application reliability, the thermal reliability w.r.t. VF scaling is shown in Fig-
ure 1(c) for dierent PARSEC benchmark applications. Here, the power and temperature values
are obtained from McPAT (Li et al. 2009) and HotSpot (Huang et al. 2006). The simulations are run
in SniperSim (Carlson et al. 2014) with more details presented in Section 6. As one can observe from
Figure 1(c), with the scaling up of VF levels, thermal reliability decreases, i.e., power consumption
and temperature increases leading to reduced thermal reliability. However, the application relia-
bility increases and vice versa. It has to be noted that the application and thermal reliabilities of
dierent applications are dierent even under same VF settings, due to their inherent characteris-
tics. As such, an optimal DVFS that meets the power-performance budget without degrading the
thermal and application reliabilities is needed. As the power management focused works can lead
to degradation of the reliability of the application and system, the power management under the
constraints of reliability is non-trivial. Thus, the objective of this work is to perform power
management under the constraints of performance, application, and thermal reliability,
i.e., to achieve low power consumption along with meeting the reliability constraints and
desired performance.
Associated Research Challenges
The associated challenges of paramount importance to perform learning-based power manage-
ment, considering the application characteristics and reliability, can be outlined as follows:
Computational Overheads: The power management can be performed at dierent levels of
abstraction such as at core-level or system-level. Performing per-core power management intro-
duces computational and hardware overheads such as VF controller per-core (Jung and Pedram
2010; Manoj et al. 2015,2018; Shaque et al. 2016). However, system-level power management
refers to performing power management at a granularity of system-level, which is ecient in terms
of computational overhead, but achieves lower power savings and/or energy eciency (Shen et al.
2013). Use of VF-island-based power management, though ecient, lacks exibility and scalability
(Rangan et al. 2009). As such, an intermediate solution is desired.
Application-Reliability Variation with DVFS: In addition to the traditional power manage-
ment challenges such as processing overhead, embedding reliability for power management adds
the following challenges: the reliability of an application varies with the VF levels at which the ap-
plication is being executed and also the reliability for dierent applications is dierent (Figure 1).
Additionally, to learn the reliability of an unseen application for an ecient power management,
the supervised learning is not an eective solution, as the reliability is hard to predict pro-actively
or known apriorifor unseen applications (Wang and Pedram 2016).
Thermal Reliability with Power Consumption: As mentioned earlier, the power consump-
tion leads to heat generation and can lead to thermal hot-spots on-chip eventually causing per-
manent failures. Power consumption–based reduction or performing DVFS to lower/mitigate the
hot-spots leads to reduced performance as well as aecting the application reliability. In addition,
thermal reliability is inversely proportional to the on-chip temperature, i.e., the higher the temper-
ature, the lower the thermal reliability. Low temperature arises from lower power consumption,
which implies that improving thermal reliability has inverse eects on application reliability. As
such, a trade-o has to be maintained between thermal and application reliability.
Contributions of This Work
To address the above-discussed problems, in this article, we make the following novel
contributions:
To the best of our knowledge, this is the rst work that considers both application and thermal
reliability along with performance to perform multi-core power management.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:5
To achieve desired application and thermal reliabilites along with power/energy savings, a
reinforcement learning (RL)–based power manager is proposed. Here, the RL agent deter-
mines the VF levelbased on the predicted power and the achieved reliability.
The reward is determined based on the power savings, and temperature, and application
reliability, which allows the power manager to optimize the power and temperature while
maintaining the application and thermal reliability.
Traditional power management works consider power-performance trade-o and do emphasize
reliability concerns. Similarly, reliability-aware works are limited to either power optimization
or concerned about one kind of reliability enhancement. In contrast, this work considers both
thermal and application reliability compared to existing works. Furthermore, as reliability cannot
be aforementioned, this work is one of the rst to utilize machine learning to adapt to variations
in reliability during runtime.
Paper Organization
The rest of the paper is structured as follows: The models for the system, reliability, and applica-
tions employed in this work are presented in Section 2. The system architecture is discussed in
Section 3. An introduction to reinforcement learning is presented in Section 4. Section 5describes
the proposed reliability-aware power management scheme. Section 6presents experimental evalu-
ation and comparison of proposed reliability-aware power management with other state-of-the-art
techniques. Conclusions are drawn in Section 7.
2SYSTEMMODEL
2.1 Hardware Architecture Model
We consider a homogeneous multi-core processor comprising of Ncores, C={C1,C2,...,CN}.
Due to varying workloads, dierent cores execute at dierent frequencies to ensure proper execu-
tion. There exists a maximum operating frequency level fmax for every possible operating voltage
V. The frequencies of a core can be varied between fmin to fmax, and the corresponding voltages
between vmin and vmax . The cores operating at higher VF levels consume more power when exe-
cuting the application. Furthermore, similar to Salehi et al. (2015), we assume that performance of
the processor core is higher when running at a higher VF level.
2.2 Application Model
We consider a mixture of single-threaded and multi-threaded applications in this work, and each
core executes one thread. Figure 2represents a snapshot of multi-core system with multiple appli-
cations deployed. In Figure 2, dierent shades on processor cores represent dierent applications
running on them. The distribution of applications is not uniform, i.e., dierent applications can
run on dierent number of cores, depending on the number of threads. Each of the applications
are composed of multiple tasks. A task τrequires wclock-cycles for execution. Also, at any given
time, the total number of executed threads are smaller or equivalent to number of cores, similar to
Pagani et al. (2017).
2.3 Reliability Models
Here, we present the employed application and thermal reliability models, followed by the power
model for the applications running on a multi-core system.
2.3.1 Application Reliability Model. To determine the application reliability model, we consider
transient faults and the timing reliability model. Transient fault occurrences are assumed to follow
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:6 P. D. Sai Manoj et al.
Fig. 2. Multi-core microprocessor equipped with the proposed application- and thermal-reliability-aware
power manager.
a Poisson process with a rate of λ(Ejlali et al. 2012). The fault rate varies exponentially with the
operating voltage (Zhu et al. 2004). As such, the transient fault rate, depending on the operating
voltage Vis
λ(V)=λ010Vmax V
Δ,(1)
where, λ0(=106) indicates the fault rate when operating at maximum possible voltage Vmax;and
Δ(=1V) is a parameter that indicates increase in fault rate when the voltage is decreased by one
level. As the transient faults in the underlying hardware results in software faults, the Functional
Vulnerability Index (FVI), as in Salehi et al. (2015), is considered, set to 1. The Functional Reliability
(FR) model due to transient fault (λ) and software failure rate λ(V)×FVI is modeled as below:
FR(FVI ,w,V,f)=eλ(V)×FVI×w
f,(2)
where windicates the number of clock-cycles needed to execute the application, and frepresents
the operating frequency. The employed reliability model is based on single task execution model,
as in Ejlali et al. (2012). One of the main reasons to consider this model is that the adopted relia-
bility models are shown to be accurate and robust for reliability estimation in Salehi et al. (2015)
with <2.5% deviation in terms of reliability eciency. However, it needs to be noted that the pro-
posed power management scheme is independent of the fault model used, as the reward function
requires the reliability variation rather than absolute reliability values. However, other application
reliability models can be employed, as the proposed technique requires information regarding the
reliability rather than the model information.
2.3.2 Thermal Reliability Model. The thermal reliability of the system depends on multiple fac-
tors, and oxide breakdown and electron migration (EM) are the predominant factors (Srinivasan
et al. 2004). As such, the thermal reliability of the system is given by
R(t)=expC·tβ·eEaβ
kT ,(3)
where R(t)indicates the reliability at time instant t,C=(1
Γ(1+1/β)·Jn)βwith nis material based
constant (1.1 for copper (Srinivasan et al. 2005)), Jbeing the energy consumption, βis the Weibull
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:7
slope parameter (=2,(Wuetal.2002)), kis the Boltzman constant, Eais the activation energy
(0.9eV for copper).
Based on Equation (3), the reliability of a dual-core system having power consumption P1and
P2leading to temperatures T1and T2is given as
R2(t)=expC·tβ·eEaβ
kT1+eEaβ
kT2.(4)
In this work, the temperature (T1,T2) is obtained from HotSpot (Huang et al. 2006)andpowerof
the cores are obtained from the McPAT (Li et al. 2009) directly.
2.4 Power Model
The total power consumption of a core is composed of static and dynamic power. The static power
is dominantly due to leakage power and varies exponentially with threshold voltage. The dynamic
power consumption is due to the application-dependent switching activities in the core. The to-
tal power consumption (Brooks et al. 2007;Ejlalietal.2012) when operating at voltage Vand
frequency fis modeled as below:
P(V,f)=Pstatic +PDynamic =I0e
Vth
ηVTV+αCV 2f.(5)
Here, I0and ηare technology parameters; VTis the thermal voltage; Vth is the threshold volt-
age; αrepresents the switching activity factors, and Cis the average capacitance. To obtain per-
application power or energy trace, we sum the power traces of the cores on which the application
is executing.
3 SYSTEM ARCHITECTURE
Figure 2illustrates the system architecture with the proposed reliability-aware power manage-
ment for a multi-core microprocessor. The microprocessor is composed of multiple cores running
dierent applications on it. Each of the cores is equipped with private L1 and L2 caches. Charac-
teristics such as per-application power trace (in mW) and the reliability are obtained or derived
for the purpose of power management. The obtained application power trace and the derived re-
liability is fed to the RL-based power manager for generating the power management policy and
to provide the optimal DVFS conguration.
The power management settings, i.e., VF levels, are determined in the OS layer. The power and
reliability data obtained from the application logs are collected iteratively over a time-window of
length n(10μs in this work) is fed to the power manager to learn the power prole and derive reli-
ability for dierent applications. The power manager determines the optimal power management
policy based on the sensed data (power trace) and its reliability. The key advantage of employing
a reinforcement technique is that the decision is learnt based on its experience rather than using
labels that might prove to be less eective, especially considering reliability, which is dierent for
dierent applications. Moreover, the decision made by the RL changes if the achieved reward is
decreasing (or going in a negative direction), which facilitates to improve the quality of power
management. To overcome the convergence constraints of the RL, the threshold on number of
loops to be run is enforced, as use of deep RL might increase latency and operational costs. Fur-
thermore, the power management is carried out at regular intervals (n) to facilitate sucient time
for switching activities and the decision making. More details on the simulation settings are pro-
vided in Section 6. It needs to be noted that the power and reliability data presented in Figure 2
are vectors and is a function of time. The application-level power trace is represented as a matrix
X, where each column represents the power trace for dierent applications at one time instant.
Similarly, the reliability is represented as vector R.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:8 P. D. Sai Manoj et al.
4 REINFORCEMENT LEARNING (RL)
Reinforcement learning (RL) is an ML technique that mimics one of the most common learning
styles in natural life, i.e., to learn to achieve a goal by trial-and-error interaction with a dynamic or
uncertain environment (Liu et al. 2010; Tan et al. 2009). In RL, interactions between the learning
agent and the environment are generally modeled using a nite state space S(corresponding to
environment inputs), a set of available actions A(corresponding to control/optimization knobs
used by the agent), and a reward function R:S×AR(used to decide which action to take for
a given state). The ultimate goal of RL is to gure out a policy π(s)=a, which chooses action
aAin each state sS(i.e., a mapping between the states and the actions), to optimize a reward
function (i.e., to maximize the cumulative rewards over a potentially innite time span).
Q-learning: Q-learning is one of the most popular algorithms used to perform RL (Liu et al.
2010; Tan et al. 2009). In Q-learning, a Q-value is associated to every state-action pair (s,a),denoted
as Q(s,a).ThevalueofQ(s,a)approximates the expected long-term cumulative reward of taking
action astarting from state s. In this way, the agent decides which action has to be performed in
the current state to achieve the maximum long-term rewards based on the value function Q(s,a).
Namely, at decision epoch tkwhen the system has just transitioned to state skS, the action ak
with the highest Q-value will be chosen. During the rst few iterations, the RL chooses an action
randomly; and based on the obtained reward, the actions are learnt. The Q-learning has the benet:
As it is a model-free learning algorithm, it is not necessary for the Q-learning agent to have any
prior system information, such as the transition probability from one state to another. Therefore,
it is a highly adaptive and exible technique, which is one of the reasons it is considered in this
work.
The fundamental aspect of Q-learning algorithm is the value iteration update of the Q-value
function. Particularly, the Q-value for each state-action pair is initially pre-dened (or set ran-
domly). However, these values are updated every time an action is issued and a reward is received.
That is, at decision epoch tk+1, the Q-value Q(sk,ak)is updated according to the received reward
as:
Q(sk,ak)Q(sk,ak)

old value
+βk

learning rate
·
expected discounted reward

rk+1

reward
+γ

discount factor
·max
aAQ(sk+1,a)

max future value
old value

Q(sk,ak)
,(6)
where rk+1is the expected reward at time tk+1after taking action akat time tk;γ(0,1)is the
discount factor; and βk(0,1)is the learning rate at time tk. The next time statesis visited, the ac-
tion with the maximum Q-value will be chosen, i.e., π(s)=maxaAQ(s,a), given that the Q-value
was updated, it might lead to a dierent action than the one taken last time state swas visited. In
this work, we set the discount factor as 0.28 and learning rate as 0.72. These factors are determined
based on a wide range of experiments and set the values that yield the best performance.
5 RELIABILITY-AWARE POWER MANAGEMENT
In this section, we present the proposed Application- and Thermal-reliability-aware Power man-
agement by employing the previously discussed RL technique. One of the key challenges to per-
form power management considering the reliabilities is that application and thermal reliabilities
have dierent units. For instance, time-dependent dielectric breakdowns are presented as parts
per million (ppm) defective, whereas soft errors are quantied as failure in time (FIT) (Seifert et al.
2012; Swaminathan et al. 2017). As such, a direct combination of them is invalid.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:9
As mentioned in the previous section, an RL agent performs near-optimal actions based on the
current state and the corresponding immediate reward it gets. First, we dene the state space, then
the action space, followed by the way the reward is calculated in this work.
5.1 State Space
There exist various metrics, such as power or energy trace, memory access characteristics, priority
of the application, CPU utilization rate, Cycles-Per-Instruction (CPI), and temperature, that serve
as factors to perform multi-core power management and represent the current state of the system.
As processing or employing all the metrics lead to computational overhead and can lead to con-
vergence issues, a subset of them depending on the applied constraints is considered for power
management. The power trace is a direct representation of the power/energy consumption and
aids in performing ecient DVFS. As the state variables such as power consumption or reliability
values are continuous in nature and can take any value, considering every value to represent a
state might incur large computational complexity and hinder the convergence. To alleviate this,
a set of discretized values is considered, and the original values are mapped to these discrete val-
ues of a state depending on how close the original value is to the discrete value. For instance, an
original power consumption of 345mW will be mapped to a state having state value of 350mW.
Here, the example is provided with just one variable in state, but in the simulations the state tuple
has three values, as mentioned later. Furthermore, in contrast to other power management works,
as this work also aims to meet the application- and thermal-reliability constraints, they are also
considered to represent the state of the system here. It is non-trivial to consider these variables as
the state of the system to ensure the overall reliability of the system.
Thus, the state of the system for the reinforcement learner (agent) are the per-application power
trace and the corresponding reliability derived based on Equations (2) and (4). As such, each ap-
plication has kstates denoted by s1,s2,...,sk,wheres1<s2<··· <sk,i.e., arranged in terms of
ascending order of power consumption. Each state here represents the power consumption of the
running application and its reliability, i.e., si={pi,ri,tri}, where power in the ith state is repre-
sented by piwith corresponding application and thermal reliability as ri,andtri, respectively.
5.2 Action Space
Each RL agent conducts a search into nite discrete space of possible target VF transitions as the
action space, denoted by A={a1,a2,...,an}, where action aiindicates assigning ith voltage and
frequency levels (vi,fi) to the application. To avoid the convergence and complexity issues arising
from the RL, we limit the number of feasible actions by having only four VF levels in this work.
5.3 Reward
The reward function has to be dened based on the state and the action taken by the RL agent. Thus,
the reward has to be composed of the power consumption and reliability (thermal and application).
As mentioned earlier, it is not straightforward to combine dierent reliabilities due to dierences
in their behaviors and cardinality. To overcome these concerns, works such as Swaminathan et al.
(2017) proposed use of principal component analysis. Though eective, this is limited by a few
factors such as non-linear or orthogonal relationship between application and thermal reliabilities,
and the involved complexity to run in the utilized scenario. As such, we consider the variation in
the reliabilities w.r.t. the desired reliability. The reward is calculated as a function of the reliability
and energy savings. The reward associated with transitioning from state sto sis given by
rk+1|(s,a,s)=α1FR/FRk)+α2TR/TRk)+α3E/Ek),(7)
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:10 P. D. Sai Manoj et al.
where sindicates all the possible states from state swhen action ais performed; ΔFR/FRkand
ΔTR/TRkare the change in functional and thermal reliability w.r.t existing reliability when tran-
sitioned from state sto swith action a; similarly, the dierence in power consumption due to
transition is given in second term (ΔE/Ek). The α1,α2,andα3are the constants, set to 0.33
in this work. The functional and thermal reliability are derived based on Equations (2) and (3),
respectively.
5.4 Power Management Policy Generation
We describe the power management policy generation by the RL agent here based on the described
state, action, and the reward.
For an eective power management, the power management has to be proactive, as reactive
power management is inecient due to computational delays. We rst predict the power trace
based on the previous traces and generate the power management policy as follows: The input for
the power management policy generator is the power trace of the system at application-level gran-
ularity. To facilitate a proactive runtime power management with less overhead, a linear predictor–
based power trace prediction is performed rst, as in Equation (8),
p(t+1)=
z
k=0
wkp(tk)+ϵ,(8)
where p(t+1)represents the power at time-instant t,wirepresents the coecient for regression,
and the error is denoted by ϵ. In this work, the order is represented by z, set to 8 in experiments.
The order is determined based on experiments to achieve lower error without overhead. With
the chosen order, an average root mean square (RMSE) of 0.53 is achieved. Once the power is
predicted, the corresponding reliability is derived, as given in Section 2.3. As the power trace is
continuous in distribution, assigning each value to a state increases the computational complexity
for the reinforcement learner. To avoid this computational complexity, the predicted power trace
and the reliability is quantized and a state that has the closest power and reliability values to the
fed predicted power and reliability is chosen as current state. The state is composed of power
and reliability, i.e., state si={pi,ri,tri}where pidenotes the power for state i, and corresponding
reliabilities by riand tri, as described previously. As each application has kstates denoted by
S={s1,s2,...,sk}, based on the predicted power and reliability, one of the states is assigned.
Based on the Bellman’s principle of optimality (Bellman 2003), given the states, and reward
function, the optimal policy can be derived as
π(s)=arд max
a(Q(s,a))(9)
The Q(s,a)is presented in Equation (6). This π(s)denotes the optimal policy for the system, given
the system is in state s. As such, we generate the optimal state-action pairs based on the inputs.
As the power management policy generation is performed oine and deployed online, the associ-
ated computational overhead does not impact power management. The proposed reliability-aware
power management policy is not restricted to any specic type of reliability model or architecture
and can be employed on dierent systems and with dierent reliability models.
An example of proposed Q-learning–based application- and thermal-reliability-aware power
management is shown in the Figure 3. Based on the predicted power consumption, as given in
Equation (8), and the derived application and thermal reliability for the given application, one of
the states is mapped. For mapping, we consider the state with closest power consumption value.
For instance, as shown in Figure 3, if the predicted power is 1.56W, then the closest state is s4;
as such, the current state is considered as state s4. Further, depending on the current state and
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:11
Fig. 3. An example describing the proposed application- and thermal-reliability-aware reinforcement
learning–based power management.
the chosen policy based on Equation (9), one of the policies is chosen. The chosen policy and
transitions are shown with a dotted line in Figure 3. Based on the chosen action and the power
consumption and reliability variations, the new reward is calculated and fed to the policy maker.
This process is repeated multiple times for convergence during the training phase. At the time of
testing, as the policies are already pre-dened, the assignment happens in one iteration, leading
to lower overhead. For the purpose of brevity, the reliabilities are not shown in Figure 3.
Summary
The whole process of RL-based application- and thermal-reliability-aware power management is
outlined in Algorithm 1.
In the rst step, based on the obtained power trace of an application, the power trace for future
time-instants are predicted as in Line 1 of Algorithm 1. The corresponding reliability is derived
for the application, as in Lines 2–3. Based on the predicted voltage and reliability, one of the states
ALGORITHM 1: Reliability-aware Power Management for multi-core system
Input: Power trace monitored at application-level granularity for all applications running (P),
and runtime
Output: Voltage-Frequency (VF) settings
1: Predict power trace as P(t+1)=z
k=0wkP(tk)+ϵ
2: Estimate corresponding application reliability, as in (2)
3: Estimate corresponding thermal reliability, as in (3)
4: Assign state for predicted power trace, and reliability, i.e., {p(t+1),r}→si,siS
5: Calculate reward rk+1as in (7)
6: Obtain the Q-values, as in (6)
7: Based on Bellman’s principle, an action with optimal policy is derived as in (9)
8: The optimal policy provides the action to be taken, i.e., VF settings will be fed to DVFS con-
troller for application-reliability-aware power management
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:12 P. D. Sai Manoj et al.
Table 1. Overview of Core Configuration
Item Description Value
Microprocessor core
Frequency (Max) 2.0GHz
Voltag e (Max. ) 1.0V
Technology node 22nm
L1-I cache 32KB
L1-D cache 32KB
L2 cache 256KB
L3-Cache 8MB
are assigned, and the reward for the next time step based on all the possible actions for the given
state is calculated and the corresponding Q-values are obtained, as given in Lines 4–6. Last, based
on the Bellman’s optimality principle, action with maximum Q-value is considered as optimal and
fed to the DVFS controller to perform power management, as given in Lines 7–8 of Algorithm 1.
In the simulations, we impose the constraints on the number of iterations performed for improved
convergence.
6 SIMULATION RESULTS
Here, we present the simulation settings, followed by the experimental analysis and comparison
with the existing traditional power management techniques.
6.1 System Seings
The proposed power management is implemented in Snipersim simulator (Carlson et al. 2014),
which is a parallel, interval-accurate, high-speed, and accurate x86 simulator. Standard Intel Xeon
microprocessor microarchitecture–based 22-nm core models are used in the simulations. The max-
imum voltage and frequency levels are 1.0V and 2.0GHz, respectively. In simulations, we use four
voltage-frequency levels for power management, which are supported by standard Xeon proces-
sor microarchitecture–based cores: (1V, 2.0GHz), (0.9V, 1.8GHz), (0.8V, 1.5GHz), and (0.7V, 1.0GHz).
However, this could be modied depending on the simulation environment and the utilized cores,
and the proposed power management is independent of the underlying core architecture. To facil-
itate enough time for switching of VF levels and reduce the processing overhead of the monitored
data, the application power traces are sampled at 10μs, though the time required for switching is
in the range of few μs, as reported in Singhal (2008). Additional details on the conguration of mi-
croprocessor core and other components are presented in Table 1. To validate the power manage-
ment, simulations are run with PARSEC (blackscholes, x264, bodytrack, swaptions, streamcluster,
canneal, dedup, and uidanimate applications are executed on the multi-core system) benchmark
(Bienia et al. 2008). The number of cores is varied from 2 to 32 for simulations.
6.2 Performance Analysis
Here, we present the energy savings, runtime, and application reliability improvement with the
proposed power manager and some other existing power management techniques.
6.2.1 Power Management at Dierent Abstraction Levels. The proposed technique focuses on
power management at application-level. However, it is also possible to perform power manage-
ment at lower abstraction level (core-level) and higher abstraction level (system-level or per-chip
level). As a case study, we present the impact of power management at dierent abstraction levels
for a four-core processor. For analysis, multi-threaded applications are chosen based on the manner
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:13
Fig. 4. Average power savings with proposed power management at dierent abstraction levels.
in which the workloads are distributed among cores. Two workload categories are chosen: (a)
tightly coupled; and (b) loosely coupled workloads. Here, tightly coupled workload indicates that
the workloads of an application are evenly distributed among multiple cores; and loosely coupled
workload indicates that the workload of an application is unevenly distributed among multiple
cores.
The normalized average power consumption at three dierent granularity levels for a micropro-
cessor running multi-threaded application(s) is shown in Figure 4. Following are the observations:
For loosely coupled multi-threaded applications, application level power management has
better power savings compared to system level, if the applications are uncorrelated, i.e.,
applications are dissimilar.
If the workloads are loosely coupled and correlated, i.e., similar workloads, then system-
level and application-level power management achieve similar power savings.
In case of single multi-threaded application (shown as single application in Figure 4) dis-
tributed among all the cores, irrespective of granularity, the power management achieves
similar performance if the application is tightly coupled.
For a loosely coupled application, system-level and application-level power management
has similar performance.
As seen, per-core power management has better power savings; however, this adds additional
overhead such as monitoring power regulators for each of the cores. System-level power manage-
ment has lower overhead and reduced power savings compared to per-core power management.
Per-application-level power management has performance in-between per-core and system-level
power management. As running multiple applications that are dissimilar in nature is much realistic
on multi-core systems, per-application–based power management is considered as a better choice
for power management here. Some of the recent works have also shown that application-level is
optimal for future multi-core power management and has lower overhead compared to per-core
power management (Rahmani et al. 2017; Shaque et al. 2016), despite power saving with per-core
being higher.
6.2.2 Energy Savings. To consider the power savings as well as performance (timing), we
evaluate the eectiveness of proposed power management technique in terms of energy sav-
ings and compare the achieved energy savings of our proposed technique with other techniques.
There are many prior techniques for power/energy management. We implemented a few, such as
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:14 P. D. Sai Manoj et al.
Fig. 5. Average energy consumption with proposed power management for microprocessor with dierent
numbers of cores.
Manoj et al. (2015), Rountree et al. (2011), Yang et al. (2015), and Zaman et al. (2015)(withmi-
nor adaptations such as power management at application-level) for a fair comparison. The ratio-
nale for choosing these are as follows: In Manoj et al. (2015), prediction of workload using Auto-
Regressive Moving Average (ARMA) and a Singular Value Decomposition (SVD)–based VF-level
assignment is carried out, which has shown better scalability for future multi-core systems. Ma-
chine learning equipped power management is proposed in Zaman et al. (2015), where SVM-based
regression for predicting workloads and SVM classier–based VF-level assignment is employed.
The sparse encoding is not implemented, as the data is not as large as that in the original work.
A linear regression with oine learning or modeling-based workload prediction and VF-level as-
signment are utilized in Yang et al. (2015) and Rountree et al. (2011), which is lightweight in nature.
Similar resemblances can be observed from other existing works.
Figure 5presents the normalized energy consumption for multi-core system with 2, 4, 8, 16, and
32 cores. In Figure 5, X-axis represents the number of cores on which the benchmark applications
are run and the Y-axis represents the normalized energy. In the legend of Figure 5, “Proposed,”
“Linear,” “SVM,” and “STM” represent the energy consumptions with proposed technique, linear
regression–based power management (Rountree et al. 2011;Yangetal.2015), SVM (Zaman et al.
2015), and space-time multiplexing (Manoj et al. 2015)-based power management techniques, re-
spectively. For the experimental evaluation of proposed and other power management works, the
benchmark applications are randomly assigned to cores.
The following observations can be made: For a system with a small number of cores (two cores),
use of lightweight techniques (such as linear regression–based power management) is benecial.
However, for a large number of cores, proposed power manager has higher performance compared
to other techniques. The rationale for these dierences can be mentioned as follows:
For miniature systems with two cores or less, the Q-learning adds higher computa-
tional overhead, i.e., the computations required to perform power management can in-
cur more computations or overhead compared to execution of workloads without power
management.
For larger systems, the achieved energy savings are higher compared to the additional
overhead.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:15
Fig. 6. Average application reliability with proposed power management and other power management
works.
These observations clearly indicate that the proposed technique is scalable and benecial for
modern-day and future multi-core and many-core systems. On average, energy savings of 20%
is achieved with our proposed technique compared to linear regression–based power manage-
ment (Rountree et al. 2011;Yangetal.2015) for a system with up to 32 cores. Similarly, an average
energy savings of 11%, and 7.7% are achieved with our proposed power management technique
compared to SVM (Zaman et al. 2015)-based and space-time multiplexing (Manoj et al. 2015)-based
power management techniques.
6.3 Application Reliability
The employed reinforcement learning–based power manager not only considers power or energy
savings as feedback (reward), but also considers the reliability of the application. Similar to en-
ergy savings, we compare the achieved application reliability with existing power management
schemes.
Figure 6presents the achieved application reliability with proposed RL-based power man-
agement and other power management works. One can observe that existing power-centric or
performance-centric power management techniques have an impact on reliability as the energy
savings improve.
In contrast to the power-saving–oriented works, with the proposed power management,
the reliability is also enhanced together with energy savings.
In this work, the Δof Equation (1) is set to 1, and λ0is set to 106, similar to Salehi et al. (2015).
Even under optimal settings of having low functional vulnerability index (FVI =1), the proposed
RL-based power management achieves higher reliability compared to other prior techniques. In
comparison with prior techniques that consider reliability for power management, the proposed
technique has an advantage of learning the reliability variations with VF settings, and also learn-
ing characteristics makes proposed application- and thermal-reliability-aware power management
achieve higher reliability. In comparison to linear regression, SVM, and STM-based power man-
agement, proposed power management has 1.8×,1.99×, and 2.08×lower variance in terms of re-
liability, respectively, on average, for a microprocessor with up to 32 cores executing PARSEC
applications. This is shown in Figure 6; lower variance indicates better stability.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:16 P. D. Sai Manoj et al.
Fig. 7. (a) Reduction in temperature with the proposed power management; (b) improvement in thermal
reliability of system.
6.4 Thermal Reliability
In addition to power savings and improvement in the application reliability, the proposed power
management scheme as well considers the thermal reliability. This leads to improvement in the
thermal reliability of the multi-core system. The thermal map at chip-level is obtained through
McPAT tool. For the purpose of obtaining the thermal reliability at an application-level granular-
ity, we consider the worst-case temperature for each application, i.e., for an application running on
(say) cores 1, 2, and 4 with core 4 having maximum temperature among the three, we consider core
4’s temperature for obtaining thermal reliability to account for worst-case scenario. The tempera-
ture reduction and thermal reliability improvements are shown in Figure 7. Figure 7(a) shows the
thermal map of a 16-core processor. One can observe reduction in temperature with the proposed
power management. A temperature reduction of up to 4.926 C is observed. For the performed ex-
periments with up to 32 cores, on average, a 2.193 C reduction across cores is achieved. As most
of the power management works are power-saving and application-reliability focused, for fair-
ness, we did not compare the thermal savings with existing power management works. However,
thermal management works are temperature-focused rather than power-saving-focused, hence a
comparison will be unfair.
In addition to reduction temperature, improvement in thermal reliability is also observed, as
showninFigure7(b). On average, 99.73% thermal reliability is achieved with the proposed power
management, which is nearly 5% higher, on average, compared to the multi-core system without
any power management. Though the numbers might look small in terms of dierence, this dier-
ence can become higher when the system is run for longer periods of time, due to accumulated heat.
Thus, in addition to the energy savings and reliability enhancement, the proposed power
management scheme can also result in lower on-chip temperatures, leading to higher
eciency.
6.5 Overhead Analysis
As the proposed power management technique involves switching and computations (needed to
predict the VF levels), it adds overheads to the system, which we discuss here. We measure the
execution time of the application without any power management technique and under dierent
power management techniques. The additional execution time can be considered as the overhead
caused due to involved computations and VF switching. The average runtime for all the executed
benchmark applications on multi-core systems with 2 to 32 cores under dierent power manage-
ment techniques is outlined in Table 2, obtained from McPAT of SniperSim. Compared to a system
that has no power management, proposed power management adds nearly 24% overhead in terms
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:17
Table 2. Average Runtime (in Seconds) for
Applications Running on Multi-core System
No DVFS Linear SVM STM Proposed
0.101 0.149 0.118 0.130 0.124
of runtime. However, compared to power management techniques such as linear regression, SVM,
and STM, our proposed technique has 22.3%, 6%, and 5.4% reduced runtime, respectively. In the
experiments, the linear regression–based power management has to be performed with a large
order to achieve similar power savings, leading to larger runtime. The reduced runtime with our
proposed technique is because of embedded learning in the proposed power management of ap-
plication characteristics and reliability. We anticipate that the runtime for SVM is lower than our
proposed technique due to the involved complexity.
7 CONCLUSION
Existing power management techniques perform power management under the constraints of
power or performance budgets. However, application reliability is impacted by lowering volt-
age frequency, and thermal reliability is exacerbated with increase in voltage-frequency levels.
In response, we proposed an application- and thermal-reliability-aware reinforcement learning–
based multi-core power management technique. In the proposed power management technique,
the power trace monitored at application-level granularity is fed to the reinforcement learner (Q-
learner) along with the application and thermal reliability. The Q-learner optimizes the VF settings
for the next time period for the application, considering both reliability and power consumption
(dened in reward function). With the proposed technique, an energy savings of up to 20% on
average, no degradation in application reliability (up to 2.08×lower variation in application relia-
bility), up to 4.926 C temperature reduction, and lower runtime is achieved when compared with
existing power management techniques.
REFERENCES
A. Bartolini, M. Cacciari, A. Tilli, and L. Benini. 2013. Thermal and energy management of high-performance multi-cores:
Distributed and self-calibrating model-predictive controller. IEEE Trans. Parallel Distrib. Syst. 24, 1 (Jan. 2013), 170–183.
A. Bartolini et al. 2011. A distributed and self-calibrating model-predictive controller for energy and thermal management
of high-performance multi-cores. In Proceedings of the Design, Automation and Test in Europe Conference (DATE’11).
Richard Ernest Bellman. 2003. Dynamic Programming. Dover Publications, Incorporated.
R. Bergamaschi et al. 2008. Exploring power management in multi-core systems. In Proceedings of the Asia and South Pacic
Design Automation Conference.
Christian Bienia et al. 2008. The PARSEC Benchmark suite: Characterization and architectural implications. In Proceedings
of the International Conference on Parallel Architectures and Compilation Techniques.
D. Brooks, R. P. Dick, R. Joseph, and L. Shang. 2007. Power, thermal, and reliability modeling in nanometer-scale micropro-
cessors. IEEE Micro 27, 3 (May 2007), 49–62.
Trevor E. Carlson et al. 2014. An evaluation of high-level mechanistic core models. ACM Trans. Archit. Code Optim. 11, 3
(Aug. 2014), 28:1–28:25.
Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. 2004. Dynamic voltage and frequency scaling based on workload
decomposition. In Proceedings of the International Symposium on Low Power Electronics and Design.
Kihwan Choi, R. Soma, and M. Pedram. 2005. Fine-grained dynamic voltage and frequency scaling for precise energy and
performance tradeo based on the ratio of o-chip access to on-chip computation times. IEEE Trans. Comput.-Aided
Des. Integr. Circ. Syst. 24, 1 (Jan. 2005), 18–28.
Foad Dabiri, Ani Nahapetian, Miodrag Potkonjak, and Majid Sarrafzadeh. 2007. Soft error-aware power optimization using
gate sizing. In Integrated Circuit and System Design: Power and Timing Modeling, Optimization and Simulation (PAT-
MOS’07), N. Azémard and L. Svensson (Eds.). Lecture Notes in Computer Science, Vol. 4644. Springer, Berlin, Heidelberg.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
33:18 P. D. Sai Manoj et al.
B. Dietrich et al. 2010. LMS-based low-complexity game workload prediction for DVFS. In Proceedings of the IEEE Interna-
tional Conference on Computer Design.
A. Ejlali, B. M. Al-Hashimi, and P. Eles. 2012. Low-energy standby-sparing for hard real-time systems. IEEE Trans. Comput.-
Aided Des. Integr. Circ. Syst. 31, 3 (Mar. 2012), 329–342.
Hadi Esmaeilzadeh et al. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the International Symposium
on Computer Architecture.
D. Gnad, M. Shaque, F. Kriebel, S. Rehman, and J. Henkel. 2015. Hayat: Harnessing dark silicon and variability for aging
deceleration and balancing. In Proceedings of the Design Automation Conference (DAC’15).
Wei Huang, Shougata Ghosh, Siva Velusamy, Karthik Sankaranarayanan, Kevin Skadron, and Mircea R. Stan. 2006. Hotspot:
Acompact thermal modeling methodology for early-stage VLSI design. IEEE Trans. Very Large Scale Integr. Syst. 14, 5
(May 2006), 501–513.
H. Jung and M. Pedram. 2010. Supervised learning based power management for multicore processors. IEEE Trans. Comput.-
Aided Des. Integr. Circ. Syst. 29, 9 (Sept. 2010), 1395–1408. DOI:https://doi.org/10.1109/TCAD.2010.2059270
N. Kapadia and S. Pasricha. 2015. VARSHA: Variation and reliability-aware application scheduling with adaptive parallelism
in the dark-silicon era. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’15).
J. S. Lee, K. Skadron, and S. W. Chung. 2010. Predictive temperature-aware DVFS. IEEE Trans. Comput. 59, 1 (Jan. 2010),
127–133.
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and
timing modeling framework for multicore and manycore architectures. In Proceedings of the IEEE/ACM International
Symposium on Microarchitecture (MICRO’09).
W. Liu, Y. Tan, and Q. Qiu. 2010. Enhanced Q-learning algorithm for dynamic power management with performance
constraint. In Proceedings of the Design, Automation and Test in Europe Conference (DATE’10). 602–605. DOI:https://
doi.org/10.1109/DATE.2010.5457135
Shiting (Justin) Lu, Russell Tessier, and Wayne Burleson. 2015. Reinforcement learning for thermal-aware many-core task
allocation. In Proceedings of the Great Lakes Symposium on VLSI.
M. A. Makhzan, A. Khajeh, A. Eltawil, and F. Kurdahi. 2007. Limits on voltage scaling for caches utilizing fault tolerant
techniques. In Proceedings of the International Conference on Computer Design.
P. D. Sai Manoj, A. Jantsch, and M. Shaque. 2018. SmartDPM: Dynamic power management using machine learning for
multi-core microprocessors. J. Low-Power Electron. 14, 4 (Dec. 2018).
P. D. Sai Manoj, J. Lin, S. Zhu, Y. Yin, X. Liu, X. Huang, C. Song, W. Zhang, M. Yan, Z. Yu, and H. Yu. 2017. A scalable
network-on-chip microprocessor with 2.5D integrated memory and accelerator. IEEE Trans. Circ. Syst. I: Reg. Papers 64,
6 (June 2017), 1432–1443.
P. D. Sai Manoj, H. Yu, H. Huang, and D. Xu. 2016. A Q-Learning based self-adaptive I/O communication for 2.5D integrated
many-core microprocessor and memory. IEEE Trans. Comput. 65, 4 (Apr. 2016), 1185–1196.
P. D. Sai Manoj, H. Yu, Y. Shang, C. S. Tan, and S. K. Lim. 2013. Reliable 3-D clock-tree synthesis considering nonlinear
capacitive TSV model with electrical-thermal-mechanical coupling. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst.
32, 11 (Nov. 2013), 1734–1747.
P. D. Sai Manoj, H. Yu, and K. Wang. 2015. 3D Many-core microprocessor power management by space-time multiplexing
based demand-supply matching. IEEE Trans. Comput. 64, 11 (Nov. 2015), 3022–3036.
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alterna-
tives. In Proceedings of the International Symposium on Computer Architecture.
S. Pagani et al. 2017. Energy eciency for clustered heterogeneous multicores. IEEE Trans. Parallel Distrib. Syst. 28, 5 (May
2017), 1315–1330.
S. Pagani, H. Khdr, W. Munawar, J. Chen, M. Shaque, M. Li, and J. Henkel. 2014. TSP: Thermal safe power—Ecient power
budgeting for many-core systems in dark silicon. In Proceedings of the International Conference on Hardware/Software
Codesign and System Synthesis.
S. Pagani, P. D. Sai Manoj, A. Jantsch, and J. Henkel. 2018. Machine learning for power, energy, and thermal management
on multi-core processors: A survey. IEEE Trans. Comput.-Aided Des. Integ. Circ. Syst. PP, 1–17. DOI:10.1109/TCAD.2018.
2878168
X. Qi, D. Zhu, and H. Aydin. 2010. Global reliability-aware power management for multiprocessor real-time systems. In
Proceedings of the IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.
Amir M. Rahmani et al. 2017. Reliability-aware runtime power management for many-core systems in the dark silicon era.
IEEE Trans. Very Large Scale Integr. Syst. 25, 2 (Feb. 2017), 427–440.
Krishna K. Rangan, Gu-Yeon Wei, and David Brooks. 2009. Thread motion: Fine-grained power management for multi-core
systems. SIGARCH Comput. Archit. News 37, 3 (Jun. 2009), 302–313.
B. Rountree et al. 2011. Practical performance prediction under dynamic voltage frequency scaling. In Proceedings of the
International Green Computing Conference and Workshops.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
Application and Thermal-reliability-aware Reinforcement Learning 33:19
M. Salehi et al. 2015. dsReliM: Power-constrained reliability management in dark-silicon many-core chips under pro-
cess variations. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS’15).
M. Salehi, M. K. Tavana, S. Rehman, F. Kriebel, M. Shaque, A. Ejlali, and J. Henkel. 2015. DRVS: Power-ecient reliability
management through dynamic redundancy and voltage scaling under variations. In Proceedings of the International
Symposium on Low Power Electronics and Design.
Avesta Sasan, Houman Homayoun, Ahmed Eltawil, and Fadi Kurdahi. 2009. A fault tolerant cache architecture for sub
500mV operation: Resizable data composer cache (RDC-cache). In Proceedings of the International Conference on Com-
pilers, Architecture, and Synthesis for Embedded Systems.
N. Seifert, B. Gill, S. Jahinuzzaman, J. Basile, V. Ambrose, Q. Shi, R. Allmon, and A. Bramnik. 2012. Soft error susceptibilities
of 22 nm tri-gate devices. IEEE Trans. Nucl. Sci. 59, 6 (Dec. 2012), 2666–2673.
Muhammad Shaque, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon
era: Temperature, reliability, and variability perspectives. In Proceedings of the Design Automation Conference.
M. Shaque, A. Ivanov, B. Vogel, and J. Henkel. 2016. Scalable power management for on-chip systems with malleable
applications. IEEE Trans. Comput. 65, 11 (Nov. 2016), 3398–3412.
H. Shen, J. Lu, and Q. Qiu. 2012. Learning-based DVFS for simultaneous temperature, performance and energy management.
In Proceedings of the International Symposium on Quality Electronic Design (ISQED’12).
Hao Shen, Ying Tan, Jun Lu, Qing Wu, and Qinru Qiu. 2013. Achieving autonomous power management using reinforce-
ment learning. ACM Trans. Des. Auto. Electron. Syst. 18, 2 (Apr. 2013), 24:1–24:32. DOI:https://doi.org/10.1145/2442087.
2442095
A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. 2007. Using process-level redundancy to exploit multiple
cores for transient fault tolerance. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and
Networks.
R. Singhal. 2008. Inside Intel®core microarchitecture (Nehalem). In Proceedings of the IEEE Hot Chips Symposium.
J. Srinivasan, S. V.Adve, P. Bose, and J. A. Rivers. 2004. The impact of technology scaling on lifetime reliability. In Proceedings
of the International Conference on Dependable Systems and Networks.
Jayanth Srinivasan, S. V. Adve, Pradip Bose, and J. A. Rivers. 2005. Lifetime reliability: Toward an architectural solution.
IEEE Micro 25, 3 (May 2005), 70–80.
K. Swaminathan, N. Chandramoorthy, C. Y. Cher, R. Bertran, A. Buyuktosunoglu, and P. Bose. 2017. BRAVO: Balanced
reliability-aware voltage optimization. In Proceedings of the IEEE International Symposium on High Performance Com-
puter Architecture (HPCA’17).
Ying Tan, Wei Liu, and Qinru Qiu. 2009. Adaptive power management using reinforcement learning. In Proceedings of the
International Conference on Computer-Aided Design (ICCAD’09). 461–467. DOI:https://doi.org/10.1145/1687399.1687486
S. J. Tarsa, A. P. Kumar, and H. T. Kung. 2014. Workload prediction for adaptive power scaling using deep learning. In
Proceedings of the IEEE International Conference on IC Design Technology.
Yanzhi Wang et al. 2011. Deriving a near-optimal power management policy using model-free reinforcement learning and
Bayesian classication. In Proceedings of the 48th Design Automation Conference (DAC’11).
Y. Wang and M. Pedram. 2016. Model-free reinforcement learning and Bayesian classication in system-level power man-
agement. IEEE Trans. Comput. 65, 12 (Mar. 2016), 3713–3726.
E. Wu, J. Suñé, W. Lai, E. Nowak, J. McKenna, A. Vayshenker, and D. Harmon. 2002. Interplay of voltage and temperature
acceleration of oxide breakdown for ultra-thin gate oxides. Solid-State Electron. 46, 11 (2002), 1787–1798.
K. Wu and D. Marculescu. 2014. Power-planning-aware soft error hardening via selective voltage assignment. IEEE Trans.
Very Large Scale Integr. (VLSI) Syst. 22, 1 (Jan. 2014), 136–145.
S. S. Wu, K. Wang, P. D. Sai Manoj, T. Y. Ho, M. Yu, and H. Yu. 2014. A thermal resilient integration of many-core mi-
croprocessors and main memory by 2.5D TSI I/Os. In Proceedings of the Design, Automation Test in Europe Conference
Exhibition (DATE’14).
D. Xu, N. Yu, P. D. Sai Manoj, K. Wang, H. Yu, and M. Yu. 2015. A 2.5-D Memory-logic integration with data-pattern-aware
memory controller. IEEE Design Test 32, 4 (Aug. 2015), 1–10.
X. Xu, K. Teramoto, A. Morales, and H. H. Huang. 2013. DUAL: Reliability-aware power management in data centers. In
Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.
Sheng Yang et al. 2015. Adaptive energy minimization of embedded heterogeneous systems using regression-based learn-
ing. In Proceedings of the International Workshop on Power and Timing Modeling, Optimization and Simulation.
M. Zaman et al. 2015. Workload characterization and prediction: A pathway to reliable multi-core systems. In Proceedings
of the IEEE International On-Line Testing Symposium.
Dakai Zhu, R. Melhem, and D. Mosse. 2004. The eects of energy management on reliability in real-time embedded systems.
In Proceedings of the IEEE/ACM International Conference on Computer Aided Design.
Received July 2018; revised December 2018; accepted March 2019
ACM Journal on Emerging Technologies in Computing Systems, Vol. 15, No. 4, Article 33. Pub. date: October 2019.
... Existing researches have primarily concentrated on improving the health of individual cores [2], minimizing mean time to failure (MTTF) [4], and enhancing the average MTTF of all cores in a manycore system [5], where architecture-level approaches have been extensively explored to achieve these improvements. Notably, dynamic voltage and frequency scaling (DVFS) [6] [7], dynamic thermal management (DTM) [8] [9], and dynamic reliability management (DRM) [10] have been investigated. Moreover, as task-to-core mapping is one of the most crucial issues in manycores [11], system-level approaches provide dynamic opportunities by controlling taskto-core mappings and per-core operation frequencies to address aging and reliability management challenges [2][3] [5]. ...
... Based on HiMap, a hierarchical mapping solution involving voltage and frequency (VF) selection to enhance the lifetime reliability of processors was proposed in [3]. A power management method for manycore systems based on conscious reinforcement learning was proposed in [7]. The primary objective of this method is to create a tradeoff between functional and thermal reliability, while simultaneously achieving power savings and maintaining performance levels. ...
... This target is achieved by iteratively interacting with a dynamic or uncertain environment, employing a trial-and-error approach. In this framework, the interaction between the learning agent and the environment is represented by a finite set of state space S, the available actions A, and a reward function R: S×A → R [7]. Through this process, the RL agent learns how to make effective decisions and take appropriate actions based on the observed state and associated rewards. ...
Preprint
Full-text available
The increasing scale of manycore systems poses significant challenges in managing reliability while meeting performance demands. Simultaneously, these systems become more susceptible to different aging mechanisms such as negative-bias temperature instability (NBTI), hot carrier injection (HCI), and thermal cycling (TC), as well as the electromigration (EM) phenomenon. In this paper, we propose a reinforcement learning (RL)-based task mapping method to improve the reliability of manycore systems considering the aforementioned aging mechanisms, which consists of three steps including bin packing, task-to-bin mapping, and task-to-core mapping. In the initial step, a density-based spatial application with noise (DBSCAN) clustering method is employed to compose some clusters (bins) based on the cores temperature. Then, the Q-learning algorithm is used for the two latter steps, to map the arrived task on a core such that the minimum thermal variation is occurred among all the bins. Compared to the state-of-the-art works, the proposed method is performed during runtime without requiring any parameter to be calculated offline. The effectiveness of the proposed technique is evaluated on 16, 32, and 64 cores systems using SPLASH2 and PARSEC benchmark suite applications. The results demonstrate up to 27% increase in the mean time to failure (MTTF) compared to the state-of-the-art task mapping techniques.
... Generally, addressing the aging and reliability challenges in a manycore system while meeting power and performance requirements is considered as a complex problem [9]. Several approaches have been proposed to increase the reliability of manycore systems, such as Dynamic Voltage and Frequency Scaling (DVFS) [10] [11], and Dynamic Thermal Management (DTM) [12] [13]. However, most of these works have focused on the NBTI and EM mechanisms of the aging phenomena, while only a few works have also considered the thermal cycling mechanism in their reliability management method. ...
... In this section, we review prior research on manycore systems, specifically, the works that focused on increasing the lifetime of these systems, considering the thermal cycling phenomenon. DVFS is a well-known technique that can improve the lifetime of the system since aging is a strong function of the supply voltage and temperature [10] [11]. Pourmohseni et. ...
Article
Full-text available
Reliability management is one of the primary concerns in manycore systems design. Different aging mechanisms such as Negative-Bias Temperature Instability (NBTI), Electromigration (EM), and thermal cycling (TC) can reduce the reliability of these systems. However, state-of-the-art works mainly focused on NBTI and EM, whereas a few works have considered the thermal cycling effect. The thermal cycling effect can significantly aggravate the system’s lifetime. Moreover, the thermal effects of cores on each other due to their adjacency may also influence the system’s Mean Time to Failure (MTTF). This paper introduces a new technique to manage the reliability of manycore systems. The technique considers thermal cycling, adjacency of cores, and process variation-induced diversity of operating frequencies. It uses two levels of task mapping to improve system lifetime. At the first level, cores with close temperatures are packed into the same bin, and then, an arrived task is assigned to a bin with a similar temperature. Afterward in the second level, the task is assigned to a core inside the selected bin in the first level, based on performance requirements and the core frequency. Compared to the conventional Thermal cycling aware techniques, the proposed method is performed at a higher level (bins level) to reduce the thermal variations of cores inside a bin, and improves the system MTTFTC, making it a promising solution for manycore systems. The efficacy of our proposed technique is evaluated on 16, 32, 64, and 256 core systems using SPLASH2 and PARSEC benchmark suite applications. The results show up to 20% MTTFTC increment compared to the conventional thermal cycling-aware task mapping techniques.
... During each learning episode, the agent interacts with the environment, observes rewards, and updates the Q-value estimates. The key update equations, based on value iteration, are as follows [15]: ...
Preprint
Full-text available
Embedded systems power many modern applications and must often meet strict reliability, real-time, thermal, and power requirements. Task replication can improve reliability by duplicating a task's execution to handle transient and permanent faults, but blindly applying replication often leads to excessive overhead and higher temperatures. Existing design-time methods typically choose the number of replicas based on worst-case conditions, which can waste resources under normal operation. In this paper, we present RL-TIME, a reinforcement learning-based approach that dynamically decides the number of replicas according to actual system conditions. By considering both the reliability target and a core-level Thermal Safe Power (TSP) constraint at run-time, RL-TIME adapts the replication strategy to avoid unnecessary overhead and overheating. Experimental results show that, compared to state-of-the-art methods, RL-TIME reduces power consumption by 63%, increases schedulability by 53%, and respects TSP 72% more often.
Preprint
Full-text available
Optimizing task-to-core allocation can substantially reduce power consumption in multi-core platforms without degrading user experience. However, many existing approaches overlook critical factors such as parallelism, compute intensity, and heterogeneous core types. In this paper, we introduce a statistical learning approach for feature selection that identifies the most influential features - such as core type, speed, temperature, and application-level parallelism or memory intensity - for accurate environment modeling and efficient energy optimization. Our experiments, conducted with state-of-the-art Linux governors and thermal modeling techniques, show that correlation-aware task-to-core allocation lowers energy consumption by up to 10% and reduces core temperature by up to 5 degrees Celsius compared to random core selection. Furthermore, our compressed, bootstrapped regression model improves thermal prediction accuracy by 6% while cutting model parameters by 16%, yielding an overall mean square error reduction of 61.6% relative to existing approaches. We provided results based on superscalar Intel Core i7 12th Gen processors with 14 cores, but validated our method across a diverse set of hardware platforms and effectively balanced performance, power, and thermal demands through statistical feature evaluation.
Preprint
Full-text available
Generating realistic and diverse unstructured data is a significant challenge in reinforcement learning (RL), particularly in few-shot learning scenarios where data is scarce. Traditional RL methods often rely on extensive datasets or simulations, which are costly and time-consuming. In this paper, we introduce a distribution-aware flow matching, designed to generate synthetic unstructured data tailored specifically for an application of few-shot RL called Dynamic Voltage and Frequency Scaling (DVFS) on embedded processors. This method leverages the sample efficiency of flow matching and incorporates statistical learning techniques such as bootstrapping to improve its generalization and robustness of the latent space. Additionally, we apply feature weighting through Random Forests to prioritize critical data aspects, thereby improving the precision of the generated synthetic data. This approach not only mitigates the challenges of overfitting and data correlation in unstructured data in traditional Model-Based RL but also aligns with the Law of Large Numbers, ensuring convergence to true empirical values and optimal policy as the number of samples increases. Through extensive experimentation on an application of DVFS for low energy processing, we demonstrate that our method provides an stable convergence based on max Q-value while enhancing frame rate by 30\% in the very beginning first timestamps, making this RL model efficient in resource-constrained environments.
Article
The multi-core method in the learning mode is an important method that we can use in the learning process. When we encounter more complex problems or need to extract a large amount of data, the multi-core mode learning method becomes very important. If we want to judge the practicability of the multi-core method, the first thing to consider is the problem of the kernel function. Using the kernel function, we can study linear and nonlinear tasks. On this basis, the method of multi-core learning mode appears. More and more people focus on the multi-core learning mode, which has become the main research direction. In the multi-core mode, we first optimize the base core to obtain a more advanced core, so that we can solve the problem of choosing which function to calculate. What we need to pay special attention to is that the two cores can be fused with each other. The two adjacent layers in the multi-core mode fuse information together. This is an important feature of the multi-core learning mode, so the meaning of the multi-core learning mode. It is important, very valuable in use and research, and through the continuous efforts of many researchers, the multi-core learning model has been greatly developed in various fields, but at the same time, the multi-core learning model also faces many problems. For example, the calculation method is single and the calculation time is longer. Therefore, we need to develop more diverse learning methods to improve the efficiency of calculation. Only when we make a more complete system can we introduce the multi-core learning mode into more neighborhoods., So that more people can experience the advantages of multi-core learning mode. In this report, we will focus on the main ideas to design a new multi-core calculation method, and then continue to optimize the model. At the same time, we will also analyze the music quality of colleges and universities. According to the characteristics of music quality, Research the status and learning ability of college students in music, draw the latest conclusions, and combine the factors in reality to find out the problems encountered by college students in learning music and cultivating music quality, and according to different students Different correction methods and suggestions are given in the situation, which are very important for the development of students.
Article
Full-text available
Due to the high integration density and roadblock of voltage scaling, modern multi-core processors experience higher power densities than previous technology scaling nodes. When unattended, this issue might lead to temperature hot spots, that in turn may cause non-uniform aging, accelerate chip failure, impair reliability, and reduce the performance of the system. This paper presents an overview of several research efforts that propose to use machine learning techniques for power and thermal management on single-core and multi-core processors. Traditional power and thermal management techniques rely on a certain a-priori knowledge of the chip’s thermal model, as well as information of the workloads/applications to be executed (e.g., transient and average power consumption). Nevertheless, these a-priori information is not always available, and even if it is, it cannot reflect the spatial and temporal uncertainties and variations that come from the environment, the hardware, or from the workloads/applications. Contrarily, techniques based on machine learning can potentially adapt to varying system conditions and workloads, learning from past events in order to improve themselves as the environment changes, resulting in improved management decisions.
Article
To address the power management challenge in multi-core microprocessors, we present a lightweight machine learning based dynamic power management (SmartDPM) scheme in which the voltage-frequency levels of the cores are dynamically adjusted along with online learning based workload prediction in an observer-controller loop. To enable scalability, our SmartDPM employs a per-application autonomous power management policy, in which online machine learning principles are employed for predicting the workload and capturing sporadic variations under the constraints of accurate yet lightweight. Further, applications are assigned appropriate voltage-frequency level towards an efficient power management. The learning helps in dynamically reducing prediction error. Compared to the non-DVFS implementation, SmartDPM achieves nearly 35% power saving and nearly 15% higher power savings on average compared to the existing machine learning based power management schemes for a microprocessor with up to 32-cores.
Article
This paper presents a 2.5D integrated microprocessor die, memory die, and accelerator die with 2.5D silicon interposer I/Os. The use of such 2.5D silicon interposer I/Os provide a scalable interconnection for core-core (up to 32 cores), core-memory (4x storage capacity) and core-accelerator (4.4x speedup in H.264 decoder). The 2.5D integrated chip was implemented in GF 65 nm process with multicore microprocessor operated at 500 MHz under 1.2 V supply with 1.08 W power dissipation. A pair of 8 Gbps 2.5D silicon interposer I/O is designed for each of 12 inter-die communication channels, achieving a bandwidth of 24 GBps with 7.5 pJ/bit energy efficiency. As a result, the specified applications such as H.264 video data analytics and AES encryption can achieve significant performance improvement of throughput and energy efficiency.
Article
Heterogeneous multicore systems clustered in multiple Voltage Frequency Islands (VFIs) are the next-generation solution for power and energy efficient computing systems. Due to the heterogeneity, the power consumption and execution time of a task changes not only with Dynamic Voltage and Frequency Scaling (DVFS), but also according to the task-to-island assignment, presenting major challenges for power management and energy minimization techniques. This paper focuses on energy minimization of periodic real-time tasks (or performance-constrained tasks) on such systems, in which the cores in an island are homogeneous and share the same voltage and frequency, but different islands have different types and numbers of cores and can be executed at other voltages and frequencies. We present an efficient algorithm to minimize the total energy consumption while satisfying the timing constraints of all tasks. Our technique consists of the coordinated selection of the voltage and frequency levels for each island, together with a task partitioning strategy that considers the energy consumption of the task executing on different islands and at different frequencies, as well as the impact of the frequency and the underlying core architecture to the resulting execution time. Every task is then mapped to the most energy efficient island for the selected voltage and frequency levels, and to a core inside the island such that the workloads of the cores in a VFI are balanced. We experimentally evaluate our technique and compare it to state-of-the-art solutions, resulting in average in 25% less energy consumption (and up to 87% for some cases), while guaranteeing that all tasks meet their deadlines.
Conference Paper
Due to the tight power envelope, in the future technology nodes it is envisaged that not all cores in a many-core chip can be simultaneously powered-on (at full performance level). The power-gated cores are referred to as Dark Silicon. At the same time, growing reliability issues due to process variations and soft errors challenge the cost-effective deployment of future technology nodes. This paper presents a reliability management system for Dark Silicon chips (dsReliM) that optimizes for reliability of on-chip systems while jointly accounting for soft errors, process variations and the thermal design power (TDP) constraint. Towards the TDP-constrained reliability optimization, dsReliM leverages multiple reliable application versions that can potentially execute on different cores with frequency variations and supporting differenst voltage-frequency levels, thus facilitating distinct power, reliability and performance tradeoffs at run time. Experiments show that our dsReliM system provides up to 20% reliability improvements under different TDP constraints when compared to a state-of-the-art technique. Also, compared to an ideal-case optimal solution, dsReliM deviates up to 2.5% in terms of reliability efficiency, but speeds up the reliability management decision time by a factor of up to 3100.
Conference Paper
This article consists of a collection of slides from the author's conference presentation. Some of the specific topics discussed include: Intel® Core™ microarchitecture (Nehalem) Philosophy; CPU Core Features; New Platform Architecture; and Power Management.
Article
Power management of networked many-core systems with runtime application mapping becomes more challenging in the dark silicon era. It necessitates considering network characteristics at runtime to achieve better performance while honoring the peak power upper bound. On the other hand, power management has a direct effect on chip temperature, which is the main driver of the aging effects. Therefore, alongside performance fulfillment, the controlling mechanism must also consider the current cores' reliability in its actuator manipulation to enhance the overall system lifetime in the long term. In this paper, we propose a multiobjective dynamic power management technique that uses current power consumption and other network characteristics including the reliability of the cores as the feedback while utilizing fine-grained voltage and frequency scaling and per-core power gating as the actuators. In addition, disturbance rejecter and reliability balancer are designed to help the controller to better smooth power consumption in the short term and reliability in the long term, respectively. Simulations of dynamic workloads and mixed criticality application profiles show that our method not only is effective in honoring the power budget while considerably boosting the system throughput, but also increases the overall system lifetime by minimizing aging effects by means of power consumption balancing.