Content uploaded by Andy D. Pimentel
Author content
All content in this area was uploaded by Andy D. Pimentel on Aug 23, 2015
Content may be subject to copyright.
A Scenario-based Run-time Task Mapping
Algorithm for MPSoCs
Wei Quan†,‡Andy D. Pimentel†
†Informatics Institute ‡School of Computer Science
University of Amsterdam National University of Defense Technology
The Netherlands Hunan, China
{w.quan,a.d.pimentel}@uva.nl quanwei02@gmail.com
ABSTRACT
The application workloads in modern MPSoC-based embedded sys-
tems are becoming increasingly dynamic. Different applications
concurrently execute and contend for resources in such systems
which could cause serious changes in the intensity and nature of
the workload demands over time. To cope with the dynamism
of application workloads at run time and improve the efficiency
of the underlying system architecture, this paper presents a novel
scenario-based run-time task mapping algorithm. This algorithm
combines a static mapping strategy based on workload scenarios
and a dynamic mapping strategy to achieve an overall improvement
of system efficiency. Weevaluated our algorithm using a homoge-
neous MPSoC system with three real applications. From the re-
sults, we found that our algorithm achieves an 11.3% performance
improvement and a 13.9% energy saving compared to running the
applications without using any run-time mapping algorithm. When
comparing our algorithm to three other, well-known run-time map-
ping algorithms, it is superior to these algorithms in terms of quality
of the mappings found while also reducing the overheads compared
to most of these algorithms.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Performance Attributes
General Terms
Algorithm, Design, Performance
Keywords
Embedded systems, KPN, MPSoC, task mapping, simulation
1. INTRODUCTION
Modern embedded systems, which are more and more based on
MultiProcessor System-on-Chip (MPSoC) architectures, often re-
quire supporting an increasing number of applications and stan-
dards, where multiple applications can run simultaneously. For
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DAC’13 May 29 - June 07 2013, Austin, TX, USA.
Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00.
10E+06
1.2E+06
1.4E+06
n
ce
e
s)
0.0E+00
2.0E+05
4.0E+05
6.0E+05
8.0E+05
1
.
0E+06
13579111315171921232527293133
mjpeg performa
n
(simulation cycl
e
Intra-application scenario ID
Figure 1: Intra-application scenario performance of MJPEG.
each single application, there are typically also different execu-
tion modes (or program phases) with different requirements. For
example, a video application could dynamically lower its resolu-
tion to decrease its computational demands in order to save battery.
As a consequence, the behavior of application workloads execut-
ing on the embedded system can change dramatically over time.
Here, one can distinguish two forms of dynamic application behav-
ior: inter-application dynamism and intra-application dynamism.
These forms of dynamism are often captured using scenarios [13,
8]. This means that there are two different kinds of scenarios: inter-
application scenarios to describe the simultaneously running appli-
cations in the system, and intra-application scenarios that define the
different execution modes for each application. The combination
of these inter- and intra-application scenarios are called workload
scenarios, and specify the application workload in terms of the dif-
ferent applications that are concurrently executing and the mode of
each application.
At design time of an embedded system, a designer could aim
at finding the optimal mapping of application tasks to MPSoC pro-
cessing resources for each inter- and intra-application scenario. How-
ever, when the number of applications and application modes in-
crease, the total number of workload scenarios will explode ex-
ponentially. Considering, e.g., 10 applications with 5 execution
modes for each application, there will be 60 million workload sce-
narios. If each scenario takes one second to find the optimal map-
ping at design time, then one would need nearly two years to obtain
all the optimal mappings. Moreover, storing all these optimal map-
pings such that they can be used at run time by the system to remap
tasks when a new scenario is detected would also be unrealistic as
this would take up too much memory storage.
An approach to solve this problem is by clustering workload sce-
narios and only storing a single mapping per cluster of workload
scenarios to facilitate run-time mapping [8]. Such clustering im-
plies a significant space reduction needed to store the mappings.
Moreover, so-called scenario-based design space exploration [17]
can be deployed to efficiently find these mappings by only evalu-
!"#$%&'( )*+,
-./01+!
)23
4+5637 !89
4:2%(;7
!"#$%&<=;
Figure 2: KPN for MJPEG application.
ating a representative subset of scenarios for each cluster. In this
paper, we consider a clustering method1in which we find and store
a single mapping for each inter-application scenario that yields, on
average, best performance for all possible intra-application scenar-
ios within the inter-application scenario. However, as we can see
from the behavior of a Motion-JPEG (MJPEG) encoder applica-
tion in Figure 1, using such a single mapping to represent an entire
inter-application scenario shows considerable performance varia-
tions for the different intra-application scenarios that exist in this
inter-application scenario. In this particular example, the inter-
application scenario contains three simultaneously running multi-
media applications: MJPEG, a MP3 decoder, and a Sobel filter for
edge detection in images. The use of cluster-level mappings (i.e.,
mappings found to be good for an entire cluster of workload scenar-
ios) can provide a run-time mapping system with enough informa-
tion to quickly find an adequate mapping for a detected workload
scenario but it will not immediately lead to finding the optimal sys-
tem mapping for any identified workload scenario. Therefore, we
propose a novel run-time Scenario-based Task Mapping algorithm
(STM) that uses the cluster-level mapping information derived from
design-time design space exploration (DSE) but, additionally, per-
forms run-time mapping optimization by continuously monitoring
the system and trying to perform (relatively small) mapping cus-
tomizations to gradually further improve the system performance.
The remainder of this paper is organized as follows. Section 2
gives some prerequisites and the problem definition for this paper.
Section 3 provides a detailed description of the scenario-based run-
time mapping algorithm. Section 4 introduces the experimental
environment and presents the results of our experiments. Section 5
discusses related work, after which Section 6 concludes the paper.
2. PREREQUISITES AND PROBLEM DEF-
INITION
In this section, we explain the necessary prerequisites for this
work and provide a detailed problem definition.
2.1 Application Model
In this paper, we target the multimedia application domain, as
was already illustrated in Figure 1. For this reason, we use the
Kahn Process Network (KPN) model of computation [11] to spec-
ify application behaviour since this model of computation fits well
to the streaming behaviour of multimedia applications. In a KPN,
an application is described as a network of concurrent processes
that are interconnected via FIFO channels. This means that an
application can be represented as a directed graph KPN = (P,F)
where Pis set of processes (tasks) piin the application and fij ∈F
represents the FIFO channel between two processes piand pj. Fig-
ure 2 shows the KPN of the MJPEG application.
1We note, however, that other clustering methods would also be
possible and that our run-time mapping algorithm is independent
on the clustering method used.
2.2 Architecture Model
In this work, we restrict ourselves to homogeneous MPSoC tar-
get architectures2. An architecture can be modeled as a graph
MPSoC = (PE,C), where PE is the set of processing elements used
in the architecture and Cis a multiset of pairs cij = (pei,pe j)∈
PE ×PE representing a communication channel (like Bus, NOC,
etc.) between processors peiand pe j. Combining the definition of
application and architecture models, the computation cost of task
(process) pion processing element pe jis expressed as Tj
iand the
communication cost between tasks piand pjon channel cxy is Ccxy
ij .
2.3 Task Mapping
The task mapping defines the corresponding relationship between
the tasks in a KPN application and the underlying architecture re-
sources. For a single application, given the KPN of this applica-
tion and a target MPSoC, a correct mapping is a pair of unique
assignments (µ:P→PE,η:F→C) such that it satisfies ∀f∈
F,src(η(f)) =µ(src(f))∧d st(η(f)) =µ(dst(f)). In the case of a
multi-application workload, the state of simultaneously running ap-
plications that are distinguished as inter- and intra-application sce-
narios should be considered in the task mapping. Let A={app0,
app1,...,appm}be the set of all applications that can run on the
system, and Mi={mdi
0,mdi
1,...,mdi
n}be the set of possible ex-
ecution modes for appi∈A. Then, SE ={se0,se1,...,seninter },
with sei={app0=0/1, ...,appm=0/1}and appi∈A, is the
set of all inter-application scenarios. And sai
j={app0=md0
j0,...,
appm=mdm
jm}, with appi∈A∧appi=1∈seiand mdi
jx∈Mi,
represents the j-th intra-application scenario in inter-application
scenario sei∈SE. The set of all workload scenarios can then be
defined as the disjoint union S=&i∈SE SAi, with SAi={sai
1,sai
2,
...,sai
ni
intra
}.
As already explained in the previous section, we propose to per-
form the run-time mapping of applications in two stages. In the
first stage, which is performed at design time, we cluster work-
load scenarios (similar to [8]) and perform DSE for each of these
scenario clusters to find a mapping that shows the best average per-
formance for that particular cluster. More specifically, in this paper,
we consider each sei∈SE as a different cluster of scenarios (i.e.,
we cluster all intra-application scenarios of an inter-application sce-
nario). The mappings derived from design-time DSE are stored so
they can be used by the run-time mapping algorithm to re-map ap-
plications when a workload scenario is detected that belongs to a
different scenario cluster. Since these statically determined map-
pings may not be optimal for the current active intra-application
scenario, the second stage of the run-time mapping algorithm tries
to perform (relatively small) mapping customizations to gradually
further improve the system performance. In our goal to optimize
mappings, we recognize two kinds of objectives: system-level ob-
jectives and application-dependent objectives. System-level objec-
tives, denoted as Oα={Oα0,Oα1,...}, define the system-wide
metrics such as system energy consumption, total system execu-
tion time, etc. Application-dependent objectives, denoted as Oβ=
{Oβ0,Oβ1,...}, are mainly used to define the performance require-
ments of each separate application like throughput, latency, etc. As
will be explained in the next section, the first stage of our run-time
mapping approach uses system-level objectives to find mappings
per scenario cluster. Here, we use system energy consumption
and total workload scenario execution time as metrics: Esi,si∈S
represents the system energy consumption of workload scenario si
2In subsequent work, we will show how scenario-based run-time
mapping can also be applied to heterogeneous MPSoCs.
and Xsi,si∈Sis the execution time of scenario si. For the second
stage, during which the mapping is gradually optimized, we ap-
ply application-specific objectives – in our case throughput require-
ments for each application – for the optimization process. However,
to measure the results of the run-time optimization process, we also
use the system-level metrics Esiand Xsi.
Under these definitions and given the KPN = (P,F)for each
application and an MPSoC = (PE,C), our goal is to continuously
customize the mapping at run time such that the system-level and/or
application-specific objectives under every workload scenario si∈
Sare satisfied.
3. SCENARIO-BASED TASK MAPPING
The STM algorithm, which is outlined in Algorithm 1, can be
divided into a static part and a dynamic part. The static part is
used to capture application dynamism at the granularity of inter-
application scenarios. For each inter-application scenario sei∈SE ,
we have determined – using design-time DSE – a mapping that
on average performs best for all intra-application scenarios SAiof
sei. That is, for each seiwe search for a mapping by solving the
following multi-objective optimisation problem:
min[∑
sai
j∈SAi
Esaj,∑
sai
j∈SAi
Xsaj].(1)
To this end, we have deployed the scenario-based DSE approach
presented in [17], which is based on the well-known NSGA-II ge-
netic algorithm and allows for effectively pruning the design space
by only evaluating a representative subset of intra-application sce-
narios of SAifor each sei∈SE. As this design-time DSE stage is
not the main focus of this paper, we refer the interested reader to
[17] for further details. The mappings derived from this design-
time DSE are used by the STM algorithm as shown in lines 1-3 of
Algorithm 1. When the system detects the execution of a differ-
ent inter-application scenario, the static part of the STM algorithm
will choose the corresponding mapping as derived from the design-
time DSE stage and which has been stored in a so-called scenario
database. Because this database only stores mappings for entire
scenario clusters, its size can be controlled by choosing a proper
granularity of scenario clusters (e.g., inter-application scenarios).
The dynamic part of our STM algorithm is active during the en-
tire duration of an inter-application scenario. As explained in the
previous section, it uses application-specific objectives, specified
for each separate application, to continuously optimize the map-
ping. When the algorithm detects that an objective is unsatisfied, it
will try to find a new task mapping for that particular application
that missed the performance goal. If multiple applications miss
their performance goal, then the STM algorithm will start optimiz-
ing the most problematic application first. The main steps of the
dynamic part of the STM algorithm are described below.
3.1 Finding the Critical Task
The first step of the dynamic part of the STM algorithm is to
find the so-called critical task for the application that missed its
objective, as shown in lines 10-13 of Algorithm 1. The rationale
behind this is that by remapping this critical task and possibly its
neighbouring tasks (forming a bottleneck in the application), the
resulting effect will be optimal. To find the critical task, the STM
algorithm maintains three lists. The first list stores the task costs
(TC). For every application, it contains the cost of the application’s
tasks, where the cost is determined by the sum of the execution and
communication times of a task. These task costs are arranged in
descending order in the list. The two other lists concern the storing
of two other metrics for each task: the proportion of task cost in
Algorithm 1 STM algorithm
Input: KPNapp0,...,appm,MPSoC,Oα,Oβ,µ,η
Output: New(µ,η)
list: TC, CIC, CIB, PU
pCIC = δc, pCIB = δb
1: if detectScenario() == true : //new inter-application scenario
2: New(µ,η)= getMapping();
3: return New(µ,η);
4: else :
5: results[] = getStatistics();
6: if (i = objectiveUnsatisfied(results, Oα,Oβ)) != -1:
7: taskCost(KPNappi, results, TC, CIC, CIB);
8: peUsage(results, PU);
9: while(1) :
10: if (apptype = getType(KPNappi)) == DATA_PARALLEL :
11: critical = findDPCritical(KPNappi, CIC, CIB, pCIB, pCIC);
12: else :
13: critical = findCritical(KPNappi, CIC, CIB, pCIB, pCIC);
14: reason = findReason(critical, CIC, CIB, pCIB, pCIC);
15: if reason == POOR_LOCALITY :
16: MCC[] = minCircle(KPNappi, results, critical);
17: if GetSubstitute(PU, µ,η, MCC, apptype) == true :
18: return New(µ,η);
19: else failed;
20: else if reason == LOAD_IMBALANCE :
21: if GetSubstitute(PU, µ,η, apptype) == true :
22: return New(µ,η);
23: else failed;
24: else :
25: pCIB += ε;
26: pCIC -= ε;
the total busy time of the PE (i.e., processor) onto which the task is
currently mapped (CIB), and the proportion of task communication
time (read and write transactions) in the task cost (CIC).
Using the TC list, the algorithm checks the task at the top of the
list to find the critical task, taking the following two conditions into
account: 1) whether or not the task’s CIB proportion is lower than
a specific threshold, defined by pCIB. Here, the rationale is that
a high-cost task receiving only a small fraction of processor time
may imply that the processor is overloaded. If the task satisfies this
condition, then this task is considered as the critical task and the
process of finding the critical task ends. Otherwise, the algorithm
continues to check the other tasks in the TC list with lower costs
until it finds the critical task. If there is no task in the applica-
tion that satisfies the first condition, then the second condition will
be used: 2) Whether or not the CIC proportion is higher than the
threshold pCIC. The algorithm checks all the tasks using this sec-
ond condition just like it did for the first condition. If all the tasks
do not satisfy these two conditions, then the algorithm will, respec-
tively, increase and decrease the pCIB and pCIC thresholds by ε,
after which the above process is restarted again.
For data parallel applications, the process of finding the critical
task has one additional test as compared to regular applications.
This extra test (performed in the function findDPCritical) involves
the check whether or not all data-parallel tasks are mapped onto
different PEs. If there are data-parallel tasks that are mapped onto
the same processor, then those tasks with higher task costs will
be treated as critical tasks. Otherwise, the process of finding the
critical task will be the same as for regular applications.
3.2 Remapping the Critical Task
After the critical task has been found, the STM algorithm tries to
analyze the reason for missing the application’s performance goal.
In this respect, we recognize two different reasons: poor locality
and load imbalance. Here, we use the process of determining the
critical task to also determine the reason for not meeting the perfor-
mance goal: If the CIC proportion of the critical task is higher than
the value of the current pCIC threshold, then the algorithm assumes
that poor locality is the reason. Otherwise it takes load imbalance
as the reason for not meeting the application demands. This means
that poor locality has a higher priority than load imbalance as a
reason for not meeting the application demands, which is helpful
to reduce the energy consumption due to communications.
Subsequently, the function GetSubstitute in the STM algorithm
can follow different strategies to find a target PE to which the crit-
ical task will be remapped. The selection of remapping strategy
depends on the reason for not meeting the application’s perfor-
mance demands as well as on the type of application (data parallel
or not). The strategies that are used to find the substitute PE for
data-parallel applications are similar to the ones for regular appli-
cations except that one additional condition is taken into account
for finding the substitute PE: the substitute PE should not be a PE
onto which its parallel tasks are mapped.
3.2.1 Poor locality
In the case of poor locality, the STM algorithm will try to find a
better mapping for the application in question based on a minimal
cost circle (MCC) approach. A situation that has been identified as
"poor locality" is mainly due to the communication overhead be-
tween tasks. Evidently, if the communicating frequency between
two tasks is very high or the communicating data size is very large,
then these two tasks should preferably be mapped onto the same
PE or onto two different PEs that contain a more efficient intercon-
nect between each other. The MCC strategy aims at redistributing
the critical task and its neighbouring tasks over PEs such that com-
munication overhead is reduced while trying to avoid creating new
computational bottlenecks. To this end, it first finds the minimal
cost circle based on equation (2) for the critical task pi:
min(Circle_Cost(pi)mn),wit h 0≤n,m≤|P|(2)
where:
Circle_Cost(pi)mn =∑
m≤i≤n
Tz
i+∑
0≤i<|P|∑
m≤j≤n
Ccxy
ij (3)
where Tz
idenotes the execution time of task ifor PE zonto which
task iis currently mapped, and Ccxy
ij denotes the communication
overhead between tasks iand j(see Section 2.2). This strategy
is applicable for heterogeneous MPSoC architectures. However,
in this paper, our focus is on homogeneous architectures using a
shared bus interconnect. This means that each task will haveacon-
stant computational cost irrespective of the PE it is mapped on,
and that communication overhead only involves internal commu-
nication within a single PE (i.e., when the communicating tasks
are mapped to the same PE) or external communication between
PEs via shared memory. Clearly, internal communication costs are
much lower than external communication costs. Figure 3.a shows
an example of an MCC (indicated by the red oval) that contains
two tasks, including the critical task (red task), whereas Figure 3.b
illustrates an MCC that only contains the critical task itself.
After the MCC of the critical task has been determined, the func-
tion GetSubstitute will choose a substitute PE for all the tasks in-
cluded in the identified MCC to achieveanew mapping. For this
purpose, the PU list is used, containing the processor utilisations
90 30 200 70
30/10 80/15 40/10
Communication cost
when two task be
mapped to different PEs
Communication cost
when two task be
mapped to the same PE
Computation cost of
the task on the PE
90 60 200 70
30/10 40/1080/20
a. MCC of the critical task contains multiple processes
b. MCC of the critical task contains single process
Figure 3: Examples of an MCC for a critical task (red task).
for each PE. The substitute PE is the PE with the lowest utilization
in the PU list that is different from the PE onto which the critical
task is currently mapped. If the MCC solely consists of the crit-
ical task itself, then the critical task will be mapped onto the PE
of a neighboring task that has the heaviest communication with the
critical task. This is, e.g., shown in Figure 3.b, where the criti-
cal task will be mapped onto the same PE as the task with cost
70. Moreover, the substitute PE should be different than the PE the
critical task is currently mapped on. Otherwise, the algorithm fails
to find a new mapping. After the substitute PE has been found,
the FIFO channels between the tasks that need to be remapped are
either mapped as internal communication onto the new PE (if com-
municating tasks are mapped onto this PE) or onto the system bus.
3.2.2 Load imbalance
In the case a load imbalance has been identified as the reason for
not meeting the application demands, a load balancing strategy is
used to remap the critical task. The substitute PE should satisfy the
condition that it is different from the current PE of the critical task
and should have the lowest processor utilization in the PU list. If
such a substitute does not exist, then the algorithm cannot find a
better mapping.
4. EXPERIMENTS
4.1 Experimental Framework
To evaluate the efficiency of our STM algorithm and the map-
pings found at run time by this algorithm, we deploy the (open-
source) Sesame system-level MPSoC simulator [14]. To this end,
we haveextended this simulator with our run-time resource schedul-
ing framework, as illustrated in Figure 4. Our extension includes
the Scenario DataBase (SDB), a System Monitor (SM) and a run-
time Resource Scheduler (RS). The SDB is used to store the map-
pings for each inter-application scenario as derived from design-
time DSE. The SM is in charge of recording the running statis-
tics for each active application as well as monitoring system-wide
statistics. The RS uses the run-time task mapping algorithm and
the statistics provided by the SM to dynamically remap application
tasks when needed, as explained in the previous section.
4.2 Experimental Results
In this subsection, we present several experimental results in
which we investigate various aspects of our STM algorithm and
compare it to three well-known mapping algorithms: First-Fit Bin-
Packing (FFBP) [7] which has been frequently adapted to do task
mapping by means of modelling it as a bin-packing problem, Output-
Rate Balancing (ORB) [5] and Recursive BiPartition and Refining
(RBPR) [18]. We modified these algorithms to fit our mapping
Inter-application scenario
Intra-app scenario
APP0
Intra-app scenario
APPn
Task0
Task1
Task1
Task1
Task0
Task1
!
!!
Event Traces
Mapping layer: abstract RTOS
(scheduling of events)
Scheduled events Scheduled events
P0 P1 P2
P3 P4 MEM
System
Monitor
Run-time
resource
scheduler
Scenario
DataBase
Application model Architecture model
Figure 4: Extended Sesame framework.
problem and extended them to also allow for mapping data-parallel
applications by constraining the data-parallel tasks so that they have
to be mapped onto different processing elements. For the FFBP al-
gorithm, the PE with the lowest utilization is taken as the first-fit bin
and the computational cost of each task in the target application is
considered as the object that needs to be packed into the bins.
For our experiments, we use the three typical multi-media appli-
cations that were already introduced in Section 1: MJPEG, Sobel
and MP3. The KPN of the MJPEG application contains 8 pro-
cesses and 18 FIFO channels, Sobel contains 6 processes and 6
FIFO channels, and MP3 contains 27 processes and 52 FIFO chan-
nels. In the Sobel and MP3 applications, data parallelism is ex-
ploited. Moreover, MJPEG has 11 intra-application scenarios, MP3
has 3 intra-application scenarios, whereas Sobel only has 1 intra-
application scenario. This results in a total of 95 different workload
scenarios. At design time, we have determined the on-average best
mapping for each possible inter-application scenario as explained
in Section 3. With respect to the target architecture, we modeled
a homogeneous MPSoC containing 5 processors, connected to a
shared bus and memory. The model also includes the required com-
ponents for our run-time scheduling framework.
As there are just three applications and each application con-
tains a limited number of intra-application scenarios, we are able
to exhaustively evaluate all workload scenarios. For each work-
load scenario, we have simulated the system using two methods:
one is deploying only the static part of our STM algorithm to deal
with the dynamism at the level of different inter-application sce-
narios, whereas the other one is running all the workload scenar-
ios under a single, fixed mapping: the on-average best mapping
found for the inter-application scenario in which all three applica-
tions are concurrently executing. The results of this experiment
are shown in Figure 5. From this figure, we can see that the static
part of our STM algorithm already yields both performance im-
provements and energy savings by dynamically adjusting the map-
ping based on the variation in inter-application scenarios. For this
specific test case, the performance improvements for the different
inter-application scenarios range from 1.69% to 29.49% and the
energy savings range from 1.09% to 24.51%. Overall, for the exe-
cution of all 95 workload scenarios, the improvements in terms of
performance and energy saving are 7.4% and 9.4%, respectively.
Figures 6.a and 6.b show the intra-application scenario execution
times and energy consumption for the FFBP, ORB, RBPR and STM
run-time mapping algorithms for a single inter-application scenario
in which all three applications are concurrently executing. More-
execution time execution time(remapping)
ener
gy
consum
p
tion ener
gy
consum
p
tion
(
rema
pp
in
g)
p
tion
04
0.6
0.8
1
1.2
gy p
gy p ( pp g)
time/energy consum
p
0
0.2
0
.
4
Nomalized execution
Figure 5: Performance and energy consumption of each inter-
application scenario.
over, these two graphs also contain the results when using optimal
mappings (OPT) for each intra-application scenario (we derived
these mappings in a design-time DSE experiment). The results in
these two graphs have been ordered in a monotonically increasing
fashion based on the results from the OPT mappings. Figures 6.c-
e show the overall (for the entire inter-application scenario) per-
formance, energy consumption and overhead. Here, the overhead
includes the run-time calculation of new mappings as well as the
migration of tasks. From Figure 6, we can see that our STM clearly
performs better than the other algorithms in terms of the execution
time of scenarios. For several intra-application scenarios, the STM
algorithm even approaches the OPT results. With respect to energy
consumption and overhead, the STM algorithm also performs well:
it ranks second closely behind the ORB algorithm. The reason for
a low overhead of ORB is that it only needs to migrate a few tasks
in our experiment which means a very low task migration cost.
In our last experiment, we used the full STM algorithm, includ-
ing the static and dynamic parts and thus combining the dynamism
of inter-application as well as intra-application scenarios, to test
all the 95 workload scenarios of our three applications. Our al-
gorithm could achievea11.3% performance improvement and an
energy saving of 13.9% compared to an approach in which we run
the applications using the (static) on-average best mapping for the
inter-application scenario in which all three applications are active.
Comparing these results to those when only using the static part of
our STM algorithm (improvements of 7.4% and 9.4%, respectively;
see above), this means that the dynamic part of the STM algorithm
is capable of significantly further improving the mappings.
5. RELATED RESEARCH
In recent years, much research has been performed in the area
of run-time mapping for embedded systems. The authors of [6]
propose a run-time mapping strategy that incorporates user behav-
ior information in the resource allocation process. An agent based
distributed application mapping approach for large MPSoCs is pre-
sented in [1]. The work of [9] proposes a run-time spatial mapping
technique to map streaming applications on MPSoCs. In [3], dy-
namic task allocation strategies based on bin-packing algorithms
for soft real-time applications are presented. A Dynamic Spiral
Mapping (DSM) algorithm for mapping an application on an MP-
SoC arranged in a 2-D mesh topology is proposed in [2]. The
authors from [4] present network congestion-aware heuristics for
mapping tasks on NoC-based MPSoCs at run-time. The work of
[16] uses a Smart Nearest Neighbour approach to perform run-time
task mapping. In [10], a run-time task allocator is presented that
1.4E+06
OPT FFBP ORB RBPR STM
e
s)
30E+07
3.2E+07
total!scenario!execution!time!(cycles)
2.0E+05
4.0E+05
6.0E+05
8.0E+05
1.0E+06
1.2E+06
e
nario execution time (cycl
e
2.4E+07
2.6E+07
2.8E+07
3
.
0E+07
FFBP ORB RBPR STM
total!scenario!ener
gy
!consum
p
tion!
(
n
j)
c. total intra-app scenario execution time
0.0E+00
135791113151719212325272931
sc
e
Intra-application scenario ID
OPT FFBP ORB RBPR STM
0
2E+09
4E+09
6E+09
8E+09
FFBP ORB RBPR STM
gy p(j)
a. estimated execution time
1.0E+08
1.5E+08
2.0E+08
2.5E+08
3.0E+08
energy consumption (nj)
10E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
total!scenario!cost!(ns)
d. total intra-app scenario energy consumption
0.0E+00
5.0E+07
13579111315171921232527293133
scenario
Intra-application scenario ID
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1
.
0E+04
FFBP ORB RBPR STM
b. estimated energy consumption e. cost of algorithms (including algorithm
execution cost and task migration cost)
Figure 6: Comparing different run-time mapping algorithms.
uses an adaptive task allocation algorithm and adaptive clustering
approach for efficient reduction of the communication load. Mar-
iani et al. [12] proposed a run-time management framework in
which Pareto-fronts with system configuration points for different
applications are determined during design-time DSE, after which
heuristics are used to dynamically select a proper system configu-
ration at run time. Compared with these algorithms, our STM al-
gorithm takes a scenario-based approach, and takes computational
and communication behavior into account to make (re-)mapping
decisions. Recently, Schor et al. [15] also proposed a scenario-
based run-time mapping approach in which mappings derived from
design-time DSE are stored for run-time mapping decisions, but
they do not cluster mappings to reduce mapping storage nor do
they dynamically optimize the mappings at run time.
6. CONCLUSION
We have proposed a run-time mapping algorithm for MPSoC-
based embedded systems to improve their performance and energy
consumption by capturing the dynamism of the application work-
loads executing on the system. This algorithm is based on the
idea of application scenarios and consists of a design-time and run-
time phase. The design-time phase produces mappings for clus-
ters of application scenarios after which the run-time phase aims
to optimize these mappings by continuously monitoring the system
and trying to perform (relatively small) mapping customizations to
gradually further improve the system performance. In various ex-
periments, we haveevaluated our algorithm and compared it with
three other algorithms. The results show that our algorithm can
yield considerable improvements as compared to just using a static
mapping strategy. Comparing our algorithm with three other, well-
known run-time mapping algorithms, it shows a better trade-off be-
tween the quality and the cost of the mappings found at run time.
7. REFERENCES
[1] M. A. Al Faruque, R. Krist, and J. Henkel. Adam: run-time
agent-based distributed application mapping for on-chip
communication. In Proc. of DAC’08, pages 760–765, 2008.
[2] M. Armin, K. Ahmad, and S. Samira. Dsm: A heuristic
dynamic spiral mapping algorithm for network on chip.
IEICE Electronics Express, 5(13):464–471, 2008.
[3] E. W. Brião, D. Barcelos, and F. R. Wagner. Dynamic task
allocation strategies in mpsoc for soft real-time applications.
In Proc. of DATE’08, pages 1386–1389, 2008.
[4] E. Carvalho and F. Moraes. Congestion-aware task mapping
in heterogeneous MPSoCs. In Int. Symposium on
System-on-Chip, pages 1–4, Nov. 2008.
[5] J. Castrillon, R. Leupers, and G. Ascheid. Maps: Mapping
concurrent dataflow applications to heterogeneous mpsocs.
IEEE Trans.on Industrial Informatics, PP(99):1, 2011.
[6] C.-L. Chou and R. Marculescu. User-aware dynamic task
allocation in networks-on-chip. In Proc. of DATE’08, pages
1232–1237, 2008.
[7] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson.
Approximation algorithms for bin packing: a survey. In
Approximation algorithms for NP-hard problems, pages
46–93. PWS Publishing Co., 1997.
[8] S. V. Gheorghita, M. Palkovic, J. Hamers, A. Vandecappelle,
S. Mamagkakis, T. Basten, L. Eeckhout, H. Corporaal,
F. Catthoor, F. Vandeputte, and K. D. Bosschere.
System-scenario-based design of dynamic embedded
systems. ACM Trans. Design Autom. Electr. Syst., 14(1),
2009.
[9] P. K. Hölzenspies, J. L. Hurink, J. Kuper, and G. J. Smit.
Run-time spatial mapping of streaming applications to a
heterogeneous multi-processor system-on-chip (mpsoc). In
Proc. of DATE’08, pages 212–217, March 2008.
[10] J. Huang, A. Raabe, C. Buckl, and A. Knoll. Aworkflow for
runtime adaptive task allocation on heterogeneous mpsocs.
In Proc. of DATE’11, pages 1119–1134, 2011.
[11] G. Kahn. The semantics of a simple language for parallel
programming. In Information processing, pages 471–475.
North Holland, Amsterdam, Aug 1974.
[12] G. Mariani, P. Avasare, G. Vanmeerbeeck,
C. Ykman-Couvreur, G. Palermo, C. Silvano, and
V. Zaccaria. An industrial design space exploration
framework for supporting run-time resource management on
multi-core systems. In Proc. of DATE’10, pages 196 –201,
march 2010.
[13] J. M. Paul, D. E. Thomas, and A. Bobrek. Scenario-oriented
design for single-chip heterogeneous multiprocessors. IEEE
Trans. VLSI Syst., 14(8):868–880, 2006.
[14] A. D. Pimentel, C. Erbas, and S. Polstra. A systematic
approach to exploring embedded system architectures at
multiple abstraction levels. IEEE Trans. Computers,
55(2):99–112, 2006.
[15] L. Schor, I. Bacivarov, D. Rai, H. Yang, S.-H. Kang, and
L. Thiele. Scenario-based design flow for mapping streaming
applications onto on-chip many-core systems. In Proc. of
CASES’12, pages 71–80, 2012.
[16] A. K. Singh, W. Jigang, A. Kumar, and T. Srikanthan.
Run-time mapping of multiple communicating tasks on
mpsoc platforms. Procedia CS, 1(1):1019–1026, 2010.
[17] P. van Stralen and A. D. Pimentel. Scenario-based design
space exploration of mpsocs. In Proc. of IEEE ICCD’10,
October 2010.
[18] J. Yu, J. Yao, L. Bhuyan, and J. Yang. Program mapping onto
network processors by recursive bipartitioning and refining.
In Proc. of DAC’07, pages 805 –810, june 2007.