Conference PaperPDF Available

A scenario-based run-time task mapping algorithm for MPSoCs

Authors:

Abstract and Figures

The application workloads in modern MPSoC-based embedded systems are becoming increasingly dynamic. Different applications concurrently execute and contend for resources in such systems which could cause serious changes in the intensity and nature of the workload demands over time. To cope with the dynamism of application workloads at run time and improve the efficiency of the underlying system architecture, this paper presents a novel scenario-based run-time task mapping algorithm. This algorithm combines a static mapping strategy based on workload scenarios and a dynamic mapping strategy to achieve an overall improvement of system efficiency. We evaluated our algorithm using a homogeneous MPSoC system with three real applications. From the results, we found that our algorithm achieves an 11.3% performance improvement and a 13.9% energy saving compared to running the applications without using any run-time mapping algorithm. When comparing our algorithm to three other, well-known run-time mapping algorithms, it is superior to these algorithms in terms of quality of the mappings found while also reducing the overheads compared to most of these algorithms.
Content may be subject to copyright.
A Scenario-based Run-time Task Mapping
Algorithm for MPSoCs
Wei Quan,Andy D. Pimentel
Informatics Institute School of Computer Science
University of Amsterdam National University of Defense Technology
The Netherlands Hunan, China
{w.quan,a.d.pimentel}@uva.nl quanwei02@gmail.com
ABSTRACT
The application workloads in modern MPSoC-based embedded sys-
tems are becoming increasingly dynamic. Different applications
concurrently execute and contend for resources in such systems
which could cause serious changes in the intensity and nature of
the workload demands over time. To cope with the dynamism
of application workloads at run time and improve the efficiency
of the underlying system architecture, this paper presents a novel
scenario-based run-time task mapping algorithm. This algorithm
combines a static mapping strategy based on workload scenarios
and a dynamic mapping strategy to achieve an overall improvement
of system efficiency. Weevaluated our algorithm using a homoge-
neous MPSoC system with three real applications. From the re-
sults, we found that our algorithm achieves an 11.3% performance
improvement and a 13.9% energy saving compared to running the
applications without using any run-time mapping algorithm. When
comparing our algorithm to three other, well-known run-time map-
ping algorithms, it is superior to these algorithms in terms of quality
of the mappings found while also reducing the overheads compared
to most of these algorithms.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Performance Attributes
General Terms
Algorithm, Design, Performance
Keywords
Embedded systems, KPN, MPSoC, task mapping, simulation
1. INTRODUCTION
Modern embedded systems, which are more and more based on
MultiProcessor System-on-Chip (MPSoC) architectures, often re-
quire supporting an increasing number of applications and stan-
dards, where multiple applications can run simultaneously. For
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DAC’13 May 29 - June 07 2013, Austin, TX, USA.
Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00.
10E+06
1.2E+06
1.4E+06
n
ce
e
s)
0.0E+00
2.0E+05
4.0E+05
6.0E+05
8.0E+05
1
.
0E+06
13579111315171921232527293133
mjpeg performa
n
(simulation cycl
e
Intra-application scenario ID
Figure 1: Intra-application scenario performance of MJPEG.
each single application, there are typically also different execu-
tion modes (or program phases) with different requirements. For
example, a video application could dynamically lower its resolu-
tion to decrease its computational demands in order to save battery.
As a consequence, the behavior of application workloads execut-
ing on the embedded system can change dramatically over time.
Here, one can distinguish two forms of dynamic application behav-
ior: inter-application dynamism and intra-application dynamism.
These forms of dynamism are often captured using scenarios [13,
8]. This means that there are two different kinds of scenarios: inter-
application scenarios to describe the simultaneously running appli-
cations in the system, and intra-application scenarios that define the
different execution modes for each application. The combination
of these inter- and intra-application scenarios are called workload
scenarios, and specify the application workload in terms of the dif-
ferent applications that are concurrently executing and the mode of
each application.
At design time of an embedded system, a designer could aim
at finding the optimal mapping of application tasks to MPSoC pro-
cessing resources for each inter- and intra-application scenario. How-
ever, when the number of applications and application modes in-
crease, the total number of workload scenarios will explode ex-
ponentially. Considering, e.g., 10 applications with 5 execution
modes for each application, there will be 60 million workload sce-
narios. If each scenario takes one second to find the optimal map-
ping at design time, then one would need nearly two years to obtain
all the optimal mappings. Moreover, storing all these optimal map-
pings such that they can be used at run time by the system to remap
tasks when a new scenario is detected would also be unrealistic as
this would take up too much memory storage.
An approach to solve this problem is by clustering workload sce-
narios and only storing a single mapping per cluster of workload
scenarios to facilitate run-time mapping [8]. Such clustering im-
plies a significant space reduction needed to store the mappings.
Moreover, so-called scenario-based design space exploration [17]
can be deployed to efficiently find these mappings by only evalu-
!"#$%&'( )*+,
-./01+!
)23
4+5637 !89
4:2%(;7
!"#$%&<=;
Figure 2: KPN for MJPEG application.
ating a representative subset of scenarios for each cluster. In this
paper, we consider a clustering method1in which we find and store
a single mapping for each inter-application scenario that yields, on
average, best performance for all possible intra-application scenar-
ios within the inter-application scenario. However, as we can see
from the behavior of a Motion-JPEG (MJPEG) encoder applica-
tion in Figure 1, using such a single mapping to represent an entire
inter-application scenario shows considerable performance varia-
tions for the different intra-application scenarios that exist in this
inter-application scenario. In this particular example, the inter-
application scenario contains three simultaneously running multi-
media applications: MJPEG, a MP3 decoder, and a Sobel filter for
edge detection in images. The use of cluster-level mappings (i.e.,
mappings found to be good for an entire cluster of workload scenar-
ios) can provide a run-time mapping system with enough informa-
tion to quickly find an adequate mapping for a detected workload
scenario but it will not immediately lead to finding the optimal sys-
tem mapping for any identified workload scenario. Therefore, we
propose a novel run-time Scenario-based Task Mapping algorithm
(STM) that uses the cluster-level mapping information derived from
design-time design space exploration (DSE) but, additionally, per-
forms run-time mapping optimization by continuously monitoring
the system and trying to perform (relatively small) mapping cus-
tomizations to gradually further improve the system performance.
The remainder of this paper is organized as follows. Section 2
gives some prerequisites and the problem definition for this paper.
Section 3 provides a detailed description of the scenario-based run-
time mapping algorithm. Section 4 introduces the experimental
environment and presents the results of our experiments. Section 5
discusses related work, after which Section 6 concludes the paper.
2. PREREQUISITES AND PROBLEM DEF-
INITION
In this section, we explain the necessary prerequisites for this
work and provide a detailed problem definition.
2.1 Application Model
In this paper, we target the multimedia application domain, as
was already illustrated in Figure 1. For this reason, we use the
Kahn Process Network (KPN) model of computation [11] to spec-
ify application behaviour since this model of computation fits well
to the streaming behaviour of multimedia applications. In a KPN,
an application is described as a network of concurrent processes
that are interconnected via FIFO channels. This means that an
application can be represented as a directed graph KPN = (P,F)
where Pis set of processes (tasks) piin the application and fij F
represents the FIFO channel between two processes piand pj. Fig-
ure 2 shows the KPN of the MJPEG application.
1We note, however, that other clustering methods would also be
possible and that our run-time mapping algorithm is independent
on the clustering method used.
2.2 Architecture Model
In this work, we restrict ourselves to homogeneous MPSoC tar-
get architectures2. An architecture can be modeled as a graph
MPSoC = (PE,C), where PE is the set of processing elements used
in the architecture and Cis a multiset of pairs cij = (pei,pe j)
PE ×PE representing a communication channel (like Bus, NOC,
etc.) between processors peiand pe j. Combining the definition of
application and architecture models, the computation cost of task
(process) pion processing element pe jis expressed as Tj
iand the
communication cost between tasks piand pjon channel cxy is Ccxy
ij .
2.3 Task Mapping
The task mapping defines the corresponding relationship between
the tasks in a KPN application and the underlying architecture re-
sources. For a single application, given the KPN of this applica-
tion and a target MPSoC, a correct mapping is a pair of unique
assignments (µ:PPE,η:FC) such that it satisfies f
F,src(η(f)) =µ(src(f))d st(η(f)) =µ(dst(f)). In the case of a
multi-application workload, the state of simultaneously running ap-
plications that are distinguished as inter- and intra-application sce-
narios should be considered in the task mapping. Let A={app0,
app1,...,appm}be the set of all applications that can run on the
system, and Mi={mdi
0,mdi
1,...,mdi
n}be the set of possible ex-
ecution modes for appiA. Then, SE ={se0,se1,...,seninter },
with sei={app0=0/1, ...,appm=0/1}and appiA, is the
set of all inter-application scenarios. And sai
j={app0=md0
j0,...,
appm=mdm
jm}, with appiAappi=1seiand mdi
jxMi,
represents the j-th intra-application scenario in inter-application
scenario seiSE. The set of all workload scenarios can then be
defined as the disjoint union S=&iSE SAi, with SAi={sai
1,sai
2,
...,sai
ni
intra
}.
As already explained in the previous section, we propose to per-
form the run-time mapping of applications in two stages. In the
first stage, which is performed at design time, we cluster work-
load scenarios (similar to [8]) and perform DSE for each of these
scenario clusters to find a mapping that shows the best average per-
formance for that particular cluster. More specifically, in this paper,
we consider each seiSE as a different cluster of scenarios (i.e.,
we cluster all intra-application scenarios of an inter-application sce-
nario). The mappings derived from design-time DSE are stored so
they can be used by the run-time mapping algorithm to re-map ap-
plications when a workload scenario is detected that belongs to a
different scenario cluster. Since these statically determined map-
pings may not be optimal for the current active intra-application
scenario, the second stage of the run-time mapping algorithm tries
to perform (relatively small) mapping customizations to gradually
further improve the system performance. In our goal to optimize
mappings, we recognize two kinds of objectives: system-level ob-
jectives and application-dependent objectives. System-level objec-
tives, denoted as Oα={Oα0,Oα1,...}, define the system-wide
metrics such as system energy consumption, total system execu-
tion time, etc. Application-dependent objectives, denoted as Oβ=
{Oβ0,Oβ1,...}, are mainly used to define the performance require-
ments of each separate application like throughput, latency, etc. As
will be explained in the next section, the first stage of our run-time
mapping approach uses system-level objectives to find mappings
per scenario cluster. Here, we use system energy consumption
and total workload scenario execution time as metrics: Esi,siS
represents the system energy consumption of workload scenario si
2In subsequent work, we will show how scenario-based run-time
mapping can also be applied to heterogeneous MPSoCs.
and Xsi,siSis the execution time of scenario si. For the second
stage, during which the mapping is gradually optimized, we ap-
ply application-specific objectives – in our case throughput require-
ments for each application – for the optimization process. However,
to measure the results of the run-time optimization process, we also
use the system-level metrics Esiand Xsi.
Under these definitions and given the KPN = (P,F)for each
application and an MPSoC = (PE,C), our goal is to continuously
customize the mapping at run time such that the system-level and/or
application-specific objectives under every workload scenario si
Sare satisfied.
3. SCENARIO-BASED TASK MAPPING
The STM algorithm, which is outlined in Algorithm 1, can be
divided into a static part and a dynamic part. The static part is
used to capture application dynamism at the granularity of inter-
application scenarios. For each inter-application scenario seiSE ,
we have determined – using design-time DSE – a mapping that
on average performs best for all intra-application scenarios SAiof
sei. That is, for each seiwe search for a mapping by solving the
following multi-objective optimisation problem:
min[
sai
jSAi
Esaj,
sai
jSAi
Xsaj].(1)
To this end, we have deployed the scenario-based DSE approach
presented in [17], which is based on the well-known NSGA-II ge-
netic algorithm and allows for effectively pruning the design space
by only evaluating a representative subset of intra-application sce-
narios of SAifor each seiSE. As this design-time DSE stage is
not the main focus of this paper, we refer the interested reader to
[17] for further details. The mappings derived from this design-
time DSE are used by the STM algorithm as shown in lines 1-3 of
Algorithm 1. When the system detects the execution of a differ-
ent inter-application scenario, the static part of the STM algorithm
will choose the corresponding mapping as derived from the design-
time DSE stage and which has been stored in a so-called scenario
database. Because this database only stores mappings for entire
scenario clusters, its size can be controlled by choosing a proper
granularity of scenario clusters (e.g., inter-application scenarios).
The dynamic part of our STM algorithm is active during the en-
tire duration of an inter-application scenario. As explained in the
previous section, it uses application-specific objectives, specified
for each separate application, to continuously optimize the map-
ping. When the algorithm detects that an objective is unsatisfied, it
will try to find a new task mapping for that particular application
that missed the performance goal. If multiple applications miss
their performance goal, then the STM algorithm will start optimiz-
ing the most problematic application first. The main steps of the
dynamic part of the STM algorithm are described below.
3.1 Finding the Critical Task
The first step of the dynamic part of the STM algorithm is to
find the so-called critical task for the application that missed its
objective, as shown in lines 10-13 of Algorithm 1. The rationale
behind this is that by remapping this critical task and possibly its
neighbouring tasks (forming a bottleneck in the application), the
resulting effect will be optimal. To find the critical task, the STM
algorithm maintains three lists. The first list stores the task costs
(TC). For every application, it contains the cost of the application’s
tasks, where the cost is determined by the sum of the execution and
communication times of a task. These task costs are arranged in
descending order in the list. The two other lists concern the storing
of two other metrics for each task: the proportion of task cost in
Algorithm 1 STM algorithm
Input: KPNapp0,...,appm,MPSoC,Oα,Oβ,µ,η
Output: New(µ,η)
list: TC, CIC, CIB, PU
pCIC = δc, pCIB = δb
1: if detectScenario() == true : //new inter-application scenario
2: New(µ,η)= getMapping();
3: return New(µ,η);
4: else :
5: results[] = getStatistics();
6: if (i = objectiveUnsatisfied(results, Oα,Oβ)) != -1:
7: taskCost(KPNappi, results, TC, CIC, CIB);
8: peUsage(results, PU);
9: while(1) :
10: if (apptype = getType(KPNappi)) == DATA_PARALLEL :
11: critical = findDPCritical(KPNappi, CIC, CIB, pCIB, pCIC);
12: else :
13: critical = findCritical(KPNappi, CIC, CIB, pCIB, pCIC);
14: reason = findReason(critical, CIC, CIB, pCIB, pCIC);
15: if reason == POOR_LOCALITY :
16: MCC[] = minCircle(KPNappi, results, critical);
17: if GetSubstitute(PU, µ,η, MCC, apptype) == true :
18: return New(µ,η);
19: else failed;
20: else if reason == LOAD_IMBALANCE :
21: if GetSubstitute(PU, µ,η, apptype) == true :
22: return New(µ,η);
23: else failed;
24: else :
25: pCIB += ε;
26: pCIC -= ε;
the total busy time of the PE (i.e., processor) onto which the task is
currently mapped (CIB), and the proportion of task communication
time (read and write transactions) in the task cost (CIC).
Using the TC list, the algorithm checks the task at the top of the
list to find the critical task, taking the following two conditions into
account: 1) whether or not the task’s CIB proportion is lower than
a specific threshold, defined by pCIB. Here, the rationale is that
a high-cost task receiving only a small fraction of processor time
may imply that the processor is overloaded. If the task satisfies this
condition, then this task is considered as the critical task and the
process of finding the critical task ends. Otherwise, the algorithm
continues to check the other tasks in the TC list with lower costs
until it finds the critical task. If there is no task in the applica-
tion that satisfies the first condition, then the second condition will
be used: 2) Whether or not the CIC proportion is higher than the
threshold pCIC. The algorithm checks all the tasks using this sec-
ond condition just like it did for the first condition. If all the tasks
do not satisfy these two conditions, then the algorithm will, respec-
tively, increase and decrease the pCIB and pCIC thresholds by ε,
after which the above process is restarted again.
For data parallel applications, the process of finding the critical
task has one additional test as compared to regular applications.
This extra test (performed in the function findDPCritical) involves
the check whether or not all data-parallel tasks are mapped onto
different PEs. If there are data-parallel tasks that are mapped onto
the same processor, then those tasks with higher task costs will
be treated as critical tasks. Otherwise, the process of finding the
critical task will be the same as for regular applications.
3.2 Remapping the Critical Task
After the critical task has been found, the STM algorithm tries to
analyze the reason for missing the application’s performance goal.
In this respect, we recognize two different reasons: poor locality
and load imbalance. Here, we use the process of determining the
critical task to also determine the reason for not meeting the perfor-
mance goal: If the CIC proportion of the critical task is higher than
the value of the current pCIC threshold, then the algorithm assumes
that poor locality is the reason. Otherwise it takes load imbalance
as the reason for not meeting the application demands. This means
that poor locality has a higher priority than load imbalance as a
reason for not meeting the application demands, which is helpful
to reduce the energy consumption due to communications.
Subsequently, the function GetSubstitute in the STM algorithm
can follow different strategies to find a target PE to which the crit-
ical task will be remapped. The selection of remapping strategy
depends on the reason for not meeting the application’s perfor-
mance demands as well as on the type of application (data parallel
or not). The strategies that are used to find the substitute PE for
data-parallel applications are similar to the ones for regular appli-
cations except that one additional condition is taken into account
for finding the substitute PE: the substitute PE should not be a PE
onto which its parallel tasks are mapped.
3.2.1 Poor locality
In the case of poor locality, the STM algorithm will try to find a
better mapping for the application in question based on a minimal
cost circle (MCC) approach. A situation that has been identified as
"poor locality" is mainly due to the communication overhead be-
tween tasks. Evidently, if the communicating frequency between
two tasks is very high or the communicating data size is very large,
then these two tasks should preferably be mapped onto the same
PE or onto two different PEs that contain a more efficient intercon-
nect between each other. The MCC strategy aims at redistributing
the critical task and its neighbouring tasks over PEs such that com-
munication overhead is reduced while trying to avoid creating new
computational bottlenecks. To this end, it first finds the minimal
cost circle based on equation (2) for the critical task pi:
min(Circle_Cost(pi)mn),wit h 0n,m|P|(2)
where:
Circle_Cost(pi)mn =
min
Tz
i+
0i<|P|
mjn
Ccxy
ij (3)
where Tz
idenotes the execution time of task ifor PE zonto which
task iis currently mapped, and Ccxy
ij denotes the communication
overhead between tasks iand j(see Section 2.2). This strategy
is applicable for heterogeneous MPSoC architectures. However,
in this paper, our focus is on homogeneous architectures using a
shared bus interconnect. This means that each task will haveacon-
stant computational cost irrespective of the PE it is mapped on,
and that communication overhead only involves internal commu-
nication within a single PE (i.e., when the communicating tasks
are mapped to the same PE) or external communication between
PEs via shared memory. Clearly, internal communication costs are
much lower than external communication costs. Figure 3.a shows
an example of an MCC (indicated by the red oval) that contains
two tasks, including the critical task (red task), whereas Figure 3.b
illustrates an MCC that only contains the critical task itself.
After the MCC of the critical task has been determined, the func-
tion GetSubstitute will choose a substitute PE for all the tasks in-
cluded in the identified MCC to achieveanew mapping. For this
purpose, the PU list is used, containing the processor utilisations
90 30 200 70
30/10 80/15 40/10
Communication cost
when two task be
mapped to different PEs
Communication cost
when two task be
mapped to the same PE
Computation cost of
the task on the PE
90 60 200 70
30/10 40/1080/20
a. MCC of the critical task contains multiple processes
b. MCC of the critical task contains single process
Figure 3: Examples of an MCC for a critical task (red task).
for each PE. The substitute PE is the PE with the lowest utilization
in the PU list that is different from the PE onto which the critical
task is currently mapped. If the MCC solely consists of the crit-
ical task itself, then the critical task will be mapped onto the PE
of a neighboring task that has the heaviest communication with the
critical task. This is, e.g., shown in Figure 3.b, where the criti-
cal task will be mapped onto the same PE as the task with cost
70. Moreover, the substitute PE should be different than the PE the
critical task is currently mapped on. Otherwise, the algorithm fails
to find a new mapping. After the substitute PE has been found,
the FIFO channels between the tasks that need to be remapped are
either mapped as internal communication onto the new PE (if com-
municating tasks are mapped onto this PE) or onto the system bus.
3.2.2 Load imbalance
In the case a load imbalance has been identified as the reason for
not meeting the application demands, a load balancing strategy is
used to remap the critical task. The substitute PE should satisfy the
condition that it is different from the current PE of the critical task
and should have the lowest processor utilization in the PU list. If
such a substitute does not exist, then the algorithm cannot find a
better mapping.
4. EXPERIMENTS
4.1 Experimental Framework
To evaluate the efficiency of our STM algorithm and the map-
pings found at run time by this algorithm, we deploy the (open-
source) Sesame system-level MPSoC simulator [14]. To this end,
we haveextended this simulator with our run-time resource schedul-
ing framework, as illustrated in Figure 4. Our extension includes
the Scenario DataBase (SDB), a System Monitor (SM) and a run-
time Resource Scheduler (RS). The SDB is used to store the map-
pings for each inter-application scenario as derived from design-
time DSE. The SM is in charge of recording the running statis-
tics for each active application as well as monitoring system-wide
statistics. The RS uses the run-time task mapping algorithm and
the statistics provided by the SM to dynamically remap application
tasks when needed, as explained in the previous section.
4.2 Experimental Results
In this subsection, we present several experimental results in
which we investigate various aspects of our STM algorithm and
compare it to three well-known mapping algorithms: First-Fit Bin-
Packing (FFBP) [7] which has been frequently adapted to do task
mapping by means of modelling it as a bin-packing problem, Output-
Rate Balancing (ORB) [5] and Recursive BiPartition and Refining
(RBPR) [18]. We modified these algorithms to fit our mapping
Inter-application scenario
Intra-app scenario
APP0
Intra-app scenario
APPn
Task0
Task1
Task1
Task1
Task0
Task1
!
!!
Event Traces
Mapping layer: abstract RTOS
(scheduling of events)
Scheduled events Scheduled events
P0 P1 P2
P3 P4 MEM
System
Monitor
Run-time
resource
scheduler
Scenario
DataBase
Application model Architecture model
Figure 4: Extended Sesame framework.
problem and extended them to also allow for mapping data-parallel
applications by constraining the data-parallel tasks so that they have
to be mapped onto different processing elements. For the FFBP al-
gorithm, the PE with the lowest utilization is taken as the first-fit bin
and the computational cost of each task in the target application is
considered as the object that needs to be packed into the bins.
For our experiments, we use the three typical multi-media appli-
cations that were already introduced in Section 1: MJPEG, Sobel
and MP3. The KPN of the MJPEG application contains 8 pro-
cesses and 18 FIFO channels, Sobel contains 6 processes and 6
FIFO channels, and MP3 contains 27 processes and 52 FIFO chan-
nels. In the Sobel and MP3 applications, data parallelism is ex-
ploited. Moreover, MJPEG has 11 intra-application scenarios, MP3
has 3 intra-application scenarios, whereas Sobel only has 1 intra-
application scenario. This results in a total of 95 different workload
scenarios. At design time, we have determined the on-average best
mapping for each possible inter-application scenario as explained
in Section 3. With respect to the target architecture, we modeled
a homogeneous MPSoC containing 5 processors, connected to a
shared bus and memory. The model also includes the required com-
ponents for our run-time scheduling framework.
As there are just three applications and each application con-
tains a limited number of intra-application scenarios, we are able
to exhaustively evaluate all workload scenarios. For each work-
load scenario, we have simulated the system using two methods:
one is deploying only the static part of our STM algorithm to deal
with the dynamism at the level of different inter-application sce-
narios, whereas the other one is running all the workload scenar-
ios under a single, fixed mapping: the on-average best mapping
found for the inter-application scenario in which all three applica-
tions are concurrently executing. The results of this experiment
are shown in Figure 5. From this figure, we can see that the static
part of our STM algorithm already yields both performance im-
provements and energy savings by dynamically adjusting the map-
ping based on the variation in inter-application scenarios. For this
specific test case, the performance improvements for the different
inter-application scenarios range from 1.69% to 29.49% and the
energy savings range from 1.09% to 24.51%. Overall, for the exe-
cution of all 95 workload scenarios, the improvements in terms of
performance and energy saving are 7.4% and 9.4%, respectively.
Figures 6.a and 6.b show the intra-application scenario execution
times and energy consumption for the FFBP, ORB, RBPR and STM
run-time mapping algorithms for a single inter-application scenario
in which all three applications are concurrently executing. More-
execution time execution time(remapping)
ener
gy
consum
p
tion ener
gy
consum
p
tion
(
rema
pp
in
g)
p
tion
04
0.6
0.8
1
1.2
gy p
gy p ( pp g)
time/energy consum
p
0
0.2
0
.
4
Nomalized execution
Figure 5: Performance and energy consumption of each inter-
application scenario.
over, these two graphs also contain the results when using optimal
mappings (OPT) for each intra-application scenario (we derived
these mappings in a design-time DSE experiment). The results in
these two graphs have been ordered in a monotonically increasing
fashion based on the results from the OPT mappings. Figures 6.c-
e show the overall (for the entire inter-application scenario) per-
formance, energy consumption and overhead. Here, the overhead
includes the run-time calculation of new mappings as well as the
migration of tasks. From Figure 6, we can see that our STM clearly
performs better than the other algorithms in terms of the execution
time of scenarios. For several intra-application scenarios, the STM
algorithm even approaches the OPT results. With respect to energy
consumption and overhead, the STM algorithm also performs well:
it ranks second closely behind the ORB algorithm. The reason for
a low overhead of ORB is that it only needs to migrate a few tasks
in our experiment which means a very low task migration cost.
In our last experiment, we used the full STM algorithm, includ-
ing the static and dynamic parts and thus combining the dynamism
of inter-application as well as intra-application scenarios, to test
all the 95 workload scenarios of our three applications. Our al-
gorithm could achievea11.3% performance improvement and an
energy saving of 13.9% compared to an approach in which we run
the applications using the (static) on-average best mapping for the
inter-application scenario in which all three applications are active.
Comparing these results to those when only using the static part of
our STM algorithm (improvements of 7.4% and 9.4%, respectively;
see above), this means that the dynamic part of the STM algorithm
is capable of significantly further improving the mappings.
5. RELATED RESEARCH
In recent years, much research has been performed in the area
of run-time mapping for embedded systems. The authors of [6]
propose a run-time mapping strategy that incorporates user behav-
ior information in the resource allocation process. An agent based
distributed application mapping approach for large MPSoCs is pre-
sented in [1]. The work of [9] proposes a run-time spatial mapping
technique to map streaming applications on MPSoCs. In [3], dy-
namic task allocation strategies based on bin-packing algorithms
for soft real-time applications are presented. A Dynamic Spiral
Mapping (DSM) algorithm for mapping an application on an MP-
SoC arranged in a 2-D mesh topology is proposed in [2]. The
authors from [4] present network congestion-aware heuristics for
mapping tasks on NoC-based MPSoCs at run-time. The work of
[16] uses a Smart Nearest Neighbour approach to perform run-time
task mapping. In [10], a run-time task allocator is presented that
1.4E+06
OPT FFBP ORB RBPR STM
e
s)
30E+07
3.2E+07
total!scenario!execution!time!(cycles)
2.0E+05
4.0E+05
6.0E+05
8.0E+05
1.0E+06
1.2E+06
e
nario execution time (cycl
e
2.4E+07
2.6E+07
2.8E+07
3
.
0E+07
FFBP ORB RBPR STM
total!scenario!ener
gy
!consum
p
tion!
(
n
j)
c. total intra-app scenario execution time
0.0E+00
135791113151719212325272931
sc
e
Intra-application scenario ID
OPT FFBP ORB RBPR STM
0
2E+09
4E+09
6E+09
8E+09
FFBP ORB RBPR STM
gy p(j)
a. estimated execution time
1.0E+08
1.5E+08
2.0E+08
2.5E+08
3.0E+08
energy consumption (nj)
10E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
total!scenario!cost!(ns)
d. total intra-app scenario energy consumption
0.0E+00
5.0E+07
13579111315171921232527293133
scenario
Intra-application scenario ID
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1
.
0E+04
FFBP ORB RBPR STM
b. estimated energy consumption e. cost of algorithms (including algorithm
execution cost and task migration cost)
Figure 6: Comparing different run-time mapping algorithms.
uses an adaptive task allocation algorithm and adaptive clustering
approach for efficient reduction of the communication load. Mar-
iani et al. [12] proposed a run-time management framework in
which Pareto-fronts with system configuration points for different
applications are determined during design-time DSE, after which
heuristics are used to dynamically select a proper system configu-
ration at run time. Compared with these algorithms, our STM al-
gorithm takes a scenario-based approach, and takes computational
and communication behavior into account to make (re-)mapping
decisions. Recently, Schor et al. [15] also proposed a scenario-
based run-time mapping approach in which mappings derived from
design-time DSE are stored for run-time mapping decisions, but
they do not cluster mappings to reduce mapping storage nor do
they dynamically optimize the mappings at run time.
6. CONCLUSION
We have proposed a run-time mapping algorithm for MPSoC-
based embedded systems to improve their performance and energy
consumption by capturing the dynamism of the application work-
loads executing on the system. This algorithm is based on the
idea of application scenarios and consists of a design-time and run-
time phase. The design-time phase produces mappings for clus-
ters of application scenarios after which the run-time phase aims
to optimize these mappings by continuously monitoring the system
and trying to perform (relatively small) mapping customizations to
gradually further improve the system performance. In various ex-
periments, we haveevaluated our algorithm and compared it with
three other algorithms. The results show that our algorithm can
yield considerable improvements as compared to just using a static
mapping strategy. Comparing our algorithm with three other, well-
known run-time mapping algorithms, it shows a better trade-off be-
tween the quality and the cost of the mappings found at run time.
7. REFERENCES
[1] M. A. Al Faruque, R. Krist, and J. Henkel. Adam: run-time
agent-based distributed application mapping for on-chip
communication. In Proc. of DAC’08, pages 760–765, 2008.
[2] M. Armin, K. Ahmad, and S. Samira. Dsm: A heuristic
dynamic spiral mapping algorithm for network on chip.
IEICE Electronics Express, 5(13):464–471, 2008.
[3] E. W. Brião, D. Barcelos, and F. R. Wagner. Dynamic task
allocation strategies in mpsoc for soft real-time applications.
In Proc. of DATE’08, pages 1386–1389, 2008.
[4] E. Carvalho and F. Moraes. Congestion-aware task mapping
in heterogeneous MPSoCs. In Int. Symposium on
System-on-Chip, pages 1–4, Nov. 2008.
[5] J. Castrillon, R. Leupers, and G. Ascheid. Maps: Mapping
concurrent dataflow applications to heterogeneous mpsocs.
IEEE Trans.on Industrial Informatics, PP(99):1, 2011.
[6] C.-L. Chou and R. Marculescu. User-aware dynamic task
allocation in networks-on-chip. In Proc. of DATE’08, pages
1232–1237, 2008.
[7] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson.
Approximation algorithms for bin packing: a survey. In
Approximation algorithms for NP-hard problems, pages
46–93. PWS Publishing Co., 1997.
[8] S. V. Gheorghita, M. Palkovic, J. Hamers, A. Vandecappelle,
S. Mamagkakis, T. Basten, L. Eeckhout, H. Corporaal,
F. Catthoor, F. Vandeputte, and K. D. Bosschere.
System-scenario-based design of dynamic embedded
systems. ACM Trans. Design Autom. Electr. Syst., 14(1),
2009.
[9] P. K. Hölzenspies, J. L. Hurink, J. Kuper, and G. J. Smit.
Run-time spatial mapping of streaming applications to a
heterogeneous multi-processor system-on-chip (mpsoc). In
Proc. of DATE’08, pages 212–217, March 2008.
[10] J. Huang, A. Raabe, C. Buckl, and A. Knoll. Aworkflow for
runtime adaptive task allocation on heterogeneous mpsocs.
In Proc. of DATE’11, pages 1119–1134, 2011.
[11] G. Kahn. The semantics of a simple language for parallel
programming. In Information processing, pages 471–475.
North Holland, Amsterdam, Aug 1974.
[12] G. Mariani, P. Avasare, G. Vanmeerbeeck,
C. Ykman-Couvreur, G. Palermo, C. Silvano, and
V. Zaccaria. An industrial design space exploration
framework for supporting run-time resource management on
multi-core systems. In Proc. of DATE’10, pages 196 –201,
march 2010.
[13] J. M. Paul, D. E. Thomas, and A. Bobrek. Scenario-oriented
design for single-chip heterogeneous multiprocessors. IEEE
Trans. VLSI Syst., 14(8):868–880, 2006.
[14] A. D. Pimentel, C. Erbas, and S. Polstra. A systematic
approach to exploring embedded system architectures at
multiple abstraction levels. IEEE Trans. Computers,
55(2):99–112, 2006.
[15] L. Schor, I. Bacivarov, D. Rai, H. Yang, S.-H. Kang, and
L. Thiele. Scenario-based design flow for mapping streaming
applications onto on-chip many-core systems. In Proc. of
CASES’12, pages 71–80, 2012.
[16] A. K. Singh, W. Jigang, A. Kumar, and T. Srikanthan.
Run-time mapping of multiple communicating tasks on
mpsoc platforms. Procedia CS, 1(1):1019–1026, 2010.
[17] P. van Stralen and A. D. Pimentel. Scenario-based design
space exploration of mpsocs. In Proc. of IEEE ICCD’10,
October 2010.
[18] J. Yu, J. Yao, L. Bhuyan, and J. Yang. Program mapping onto
network processors by recursive bipartitioning and refining.
In Proc. of DAC’07, pages 805 –810, june 2007.
... To improve energy consumption and/or to meet performance constraint in multi-core mobile platforms, various approaches for DVFS and/or mapping have been proposed using offline, online or hybrid (online optimization facilitated by offline analysis results) optimization for resource management [1], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [37], [38]. Depending on the control mechanism, runtime energy management approaches can be further classified into two categories: proactive [21], [22], [30] and reactive [16], [29], [31], [33], [34], [37], [38]. ...
... To improve energy consumption and/or to meet performance constraint in multi-core mobile platforms, various approaches for DVFS and/or mapping have been proposed using offline, online or hybrid (online optimization facilitated by offline analysis results) optimization for resource management [1], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [37], [38]. Depending on the control mechanism, runtime energy management approaches can be further classified into two categories: proactive [21], [22], [30] and reactive [16], [29], [31], [33], [34], [37], [38]. ...
... Quan and Pimentel [31] proposed scenario-based hybrid mapping approaches targeting homogeneous multi-core platforms in which mappings derived from design-time DSE are stored for runtime mapping decisions. Above discussed approaches target only homogeneous multi-cores and thus may not be efficient for heterogeneous multi-cores. ...
Article
Full-text available
Multi-core mobile platforms are on rise as they enable efficient parallel processing to meet ever-increasing performance requirements. However, since these platforms need to cater for increasingly dynamic workloads, efficient dynamic resource management is desired mainly to enhance the energy and thermal efficiency for better user experience with increased operational time and lifetime of mobile devices. This article provides a survey of dynamic energy and thermal management approaches for multi-core mobile platforms. These approaches do either proactive or reactive management. The upcoming trends and open challenges are also discussed.
... An online spatial mapping for streaming applications on MPSoC platforms is presented by (Holzenspieces et al., 2008). Scenario-based hybrid mapping mechanisms on homogenous MPSoC was proposed in the study (Quan & Pimentel, 2013) in which the mappings obtained from offline design space exploration are stored for runtime mapping decisions. However, these approaches do not consider the heterogeneity of the MPSoC architecture in optimizing energy efficiency. ...
Article
Full-text available
Modern mobile cyber-physical systems such as hand-held smartphones, tablets and wearable devices are now key part of human daily lives. People increasingly rely on these devices to perform many daily tasks such as monitoring smart appliances, watching favourite TV series, playing favourite video games, and tracking social and active lives. Majority of today's mobile computing devices increasingly employ Multi-Processor System-on-chip (MPSoC), which consists of many processing cores integrated on a single chip. These mobile devices are battery operated at all the times. As a result, they have limited power supply and their reliability in terms of life span is also impaired by the peak thermal behaviour that also makes them susceptible to temperature side-channel attack. In this research paper, we review the existing thermal and energy efficiency management methodologies in mobile MPSoC devices. The merits and limitations of these methodologies are discussed, and future research direction highlighted. It is found that most of the methodologies enhances either energy efficiency or thermal behaviour in mobile devices. In addition, they employ frequency scaling and thread partitioning and mapping to these objectives. However, these approaches have not considered dynamic computing resources managements in enhancing thermal and energy efficiency. Therefore, there need is to improve thermal and energy efficiency through intelligent resource management in mobile and embedded computing systems for better user experience and device reliability.
... (1) at design-time, a set of solutions is computed, and (2) one solution is selected at run-time. A wide variety of approaches can then be cited: based on traces in [60], on priority in [61], on scenario in [62], on previously identified design points in [63], or on WCET and scheduling in [64]. None of these studies demonstrates its efficiency with real video applications running reference sequences. ...
Thesis
Full-text available
This Habilitation to supervise research presents the numerous research activities performed since 2014 targeting the development of flexible and efficient architectures for high performance embedded computing. The presented research activities aim at the realization of flexible and efficient architectures in multitude application domains such as digital communication, data-flow, neural networks, embedded machine learning and embedded vision. These research works have addressed the design and implementation of novel hardware architectures aiming to attain the emergent flexibility requirement, and the ever-increasing requirements of enhanced performance and reduced power consumption and implementation resources. The performed work has targeted the elaboration of new algorithms and hardware architectures using different design paradigms. In this context, several research works have been initiated through completed or ongoing research projects, two defended PhD theses and several Master theses. The most significant achievements are presented by grouping them in four sub-themes: (1) Flexible yet efficient architectures for applications in the digital communication domain; (2) Efficient algorithms and architectures for dataflow applications; (3) Efficient and flexible design paradigms based on emergent memristive devices and (4) Efficient implementations of machine learning algorithms. Current research activities focus on embedded computer vision and artificial intelligence with the goal of achieving efficient implementations on edge devices with low computational resources and low power budget.
... The job is split into two phases: (1) at designtime, a set of solutions is computed, and (2) one solution is selected at run-time. A wide variety of approaches can then be cited: based on traces in [25], on priority in [26], on scenario in [27], on previously identified design points in [28], or on WCET and scheduling in [29]. None of these studies demonstrates its efficiency with real video applications running reference sequences. ...
Article
Full-text available
Multiprocessor system-on-chip (MPSoC) platforms have been emerging as the main solution to cope with processor frequency ceiling and power density issues while still improving performances. Then, network-on-chip (NoC) has been adopted to provide the increasing number of processors with the required communication bandwidth as well as with the necessary flexibility. Video processing and streaming applications are adopting dynamic dataflow model of computation as the need for high performance parallel computing is growing. Dataflow applications executed on modern MPSoC-based architectures are becoming increasingly dynamic and more data-dependent. Different tasks execute concurrently with significant modifications in their workloads and resource demanding over time depending on the input data. Hence, adopting any static or offline dynamic scheduling for mapping tasks will not cope with the computation variations. This paper introduces an original run-time mapping algorithm based on the Move Based (MB) method targeting a dedicated heterogeneous NoC-based MPSoC architecture to achieve workload balancing and optimized communication traffic. The performance of the proposed algorithm is verified by conducting cycle-accurate SystemC simulations of the adopted NoC implementing a real MPEG4-SP decoder. The obtained results reveal the effectiveness of our proposed algorithm. For various real-life videos, the proposed algorithm systematically succeeded to enhance significantly the performance.
... In Reference [41], the concept of co-evolution is used that two GAs are running in parallel: a GA in the solution space for node mapping and a GA in the problem space for scenario selection. Scenario clustering is proposed in the works [29,32] and a MOEA-based DSE is performed at design time to find a mapping that shows the best average performance for that particular cluster. The mentioned works are extended by Reference [30] for supporting heterogeneous architectures and new applications, and an NSGA-II-based DSE is performed at design time to find two mappings that maximize the throughput and maximize the throughput under a predefined energy budget, respectively. ...
Article
Some signal processing and multimedia applications can be specified by synchronous dataflow (SDF) models. The problem of SDF mapping to a given set of heterogeneous processors has been known to be NP-hard and widely studied in the design automation field. However, modern embedded applications are becoming increasingly complex with dynamic behaviors changes over time. As a significant extension to the SDF, the multi-mode dataflow (MMDF) model has been proposed to specify such an application with a finite number of behaviors (or modes) and each behavior (mode) is represented by an SDF graph. The multiprocessor mapping of an MMDF is far more challenging as the design space increases with the number of modes. Instead of using traditional genetic algorithm (GA)-based design space exploration (DSE) method that encodes the design space as a whole, this article proposes a novel cooperative co-evolutionary genetic algorithm (CCGA)-based framework to efficiently explore the design space by a new problem-specific decomposition strategy in which the solutions of node mapping for each individual mode are assigned to an individual population. Besides, a problem-specific local search operator is introduced as a supplement to the global search of CCGA for further improving the search efficiency of the whole framework. Furthermore, a fitness approximation method and a hybrid fitness evaluation strategy are applied for reducing the time consumption of fitness evaluation significantly. The experimental studies demonstrate the advantage of the proposed DSE method over the previous GA-based method. The proposed method can obtain an optimization result with 2×−3× better quality using less (1/2−1/3) optimization time.
... By a combination of design-time and runtime methods, an appropriate system configuration point is selected at runtime, which helps to meet the QoS constraints imposed by the user. Reference [36] proposes a scenario-based runtime mapping strategy for homogeneous platforms. In this work, the optimal mappings for inter-application scenarios are explored at design time and serve as the basis for runtime decision making. ...
Article
The application workloads in modern multicore platforms are becoming increasingly dynamic. It becomes challenging when multiple applications need to be executed in parallel in such systems. Mapping and scheduling of these applications are critical for system performance, and energy consumption, especially in Network-on-Chip (NoC) based multicore systems. These systems with multitasking processors offer a better opportunity for parallel application execution. Mapping solutions generated at design-time may be inappropriate for dynamic workloads. To improve the utilization of the underlying multicore platform and cope with the dynamism of application workload, often, task allocation is carried out dynamically. This paper presents a hybrid task allocation and scheduling strategy which exploits the design-time results at runtime. By considering the multitasking capability of the processors, communication energy, and timing characteristics of the tasks, different allocation options are obtained at design-time. During runtime, based on the availability of the platform resources and application requirements, the design-time allocations are adapted for mapping and scheduling of tasks, which result in improved runtime performance. Experimental results demonstrate that the proposed approach achieves, on an average 11.5%, 22.3%, 28.6% and 34.6% reduction in communication energy consumption as compared to CAM [18], DEAMS [4], TSMM [38] and CPNN [32], respectively forNoC based multicore platforms with multitasking processors. Also, the deadline satisfaction of the tasks of allocated applications improves on an average of 32.8% when compared with the state-of-the-art dynamic resource allocation approaches.
Article
Full-text available
Heterogeneous multicore system could provide high flexibility for applications through integrating different types of cores. As the number of cores increases, more and more transistors are integrated which could lead to the rise of chip temperature, thus have a negative impact on the system performance. At present, the typical method to tackle this problem is dynamic voltage and frequency scaling (DVFS) which reduces the voltage and frequency to achieve cooling. However, DVFS will make the heterogeneous multicore system unable to take full advantage of its performance. This paper proposes a dynamic mapping method called TspDM which maximizes the system performance target under the premise of satisfying the temperature safety power. TspDM exploits the runtime performance counters of running threads to dynamically adjust thread mapping. Specifically, a lightweight Artificial Neural Network (ANN) model is proposed to obtain the thread-to-core type mapping when a remapping process is activated. Subsequently, TspDM determines the specific core locations of threads according to the number of threads mapped to a certain core type and the thermal safety power computation. The experimental results show that the proposed method is 21% better than that of PCMig while ensuring the thermal safety of heterogeneous platform. In addition, TspDM outperforms recent performance constrained and power constrained literature in terms of system throughput.
Article
To cope with the increasing demand for deep learning applications in embedded systems, emerging embedded devices tend to equip multiple heterogeneous processors, including GPU and deep learning hardware accelerator, called neural processing unit (NPU). It becomes popular to run multiple deep learning (DL) applications simultaneously to provide several functionalities. In this work, we assume that applications have real-time constraints that may vary at run time. While extensive studies have been conducted recently to find an efficient mapping of multiple DL applications on various hardware platforms, they do not consider the constraints imposed by the NPU and the associated software development kit (SDK) in a real embedded platform. In this paper, we propose a novel energy-aware mapping methodology of multiple DL applications onto a real embedded system that has multiple heterogeneous processors. The objective is to minimize energy consumption while satisfying the real-time constraints of all applications. In the proposed scheme, we first select Pareto-optimal mapping solutions for each application. Then mapping combination is explored, considering the scenario that indicates the dynamism of applications while satisfying the constraints. Also, we reduce energy consumption by tuning the frequency of processors. We could satisfy up to 40%40\% higher deadline constraints and reduce the energy consumption by 22%22\% \sim 31% compared to the static mapping methods with real-life applications and different scenarios on a real platform.
Chapter
This chapter addresses the challenges associated with compilation and optimization techniques for heterogeneous multi‐core computing systems in the embedded industry. The dataflow modeling (especially process networks) is discussed. A retargetable source‐to‐source compiler has been developed to provide the key capabilities that facilitate the construction of compiler infrastructures for real‐world, complex multi‐core architectures. Their details and some optimization techniques for software distribution are explained. The most notable process is the software distribution, which accounts for mapping computation (processes) onto the heterogeneous cores of the platform and communication (channels) to communication resources (e.g. communication APIs, memories and interconnect). Given the requirements imposed by the process network programming model, MAPS relies on the analysis of traces to determine static and hybrid mappings, as discussed.
Conference Paper
Full-text available
The next generation of embedded software has high performance requirements and is increasingly dynamic. Multiple applications are typically sharing the system, running in parallel in different combinations, starting and stopping their individual execution at different moments in time. The different combinations of applications are forming system execution scenarios. In this paper, we present the distributed application layer, a scenario-based design flow for mapping a set of applications onto heterogeneous on-chip many-core systems. Applications are specified as Kahn process networks and the execution scenarios are combined into a finite state machine. Transitions between scenarios are triggered by behavioral events generated by either running applications or the run-time system. A set of optimal mappings are precalculated during design-time analysis. Later, at run-time, hierarchically organized controllers monitor behavioral events, and apply the precalculated mappings when starting new applications. To handle architectural failures, spare cores are allocated at design-time. At run-time, the controllers have the ability to move all processes assigned to a faulty physical core to a spare core. Finally, we apply the proposed design flow to design and optimize a picture-in-picture software.
Article
Full-text available
Current multi-core design methodologies are facing increasing unpredictability in terms of quality due to the actual diversity of the workloads that characterize the deployment scenario. To this end, these systems expose a set of dynamic parameters which can be tuned at run-time to achieve a specified Quality of Service (QoS) in terms of performance. A run-time manager operating system module is in charge of matching the specified QoS with the available platform resources by manipulating the overall degree of task-level parallelism of each application as well as the frequency of operation of each of the system cores. In this paper, we introduce a design space exploration frame-work for enabling and supporting enhanced resource manage-ment through software re-configuration on an industrial multi-core platform. From one side, the framework operates at design time to identify a set of promising operating points which represent the optimal trade-off in terms of the target power consumption and performance. The operating points are used after the system has been deployed to support an enhanced resource management policy. This is done by a light-weight resource management layer which filters and selects the optimal parallelism of each application and operating frequency of each core to achieve the QoS constraints imposed by the external world and/or the user. We show how the proposed design-time and run-time tech-niques can be used to optimally manage the resources of a multiple-stream MPEG4 encoding chip dedicated to automotive cognitive safety tasks 1 .
Conference Paper
Full-text available
Multiprocessors systems-on-chip (MPSoCs) are a trend in VLSI design, since they minimize the design crisis represented by the gap between the silicon technology and the actual SoC design capacity. MPSoCs may employ NoCs to integrate several programmable processors, specialized memories, and other specific IPs in a scalable way. Besides communication infrastructure, another important issue in MPSoCs is task mapping. Dynamic task mapping is needed, since the number of tasks running in the MPSoC may exceed the available resources. Most works in literature present static mapping solutions, not appropriate for this scenario. This paper investigates the performance of mapping algorithms in NoC-based heterogeneous MPSoCs, targeting NoC congestion minimization. The use of the proposed congestion-aware heuristics reduces the NoC channel load, congestion, and packet latency.
Conference Paper
Full-text available
Early design space exploration (DSE) is a key ingredient in system-level design of MPSoC-based embedded systems. The state of the art in this field typically still explores systems under a single, fixed application workload. In reality, however, the applications are concurrently executing and contending for system resources in such systems. As a result, the intensity and nature of application demands can change dramatically over time. This paper therefore introduces the concept of workload scenarios in the DSE process, capturing dynamic behavior both within and between applications. More specifically, we present and evaluate a novel, scenario-based DSE approach based on a coevolutionary genetic algorithm.
Article
Full-text available
Multi-task supported processing elements (PEs) are required in a Multiprocessor System-on-Chip platform for better scalability, power consumption etc. Efficient utilization of PEs needs intelligent mapping of tasks onto them. The job becomes more challenging when the workload of tasks is dynamic. These scenarios require tasks to be mapped at run-time. This paper presents a run-time mapping technique for efficiently mapping the tasks of applications on the multitasking resources. The technique tries to map the communicating tasks onto the same processing resource and also the tasks of an application close to each other in order to reduce the communication overhead. For an evaluated scenario, the presented technique reduces total execution time by 22%, average channel load by 47% and power dissipation by 48% when compared to state-of-the-art run-time mapping techniques.
Article
In this paper, a heuristic Dynamic Spiral Mapping (DSM) algorithm for 2-D mesh topologies is proposed. Based on the DSM we have presented two different approaches: the Full Dynamic Spiral Mapping (FDSM) and the Partial Dynamic Spiral Mapping (PDSM). To compare the efficacy of the algorithm, the reconfiguration time of the FDSM and PDSM are compared. The experimental results of almost 100 simple and complex scenarios with synthetic traffic profiles reveal that in 82% of simulation cases, the PDSM has less reconfiguration time comparing to the FDSM.
Article
Bin packing problems, in which one is asked to pack items of various sizes into bins so as to optimize some given objective function, arise in a wide variety of contexts and have been studied extensively during the past ten years, primarily with the goal of finding fast “approximation algorithms” that construct near-optimal packings. Beginning with the classical one-dimensional bin packing problem first studied in the early 1970’s, we survey the approximation results that have been obtained for this problem and its many variants and generalizations, including recent (unpublished) work that reflects the currently most active areas of bin packing research. Our emphasis is on the worst-case performance guarantees that have been proved, but we also discuss work that has been done on expected performance and behavior “in practice,” as well as mentioning some of the many applications of these problems.
Conference Paper
Modern Multiprocessor Systems-on-Chips (MPSoCs) are ideal platforms for co-hosting multiple applications, which may have very distinct resource requirements (e.g. data processing intensive or communication intensive) and may start/stop execution independently at time instants unknown at design time. In such systems, the runtime task allocator, which is responsible for assigning appropriate resources to each task, is a key component to achieve high system performance. This paper presents a new task allocation strategy in which self-adaptability is introduced. By dynamically adjusting a set of key parameters at runtime, the optimization criteria of the task allocator adapts itself according to the relative scarcity of different types of resources, so that resource bottlenecks can be effectively mitigated. Compared with traditional task allocators with fixed optimization criteria, experimental results show that our adaptive task allocator achieves significant improvement both in terms of hardware efficiency and stability.