ArticlePDF Available

Abstract and Figures

This paper presents the first memory allocation scheme for embedded systems having scratch-pad memory whose size is unknown at compile time. A scratch-pad memory (SPM) is a fast compiler-managed SRAM that replaces the hardware-managed cache. Its uses are motivated by its better real-time guarantees as compared to cache and by its significantly lower overheads in energy consumption, area and access time.Existing data allocation schemes for SPM all require that the SPM size be known at compile-time. Unfortunately, the resulting executable is tied to that size of SPM and is not portable to processor implementations having a different SPM size. Such portability would be valuable in situations where programs for an embedded system are not burned into the system at the time of manufacture, but rather are downloaded onto it during deployment, either using a network or portable media such as memory sticks. Such post-deployment code updates are common in distributed networks and in personal hand-held devices. The presence of different SPM sizes in different devices is common because of the evolution in VLSI technology across years. The result is that SPM cannot be used in such situations with downloaded code.To overcome this limitation, this work presents a compiler method whose resulting executable is portable across SPMs of any size. The executable at run-time places frequently used objects in SPM; it considers code, global variables and stack variables for placement in SPM. The allocation is decided by modified loader software before the program is first run and once the SPM size can be discovered. The loader then modifies the program binary based on the decided allocation. To keep the overhead low, much of the pre-processing for the allocation is done at compile-time. Results show that our benchmarks average a 36% speed increase versus an all-DRAM allocation, while the optimal static allocation scheme, which knows the SPM size at compile-time and is thus an un-achievable upper-bound, is only slightly faster (41% faster than all-DRAM). Results also show that the overhead from our embedded loader averages about 1% in both code-size and run-time of our benchmarks.
Content may be subject to copyright.
Memory Allocation for Embedded Systems with a
Compile-Time-Unknown Scratch-Pad Size
Nghi Nguyen Angel Dominguez Rajeev Barua
nghi@eng.umd.edu angelod@eng.umd.edu barua@eng.umd.edu
Electrical and Computer Engineering Department
University of Maryland
College Park, MD 20742, USA
ABSTRACT This paper presents the first memory allocation
scheme for embedded systems having scratch-pad memory whose
size is unknown at compile time. A scratch-pad memory (SPM) is a
fast compiler-managed SRAM that replaces the hardware-managed
cache. Its uses are motivated by its better real-time guarantees as
compared to cache and by its significantly lower overheads in en-
ergy consumption, area and access time.
Existing data allocation schemes for SPM all require that the
SPM size be known at compile-time. Unfortunately, the resulting
executable is tied to that size of SPM and is not portable to proces-
sor implementations having a different SPM size. Such portability
would be valuable in situations where programs for an embedded
system are not burned into the system at the time of manufacture,
but rather are downloaded onto it during deployment, either using
a network or portable media such as memory sticks. Such post-
deployment code updates are common in distributed networks and
in personal hand-held devices. The presence of different SPM sizes
in different devices is common because of the evolution in VLSI
technology across years. The result is that SPM cannot be used in
such situations with downloaded code.
To overcome this limitation, this work presents a compiler
method whose resulting executable is portable across SPMs of any
size. The executable at run-time places frequently used objects in
SPM; it considers code, global variables and stack variables for
placement in SPM. The allocation is decided by modified loader
software before the program is first run and once the SPM size
can be discovered. The loader then modifies the program binary
based on the decided allocation. To keep the overhead low, much
of the pre-processing for the allocation is done at compile-time.
Results show that our benchmarks average a 36% speed increase
versus an all-DRAM allocation, while the optimal static allocation
scheme, which knows the SPM size at compile-time and is thus an
un-achievable upper-bound, is only slightly faster (41% faster than
all-DRAM). Results also show that the overhead from our embed-
ded loader averages about 1% in both code-size and run-time of our
benchmarks.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CASES’05, September 24–27, 2005, San Francisco, California, USA.
Copyright 2005 ACM 1-59593-149-X/05/0009 ...$5.00.
Categories and Subject Descriptors
B.3.1 [Memory Structures]: Semiconductor Memories-DRAM,
SRAM; B.3.2 [Memory Structures]: Design Styles-Cache Mem-
ories; C.3 [Special-Purpose And Application-Based Systems]:
Real-time and Embedded Systems; D.3.4 [Programming Lan-
guages]: Processors-Code Generation, Compilers; E.2 [Data Stor-
age Representation]: Linked Representations
General Terms
Performance, Algorithms, Management, Design
Keywords
Memory Allocation, Scratch-Pad, Compiler, Embedded Systems,
Downloadable Codes, Embedded Loading, Data Linked List
1. INTRODUCTION
In both desktop and embedded systems, SRAM and DRAM are
the two most common writable memory organizations used for pro-
gram data allocation. SRAM is fast but expensive while DRAM
is slower (by a factor of 10 to 100) but less expensive (by a fac-
tor of 20 or more). To combine their advantages, a large amount
of DRAM is often used to provide low-cost capacity, along with
a small-size SRAM to reduce runtime by storing frequently used
data. The proper use of SRAM in embedded systems can intro-
duces an average of 2x speedup compared to using DRAM only.
This gain is likely to increase in the future since the speed of SRAM
is increasing by 60% a year versus only 7% a year for DRAM [12].
There are two common ways of adding SRAM: either as a
hardware-cache or a Scratch Pad Memory (SPM). In desktops sys-
tems, caches are the most popular approach. The caching mech-
anism dynamically stores a subset of the frequently used data in
SRAM, satisfying the dynamic behavior of program data. Caches
have been a great success for desktops; a trend that is likely to
continue in the future. On the other hand, in most embedded sys-
tems, the overheads of cache come with serious drawbacks. Cache
incurs a significant penalty in area cost, energy, hit latency and
real-time guarantees. A detailed recent study [6] compares the
tradeoffs of a cache as compared to a SPM. Their results show
that a SPM has 34% smaller area and 40% lower power consump-
tion than a cache memory of the same capacity. Further, the run-
time with a SPM using a simple static knapsack-based [6] alloca-
tion algorithm, was measured to be 18% better as compared to a
cache. Thus, defying conventional wisdom, they found absolutely
no advantage to using a cache, even in high-end embedded systems
where performance is important. Given the power, cost, perfor-
mance and real time advantages of SPM, it is not surprising that
SPM is the most common form of SRAM in embedded CPUs to-
day. Examples of embedded processor families having SPM in-
clude low-end chips such as the Motorola MPC500, Analog De-
vices ADSP-21XX, Philips LPC2290; mid-grade chips such as the
Analog Devices ADSP-21160m, Atmel AT91-C140, ARM 968E-
S, Hitachi M32R-32192, Infineon XC166 and high-end chips such
as Analog Devices ADSP-TS201S, Hitachi SuperH-SH7050, and
Motorola Dragonball; there are many others. Trends in recent em-
bedded designs indicate that the dominance of SPM will likely con-
solidate further in the future [6,19], for regular as well as network
processors.
A great variety of allocation schemes for SPM have been pro-
posed recently [5,6,9, 17,24], but all of them require the SPM size
to be known at compile-time. This is because they establish their
solutions by reasoning about which data variables and code blocks
will fit in SPM at compile-time, which inherently and unavoidably
requires knowledge of the SPM size. This has not been a problem
for traditional embedded systems where the code is typically fixed
at the time of manufacture, usually by burning it into ROM, and
is not changed thereafter. There is, however, an emerging and in-
creasing class of embedded systems, where this simple allocation
strategy is no longer feasible. These are systems where the code is
updated on the embedded system after deployment, through either
downloading or portable media, where there is a need for the same
executable to run on different implementations of the same ISA.
Such a situation is common in networked embedded infrastructure
where the amount of SPM is increased every year, due to techno-
logical evolutions, as expected by Moore’s law. Further, in these
systems, code-updates that fix bugs, update security features or en-
hance functionality are common. Consequently, the downloaded
code may not know the SPM size of the processor, and thus is un-
able to use the SPM properly. This leaves the designers with no
choice but to use an all-DRAM allocation or a processor with a
cache, in which the well-known advantages of SPMs are lost.
To make code portable across platforms with varying SPM size,
one theoretical approach is to recompile the source code separately
using all the SPM sizes that exist in practice; and then download
all the resulting executables to each embedded node; discover the
node’s SPM size at run-time; and finally discard all the executables
for SPM sizes other than the one actually present. This has several
drawbacks. First, many executables need to be broadcast and re-
ceived, increasing network bandwidth consumption, energy use on
any portable devices, and storage requirement on any portable me-
dia. Second, the complexity of the system increases and software
used to update code becomes significantly larger and more com-
plex. Third, when an unanticipated SPM size (usually larger) is
used at a future time, an executable for that size may not be readily
available, therefore, may require contacting the vendor who wrote
the application source. This is time-consuming at best and impos-
sible if the vendor no longer exists, which is possible since applica-
tion sources often linger for decades after their first development.
It is important to emphasize that this approach is our speculation –
we have not found this approach being suggested in the literature,
which is not surprising considering its drawbacks listed above. It
would be vastly preferable to have a single executable that could
run on a system with any SPM size.
Challenges Effectively utilizing memory in SPM-based-
embedded systems has always been a challenge. Deriving a mem-
ory allocation scheme for such systems when the sizes of SPMs are
unknown at compile-time is a greater challenge. Without knowing
the size of the SPM at compile-time, it becomes impossible to know
which variables could be placed in SPM at runtime. In order to even
place a single variable in SPM requires knowing all locations in the
binary where that variable is accessed, to be able to update those
accesses to the updated SPM address. Extending this to multiple
variables with an unknown SPM size becomes challenging. To il-
lustrate, consider a variable A of size 4000 bytes. If the available
size of SPM is less than 4000 bytes, this variable A must remain al-
located in DRAM at some address, say, 0x8000. Otherwise, A can
be effectively allocated to SPM to achieve speedup at some address,
say, 0x400. Without the knowledge of the SPM size, the address of
A could be either 0x8000 or 0x400, and thus remains unknowable
at compile-time. Hence, it becomes difficult to generate an instruc-
tion at compile-time that accesses this variable since that requires
knowledge of its assigned address. A level of indirection can be
introduced in software for each memory access to discover its lo-
cation first , but that would incur an unacceptably high overhead.
An SPM allocation for even a single variable such as A is therefore
hard to achieve without knowing the SPM size beforehand.
Method Features and Outline In this paper, we introduce a
compiler technique for managing SPM-based-embedded systems,
which for the first time, does not require the knowledge of the
SPM size at compile time. Our method is described as follows.
At compile-time, the compiler analyzes the program to identify all
the locations in the code segment of the executable that contain
the unknowable offsets and addresses. These locations are the load
and store instructions that access the program stack, and all loca-
tions that store the addresses of global variables. The compiler then
stores the addresses of all these locations along with additional in-
formation about each variable accessed, such as their Frequency-
Per-Byte (FPB), original stack offsets, original global addresses,
and variable sizes as part of the executable for use by the embed-
ded loader. This “original” offset or address of a variable is that in
an all-DRAM no-SPM allocation. The FPB of a variable denotes
the number of times each variable is accessed in the profile data,
divided by its size.
At the next step, when the program is loaded into memory at the
beginning of run-time, a modified embedded loader is used. The
embedded loader is a set of compiler-inserted routines that exe-
cute at the beginning of the application’s the first execution, but not
subsequent executions. The loader routines perform the following
three tasks. First, they discover the size of SPM present on the de-
vice, either by making an OS or low-level system call if available on
that ISA, or by probing addresses in memory using a binary search
pattern and observing the latency to find the range of addresses be-
longing to SPM. Second, the loader routines compute a suitable
allocation to the SPM using its just-discovered size and the FPB of
variables. Third, the loader implements the allocation by travers-
ing the locations in the code segment of the executable that have
unknown fields and replacing them with the SPM stack offsets and
global addresses for the run-time-decided allocation. The resulting
executable is now tailored for the SPM size on the target device,
and can be executed without any further overhead. The executable
can be re-run indefinitely, as is common in embedded systems, with
no further overhead.
The embedded loader needs a list of all code locations that con-
tain memory address references for stack and global variables that
will be optimized. This list could be appended to the executable;
but that would increases its code size significantly. To avoid this in-
crease, our approach stores the locations to-be-updated in a linked
list in-place in those locations itself. In other words, each to-be-
modified location stores the displacement in words to the next to-
be-modified location in the executable. These displacements are
stored in the bits where the still-unknown stack offsets and global
addresses will be stored at run-time. At the start of run-time this in-
place linked list is traversed, and after the displacement to the next
location is read, it is overwritten with the correct offset or address.
Since these lists are stored in the unused bits of the executable file,
they cause no increase in code size. Only the first element of each
linked list needs to be stored in the executable.
Finally at run-time when the SPM size is known, the allocation
of code and data to SPM is decided. Our allocation scheme is able
to share space for stack variables with non-overlapping lifetimes.
The resulting allocation is implemented when the embedded loader
modifies the program code at the start of run-time to correctly refer
to the addresses of objects in SPM.
Besides data objects, our method can also place frequently ex-
ecuted code blocks in SPM. They are handled much like global
variables in deciding their allocation. The implementation of the
allocation, however, requires patching the code to insert a branch
to the start of the moved block in SPM from its predecessor blocks;
as well as a branch to subsequent blocks from the end of the block
in SPM. Our method shows how this patching can be done in the
embedded loader.
Results and Paper Overview Our method is implemented on
GNU tool chain from CodeSourcery [8] for the ARM v5e embed-
ded processor family [3]. Running our allocator for eight bench-
marks, we achieve on average a 36% speedup compared to an all-
DRAM, no-SPM allocation; this compares to the 41% speedup
achieved by the optimal static allocation scheme [5]. The runtime
and code size overheads for our embedded loader are around 1.0%
each for a single run of the application, and lower still when amor-
tized across multiple runs. These results indicate that with very
low runtime and code-size overhead, our method achieves the goal
of generating code that is portable across platforms with different
sizes of SPM, while obtaining a performance that is close to that of
the unachievable optimal upper bound [5].
The rest of the paper is organized as follows. Section 2 describes
the scenarios where our method is useful. Section 3 overviews re-
lated work. Section 4 discusses the method in [5] whose allocation
we aim to reproduce without the knowledge of the SPM size. Sec-
tion 5 discusses our method in detail from the profiling stage to
embedded loading stage. Section 6 discusses the allocation policy
used in the embedded loader. Section 7 describes how program
codes are allocated into SPM. Section 8 presents the experimen-
tal environment, benchmarks properties, and our method’s results.
Section 9 concludes.
2. SCENARIOS WHERE OUR PROPOSED
METHOD IS USEFUL
Our method is useful in the situations when the code is not
burned into ROM at the time of manufacture, but is instead down-
loaded later onto the systems; and moreover, when due to tech-
nological evolution, the code may be required to run on multi-
ple processor implementations of the same ISA having differing
amounts of SPM.
One situation where this often occurs is in distributed networks
such as a network of ATM machines at financial institutions. Such
ATM machines may be deployed in different years and therefore
have different sizes of SPM. Code-updates are usually issued to
these ATM machines over the network, to update their function-
ality, fix bugs, or install new security features. In current practice,
SPMs cannot be used for such code-updates since they do not know
the SPM size. We would like to enable such codes to run on any
ATM machine with any SPM size. This is made possible by our
method.
Another situation where our technology may be useful is in sen-
sor networks. Examples of such networks are the sensors that de-
tect traffic conditions on roads or the ones that monitor environ-
mental conditions over various points in a terrain. In these long-
lived sensor networks, nodes may be added over a period of several
years. At the pace of technology evolution today, where a new
processor implementation is released every few months, this may
represent several generations of processors with increasing sizes
of SPM that are present simultaneously in the same network. Our
method will allow remote code updates, common in such sensor
networks, to use the SPM regardless of its size.
A third example is downloadable programs for personal digital
assistants (PDAs), mobile phones and other consumer electron-
ics. These applications may be downloaded over a network or
from portable media such as flash memory sticks. These programs
are designed and provided independently to the configurations of
SRAM sizes on the consumer products. Therefore, to efficiently
utilize the SPM for such downloadable software, a memory alloca-
tion scheme for unknown size SPMs is much needed. There exists a
variety of these downloadable programs on the market used for dif-
ferent purposes such as entertainment, education, business, health
and fitness, and hobbies. Real-world examples include games such
as Pocket DVD Studio [11], FreeCell, and Pack of Solitaires [27];
document managers such as PhatNotes [18], PlanMaker [22], and
e-book readers; and other tools such as Pocket Slideshow [7] and
Pocket Quicken [15]. In all these situations, our technology would
allow these codes to take advantages of the SPM for the first time.
We expect that in the future our technology may eventually even
allow desktop systems to use SPM efficiently. One of the primary
reasons that caches are popular in desktops is that they deliver good
performance for any executable, without requiring it to be cus-
tomized for any particular cache size. This is in contrast to SPMs,
which so far have required customization to a particular SPM size.
By freeing programs of this restriction, SPMs can overcome one
hurdle to their use in desktops. However, there are still other hur-
dles to have SPMs become the norm in desktop systems, including
that heap data, which our method does not handle, are more com-
mon in desktops than in embedded systems. In addition, the in-
herent advantages of SPM over cache are less important in desktop
systems. For this reason we do not consider desktops further in this
paper.
3. RELATED WORK
Static methods to allocate data to SPM include [4–6,13, 17,20,
21]. Static methods are those whose SPM allocation does not
change at run-time. Some of these methods [6, 17,20] are restricted
to allocating only global variables to SPM, while others [4,5,13, 21]
can allocate both global and stack variables to SPM. These static
allocation methods either use greedy strategies to find an efficient
solution, or model the problem as a knapsack problem or an integer-
linear programming problem (ILP) to find an optimal solution.
Some static allocation methods [2, 25] aim to allocate code to
SPM rather than data. Other static methods [23,26] can allocate
both code and data to SPM. Their data allocation is still restricted
to global and stack data only. The goal of the work in [1] is yet an-
other: to map the data in the scratch-pad among its different banks
in multi-banked scratch-pads; and then to turn off (or send to a
lower energy state) the banks that are not being actively accessed.
Dynamic methods are those which can change the SPM allo-
cation during run-time [9, 16,24]. The method in [16] can place
global and stack arrays accessed through affine functions of enclos-
ing loop induction variables in SPM. No other variables are placed
in SPM; further the optimization for each loop is local in that it
does not consider other code in the program. The method in [24] is
a fully general dynamic method that can place all kinds of global
Figure 1: Example of stack split into two separate memory units.
Variables aand bare placed on SRAM and DRAM respectively. A call
to foo() requires the stack pointers in both memories to be incremented.
and stack variables in SPM. It uses a whole-program analysis that
aims to consider the interactions between neighboring code regions
to minimize the transfer of data between SPM and DRAM while
maximizing the fraction of data found in SPM. The method in [9]
is a dynamic method that is the first SPM allocation method to place
a portion of the heap data in the SPM.
All the existing methods discussed above require the compiler
to know the size of the SPM. Moreover, the resulting executable
is meant only for processors with that size of SPM. Our method is
the first to produce an executable that makes no assumptions about
SPM size and thus is portable to any possible size.
4. BACKGROUND
The allocation strategy used by the loader in our method aims to
produce an allocation that is as similar as possible to the optimal
static allocation method presented in [5] for global and stack vari-
ables. This section outlines their method since our method builds
upon its foundation.
The allocation in [5] is as follows. In effect, for global variables,
the ones with highest FPB are placed in SPM. For stack variables,
to allow variables in the same stack frame to be allocated to differ-
ent memories (SPM vs DRAM), a distributed stack is used. Here
the stack is partitioned into two stacks for the same application:
one for SPM and the other for DRAM. Each stack frame is parti-
tioned, and two stack pointers are maintained, one pointing to the
top of the stack in each memory. An example of how a single stack
frame is distributed is shown in figure 4. The allocator places the
frequently used stack variables in the SPM stack, and the rest are in
the DRAM stack. In this way, at run-time, only the frequently-used
stack variables (such as variable ain figure 4) appear in SPM.
The method in [5] formulates the problem of searching the space
of possible allocations with an objective function and a set of con-
straints. The objective function to be minimized is the expected
run-time with the allocation, expressed in terms of the proposed al-
location and the profile-discovered frequency-per-bytes (FPBs) of
the variables. The constraints are that for each path through the
call graph of the program, the size of the SPM stack fits within
the SPM’s size. This constraint automatically takes advantage of
the limited lifetime of stack variables: if main() calls f1() and f2(),
then the variables in f1() and f2() share the same space in SPM, and
the constraint correctly estimates the stack height in each path. As
we shall see later, our method also takes advantage of the limited
lifetime of stack variables.
This search problem is solved in two ways: using a greedy search
and a provably optimal search based on Integer-Linear Program-
ming (ILP). Results show that both approaches produce good re-
sults, with the ILP solution doing only marginally better. Therefore
the greedy solver is also near-optimal. For this reason, and since
the greedy solver is much more practical in real compilers, in our
Figure 2: Stack Variable access locations in the executable file.
Instruction “ldr rx, [fp, c]” rx mem[fp+c] and “str rx, [fp,
c]” mem[fp+c] rx
Figure 3: Global Variable Address Locations. Instruction “ld-
mia rx,[ry,rz,rw]’ ’ ry mem[rx], rz mem[rx+4], rw
mem[rx+8]
evaluation we use the greedy solver for both the method in [5] and
in the off-line component of our method, although either could be
used.
5. METHOD
Our method’s goal is to achieve as close an allocation as possi-
ble to that in Avissar et al. [5]. In particular, we will emulate the
run-time behavior of [5] to the extent possible including its use
of a distributed stack. However, our mechanisms at compile-time
and loading-time will be significantly different from their method,
and more complicated, because of the difficulties introduced by not
knowing the SPM’s size at compile-time.
Our method introduces modifications to the profiling, compila-
tion, linking and loading stages of code development. The tasks at
each stage are described below.
Profiling Stage The application is run multiple times with differ-
ent inputs to collect the number of accesses for each variable in the
program for each input; and an average is taken. These numbers
of accesses represent how frequently a variable is accessed in the
application. Next, this frequency of each variable is divided by its
size in bytes to yield its FPB. Intuitively, variables with higher FPB
should have higher priority for placement in SPM. Thereafter, a
list of variables sorted in decreasing order of their FPBs is created.
This list is stored in the output of the profiling stage; along with
each variable is also stored its FPB and its size.
Compiling Stage Since the SPM size is unknown, we do not
fix the allocation at compile-time, but delay the assignments of
variable addresses and offsets until runtime. Various types of pre-
processing are done in the compiler to reduce the embedded loader
overhead. These are described next.
As the first step, the compiler analyzes the program to identify
all the code locations that contain data which is unknown due to
not knowing the SPM’s size at compile time. These locations are
the load and store instructions that access program stack variables,
and all locations that store the addresses of global variables.
Figure 4: Before Modification
These locations are identified by their addresses in the executable
file.
Let us consider how stack accesses are handled. For the ARM
architecture on which we performed our experiments, the locations
in the executable file that affect the stack offset assignments are the
load and store instructions that access the stack variables, and the
arithmetic instructions that calculate their offsets. In the usual case,
when the stack offset value is small enough to fit into the immediate
field of the load/store instruction, these load and store instructions
are the only ones that affect the stack offset assignments. The first
ldr and the subsequent str instructions in Figure 2 illustrate two
accesses of this type, where the stack offset value of -44 from the
frame pointer (fp) fits in the 12-bit immediate field of the load/store
instructions in ARM.
In some rare cases, when the stack offset value is larger than the
range of the immediate field of load/store instruction, additional
arithmetic instructions are needed to calculate the correct offset of
a stack variable. Such cases arise for procedures with frame sizes
that are larger than the span of the immediate field of the load/store
instructions. In ARM, this translates to stack offsets larger than
212 =4096 bytes. In these rare cases, the stack offset is first moved
to a register and then added to the frame pointer. An example is
seen in the three-instruction sequence (mov,add,ldr) at the bot-
tom of figure 2. Since the mov instruction allows 16-bit immedi-
ates, stack offsets of up to 64Kbytes are allowed in this addressing
mode. So the offset shown (-4160) fits in the immediate field of the
mov instruction. In this case, only the mov instruction needs to be
added to the linked list of locations with unknown immediate fields
that we maintain, since only its field needs to be changed by the
embedded loader.
For global variables in ARM, the addresses are stored in the lit-
eral tables. Literal tables are locations in the code segment which
contain the full 32-bit addresses of global variables in the program.
In ARM, they reside just after each procedure in the executable file.
In a rare situation, the literal tables can also appear in the middle of
the code segment of a function with a branch instruction jumping
around it for the sake of program correctness. This situation occurs
only when the code length of a function is larger than the range of
the load immediates used in the code to access the literal tables. An
example of a literal table is presented in the figure 3.
After identifying the locations that need to be modified by the
embedded loader – those are locations containing stack offsets and
global addresses – the compiler creates a linked-list of such loca-
tions for each variable for use in the linking stage. This compiler
linked-list is not yet the in-place linked list stored in the instruc-
tions themselves. It is, however, used later to establish the actual
in-place linked-list at linking time, when the exact displacements
of the to-be-modified locations are known.
The compiler also analyzes the limited lifetimes of the stack vari-
ables to determine the additional sets of variables for allocating into
SPM for each cut-off point. Details of the allocation policy and life-
Figure 5: After Modification
time analysis are presented in section 6. Finally the compiler inserts
the embedded loader routines into the code. A part of the loader is
code that will find the SPM vs. DRAM allocation at run-time.
Linking Stage At the end of the linking stage, to avoid significant
code-size overhead, we store the linked-list of all locations in code
segment with unknown immediate values in-place in the locations
themselves. This is possible since the locations will be overwritten
with the correct immediate values only at the start of run-time, and
until then, they can be used to store the displacement to the next
element in the list, expressed in words. To achieve this, at the end of
the linker stage, the linker traverses the compiler-generated linked
lists, and converts them to the in-place format. This is possible in
the linker stage since the exact displacements in the executable of
all locations is now known. The addresses of the first locations in
the linked-lists are also stored elsewhere in a table in the executable
to be used at run-time as the starting addresses of the linked-lists.
With this technique, the linked lists can be easily traversed at the
start of run-time.
An example of the code conversion in the linker is shown in fig-
ures 4 and 5. Figure 4 shows the output code from the compiler
with the stack offsets assuming an all-DRAM allocation. Figure 5
shows the same code after the linker converts the separately stored
linked lists to in-place linked lists in the code. Each instruction now
stores the displacement to the next address.
The in-place linked list representation is possible because in
most cases the bit-width of the immediate fields is sufficient to store
the displacement to the next access of a variable. For example, for
stack accesses in ARM, the immediates are either 12 or 16-bit as
described earlier, which yield an allowed displacement to the next
instruction of 4096 or 64K words. The presence of these multiple
widths in the same linked list causes no problem since the loader
will look at the instruction opcode to know which is used (ldr/str
12 bits; mov 16 bits). In most cases, the next use of a variable
is close to the current use, so this displacement is adequate. In the
rare case it is not, a new linked list is created for this same variable
and handled identically thereafter.
For global variables, the literal table entries are 32-bits wide. In
a 32-bit address space, this is wide enough to store all possible
displacements, so a single linked list is always adequate for each
global variable.
Although our method above is illustrated with the example of the
ARM ISA, it is applicable to most embedded ISAs. To apply our
method for any other ISA, the locations in the program code that
store the immediates for stack offsets and global addresses must be
identified and stored in the linked lists. The exact widths of the
immediate fields may differ from ARM, leading to more or fewer
linked lists than in ARM. However because accesses to the same
variable are often close together in the code, the number of linked
lists is expected to remain small.
Embedded Loader The embedded loader is implemented in a set
of compiler-inserted codes that are executed just after the program
Define:
A: is the list of all global and stack variables in decreasing FPB order
Greedy Set: is the set of variables allocated greedily to SPM
Limited Lifetime Bonus Set: is the limited-lifetime-bonus-set of variables SPM
GREEDY SIZE: is the cumulated size of greedily allocated variables to SPM at each cutoff point
BONUS SIZE: is the cumulated size of variables in limited-lifetime-bonus-set
MAX HEIGHT SPM STACK: the maximum height of the SPM stack during lifetime of current variable
void Find allocation(A) {/* Run at compile-time */
1. for (i = beginning to end of FPB list A) {
2. GREEDY SIZE 0; BONUS SIZE 0;
3. Greedy Set NULL; Limited Lifetime Bonus Set NULL;
4. for (j = 0 to i) {
5. GREEDY SIZE GREEDY SIZE + size of A[j]; /* jth variable in FPB list */
6. Add A[j] to the Greedy Set;
7. }
8. Call Find limited lifetime bonus set(i, GREEDY SIZE);
9. Save Limited Lifetime Bonus set for cut-off at variable A[i] in executable;
10. }
11. return;}
void Find limited lifetime bonus set(cut-off-point, GREEDY SIZE) {
12. for (k = cut-off-point to end of FPB list A) {
13. Add stack variables in Greedy Set Limited Lifetime Bonus Set to SPM stack;
14. if (A[k] is a stack variable) {
15. Find MAX HEIGHT SPM STACK among all call-graph paths from main() to leaf procedures that go through procedure containing A[k];
16. }else {/* A[k] is global variable */
17. Find MAX HEIGHT SPM STACK among all call-graph paths from main() to leaf procedures;
18. }
19. ACTUAL SPM FOOTPRINT (Size of globals in Greedy Set Limited Lifetime Bonus Set) + MAX HEIGHT SPM STACK;
20. if (GREEDY SIZE - ACTUAL SPM FOOTPRINT size of A[k]) {/* L.H.S. is over-estimate amount */
21. add A[k] into the Limited Lifetime Bonus Set
22. BONUS SIZE BONUS SIZE + size of A[k];
23. }
24. }
25. return;}
Figure 6: Compiler pre-processing pseudo-code that finds Limited Lifetime Bonus Set at each cut-off
is loaded in memory. As a part of the executable, it is executed by
an OS call to a loader routine in the executable, or by an application
call at the start of main(). It is executed only before the first time
the program is run, and not before subsequent runs; these can be
differentiated by a persistent is-first-time boolean variable in the
loader routine. In this way, the overhead of the embedded loader is
encountered only once even if the program is re-run indefinitely, as
is common in embedded systems.
The loader routines perform the following three tasks. First, they
discover the size of SPM present on the device, either by making an
OS system call if available on that ISA, or by probing addresses in
memory using a binary search and observing the latency to find the
range of addresses in SPM. Second, the loader routines compute a
suitable allocation to the SPM using its just-discovered size and the
frequency-per-byte of variables. The details of the allocation are
described in section 6. Third, the loader implements the allocation
by traversing the locations in the code that have unknown fields and
replacing them with the stack offsets and global addresses for the
run-time-decided allocation. The resulting executable is now tai-
lored for the SPM size on that device, and can be executed without
any further overhead.
Although the embedded loader name implies that it is part of
the already-provided default system loader, it does not have to be.
Indeed the system loader need not be modified at all in our frame-
work. Instead the embedded loader is usually implemented as a set
of routines that are executed from inside the application at its be-
ginning; the routines themselves can be stored as part of the appli-
cation or in a library. Nevertheless since its functionality is closer
in spirit to a loader, we feel the “loader” name is appropriate.
6. ALLOCATION POLICY IN EMBEDDED
LOADER
The SPM-DRAM allocation is decided by the embedded loader
using the run-time discovered SPM-size, the frequency-per-byte
(FPB) of each variable and additional pre-processing information
about limited lifetimes of stack variables that the compiler pro-
vides. The greedy profile-driven cost allocation in the loader is
as follows. The embedded loader traverses the list of all global
and stacks variables stored by the compiler in a decreasing order of
their FPBs, placing variables into SPM, until the cumulative size of
the variables allocated so far exceeds the SPM size. This point in
the list is called its cut-off point.
We observe, however, that the SPM may not actually be full on
each call graph path at the cut-off point because of the limited life-
times of stack variables. For example, if main() calls f1() and f2(),
then the variables in f1() and f2() can share the same space in SPM
since they have non-overlapping lifetimes, and simply cumulating
their sizes over-estimates the maximum height of the SPM stack.
Thus the greedy allocation under-utilizes the SPM.
Our method uses this opportunity to allocate an additional set of
stack variables in SPM to utilize the remaining SPM space. We call
this the limited-lifetime-bonus-set of variables to place in SPM. To
avoid an expensive search at loading time, this set is computed off-
line by the compiler and stored in the executable for each possible
cut-off point in the FPB-sorted list. Since the greedy search can
cut-off at any variable, a bonus set must be pre-computed for each
variable in the program. Once this list is available to our embedded
loader at the start of run-time, it implements its allocations in the
same way as for other variables. In this way, our method takes
advantage of the limited lifetimes of stack variables.
The compiler algorithm to compute the limited-lifetime-bonus-
set of variables at each cut-off point in the FPB list is presented in
figure 6. Lines 1-11 show the main loop traversing the FPB-sorted
list in decreasing order of FPB. Lines 4-7 find the greedy allocation
for the cut-off point at variable i. Line 8 makes the call to a rou-
Figure 7: Program code is divided into code regions
tine to find the limited-lifetime-bonus-set at this cut-off point; the
routine is in lines 12-25. Then the unutilized space in SPM is com-
puted as the difference of the greedily-estimated size and the actual
memory footprint (line 20), which may be lower because of lim-
ited lifetimes. Additional variables are then found to fill this space
in decreasing order of FPB among the remaining variables. This
search of bonus variables considers the stack allocation only along
paths through the current variable’s procedure if it is a stack vari-
able (line 15); therefore it does not itself over-estimate the memory
footprint.
Two factors reduce the code-size increase from storing the bonus
sets at each cut-off. First, the bonus sets are stored in bit-vector
representation on the set of variables, and so are extremely com-
pact. Second, in a simple optimization, instead of defining cut-
offs at each variable, a cut-off is defined at a variable only if
the cumulative size of variables from the previous cut-off ex-
ceeds CUT OFF THRESHOLD, a small constant currently set at
10 words. This avoids defining a new cut-off every time a sin-
gle scalar variable is considered; instead groups of adjacent scalars
with similar FPBs to be considered together for purposes of com-
puting a bonus set. This can reduce the code space increase by up
to a factor of 10, with only a small cost in SPM utilization.
7. CODE ALLOCATION
Our method allows program code to be allocated to SPM in sim-
ilar manner to data. Code is considered for placement in SPM at
the granularity of regions. For this reason, the program code is par-
titioned into regions. Some criteria for a good choice of regions are
(i) the regions should not too big to allow fine-grained considera-
tion of placement of code in SPM; (ii) the regions should not be
too small to make for a very large search problem and excessive
patching of code; (iii) the regions should correspond to significant
changes in frequency of access, so that regions are not forced to
allocate infrequent code in them just to bring their frequent parts
in SPM; and (iv) except in nested loops, the regions should contain
entire loops in them so that the patching at the start and end of the
region is not inside a loop, and therefore has low overhead. With
these considerations, we define a new region to begin at (i) the start
of each procedure; and (ii) just before the start, and at the end, of
every loop (even inner loops of nested loops). Other choices are
possible, but we have found this heuristic choice to work well in
practice.
An example of how code is partitioned into regions is in Figure 7.
As the following step, each region’s profiled data such as size, FPB,
start and end addresses are collected at the profiling stage along
with profiled data for program variables.
Figure 8: Jump instruction is inserted to redirect control flow
between SPM and DRAM
Since code regions and global variables have the same lifetime
characteristics, code allocation, therefore region allocation, is de-
cided at embedded loading time using the same allocation policy
as global variables. The greedy profile-driven cost allocation in
the embedded loader is modified to include code regions as fol-
lows. The embedded loader traverses the list of all global vari-
ables, stacks variables, and code regions stored by the compiler in
a decreasing order of their FPBs, placing variables and transferring
code regions into SPM, until the cumulative size of the variables
and code allocated so far exceeds the SPM size. At this cut-off
point, an additional set of variables and code regions, which are
established at compile time by the limited-lifetime-bonus-set algo-
rithm for both data variables and code regions, are also allocated
to SPM. The limited-lifetime-bonus-set algorithm is modified to
include code regions, which are treated as additional global vari-
ables.
Since relocating a region of program code into SPM can break
the control flow of the program, code-patching is needed at sev-
eral places to ensure that the code with SPM allocation is function-
ally correct. Figure 8 shows the patching needed. At embedded
loading time, for each code region that is transferred to SPM, our
method inserts a jump instruction at the original DRAM address of
the start of this region. The copy of this region in DRAM becomes
unused DRAM space1. Upon reaching this loading-time inserted
instruction, execution will jump to the SPM address this region is
assigned, thereby, redirecting all incoming execution paths of this
regions to the correct address in SPM.
Similarly, we also insert a patching instruction as the last instruc-
tion of the SPM allocated code region, which redirects program
flow back to DRAM. The distance from the original DRAM space
to the newly allocated SPM space of the transferring region usually
fits into the immediate field of the jump instructions. In the ARM
architecture, which we use for evaluation, jump instructions have a
24-bit offset which is large enough in most cases. In the rare cases
that the offset is too large to fit in the space available in the jump in-
struction, a longer sequence of instructions is needed for the jump;
this sequence first places the offset into a register and then jumps
to the contents of the register.
Besides incoming and outgoing paths, side entries and side exits
of the optimized regions also need modification to ensure correct
1We do not attempt to recover this space since it will require patch-
ing code even when it is not moved to SPM, unlike in our current
scheme. Moreover since the SPM is usually a small fraction of the
DRAM space, the space recovered in DRAM will be small.
Application Source Description Data Size(Bytes) Lines of Code # of assembly instr.
StringSearch MIBench A Pratt-Boyer-Moore String Search 12820 3037 4433
CRC MIBench 32 BIT ANSI X3.66 CRC checksum 1068 187 504
Dijkstra MIBench Shortest path Algorithm 4097 174 501
EdgeDetect UTDSP Edge Detection in an image 196848 297 701
FFT UTDSP Fast Fourier Transform 16568 189 478
KS PtrDist Minimum Spanning Tree for Graphs 27702 408 1327
MMULT UTDSP Matrix Multiplication 120204 164 416
Qsort MIBench Quick Sort Algorithm 7680000 45 116
Table 1: Application Characteristics
Figure 9: Runtime speedup compared to all-DRAM method and Static Optimal Method
control flow. With our definition of regions, side entries are usually
caused by unstructured control flow from programming statements
such as “goto”, which are rare in applications. Our method does not
consider regions which are the target of unstructured control flow
for SPM allocation; therefore, no further modification is needed for
side entries of SPM-allocated regions.
On the other hand, side exits such as procedure calls from our
code regions are common. The returns from procedures do not
need patching since their target address is computed at run-time.
However side exits such as the calls to procedures are patched as
follows. For each SPM-allocated code region, the branch offsets of
all control transfer instructions that branch to outside of the region
they belong to, are adjusted to the new corrected offsets. These
new corrected branch offsets are calculated by adding the original
branch offsets to the distance between DRAM and SPM starting
addresses of the transferring regions.
The final step in the patching required for code allocation is the
modification of load-address instructions of global variables, which
are accessed in the SPM-allocated regions. The load-address in-
struction of a global variable is obtained from a PC-relative load
that loads the address of the global variables from the literal table
also in the code. Allocating code regions with such load-address
instructions into SPM will make the original relative offsets in-
valid. Besides, for the ARM architecture, the relative offsets of
the load-address instructions are 12-bit. Thus, it is quite likely that
the distance from the load-address instructions in SPM to the literal
tables in DRAM is too large to fit into those 12-bit relative offsets.
To solve these two problems, our method generates a second set of
literal tables which reside in SPM. During when code objects are
being placed in SPM one-after-another, a literal table is generated
at a point in the SPM layout if code about to be placed cannot re-
fer to the previously generated literal table in SPM since it is out of
range. This leads to (roughly) one literal table per 212 =4096 bytes
of code in SPM. These secondary SPM literal tables contain the
addresses of only those global variables that refer to it. Afterward,
the relative offsets of these load-address instructions are adjusted
to the corrected offsets, which are calculated by the distance from
the load-address instructions to the SPM literal table.
8. RESULTS
This section presents our results by comparing the proposed al-
location scheme for embedded systems with unknown size SPM
against an all-DRAM-allocation and against Avissar et. al’s method
in [5]. We compare our method to the all-DRAM-allocation
method since all existing methods are inapplicable in our target
systems where the SPM size is unknown at compile-time; thus,
they have to force all data and code allocation to DRAM. We also
compare our scheme to [5] to show that our scheme obtains a per-
formance which is close to the un-achievable optimal upper-bound.
Experimental Environment Our method is implemented on the
GNU tool chain from CodeSourcery [8] that produces code for the
ARM v5e embedded processor family [3]. The process of identi-
fying variable accesses and addresses, analysis of variable limited
lifetime, and embedded loader codes generation are implemented in
the GCC v3.4 cross-compiler. The modifications of executable file
are done in the linker of the same tool chain. The memory charac-
teristics are as follows. An external DRAM with 20-cycle latency,
Flash memory with 10-cycle latency, and an internal SRAM (SPM)
with 1-cycle latency are simulated. Data is placed in DRAM and
code in Flash memory. Code is most commonly placed in Flash
memory today when it needs to be downloaded. A set of most fre-
quently used data and code is intelligently allocated to SPM by our
compiler. The memory latencies assumed for Flash and DRAM are
representative of those in modern embedded systems [10,14]. The
SRAM size is configured to be 20% of the total data size in the
program2. The benchmarks’ characteristics are shown in table 1.
Runtime Speedup The run-time for each benchmark is presented
in figure 9 for five configurations: all-DRAM, our method for data
2We could have also chosen a second SRAM size to be 20% of total
code + data size, when evaluating the methods for both code and
data. However, to make comparisions between methods for data
only and for both code and data, we had to choose one SRAM size
so that the comparison is fair.
Figure 10: Runtime speedup of our method with and without limited lifetime compared to all-DRAM method
Figure 11: Variation of embedded loading time across benchmarks
Figure 12: Runtime Overhead
allocation only, optimal upper bound obtained by using [5] for data
allocation only, our method enhanced for both code and data al-
locations, and the optimal upper bound obtained by using [5] for
both code and data allocations. Averaging across the eight bench-
marks, our full method (the fourth bar) achieves a 36% speedup
compared to all-DRAM allocation (the first bar). The provable op-
timal static allocation method [5], extended for code in addition
to data, achieves a speedup of 41% on the same set of benchmarks
(the fifth bar). This small difference indicates that we can obtain a
performance that is close to that in [5] without requiring the knowl-
edge of SPM’s size at compile time.
The figure also shows that when only data is considered for al-
location to SPM, a smaller run-time gain of 27% is observed ver-
sus an upper bound of 31% for the optimal static allocation. This
shows that considering code for SPM placement rather than just
data yields an additional 36%-27%=9% improvement in run-time
for the same size of SPM.
The performance of the limited-lifetime algorithm is showed in
figure 10. The difference between the second and third bars in fig-
ure 10 gives the improvement using our limited lifetime analysis, as
compared to a greedy allocation, for each benchmark. Although the
average benefit is small (4% on average), for certain benchmarks
(for example, StringSearch and Dijkstra), the benefit is greater.
This shows that the limited lifetime enhancement is worth doing
but is not critical.
Loading Time Overhead Figure 12 shows the increase in the
run-time from the embedded loader as a percentage of the run-
time of one execution of the application. The figure shows that this
run-time overhead from the loader averages only 1.3% across the
benchmarks. A majority of the overhead is from code allocation
including the latency of copying code from DRAM to SRAM at
embedded loading time. The overhead is an even smaller percent-
age when amortized over several runs of the application; re-runs
are common in embedded systems. The reason why the runtime
overhead is small is explained as follows. The embedded load-
ing time is proportional to the total number of appearances in the
executable file of load and store instructions that accesses the pro-
gram stack, and the locations that store global variables addresses.
These numbers are in-turn upper-bounded by the number of static
instructions in the code. On the other hand, the run-time of the ap-
plication is proportional to the number of dynamic instructions ex-
ecuted, which usually far exceeds the number of static instructions
because of loops and repeated calls to procedures. Consequently
the overhead of the loader is small as a percentage of the run-time
of the application.
Another metric is the absolute time taken by the embedded
loader. This is the waiting time between when the application has
finished downloading and is ready to run after the loader has com-
pleted its work. For a good response time, this number should be
low. Figure 11 shows that this waiting time is very low, and av-
erages 50 micro-seconds across the eight benchmarks. It will be
larger for larger benchmarks, and is expected to grow roughly lin-
early in the size of the benchmark.
Code Size Overhead Figure 13 shows the code size overhead
of our method for each benchmark. The code-size increase from
our method compared to the unmodified executable that does not
use the SPM averages 1.0% across the benchmarks. The code-size
overhead is small because of our technique of reusing the un-used
bit-fields in the executable file to store the linked lists containing
locations with unknown stack offsets and global addresses. In ad-
dition, the figure shows the code size overhead from its constituent
two parts: the embedded loader codes and the additional informa-
tion about each variable and code region in the program which is
stored until runtime. These additional information are the starting
addresses of the location linked-lists, regions sizes, regions start
and end addresses, variables sizes, original stack offset and global
variable addresses.
Memory Access Distribution Figure 14 show the distribution
of memory accesses between SRAM and DRAM. This is another
view of our method’s performance in terms of how many percent-
ages of the memory accesses our method is able to direct to SRAM
instead of allocating them in DRAM. Figure 14 indicates that on
average, 53% of memory accesses are to SRAM for our method;
vs. 59% for Avissar et. al’s method in [5].
Runtime vs. SPM size Figure 15 shows the variation of runtime
Figure 13: Variation of code size overhead across benchmarks
Figure 14: Comparison of memory access distribution against Static Optimal Method
Figure 15: Runtime Speedup with varying SPM Sizes for Dijk-
stra Benchmark
for the Dijkstra benchmark with different SPM size configurations
ranging from 5% to 35% of the data size. When the SPM size
is set to lower than 15% of the data size, both our method and the
optimal solution in [5] do not gain much speedup for this particular
benchmark. Our method starts achieving good performance when
the SPM size is more than 15% of the data size since at that point
the more significant data structures in the benchmark start to fit
in the SPM. When the SPM size exceeds 30% of the data set, a
point of diminishing returns is reached in that the variables that do
not fit are not frequently used. The point of this example is not
to so much to illustrate the absolute performance of the methods.
Rather it is to demonstrate that our method is able to closely track
the performance of the optimal static allocation in robust manner
across the different sizes by using the exact same executable. In
contrast the optimal static allocation uses different executables for
each size.
9. CONCLUSION
In this paper, we introduce a compiler technique that, for the
first time, is able to generate code that is portable across different
SPM sizes. With technology evolution every year leading to differ-
ent SPM sizes for the same ISA’s processor implementations, there
is a need for a method that can generate such portable code. Our
method is also able to share memory between stack variables that
have mutually disjoint lifetimes. Our results indicate that on aver-
age, the proposed method achieves 36% speedup compared to all-
DRAM allocation without knowing the size of the SPM at compile-
time. The speedup is only slightly higher (41% vs all-DRAM) with
an unattainable optimal upper-bound allocation that requires know-
ing the SPM size [5].
Our method currently only considers program code, global vari-
ables and stack variables for static SPM allocation. Dynamic allo-
cation schemes and heap allocation schemes may also be investi-
gated in the future.
10. REFERENCES
[1] F. Angiolini, L. Benini, and A. Caprara. Polynomial-time
algorithm for on-chip scratchpad memory partitioning. In
Proceedings of the 2003 international conference on
Compilers, architectures and synthesis for embedded
systems, pages 318–326. ACM Press, 2003.
[2] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and
M. Olivieri. A post-compiler approach to scratchpad
mapping of code. In Proceedings of the 2004 international
conference on Compilers, architecture, and synthesis for
embedded systems, pages 259–267. ACM Press, 2004.
[3] ARM968E-S 32-bit Embedded Core. Arm, Revised March
2004.
http://www.arm.com/products/CPUs/ARM968E-S.html.
[4] O. Avissar, R. Barua, and D. Stewart. Heterogeneous
Memory Management for Embedded Systems. In
Proceedings of the ACM 2nd International Conference on
Compilers, Architectures, and Synthesis for Embedded
Systems (CASES), November 2001. Also at
http://www.ece.umd.edu/barua.
[5] O. Avissar, R. Barua, and D. Stewart. An Optimal Memory
Allocation Scheme for Scratch-Pad Based Embedded
Systems. ACM Transactions on Embedded Systems (TECS),
1(1), September 2002.
[6] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and
P. Marwedel. Scratchpad Memory: A Design Alternative for
Cache On-chip memory in Embedded Systems. In Tenth
International Symposium on Hardware/Software Codesign
(CODES), Estes Park, Colorado, May 6-8 2002. ACM.
[7] Cnetx. Downloadable software.
http://www.cnetx.com/slideshow/.
[8] CodeSourcery. http://www.codesourcery.com/.
[9] A. Dominguez, S. Udayakumaran, and R. Barua. Heap Data
Allocation to Scratch-Pad Memory in Embedded Systems.
Journal of Embedded Computing(JEC), 2005. Cambridge
International Science Publishing. To appear August 2005.
[10] Intel wireless flash memory (W30). Intel Corporation.
http://www.intel.com/design/flcomp/datashts/290702.htm.
[11] Handango. Downloadable software.
http://www.handango.com/.
[12] J. Hennessy and D. Patterson. Computer Architecture A
Quantitative Approach. Morgan Kaufmann, Palo Alto, CA,
second edition, 1996.
[13] J. D. Hiser and J. W. Davidson. Embarc: an efficient memory
bank assignment algorithm for retargetable compilers. In
Proceedings of the 2004 ACM SIGPLAN/SIGBED
conference on Languages, compilers, and tools for
embedded systems, pages 182–191. ACM Press, 2004.
[14] J. Janzen. Calculating Memory System Power for DDR
SDRAM. In DesignLine Journal, volume 10(2). Micron
Technology Inc., 2001.
http://www.micron.com/publications/designline.html.
[15] Landware. Downloadable software.
http://www.landware.com/pocketquicken/.
[16] M.Kandemir, J.Ramanujam, M.J.Irwin, N.Vijaykrishnan,
I.Kadayif, and A.Parikh. Dynamic Management of
Scratch-Pad Memory Space. In Design Automation
Conference, pages 690–695, 2001.
[17] P. R. Panda, N. D. Dutt, and A. Nicolau. On-Chip vs.
Off-Chip Memory: The Data Partitioning Problem in
Embedded Processor-Based Systems. ACM Transactions on
Design Automation of Electronic Systems, 5(3), July 2000.
[18] Phatware. Downloadable software.
http://www.phatware.com/phatnotes/.
[19] Compilation Challenges for Network Processors. Industrial
Panel, ACM Conference on Languages, Compilers and Tools
for Embedded Systems (LCTES), June 2003. Slides at
http://www.cs.purdue.edu/s3/LCTES03/.
[20] J. Sjodin, B. Froderberg, and T. Lindgren. Allocation of
Global Data Objects in On-Chip RAM. Compiler and
Architecture Support for Embedded Computing Systems,
December 1998.
[21] J. Sjodin and C. V. Platen. Storage Allocation for Embedded
Processors. Compiler and Architecture Support for
Embedded Computing Systems, November 2001.
[22] Softmaker. Downloadable software.
http://www.softmaker.de.
[23] S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel.
Assigning program and data objects to scratchpad for energy
reduction. In Proceedings of the conference on Design,
automation and test in Europe, page 409. IEEE Computer
Society, 2002.
[24] S. Udayakumaran and R. Barua. Compiler-decided dynamic
memory allocation for scratch-pad based embedded systems.
In Proceedings of the international conference on Compilers,
architectures and synthesis for embedded systems (CASES),
pages 276–286. ACM Press, 2003.
[25] M. Verma, L. Wehmeyer, and P. Marwedel. Cache-aware
scratchpad allocation algorithm. In Proceedings of the
conference on Design, automation and test in Europe, page
21264. IEEE Computer Society, 2004.
[26] L. Wehmeyer, U. Helmig, and P. Marwedel.
Compiler-optimized usage of partitioned memories. In
Proceedings of the 3rd Workshop on Memory Performance
Issues (WMPI2004), 2004.
[27] Xi-art. Downloadable software. http://www.xi-art.com/.
... Since hard realtime applications rarely use heap data, most researchers have focused on using SPMs for allocating program code [21], [22] or stack/global data [7], [10]- [14], [20]. In such techniques, the compiler allocates the SPM space to selected code/data blocks at compile-time (also referred to as static management) [7], [10], [11], [21] or perform code transformation so that code/data blocks are moved between the SPM and the main memory at runtime (also referred to as dynamic management) [12]- [14], [20], [22]. WCET-aware C compiler framework (WCC) [23] includes similar techniques to use the SPM space for code and global data. ...
... WCET-aware C compiler framework (WCC) [23] includes similar techniques to use the SPM space for code and global data. Most previous approaches either divide the stack data into two parts (one to be stored in the main memory and the other in the SPM) and accessed by using two stack pointers [10], [11] or, allocate the SPM space for only selected stack variables [7], [12]- [14]. The allocation is typically done by creating a copy of a stack variable in the SPM as a global variable [7], [12], [14]. ...
Conference Paper
Full-text available
In systems with strict timing requirements, worst-case execution times (WCETs) are of utmost importance since missing a deadline can cause a failure. Cache memories are promising for average-case performance but have a great impact on the pessimism of WCET. Compared to caches, scratchpad memories (SPMs) provide a time-predictable alternative that requires explicit management in the software. Since a large number of accesses to the memory are to stack data [1], by keeping the call stack in the SPM instead of main memory, many of accesses to the memory can benefit from the time-predictable characteristics of the SPM and therefore, result in a tight WCET. The size of the SPM, however, is limited, and stack frames may need to be evicted to and restored from the main memory to avoid stack overflow. In this paper, we propose a technique to find optimal locations in a given program to perform stack management operations such that the WCET of the program is minimized. Compared to the closest related work, our technique is able to reduce the WCET up to 48% in the evaluation with several benchmarks from Mälardalen WCET suite. Additionally, results indicate that our technique can achieve up to 49% lower WCET compared to 2-way set associate caches.
... It is because embedded applications often suffer from limited memory available. Therefore, they often use global variables to handle shared data that is referenced throughout the application [9]. ...
... Les exemples de familles de processeurs embarqués contenant une SPM sont nombreux : Motorola MPC500, Analog Devices ADSP-21XX, Philips LPC2290, Atmel AT91-C140, ARM 968E-S, Hitachi M32R-32192, Infineon XC166, Analog Devices ADSP-TS201S, Hitachi SuperH-SH7050, Motorola Dragonball et tant d'autres [Nguyen et al., 2007b]. Selon [Nguyen et al., 2007a], il y a au moins 80 processeurs embarqués de ce type ayant une SPM et une mémoire principale (DRAM) directement accédées par le CPU 11 , mais sans aucune mémoire cache. De plus, les tendances dans [LCTES, 2003] D'ailleurs, plusieurs auteurs ont essayé de tirer profit des avantages des SPMs et différentes directions de recherches ont été explorées [Absar et Catthoor, 2005;Angiolini et al., 2003;Absar et Catthoor, 2006;. ...
Thesis
La mémoire est considérée comme étant gloutonne en consommation d'énergie, un problème sensible, particulièrement dans les systèmes embarqués. L'optimisation globale de fonctions multimodales est également un problème délicat à résoudre du fait de la grande quantité d'optima locaux de ces fonctions. Dans ce mémoire, je présente différents nouveaux algorithmes hybrides et distribués afin de résoudre ces deux problèmes d'optimisation. Ces algorithmes sont comparés avec les méthodes classiques utilisées dans la littérature et les résultats obtenus sont encourageants. En effet, ces résultats montrent une réduction de la consommation d'énergie en mémoire d'environ 76% jusqu'à plus de 98% sur nos programmes tests, d'une part. D'autre part, dans le cas de l'optimisation globale de fonctions multimodales, nos algorithmes hybrides convergent plus souvent vers la solution optimale globale. Des versions distribuées et coopératives de ces nouveaux algorithmes hybrides sont également proposées. Elles sont, par ailleurs, plus rapides que leurs versions séquentielles respectives.
... The limitation of statically mapping the data is improved in [35] by identifying program points where the application will benet from using the SPM. The work in [23] considers an unknown SPM size and proposes a loader, which modies the program binary at run-time once the size of the SPM is known. Unlike these studies, our target system has a hierarchical structure consisting of local memory in each processing unit, shared on-chip RAM and an external DRAM. ...
Conference Paper
We stand at the dawn of the next wireless revolution that is driven by 5G and internet-of-things technologies. The dramatic increase in the diversity of needs necessitates breaking the walls of rigid protocols. This paper introduces the concept of fluid wireless protocols, i.e., protocols that can change with the application requirements. We also present a protocol development kit to aid the design of these fluid protocols. Our tool set consists of a protocol recommendation engine for wireless communications and a hardware optimization framework for optimizing the implementation on a state-of-the-art system-on-chip platform. Specifically, we propose a hardware recommendation engine to generate an energy-efficient hardware implementation. We demonstrate the proposed techniques on four protocols with varying requirements, and also run air-to-air experiments on a commercial system-on-chip platform.
Chapter
Full-text available
Embedded systems have to be efficient (at least) with respect to the objectives considered in this book. In particular, this applies to resource-constrained mobile systems, including sensor networks embedded in the Internet of Things. In order to achieve this goal, many optimizations have been developed. Only a small subset of those can be mentioned in this book. In this chapter, we will present a selected set of such optimizations. This chapter is structured as follows: first of all, we will present some high-level optimization techniques, which could precede compilation of source code or could be integrated into it. We will then describe concurrency management for tasks. Section 7.3 comprises advanced compilation techniques. The final Sect. 7.4 introduces power and thermal management techniques.
Article
Deep neural networks have been demonstrated to be useful in varieties of intelligent tasks, and various specialized NN accelerators have been proposed recently to improve the hardware efficiency, which are typically equipped with software‐managed scratchpad memory (SPM) for high performance and energy efficiency. However, traditional SPM management techniques cause memory fragmentation for NN accelerators, and thus lead to low utilization of precious SPM. The main reason is that traditional techniques are originally designed for managing fixed‐length registers rather than variable‐length memory blocks. In this article, we propose a novel SPM management approach for NN accelerators. The basic intuition is that NN computation/memory behaviors are predictable and relatively regular compared with traditional applications, and thus most information can be determined at compile time. In addition, by exploiting the variable‐length feature of SPM, we propose to divide the allocation process into two passes: the space assignment and the address assignment pass, which are simultaneously (and implicitly) performed in traditional one‐pass allocation techniques. Experimental results on the memory requests of a representative NN accelerator demonstrate that the proposed approach can significantly reduce the memory consumption by 30% at most compared with state‐of‐the‐art SPM management techniques, and the memory usage is only 2% larger than that of the theoretical optimal allocation.
Article
Contemporary many-core architectures, such as Adapteva Epiphany and Sunway TaihuLight, employ per-core software-controlled Scratchpad Memory (SPM) rather than caches for better performance-per-watt and predictability. In these architectures, a core is allowed to access its own SPM as well as remote SPMs through the Network-On-Chip (NoC). However, the compiler/programmer is required to explicitly manage the movement of data between SPMs and off-chip memory. Utilizing SPMs for multi-threaded applications is even more challenging, as the shared variables across the threads need to be placed appropriately. Accessing variables from remote SPMs with higher access latency further complicates this problem as certain links in the NoC may be heavily contended by multiple threads. Therefore, certain variables may need to be replicated in multiple SPMs to reduce the contention delay and/or the overall access time. We present Coordinated Data Management (CDM), a compile-time framework that automatically identifies shared/private variables and places them with replication (if necessary) to suitable on-chip or off-chip memory, taking NoC contention into consideration. We develop both an exact Integer Linear Programming (ILP) formulation as well as an iterative, scalable algorithm for placing the data variables in multi-threaded applications on many-core SPMs. Experimental evaluation on the Parallella hardware platform confirms that our allocation strategy reduces the overall execution time and energy consumption by 1.84× and 1.83×, respectively, when compared to the existing approaches.
Article
Traditional approaches for managing software-programmable memories (SPMs) do not support sharing of distributed on-chip memory resources and, consequently, miss the opportunity to better utilize those memory resources. Managing on-chip memory resources in many-core embedded systems with distributed SPMs requires runtime support to share memory resources between various threads with different memory demands running concurrently. Runtime SPM managers cannot rely on prior knowledge about the dynamically changing mix of threads that will execute and therefore should be designed in a way that enables SPM allocations for any unpredictable mix of threads contending for on-chip memory space. This article proposes ShaVe-ICE, an operating-system-level solution, along with hardware support, to virtualize and ultimately share SPM resources across a many-core embedded system to reduce the average memory latency. We present a number of simple allocation policies to improve performance and energy. Experimental results show that sharing SPMs could reduce the average execution time of the workload up to 19.5% and reduce the dynamic energy consumed in the memory subsystem up to 14%.
Conference Paper
Full-text available
This paper presents a highly predictable, low overhead and yet dynamic, memory allocation strategy for embedded systems with scratch-pad memory. A scratch-pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in energy consumption, area and overall runtime, even with a simple allocation scheme [4].Existing scratch-pad allocation methods are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption and SRAM space for tags and deliver poor real-time guarantees just like hardware caches. A second category of algorithms partitionsm variables at compile-time into the two banks. For example, our previous work in [3] derives a provably optimal static allocation for global and stack variables and achieves a speedup over all earlier methods. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. It is easy to see why a data allocation that never changes at runtime cannot achieve the full locality benefits of a cache.In this paper we present a dynamic allocation method for global and stack data that for the first time, (i) accounts for changing program requirements at runtime (ii) has no software-caching tags (iii) requires no run-time checks (iv) has extremely low overheads, and (v) yields 100% predictable memory access times. In this method data that is about to be accessed frequently is copied into the SRAM using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to a provably optimal static allocation our results show runtime reductions ranging from 11% to 38%, averaging 31.2%, using no additional hardware support. With hardware support for pseudo-DMA and full DMA, which is already provided in some commercial systems, the runtime reductions increase to 33.4% and 34.2% respectively.
Conference Paper
Full-text available
Focusing on embedded applications, scratchpad memories (SPMs) look like a best-compromise solution when taking into account performance, energy consumption and die area. The main challenge in SPM design is mapping memory locations to scratchpad locations. This paper describes an algorithm to optimally solve such a mapping problem by means of Dynamic Programming applied to a synthesizable hardware architecture. The algorithm works by mapping segments of external memory to physically partitioned banks of an on-chip SPM; this architecture provides significant energy savings. The algorithm does not require any user-set bound on the number of partitions and takes into account partitioning overhead. Improving on previous solutions, execution time is polynomial in the input size. Strategies to optimize memory requirements and speed of the algorithm are exploited. Additionally, we integrate this algorithm in a complete and automated design, simulation and synthesis flow.
Conference Paper
Full-text available
ScratchPad Memories (SPMs) are commonly used in embedded systems because they are more energy-efficient than caches and enable tighter application control on the memory hierarchy. Optimally mapping code and data to SPMs is, however, still a challenge. This paper proposes an optimal scratchpad mapping approach for code segments, which has the distinctive characteristic of working directly on application binaries, thus requiring no access to either the compiler or the application source code - a clear advantage for legacy or proprietary, IP-protected applications.The mapping problem is solved by means of a Dynamic Programming algorithm applied to the execution traces of the target application. The algorithm is able to find the optimal set of instructions blocks to be moved into a dedicated SPM, either minimizing energy consumption or execution times. A patching tool, which can use the output of the optimal mapper, modifies the binary of the application and moves the relevant portions of its code segments to memory locations inside of the SPM.
Conference Paper
Full-text available
In an embedded system, it is common to have several memory areas with different properties, such as access time and size. An access to a specific memory area is usually restricted to certain native pointer types. Different pointer types vary in size and cost. For example, it is typically cheaper to use an 8-bit pointer than a 16-bit pointer. The problem is to allocate data and select pointer types in the most effective way. Frequently accessed variables should be allocated in fast memory, and frequently used pointers and pointer expressions should be assigned cheap pointer types. Common practice is to perform this task manually.We present a model for storage allocation that is capable of describing architectures with irregular memory organization and with several native pointer types. This model is used in an integer linear programming (ILP) formulation of the problem. An ILP solver is applied to get an optimal solution under the model. We describe allocation of global variables and local variables with static storage duration.A whole program optimizing C compiler prototype was used to implement the allocator. Experiments were performed on the Atmel AVR 8-bit microcontroller [2] using small to medium sized C programs. The results varied with the benchmarks, with up to 8% improvement in execution speed and 10% reduction in code size.
Conference Paper
Full-text available
In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories whose usage is not controlled by hardware, but rather by the programmer or the compiler. Techniques for utilization of these scratchpads have been known for some time. Some new processors provide more than one scratchpad, making it necessary to enhance the workflow such that this complex memory architecture can be efficiently utilized. In this work, we present an energy model and an ILP formulation to optimally assign memory objects to different partitions of scratchpad memories at compile time, achieving energy savings of up to 22% compared to previous approaches.
Article
Full-text available
This paper presents the first-ever compile- time method for allocating a portion of the heap data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in access time, energy consumption, area and overall runtime. Existing compiler methods for allocating data to scratch-pad are able to place only global and stack data in scratch-pad memory; heap data is allocated entirely in DRAM, resulting in poor performance. Runtime methods based on software caching can place heap data in scratch- pad, but because of their high overheads from software address translation, they have not been successful, especially for heap data. In this paper we present a dynamic yet compiler-directed allocation method for heap data that for the first time, (i) is able to place a portion of the heap data in scratch-pad; (ii) has no software-caching tags; (iii) requires no run-time per-access extra address translation; and (iv) is able to move heap data back and forth between scratch-pad and DRAM to better track the program's locality characteristics. With our method, global, stack and heap variables can share the same scratch-pad. When compared to placing all heap variables in DRAM and only global and stack data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 34.6%, and the average power consumption by 39.9%, for the same size of scratch-pad fixed at 5% of total data size.
Conference Paper
Many architectures today, especially embedded systems, have multiple memory partitions, each with potentially different performance and energy characteristics. To meet the strict time-to-market requirements of systems containing these chips, compilers require retargetable alogrithms for effectively assigning values to the memory partitions. The EMBARC algorithm described in this paper is the first algorithm to attempt to realize a comprehensive, retargetable algorithm for effective partition assignment of variables in an arbitrary memory hierarchy. It supports a wide variety of memory models including on-chip SRAMs, multiple layers of caches, and even uncached DRAM partitions. Even though it is designed to handle such a range of memory hierarchies, EMBARC is capable of generating partition assignments of similar quality to algorithms designed for specific memory hierarchies. We use a large range of benchmarks and memory models to demonstrate the effectiveness of the EMBARC algorithm. We found that EMBARC can achieve 99% of the improvement of a dedicated algorithm for cacheless systems without SRAM. Also, for cacheless systems with SRAM, EMBARC generated the optimal partition assignment for benchmarks that were simple enough to hand-generate an optimal partition assignment. As further proof of EMBARC's generality, we also show how EMBARC can be used to generate partition assignments for a memory hierarchies with two on-chip caches that can be accessed in parallel.
Article
Efficient utilization of on-chip memory space is extremely important in modern embedded system applications based on processor cores. In addition to a data cache that interfaces with slower off-chip memory, a fast on-chip SRAM, called Scratch-Pad memory, is often used in several applications, so that critical data can be stored there with a guaranteed fast access time. We present a technique for efficiently exploiting on-chip Scratch-Pad memory by partitioning the application's scalar and arrayed variables into off-chip DRAM and on-chip Scratch-Pad SRAM, with the goal of minimizing the total execution time of embedded applications. We also present extensions of our proposed memory assignment strategy to handle context switching between multiple programs, as well as a generalized memory hierarchy. Our experiments on code kernels from typical applications show that our technique results in significant performance improvements.