ArticlePDF Available

A comparison of memory allocators for real-time applications

Authors:

Abstract and Figures

Real-Time applications can require dynamic storage management. However this feature has been sistematically avoided due to the general belief about the poor performance of allocation and deallocation operations in time and space. Actually, the use of Java technologies in real-time require to analyse in detail the performance of this feature due to its intensive use. In a previous paper, the authors proposed a new dynamic storage allocator that perform malloc and free operations in constant time (O(1)) with a very high efficiency. In this paper, we compare the behaviour of several allocators under "real-time" loads measuring the temporal cost and the fragmentation incurred by each allocator. In order to compare the temporal cost of the allocators, two parameters have been considered: number of instructions and processor cycles. To measure the fragmentation, we have calculated the relation between the maximum memory used by the each allocator relative to the point of the maximum amount of memory used by the load. Additionally, we have measured the impact of delayed deallocation in a similar way a periodic garbage collector server will do. The results of this paper show that TLSF allocator obtains the best resuts when both aspects, temporal and spatial are considered.
Content may be subject to copyright.
A Comparison of Memory Allocators for Real-Time
Applications
Miguel Masmano Ismael Ripoll Alfons Crespo
Real-Time Systems Group
Universidad Politecnica de Valencia
Valencia, Spain
{mmasmano, iripoll,acrespo}@disca.upv.es
ABSTRACT
Real-Time applications can require dynamic storage man-
agement. However this feature has been sistematically avoi-
ded due to the general belief about the poor performance
of allocation and deallocation operations in time and space.
Actually, the use of Java technologies in real-time require
to analyse in detail the performance of this feature due to
its intensive use. In a previous paper, the authors proposed
a new dynamic storage allocator that perform malloc and
free operations in constant time (O(1)) with a very high ef-
ficiency. In this paper, we compare the behaviour of several
allocators under ”real-time” loads measuring the temporal
cost and the fragmentation incurred by each allocator. In
order to compare the temporal cost of the allocators, two
parameters have been considered: number of instructions
and processor cycles. To measure the fragmentation, we
have calculated the relation between the maximum mem-
ory used by the each allocator relative to the point of the
maximum amount of memory used by th e load. Addition-
ally, we have measured the impact of d elayed deallocation
in a similar way a periodic garbage collector server will do.
The results of this paper show that TLSF allocator obtains
the best resuts when both aspects, temporal and spatial are
considered.
Categories and Subject Descriptors
D.4 [Ope rating Systems]: Performance; D.4.2 [Alloca-
tion/deallocation strategies]: Metrics—complexity mea-
sures, performance measures
General Terms
Dynamic storage management
Keywords
Dynamic storage management, Real-time systems, Operat-
ing systems
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
JTRES ’06, October 11-13, 2006 Paris, France
Copyright 2006 ACM 1-59593-544-4/06/10 ...$5.00.
1. INTRODUCTION
Although dynamic storage allocation has been extensively
studied, it has not been widely used in real-time systems due
to the commonly accepted idea that, because of the intrin-
sic nature of the problem, it is difficult or even impossible
to design an efficient, time-bounded algorithm. Even the
name, dynamic storage allocation, seems to suggest the idea
of dynamic and unpredictable behaviour.
An application can request and release blocks of different
sizes in a sequence that is, a priori, unknown to the allo-
cator. The allocator must keep track of released blocks in
order t o reuse them to serve new allocation requests, other-
wise memory will eventually be exhausted. A key factor in
an allocator is the data structure used to record information
about free blocks. Although not explicitly stated, it seems
that it has been accepted that even using a very efficient and
smart data structure the allocator algorithm, in some cases,
has to perform some sort of linear or logarithmic search to
find a suitable free block; otherwise, significant fragmenta-
tion
1
may occur.
Regarding the way allocation and deallocation are man-
aged, t here are two general approaches to dynamic storage
allocation (DSA): (i) explicit allocation and deallo cation,
where the application has to explicitly call the primitives of
the DSA algorithm to allocate memory (e.g., malloc) and
to release it (e.g., free); and (ii) implicit memory dealloca-
tion (also known as garbage collection), where the DSA is
in charge of collecting the blocks of memory that have been
previously requested but are not needed anymore. This pa-
per is focussed in the analysis of an explicitly on low level
allocation and deallocation primitives. Garbage collection is
not addressed in this work. This exp licit allocation is a low
level functionality extensively used by the Java technologies.
There are several dyn amic storage allocation strategies
that have been proposed and analysed under different real
or synthetical loads. In [19] a detailed survey of dynamic
storage allocation was presented which has been considered
the main reference since then. In th e literature, several effi-
cient implementations of dynamic memory allocators exists,
[4, 8, 9, 13, 10, 3, 1, 18, 4]. Also there exist worst-case anal-
ysis of these algorithm from the temporal or spatial point of
view [16, 17, 7].
There exist some allocators proposed for real-time. Oga-
sawara [12] proposed the Half-fit allocator, which was the
1
Although the term “wasted memory” describes better the
inability to use some parts of the memory, for historical
reasons we will use the term “fragmentation” to refer to the
same idea.
68
first to performs in constant time both to allocate and deal-
locate. TLSF [11] is another allocator performing these op-
erations in constant time. In 2002, Puaut [15] presented a
performance analysis of a set of general purpose allocators
regarding with respect to real-time requirements.
In this paper, we compare the behaviour of several allo-
cators under ”real-time” loads measuring th e temporal cost
and the fragmentation incurred by each allocator. In the ab-
sence of ”real-time” workloads for dynamic memory use, we
have generated synthetic loads following a model similar to
the Real-Time Java memory model. In order to compare the
temporal cost of the allocators, two parameters have been
considered: number of instru ctions and processor cycles. To
measure the fragmentation, we have calculated the relation
between the maximum memory used by the each allocator
relative to the point of the maximum amount of memory
used by the load (live memory).
An allocator must fulfil two requirements in order to be
used in real-time systems: 1) it must have a bounded re-
sponse time, so that schedulability analysis can be performed;
2) it must cause low fragmentation. The results of this paper
show that TLSF obtain the best resuts when both aspects,
temporal and spatial, are considered.
The paper is organised as follows: the following section
provides a brief description of the allocators used in this pa-
per. Section S ection 3 presents the workload model and the
test generation u sed in the comparison. Section 4 describ es
the metrics and experimental framework. In Section 4, ex-
perimental results are presented and discussed. The last
section concludes by summarising t he results obtained and
outlines open issues and the directions of futu re work.
2. DESCRIPTION OF THE ALLOCATORS
This section presents a brief descrition of the allocators
used in the evaluation. Allocators can be classified attending
to the policy used (first fit, best fit, good fit, etc.) and
the mechanism implemented (doubly linked lists, segregated
lists, bitmaps, etc.) based on the work of Wilson et al.[19]
In order to perform the evaluation presented in th is pa-
per, we have selected some of the allocatores more represen-
tatives taking into account considerations like:
Representatives of very well known policie s. First-fit
and Best-fit are two of the most representative sequen-
tial fit allocators. First-fit allo cator is u sed in all com-
parisons. It does not provide good results in terms of
time and fragmentation but it is a reference. Best-fit
provides very good results on fragmentation but bad
results in time. Both of them are usually implemented
with a doubly linked list. The pointers which imple-
ment the list are emb]dded inside the header of each
free block. First-fit allocator searches the free list and
selects the first block whose size is equal or greater
than the requested size, whereas Best-fit goes further
to select the block which best fits the request.
Widely used in several environments. Doug Lea’s al-
locator [10] is the most representative of hybrid allo-
cator and it is used in Linux systems and several envi-
ronments. It is a combination of several mechanisms.
This allocator uses a single array of lists, where th e
first 48 in dexes are lists of blocks of an exact size (16
to 64 bytes) called “fast bins” . The remaining part
of the array contains lists of segregated lists, called
“bins”. Each of these segregated lists are sorted by
block size. A mapping function is u sed to quickly lo-
cate a suitable list. DLmalloc uses the d elayed co-
alescing strategy, that is, the deallocation operation
does not coalesce b locks. Instead a massive coalescing
is done when the allocator can not serve a request.
Labelled as ”real-time” allocators. Binary-Buddy and
Half-fit are goo d-fit allocators that provide excellent
results in time rep onse. However, the fragmentation
produced by these allocators is known to be non neg-
ligible.
Buddy systems [9] are a particular case of Segregated
free lists. Being H the heap size, there are only log
2
(H)
lists since the heap can only be split in powers of two.
This restriction yields efficient splitting and merging
operations, but it also causes a high memory fragmen-
tation. There exist several variants of this method [13]
such as Binary-buddy, Fibonacci-buddy, Weighted buddy
and Double-buddy.
The Binary-buddy [9] allocator is the most represen-
tative of the Bud dy Systems allocators, which besides
has always been considered as a real-time allocator.
The initial heap size has to b e a power of two. If a
smaller block is needed, then any available block can
only be split into two blocks of th e same size, which
are called buddies. When both buddies are again free,
they are coalesced back into a single block. Only bud-
dies are allowed to be coalesced. When a small block
is requested and no free block of the requested size is
available, a bigger free block is split one or more times
until one of a suitable size is obtained.
Half-fit [12] uses bitmaps to fi nd free blocks rapidly
without having to perform an exhaustive search. Half-
fit groups free blocks in the range [2
i
, 2
i+1
[ in a list
indexed by i. Bitmaps to keep track of empty lists
jointly with bitmap p rocessor instructions are used to
speed -up search operations. When a block of size r is
required, the search for a suitable free block starts on
i, where i = log
2
(r 1) + 1 (or 0 if r = 1). Note t hat
the list i always holds blo cks whose sizes are equal to
or larger than the requested size. If this list is empty,
then the next non-empty free list is used instead. If
the size of the selected free block is larger than the
requested one, the block is split in two blocks of sizes
r and r
. The remainder block of size r
is re-inserted
in the list indexed by i
= log
2
(r
).
New real-time allocator. TLSF (Two-Level Segregated
Fit) [11] is a bounded-time, good-fit allocator. TLSF
implements a combination of segregated and bitmap
fits mechanisms. The use of bitmaps allow to imple-
ment fast, b ounded-time mapping and searching func-
tions. TLSF data structure can be represented as a
two-dimension array. The first dimension splits free
blocks in size-ranges a power of two apart from each
other, so that first-level index i refers to free blocks
of sizes in the range [2
i
,2
i+1
[. The second dimen-
sion splits each first-level range linearly in a number
of ranges of an equal width. The number of such
ranges, 2
L
, should not exceed the number of bits of
the un derlying architecture, so that a one-word bitmap
69
can represent the availability of free blocks in all the
ranges. TLSF uses word-size bitmaps and processor
bit instructions to find a suitable list in constant time.
The range of sizes of the segregated lists has been cho-
sen so that a mapping function can be used to locate
the position of the segregated list given the block size,
with no sequential or binary search. Also, ranges have
been spread along the whole range of possible sizes in
such a way that the relative width (the length of the
range) of the range is similar for small blocks than for
large blocks. In oth er words, there are more lists used
for smaller blocks than for larger blocks.
One important aspect is the theoretical temporal cost
(complexity) of each allocator. Table 1 summarises these
costs for each allocator.
Table 1: Algorithm complexity
Allocation Deallocation
First-fit/Best-fit O
`
H
2M
´
O(1)
Binary-buddy O(log
2
`
H
M
´
) O(log
2
`
H
M
´
)
DLalloc O
`
H
M
´
O(1)
Half-fit O(1) O(1)
TLSF O(1) O(1)
In [11], the worst-case or bad-case
2
scenario of each allo-
cator has been analysed and detailed. For each allocator, a
sinthetic load was generated to conduct it to its worst-case
allocating and deallocating scenarios. Once these scenarios
were reached, we measured the number of instructions per-
formed by the allocation or deallocation operations. Table
2 shows a summary of th ese results.
Table 2: Worst-case (WC) and Bad-case (BC) allo-
cation and deallocation: Processor instructions
Allocators Malloc Free
First-Fit 81995 126
Best-Fit 98385 126
Bunary-Buddy 1403 1379
DLalloc 721108 83
Half-Fit 136 166
TLSF 170 194
These results can slightly change depending on t he com-
piler version and the optimisation options used.
3. WORKLOAD MODEL
While there exists a complete and consolidated model
for the temporal requirements of real-time applications, the
memory parameters that describe task behaviour are far
from being well-defined and understood. Considering the
temporal requirements, real-time tasks are commonly rep-
resented as a set of parameters (computation time, deadline
and period). This model is b oth simple, because t he three
parameters are characteristics that can be easily understood
and measured, and complete because it is possible to make
schedulability analysis using only that few parameters.
Memory requirements in task model has been considered
in some works as [6, 5, 14]. However, it is only considered the
2
A bad-case may be the worst-case, b ut it has not been
proved.
maximum amount of memory that can be allocated per task.
Feizabadi et al. [6] used this simple model and suggested
to extend it to take into account the allocation patterns of
tasks. Feizabadi called that n ew model WCAR (Worst Case
Allocation Requirements).
At the best of our knowledge, there is not a memory model
for real-time periodic threads except the model proposed in
the Real-Time Specification for Java [2]. In this model, a
RealtimeThread can define its memory requirements using
the MemoryParameters class, which contains the following
information:
maxMemoryArea A limit on the amount of memory the thread
may allocate in the memory area.
maxImmortal A limit on the amount of memory the thread
may allocate in the immortal area ( memory that live
until the end of the application).
allocationRate A limit on the rate of allocation in the
heap. Units are in bytes per second.
The Real-Time Java memory model has been designed
around the needs of the garbage collector. That is, the ac-
tivation period of the garbage collector can be calculated
from the parameters of the tasks.
It is an open question to define a memory model for real-
time periodic tasks. A complete memory model should con-
tain parameters that determine the minimum and maximum
block size, the h olding time, maximum live memory, etc.
Let τ = {T
1
, ..., T
n
} be a periodic task system. Each task
T
i
τ has the following temporal parameters T
i
= (c
i
, p
i
,
d
i
, g
i
, h
i
). Where c
i
is the worst execution time, p
i
is the
period; d
i
is the deadline, g
i
is the maximum amount of
memory that a task T
i
can request per period; and h
i
is the
longest time than task can hold a block after its allocation
(holding time).
In this model a task T
i
can ask for several blocks until a
maximum of g
i
bytes per p eriod. Each request r
ik
can have
different holding time b ut shorter than h
i
units of time.
To serve all memory requests, th e system provides an area
of free memory (also known as heap) of H bytes.
From this model, it is easy to derive the parameters pro-
posed by th e Real-Time Specification for Java.
Independently of the deallocation policy of each allocator,
in order to cope different global deallocation strategies, two
alternatives for freeding blocks have been considered:
Immediate de allocation : As soon as a block is not used,
an explicit free operation is executed by the application
task.
Delayed deallocation : Application tasks do not perform
explicit block deallocation. Instead of deallocating
blocks application task mark blocks as ”not used”. A
periodic deallocation task performs the free operation
on blocks th at are not used by any task.
Whereas the first strategy permits the analysis of explicit
allocation/deallocation applications, the second one can be
used to analyse the effects of implicit d eallocation systems
as those based on garbage collector.
3.1 Workload model
In order to perform an evaluation of the considered allo-
cators u nder the prop osed model, a synthetic load generator
70
has been designed. The workload model generate set of tasks
under the following premises:
1. Periods (P ) are randomly generated using uniform dis-
tribution between a minimum and maximum admissi-
ble period.
2. Maximum amount of memory requested by period (Gmax)
is randonmly generated using a uniform distribution in
the range of maximum and minimum block size defi ned
by the user.
3. Maximum number of requests (Mnr) per period fol-
lows a uniform distribution in the range of maximum
and minimum block size.
4. The amount of memory by request follows a normal
distribution with an average value (Gavg) obtained as
Gmax/Mnr and a standard deviation, calculated as a
constant factor of the size block.
5. Holding time is determined using a uniform distribu-
tion between the minimum and maximum admissible
holding time.
Three different profiles of tasks have b een defined:
1. Set of tasks allo cating high amount of memory per
period requesting big blocks.
2. Set of tasks allocating small amount of memory per
period requesting small blocks
3. Set of tasks of any of the previous profiles
3.2 Test definition
In order to evaluate each allocator the following tests have
been designed.
Test 1: This fi rst test is designed to evaluate the behaviour
of allocators when requesting big blocks. The load con-
sists in a set of [3..10] tasks with periods in the range
[10..150]. The maximum memory per period g
i
of each
task is in the range of [2048..20480]. with a range of
[2..5] requests per period. The holding time h
i
of each
request has also been calculated with an uniform dis-
tribution in the period range of [30..50]. The policy
used to free blocks is immediate deallocation. An ex-
ample of these scenarios is:
Table 3: Task set example of profile 1.
Per. Gmax Gavg Gsdv Hmax Htmin
T0 10 10203 10200 300 43 11
T1 19 17728 8471 200 24 7
T2 29 14503 6404 150 21 11
Figure 1 shows an example of the histogram of the
mallocs sizes generated by this profile.
Test 2: The second load is oriented to evaluate the alloca-
tor behaviour when only small blocks are requested.
The difference with respect to previous test is on the
maximum memory per period of each task which is in
the range of [8..1024]. One of the p ossible scenarios
generated is shown in table 4
0
100
200
300
400
500
600
700
5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 10500
Number of mallocs
Block sizes
Block size frequency (test 1)
Figure 1: Block size histogram example of profile 1
Table 4: Task set example of profile 2.
Per. Gmax Gavg Gsdv Hmax Htmin
T0 23 786 235 28 43 14
T1 50 128 31 24 45 9
T2 60 418 169 32 30 12
Test 3: The third load tries to cover both previous cases.
Now the range of the max imum amount of memory
requested by period can vary in the range [8..20480].
Test 4: This test has been designed to analyse the effects of
the delayed deallocation policy on the memory needs
to serve a workload. One of the scenarios generated by
Test 3 is taken as basis for the analysis. Using this set
of tasks, we executed it for a range of deallocation task
period from the smallest task and 10 times the largest
period of the application task set (see section 4.4).
4. EXPERIMENTALANALYSISOFTHEAL-
LOCATORS
In this section we present the results obtained by the se-
lected allocators under different loads. First fit, Best t
and Binary buddy allocators has been implemented for these
evaluation from other implementations. DLmalloc code has
been taken from the author web site (version 2.7.2). Finally,
we could not find any detailed implementation of Half-fit
which has been implemented from the Ogasawara paper de-
scription.
Each allo cator test consists in the execution of 10 tests of
each profile with different random seeds. Each test measures
the following metrics:
Execution time: The number of processor cycles needed
by each allocation and deallocation operation has been
measured. In order to avoid the effects of system in-
terrupts these operations have been executed with dis-
abled interrupts. Although the test platform has been
designed to reduce the interferences of other processes,
still there are some factors (as cache faults, TLB, etc.)
that can produce significant variations in the execu-
tion. Additionally, each test was executed twice, using
the same seed. As the operation size requests are in
the same order, the results of both execution sh ould
produce very similar number of cycles for each opera-
tion. Differences are associated to processor interfer-
ences that could b e minimise selecting the minimum
number of cycles of both outputs. Final results are
71
Table 5: Temporal cost of the allocator operations in proces sor cycles
(a) Malloc
Test1 Test2 Test3
Alloc. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min.
First-fit 315 239 2185 97 248 243 2274 97 291 279 4758 97
Best-fit 511 513 2330 112 347 352 2380 95 1134 1133 6432 107
Binary-buddy 170 472 5517 143 157 262 7433 105 161 263 5960 121
DLmalloc 342 345 5769 114 249 278 5859 79 292 296 6985 83
Half-fit 196 332 1237 129 148 569 1269 108 161 520 1491 128
TLSF 216 274 2017 137 173 257 2056 115 196 230 2473 115
(b) Free
Test1 Test2 Test3
Alloc. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min.
First-fit 163 189 1433 84 148 185 1387 84 176 196 1529 85
Best-fit 153 189 1420 85 120 212 1292 84 143 178 1448 85
Binary-buddy 147 285 1728 120 150 287 855 120 153 284 1412 120
DLmalloc 124 198 644 85 99 354 425 74 128 181 732 75
Half-fit 184 209 1273 104 173 216 1209 104 186 210 1099 110
TLSF 201 220 1641 118 170 223 741 111 191 217 1095 118
presented as average, standard deviation, maximum
and minimum number of cycles.
Processor instructions: One way to eliminate the in-
terferences is to measure the number of instructions of
each allocator. In order to measure it, the test program
have been instrumented using the ptrace system call.
This system call allows a parent process to control the
execution of the child (test) process. The single step
mode permits to the parent process be notified each
time a child instruction is executed. As in the previ-
ous metric, results are p rovided in average, standard
deviation, maximum and minimum number of instruc-
tions executed.
Fragmentation. To measure the fragmentation incurred
by each allocator, we have calculated the factor F,
which is computed as the point of the max imum mem-
ory used by the allo cator relative to the point of the
maximum amount of memory used by the load (live
memory).
In order to clarify the fragmentation measure, figure
2 shows a first trivial example generated by a single
task under th e Best-Fit allocator. The gure shows
the memory used by this allocator to satisfy th e re-
quests of this task. The continous line shows the max-
imum memory address reached by the allocator while
the non-continous line plots the live memory (used by
the task). The live memory draws a periodic shape
whose rise coincides with the period of the task (15
u.t.) and it falls at the end of the holding time of each
allocation (60 u.t.). In the figure 2, F is 47.06% and
corresponds to the point 1 relative to the point 2.
The simulation time is measured in number of mallocs
units. The number of mallocs analysed has been 300.000 for
processor cycles or number of instructions measurement and
1.500.000 for fragmentation measurement.
4.1 Execution time results
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 200 400 600 800 1000
Memory (Kbytes)
Time (usec)
1
2
Used memory (Kb)
Live memory (Kb)
Figure 2: Memory usage of Best-Fit with 1 task
Table 5 shows a summary of the processor cycles spent
for each allocator for both operations: malloc and free.
In general, all allocators need more cy cles when allocate
higher blocks than small blocks. Binary-Buddy, Half-fit
and TLSF show a stable (lower standar d eviation) and effi-
cient (less cycles) than other allocators. Doug Lea allocator
presents a goo d execution performance. In this allocator,
the cost of deferred coalescence used is measured in some
malloc operations (see maximum number of cycles), so the
maximum value is relatively higher than other ones.
This deferred coalescence is the reason why Doug Lea al-
locator presents the best results when free operations are
executed. All allocators p erform the free operation with a
reduced number of cycles and similar standard dev iation.
The free operation implementation of First-fit and Best-fit
allocators is th e same obtaining very similar results.
4.2 Processor instructions results
Table 6 summarises the results of numbers of instructions
executed by each allocator in the previous d esigned tests.
72
Table 6: Temporal cost of the allocator operations in proces sor instructions
(a) Malloc
Test1 Test2 Test3
Alloc. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min.
First-fit 204 21 478 71 201 17 818 70 203 23 957 70
Best-fit 582 69 798 76 442 130 1006 76 805 179 1539 76
Binary-buddy 169 17 843 157 136 22 1113 95 153 24 1113 95
DLmalloc 279 107 921 64 161 126 933 49 232 152 1277 57
Half-fit 118 1 123 115 116 7 123 76 118 1 123 82
TLSF 147 13 164 104 118 25 164 84 133 22 164 84
(b) Free
Test1 Test2 Test3
Alloc. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min.
First-fit 93 96 128 59 90 92 128 57 92 95 128 57
Best-fit 91 115 126 57 69 148 126 57 79 198 128 57
Binary-buddy 68 70 225 65 68 72 277 65 69 73 228 65
DLmalloc 70 128 77 53 59 177 77 39 67 168 77 39
Half-fit 117 117 167 73 115 116 165 76 117 117 167 76
TLSF 140 140 217 91 107 110 216 87 120 122 217 87
TLSF and Half-fit demonstrate the excellent behavior achieved.
Both obtain the lowest average and standard deviation. For
the same reasons stated in the number of cycles, DLmalloc
presents reasonable good average but high maximum values.
With respect to the free operations the results are very
similar. As expected, DLmalloc presents the best results.
A more detailed analysis of each allocator behaviour can
be seen in figure 3. These plots show the evolution of the
processor instructions executed by the mallo c operation on
time (in number of malloc requests). Note that plots have
different y axis range. We can appreciate that Half Fit and
TLSF response time (instructions) is constant, less than 139
and 170 respectively.
At the beginning of the simulation, First fit allocator
works reasonably well, but as the number of free blocks gets
higher, the variation of the response time gets larger. Best
fit presents the worst behaviour of the analysed allocators.
The more time the application runs, the more free blocks ex-
ists, and so, the more time it needs to find the block which
fits best. In the Binary buddy plot, it is possible to appreci-
ate the bands caused by the power of two splits. Although
the average resp onse time of th e Douglas Lea is very fast,
from t ime to time it has to coalesce blocks (points above 500
instructions), which is a costly operation.
A deeper analysis of the TLSF plot (Figure 3(f)) permits
to detect the different groups of instructions executed de-
pending on the sizes of the allocation requests. The two hor-
izontal bands between 90 and 100 instructions correspond to
block req uests whose size is lower than 128 bytes when the
block used to serve the request is not split. Bands between
105 and 115 correspond to request of blocks greater than
128 bytes without block split. When the TLSF has to split
a block in order to attend a request (split and insert the re-
maining block on the structure) the cost gets higher, bands
above 140 instructions. Variations around a band are con-
sequence of the insertion in a empty list or deletion of the
last element of a list.
Half fit has also two groups: points below 85 and points
above 110 instructions. The lower bands happens when no
split is done, and the upper bands correspond to requests
that required to split the free block.
4.3 Fragmentation results
Table 7 shows the fragmentation obtained by each alloca-
tor. As it was detailed above, factor F has been measured
at the en d of each scenario. This factor provides information
about the percentage of the additional memory required to
allocate the application requests.
These results show that, overall, TLSF is the allocator
that requires less memory (less fragmentation) closely fol-
lowed by Best-Fit and DLmalloc. As shown, TLSF behaves
better even than Best-Fit in most cases, that can be ex-
plained due that TLSF always rounds-up all petitions to
fit them to any existing list, allowing TLSF to reuse one
block with several petitions with a similar size, indepen-
denly of their arrival order. On the other hand, Best-Fit al-
ways splits blocks to fit the requested sizes, making impossi-
ble to reuse some blocks when these blocks have previously
been allocated to slightly smaller petitions. For example,
for a request of 130 bytes, TLSF will always allocate 132
bytes. Best-Fit will allocate 130 bytes instead. If this block
is released and 132 bytes are requested later, TLSF will be
able to reuse the previous block whereas Best-Fit will not.
Besides, DLalloc and Best-fit allocators show worst results
when small blocks are requested (Test 2). And finally, as
can be seen, simulating the same load with a different seed
in t he simulator h as outcome in almost the same results,
that is a very low standard deviation.
On the other hand we have Binary-Buddy and Half-Fit,
both of them with a fragmentation around 80% in the case
of Half-Fit and 70% in Binary-Buddy. As expected, the high
fragmentation caused by Binary-buddy is due to the exce-
sive size round u p (round up to power of two). All wasted
memory of Binary-buddy is caused by internal fragmenta-
tion. Half-Fit’s fragmentation was also expected because of
its incomplete memory use. As can be seen, both allocators
are qu ite sensitive request sizes th at are not close to power
of two, causing a high fragmentation, internal fragmenta-
73
0
500
1000
1500
2000
2500
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of proccessor instructions
Time (number of malloc requests)
(a) First Fit
0
1000
2000
3000
4000
5000
6000
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of proccessor instructions
Time (number of malloc requests)
(b) Best Fit
0
100
200
300
400
500
600
700
800
900
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of proccessor instructions
Time (number of malloc requests)
(c) Binary Buddy
0
1000
2000
3000
4000
5000
6000
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of proccessor instructions
Time (number of malloc requests)
(d) Douglas Lea
75
80
85
90
95
100
105
110
115
120
125
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of proccessor instructions
Time (number of malloc requests)
(e) Half Fit
80
90
100
110
120
130
140
150
160
170
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of proccessor instructions
Time (number of malloc requests)
(f) TLSF
Figure 3: Malloc processor instructions (Test 1)
Table 7: Fragmentation results: Factor F
Test1 Test2 Test3
Alloc. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min. Avg. Stdv. Max. Min.
First-fit 93.25 3.99 99.58 87.57 83.21 9.04 98.17 70.67 87.63 4.41 94.82 70.76
Best-fit 10.26 1.25 14.23 7.20 21.51 2.73 26.77 17.17 11.76 1.32 14.14 9.71
Binary-buddy 73.56 6.36 85.25 66.61 61.97 1.97 65.06 58.79 77.58 5.39 84.34 64.88
DLmalloc 10.11 1.55 12.90 7.39 17.13 2.07 21.75 14.71 11.79 1.39 13.72 9.90
Half-fit 84.67 3.02 90.07 80.40 71.50 3.44 75.45 65.02 98.14 3.12 104.67 94.21
TLSF 10.49 1.66 11.79 6.51 14.86 2.15 18.56 9.86 11.15 1.10 13.91 7.48
tion in the case of the Binary-Buddy and external one in
the Half-Fit case.
First-Fit, which has been studied due to its relevance in
the existing theoretical fragmentation analysis, presents the
worst fragmentation in all the cases. First-Fit t ends to split
large blocks to satisfy small-size request, preventing its use
for incoming request.
Figure 4 shows the evolution of the fragmentation with the
number of mallocs requested. The lowest plot corresponds
with the requested workload (blocks requested by tasks). As
the previous table showed, TLSF, Best-fit and DLalloc have
similar evolution. The higher fragmentation is obtained by
Binary-buddy and Half-fit. In the plot we have represented
the first 1400 of 1500000 mallocs of the simulation.
4.4 Analysis of the impact of delayed deallo-
cation policy
The goal of Test 4 was to analyse th e evolution of the
fragmentation incurred by each allocator when tasks do not
perform explicit deallocation. Deallo cation is carried out
by the deallocation task which free all block that are not
in use. The fragmentation of the set of task with explicit
deallocation (without deallocation task) is considered as the
reference task set. It means that the fragmentation incurred
when there is the deallocation task, is calculated as the point
of the maximum memory used by the allocator in t he refer-
ence task set relative to the point of the maximum amount
of memory used by the load in t he scenario with dealloca-
tion task. This approach permits to evaluate th e impact of
delayed deallocation in the fragmentation.
For each deallocation task period, we executed 50 times
the task set with different seed. Average values of the mem-
ory used and maximum memory required were obtained.
These results are shown in figure 5. The initial value of
the fragmentation corresponds to the same scenario with
explicit deallocation (reference task set). The range of the
task periods varies from the shortest application task period
to 10 times the largest period (in this case [45..1400]).
As shown in figure 5, TLSF, D Lalloc and Best-fit work
reasonably well increasing the framentation (measured by
factor F) from the initial value of 7% to 45%. We have
included only these allocators because others present very
high fragmentation.
5. CONCLUSIONS
TLSF is a “good-fit” dynamic storage allo cator designed
to meet real-time requ irements. TLSF has been designed
as a compromise between constant and fast response time
74
0
1000
2000
3000
4000
5000
6000
7000
0 200 400 600 800 1000 1200 1400
Memory (kbytes)
Time (in Malloc requests)
TLSF
First-fit
Binary-buddy
Half-fit
Best-fit DLmalloc
Memory Requested
Figure 4: Memory used evolution during one of the
simulations (Test 3)
5
10
15
20
25
30
35
40
45
50
55
60
0 200 400 600 800 1000 1200 1400
Fragmentation (F) in %
Server task period
TLSF
Best-fit
DLmalloc
Shortest Task Period
Largest Task Period
5
10
15
20
25
30
35
40
45
50
55
60
0 200 400 600 800 1000 1200 1400
Fragmentation (F) in %
Server task period
TLSF
Best-fit
DLmalloc
Shortest Task Period
Largest Task Period
Figure 5: Fragmentation incurred w hen the deallo-
cation task period changes (test4)
and efficient memory use (low fragmentation). In this p aper
we have compared it with other allocators under ”real-time”
workloads.
TLSF and Half-fit exhibit a stable, bounded response time,
which make them suitable for real-time applications. The
bounded response time of TLSF is not achieved at the cost
of wasted memory, as Half-fit does. Besides a bound ed re-
sponse time, a good average resp onse time is also achieved
with some real workload.
Our analysis also shows that allo cators designed to op-
timise average response time by considering the usage pat-
tern of conventional applications, such as DLalloc or Binary-
buddy, are not suitable for real-time systems.
Since real-time applications are long running and the start-
up and the shutdown is usually done during the non-critical
phase of the system, the analysis has been focused on the
stable phase designing experiments with very huge number
of malloc operations. Also, the specific characteristics of
real-time applications have be considered, including: peri-
odic request pattern s, limited amount of allocated memory
per task, bounded holding time, etc.
From the fragmentation point of view, TLSF and DLmal-
loc present the best results followed by Best-fit. Binary-
Buddy and Half-fit generate a very important fragmenta-
tion.
In summary, an allocator must fulfil two requirements
in order to be used in real-time systems: 1) it must have
a bounded response time, so that schedulability analysis
can be performed; 2) it must cause low fragmentation. It
will also be desirable to have some kind of worst-case frag-
mentation analysis, similar to those used in schedulabil-
ity analysis of tasks in real-time systems. From this point
of v iew, TLSF achieves both requirements. Moreover a
bounded response time, the average response time is as good
as DLmalloc and Binary-bu ddy. All the code is available at:
http://rtportal.upv.es/rtmalloc.
6. REFERENCES
[1] E. D. Berger, B. G. Zorn, and K. S. McKinley.
Composing high-performance memory allocators. In
SIGPLAN Conference on Programming Language
Design and Implementation, pages 114–124, 2001.
[2] G. Bollella and J. Gosling. The real-time specification
for java. IEEE Computer, 33(6):47–54, 2000.
[3] J. Bonwick. The slab allocator: An object-caching
kernel memory allocator. In USENIX Summer, pages
87–98, 1994.
[4] R. P. Brent. Efficient implementation of the first-fit
strategy for dynamic storage allocation. ACM Trans.
Program. Lang. Syst., 11(3):388–403, 1989.
[5] K. Danne and M. Platzner. Memory demanding
periodic real-time applications on fpga computers. In
WIP Session of the 17th Euromicro Conference on
Real-Time Systems. Palma de Mallorca, Spain, pages
65–68, 2005.
[6] S. Feizabadi, B. Ravindran, and E. D. Jensen. Msa: a
memory-aware utility accrual scheduling algorithm. In
SAC, pages 857–862, 2005.
[7] M. Garey, R. Graham, and J. Ullman. Worst Case
Analysis of Memory Allocation Algorithms. In
Proceedings of the 4th Annual ACM Symposium on the
Theory of Computing (STOC’72). ACM Press, 1972.
[8] D. Grunwald and B. Zorn. Customalloc: Efficient
synthesized memory allocators. Software Practice and
Experience, 23(8):851–869, 1993.
[9] D. K nuth. The Art of Computer Programming, volume
1: Fundamental Algorithms. Addison-Wesley, Reading,
Massachusetts, USA, 1973.
[10] D. Lea. A Memory Allocator. Unix/Mail, 6/96, 1996.
[11] M. Masmano, I. Ripoll, A. Crespo, and J. Real. TLSF:
A new dynamic memory allocator for real-time
systems. In 16th Euromicro Conference on Real-Time
Systems, pages 79–88, Catania, Italy, July 2004. IEEE.
[12] T. Ogasawara. An algorithm with constant execution
time for dynamic storage allocation. 2nd Int.
Workshop on Real-Time Computing Systems and
Applications, page 21, 1995.
[13] J. Peterson and T. Norman. Buddy Systems.
Communications of the ACM, 20(6):421–431, 1977.
[14] A.-M. D. Pierre-Emmanuel Hladik,
Hadrien Cambazard and N. Jussien. How to solve
allocation problems with constraint programming. In
WIP Session of the 17th Euromicro Conference on
Real-Time Systems. Palma de Mallorca, Spain, pages
25–28, 2005.
75
[15] I. Puaut. Real-Time Performance of Dyn amic Memory
Allocation Algorithms. 14 th Euromicro Conference
on Real-Time Systems, page 41, 2002.
[16] J. M. Robson. Bounds for some functions concerning
dynamic storage allocation. J. ACM, 21(3):491–499,
1974.
[17] J. M. Robson. Worst case fragmentation of first fit
and best fit storage allocation strategies. The
Computer Journal, 20(3):242–244, 1977.
[18] R. Sedgewick. Algorithms in C. Third Edition.
Addison-Wesley, Reading, Massachusetts, USA, 1998.
[19] P. R. Wilson, M. S. Johnstone, M. Neely, and
D. Boles. Dynamic Storage Allocation: A Survey and
Critical Review. In H. Baker, editor, Proc. of the Int.
Workshop on Memory Management, Kinross,
Scotland, UK, Lecture Notes in Computer Science.
Springer-Verlag, Berlin, Germany, 1995. Vol:986,
pp:1–116.
76
... Vários trabalhos (e.g. [5,6,7]) têm investigado o desempenho de UMAs, entretanto a maioria desses trabalhos se baseia em testes sintéticos (ex. mtmalloctest [8]). ...
... Na Fig. 1 (UMAs) possuírem muitas similaridades em termos de estruturas de dados, eles diferem consideravelmente em termos de gerência da heap. Cada um implementa abordagens diferentes para tratar problemas importantes naárea de gerência de memória, tais como: blowup (explosão), false sharing (falso compartilhamento) e memory contention (contenção de memória) [4,7]. Todos esses problemas são potencializados quando a aplicaçãoé do tipo multithread eé executada em ambiente multicore. ...
... Cada UMA aborda os problemas mencionados de um modo particular, determinando desempenho diferenciado. Assim, avaliamos os UMAs a seguir devido ao desempenho destes [4,5,6,7] eà disponibilidade de seus respectivos códigos-fonte. ...
Conference Paper
The performance of memory allocation operations is a very important aspect to be considered in software design, however it is frequently neglected. This paper presents a comparative analysis of seven largely adopted memory allocators. Unlike other related works, based on artificial benchmark tests, we evaluate the selected allocators using real-world middleware applications. In order to compare the performance of the investigated allocators we consider the response time, memory consumption, and memory fragmentation. All tests are evaluated with respect to different combinations of processor cores. The results indicate that for workloads based on memory allocations up to 64 bytes and all combinations of processor cores up to four, the best average response time and memory usage is obtained using the TCMalloc memory allocator, followed by the Ptmalloc version 3.
... Vários trabalhos (e.g. [5,6,7]) têm investigado o desempenho de UMAs, entretanto a maioria desses trabalhos se baseia em testes sintéticos (ex. mtmalloctest [8]). ...
... Apesar dos alocadores (UMAs) possuírem muitas similaridades em termos de estruturas de dados, eles diferem consideravelmente em termos de gerência da heap. Cada um implementa abordagens diferentes para tratar problemas importantes na área de gerência de memória, tais como: blowup (explosão), false sharing (falso compartilhamento) e memory contention (contenção de memória) [4,7]. Todos esses problemas são potencializados quando a aplicação é do tipo multithread e é executada em ambiente multicore. ...
... Miser (cilk_8503-i686). Selecionamos esses alocadores por causa dos seus desempenhos em trabalhos relacionados [4,5,6,7] e porque seus respectivos códigos-fonte estão disponíveis, o que nos permitiu investigar seus algoritmos. As seções seguintes descrevem as características de cada um deles. ...
Conference Paper
Full-text available
The performance of memory allocation operations is a very important aspect to be considered in software design, however it is frequently neglected. This paper presents a comparative analysis of seven largely adopted memory allocators. Unlike other related works, based on artificial benchmark tests, we evaluate the selected allocators using real-world middleware applications. In order to compare the performance of the investigated allocators we consider the response time, memory consumption, and memory fragmentation. All tests are evaluated with respect to different combinations of processor cores. The results indicate that for workloads based on memory allocations up to 64 bytes and all combinations of processor cores up to four, the best average response time and memory usage is obtained using the TCMalloc memory allocator, followed by the Ptmalloc version 3.
... More recently, Masmano et al. [2006] compared the first-fit, best-fit, binarybuddy [Knuth 1997], DLmalloc [Lea, D. 1996], Half-fit, and TLSF memory allocators for real-time applications. Results have indicated that TLSF and Half-fit can be used by real-time applications due to stable and bounded response times. ...
... In contrast, algorithms designed to optimize AVG execution times, such as DLmalloc and binary-buddy, are not suitable for real-time applications. Half-fit achieves bounded response times wasting more memory than TLSF [Masmano et al. 2006]. ...
Article
Full-text available
Multicore processors are being extensively used by real-time systems, mainly because of their demand for increased computing power. However, multicore processors have shared resources that affect the predictability of real-time systems, which is the key to correctly estimate the worst-case execution time of tasks. One of the main factors for unpredictability in a multicore processor is the cache memory hierarchy. Recently, many research works have proposed different techniques to deal with caches in multicore processors in the context of real-time systems. Nevertheless, a review and categorization of these techniques is still an open topic and would be very useful for the real-time community. In this article, we present a survey of cache management techniques for real-time embedded systems, from the first studies of the field in 1990 up to the latest research published in 2014. We categorize the main research works and provide a detailed comparison in terms of similarities and differences. We also identify key challenges and discuss future research directions.
... They used memory efficiently but were highly serial, constituting an obstacle to the throughput of concurrent applications, which require some form of synchronization to protect the heap. Additionally, when a concurrent application is run in a multiprocessor system, other problems can occur, such heap blowup, false sharing, or memory contention (Masmano et al. 2006;Gidenstam et al. 2010). ...
Article
Full-text available
One of the main advantages of Prolog is its potential for the implicit exploitation of parallelism and, as a high-level language, Prolog is also often used as a means to explicitly control concurrent tasks . Tabling is a powerful implementation technique that overcomes some limitations of traditional Prolog systems in dealing with recursion and redundant sub-computations. Given these advantages, the question that arises is if tabling has also the potential for the exploitation of concurrency/parallelism. On one hand, tabling still exploits a search space as traditional Prolog but, on the other hand, the concurrent model of tabling is necessarily far more complex, since it also introduces concurrency on the access to the tables. In this paper, we summarize Yap's main contributions to concurrent tabled evaluation and we describe the design and implementation challenges of several alternative table space designs for implicit and explicit concurrent tabled evaluation that represent different trade-offs between concurrency and memory usage. We also motivate for the advantages of using fixed-size and lock-free data structures, elaborate on the key role that the engine's memory allocator plays on such environments, and discuss how Yap's mode-directed tabling support can be extended to concurrent evaluation. Finally, we present our future perspectives toward an efficient and novel concurrent framework which integrates both implicit and explicit concurrent tabled evaluation in a single Prolog engine.
... Many previous work (e.g., [1], [3]- [6]) have evaluated different memory allocators from an experimental point of view. In order to contribute to the body of knowledge in this area, this work presents a theoretical study of six widely used memory allocator algorithms, which are Hoard (version 3.8), Ptmalloc (version 2), Ptmalloc (version 3), TCMalloc (version 1.5), Jemalloc (version 2.0.1), and Miser. ...
Conference Paper
Memory allocations are one of the most frequently used operations in computer programs. The performance of memory allocation operations is a critical factor in software design; however, it is very often neglected. In this paper, we present a comprehensive complexity analysis of widely adopted user-level memory allocator algorithms. We consider time and space complexity, as well as the allocator overhead. The results show that the Ptmalloc family of memory allocator algorithms outperformed all other investigated allocators in terms of theoretical time complexity and space overhead. All allocators showed the same space complexity.
... Most of the established explicit sequential dynamic heap management systems [24,32] are optimized to offer excellent best-case and average-case response times, but in the worst-case are unbounded in the size of the memory allocation or deallocation request, i.e., depend on the global state of memory. The best known are First-fit, Best-fit [22] and DL [23] with allocation times depending on the global state of memory. ...
Article
Full-text available
We study, formally and experimentally, the trade-off in temporal and spatial overhead when managing contiguous blocks of memory using the explicit, dynamic and real-time heap management system Compact-fit (CF). The key property of CF is that temporal and spatial overhead can be bounded, related, and predicted in constant time through the notion of partial and incremental compaction. Partial compaction determines the maximally tolerated degree of memory fragmentation. Incremental compaction of objects, introduced here, determines the maximal amount of memory involved in any, logically atomic, portion of a compaction operation. We explore CF's potential application space on (1) multiprocessor and multicore systems as well as on (2) memory-constrained uniprocessor systems. For (1), we argue that little or no compaction is likely to avoid the worst case in temporal as well as spatial overhead but also observe that scalability only improves by a constant factor. Scalability can be further improved significantly by reducing overall data sharing through separate instances of Compact-fit. For (2), we observe that incremental compaction can effectively trade-off throughput and memory fragmentation for lower latency.
... These approaches assume that their target application is temporally independent of other co-running applications, which is not true when shared pages are used. Real-time memory management techniques in the context of dynamic memory allocation have also been studied in [11][17][12][13]. They aim to meet changing memory demands of user-level applications under real-time timing constraints. ...
Conference Paper
Full-text available
Memory reservation provides real-time applications with guaranteed memory access to a specified amount of physical memory. However, previous work on memory reservation primarily focused on private pages, and did not pay attention to shared pages, which are widely used in current operating systems. With previous schemes, a real-time application may experience unexpected timing delays from other applications through shared pages that are shared by another process, even though the application has enough free pages in its reservation. In this paper, we describe problems with shared pages in real-time applications, and propose a shared-page management mechanism to enhance the temporal isolation of memory reservations in resource kernels that use resource reservation. The proposed mechanism consists of two techniques, Shared-Page Conservation (SPC) and Shared-Page Eviction Lock (SPEL), each of which prevents timing penalties caused by the seemingly arbitrary eviction of shared pages. The mechanism can manage shared data for inter-process communication and shared libraries, as well as pages shared by the kernel's copy-on-write technique and file caches. We have implemented and evaluated our schemes on the Linux/RK platform, but it can be applied to other operating systems with paged virtual memory.
... Real-time memory management techniques in the context of dynamic memory allocation have also been studied in [11,17,12,13]. They aim to meet changing memory demands of user-level applications under real-time timing constraints. ...
Article
Full-text available
Memory reservations are used to provide real-time tasks with guaranteed memory access to a specified amount of physical memory. However, previous work on memory reservation primarily focused on private pages, and did not pay attention to shared pages, which are widely used in current operating systems. With previous schemes, a real-time task may experience unexpected timing delays from other tasks through shared pages that are shared by another process, even though the task has enough free pages in its own reservation. In this paper, we first describe the problems that arise when real-time tasks share pages. We then propose a shared-page management framework which enhances the temporal isolation provided by memory reservations in resource kernels that use the resource reservation approach. Our proposed solution consists of two schemes, Shared-Page Conservation (SPC) and Shared-Page Eviction Lock (SPEL), each of which prevents timing penalties caused by the seemingly arbitrary eviction of shared pages. The framework can manage shared data for inter-process communication and shared libraries, as well as pages shared by the kernel’s copy-on-write technique and file caches. We have implemented and evaluated our schemes on the Linux/RK platform, but it can also be applied to other operating systems with paged virtual memory.
Article
Full-text available
In this paper 1 , we present a constraint programming-based approach to solve a hard real-time allocation prob-lem. This problem consists in assigning periodic tasks to processors in the context of fixed priority preemptive scheduling. Our approach builds on dynamic constraint programming together with a learning method to find a fea-sible processor allocation under constraints. This prob-lem is decomposed into two subproblems: allocation, and schedulability. Benders decomposition is then used as a way of learning when the allocation subproblem yields a valid solution while the schedulability analysis of the allo-cation does not. The rationale of this approach is to learn from the failures of the schedulability analysis to reduce the search space.
Conference Paper
Full-text available
Whereas fairness can be the basis for general-purpose operating system scheduling policies, timeliness is the primary concern for real-time systems. As such, real-time schedulers permit uninterrupted, exclusive access to the CPU by a specific task to ensure its timely completion of execution. Only a subset of tasks, however, can satisfy their timing constraints during processor overload conditions in a real-time system. Utility accrual (UA) scheduling disciplines assure scalability and graceful performance degradation by identifying the subset of tasks to be granted the heavily contended system resources.Furthermore, whereas general-purpose operating systems treat memory monolithically and indiscriminately service dynamic allocation requests, UA-schedulers can benefit from special memory management considerations during memory overload conditions. MSA, the scheduling algorithm here presented is the first of its kind to treat memory as a UA-scheduler-managed resource. The scheduler is made aware of memory allocation requirements of each task throughout runtime and accordingly makes appropriate CPU and resource scheduling decisions. The algorithm is well-suited for use in resource-constrained embedded systems in a soft real-time environment.We have implemented MSA in a POSIX real-time operating environment and measured its performance under various load conditions. Our experimental results show overall performance gains over other memory-unaware UA scheduling algorithms during memory overload.
Article
Full-text available
The worst possible storage fragmentation is analyzed for two commonly used allocation strategies. In the case of the first fit system, fragmentation is not much worse than is inevitable but for the best fit system, it is almost as bad as it could be for any system.
Article
This paper presents work in progress on executing real- time applications on FPGA-based computing systems. Usu- ally, these systems provide an FPGA device coupled with multiple external SRAM banks. Todays FPGAs show large capacities and are reprogrammable during runtime, al- lowing for space- and time-sharing multitasking. Typical FPGA tasks demand large memory buffers and access them periodically. In our previous work, we have devised tech- niques for scheduling periodic real-time tasks to such sys- tems. In this paper, we address the resulting problem of as- signing data buffers to physical memories, and aim at min- imizing the number of required memories for a given appli- cation. We model the minimization problem as an integer linear program and present first results.
Article
We describe an algorithm that efficiently implements the first-fit strategy for dynamic storage allocation. The algorithm imposes a storage overhead of only one word per allocated block (plus a few percent of the total space used for dynamic storage), and the time required to allocate or free a block is O(log W), where W is the maximum number of words allocated dynamically. The algorithm is faster than many commonly used algorithms, especially when many small blocks are allocated, and has good worst-case behavior. It is relatively easy to implement and could be used internally by an operating system or to provide run-time support for high-level languages such as Pascal and Ada. A Pascal implementation is given in the Appendix.
Article
The predictability of the computation time of program modules is very important for estimating an accurate worst-case execution time (WCET) of a task in real-time systems. Dynamic storage allocation (DSA) is a common programming technique. Although many DSA algorithms have been developed, they focus on the average execution time rather than the WCET, making it is very difficult to calculate their WCET accurately. In this paper, we propose a new algorithm called Half-Fit whose WCET can be calculated accurately. The algorithm finds a free block without searching on a free list or tree, allowing extra unusable memory called incomplete memory use. In a simulation following a queueing model of M/G//spl infin/, Half-Fit has the advantage over the binary buddy system of more efficient storage utilization. The binary buddy system is a conventional algorithm whose WCET can be calculated.
Conference Paper
Various memory allocation problems can be modeled by the following abstract problem. Given a list A &equil; (&agr;1,&agr;2,...&agr;n,) of real numbers in the range (0, 1], place these in a minimum number of “bins” so that no bin holds numbers summing to more than 1. We let A* be the smallest number of bins into which the numbers of list A may be placed. Since a general placement algorithm for attaining A* appears to be impractical, it is important to determine good heuristic methods for assigning numbers of bins. We consider four such simple methods and analyze the worst-case performance of each, closely bounding the maximum of the ratio of the number of bins used by each method applied to list A to the optimal quantity A*.
Conference Paper
This paper presents a comprehensive design over- view of the SunOS 5.4 kernel memory allocator. This allocator is based on a set of object-cachin g primitives that reduce the cost of allocating complex objects by retaining their state between uses. These same primitives prove equally effective for manag- ing stateless memory (e.g. data pages and temporary buffers) because they are space-efficient and fast. The allocator's object caches respond dynamically to global memory pressure, and employ an object- coloring scheme that improves the system's overall cache utilization and bus balance. The allocator also has several statistical and debugging features that can detect a wide range of problems throughout the system.