Conference PaperPDF Available

SLIMalloc: a Safer, Faster, and more Capable Heap Allocator

Authors:
  • TWD Industries AG TWD Holding AG (Thales Nasdaq: SPCO)

Abstract

For 50 years, one of the most puzzling issues when writing computer programs has been memory allocation errors or patterns causing instant or delayed crashes, memory leaks, or poor resource management compromising security, performance, scalability-and process lifespan. Despite significant advances, performance, security, and convenience are often seen as incompatible goals. We introduce SLIMalloc, a rewrite of the 2019 secure SlimGuard heap allocator. SLIMalloc, as compared to the non-secure GLIBC standard allocator and 2019 Microsoft Research secure Mimalloc, delivers (1) the most advanced features available, (2) an unprecedented real-time invalid-pointer detection capability preventing allocator misuse such as double-free or invalid-free/realloc errors, (3) new troubleshooting features to assist developers with the exact location of allocation errors, (4) the detection, location and correction of all memory leaks, (5) tracing all allocation calls in third-party code, (6) a much smaller source-code, (7) higher performance than GLIBC and Mimalloc, (8) a lower memory usage, (9) and the ability to monitor resource usage (RAM, CPU, disk or network I/O...) and generate SVG charts during any program's execution (using SLIMalloc or other allocators). As far as we know, SLIMalloc is the first scalable allocator able to catch and report invalid pointers in real-time. And speed matters: memory allocation consumes 7% of all CPU cycles in Google datacenters.
SLIMalloc: a Safer, Faster, and more Capable
Heap Allocator
July 3, 2020 (February 2021 update)
Pierre Gauthier
TWD Industries AG
pierre@trustleap.com
Abstract
For 50 years, one of the most puzzling issues when writing computer programs has been memory
allocation errors or patterns causing instant or delayed crashes, memory leaks, or poor resource
management compromising security, performance, scalability – and process lifespan. Despite
significant advances, performance, security, and convenience are often seen as incompatible
goals. We introduce SLIMalloc, a rewrite of the 2019 secure SlimGuard heap allocator.
SLIMalloc, as compared to the non-secure GLIBC standard allocator and 2019 Microsoft
Research secure Mimalloc, delivers (1) the most advanced features available, (2) an
unprecedented real-time invalid-pointer detection capability preventing allocator misuse such as
double-free or invalid-free/realloc errors, (3) new troubleshooting features to assist developers
with the exact location of allocation errors, (4) the detection, location and correction of all
memory leaks, (5) tracing all allocation calls in third-party code, (6) a much smaller source-code,
(7) higher performance than GLIBC and Mimalloc, (8) a lower memory usage, (9) and the ability
to monitor resource usage (RAM, CPU, disk or network I/O...) and generate SVG charts during
any program's execution (using SLIMalloc or other allocators). As far as we know, SLIMalloc is
the first scalable allocator able to catch and report invalid pointers in real-time. And speed
matters: memory allocation consumes 7% of all CPU cycles in Google datacenters.
Keywords: Energy-savings, IoT, System on Chip, Datacenter, Software Engineering,
Troubleshooting, Security, Privacy, Operating Systems, Memory Allocation, Malloc.
1. Introduction
Never before I.T. has been deployed so widely – from ubiquitous IoT to datacenters, the skills
have been so diverse, and operating systems and runtime libraries have been under pressure to
deliver optimal results[6] and resilience in such a wide variety of cases and exposure to risk[7].
Novel architectures and capabilities are required, so that the cost of poor design is not impacting
billions of users, even more devices – and unpatchable Systems on Chips (c.f. Apple T2)[23].
“Profiling a warehouse-scale computer”[6]
For example, when a developer has made a
mistake, with effects eventually seen in his
code or in third-party code, a crash-dump is
not enough since it may result as the long-
deferred consequence of something that (1)
could have been detected immediately, (2)
reported in the most intelligible manner
available at the time, and (3) automatically
blocked or corrected when possible.
1 / 25
“Is memory safety relevant? In 2017, 55% of remote-code execution (RCE) causing bugs in
Microsoft due to memory errors[21]. Microsoft describes decades old sticky root causes[20]:
Given the scale of the damage, are memory errors fixed? No says Microsoft[22]:
Reality confirms vendor figures as the number of heap-related security breaches explodes[2]:
2 / 25
Therefore bringing a practicable (“plug & play”) solution is required – at least until the hardware
(mainly CPUs) is designed to prevent memory-management from being the most important and
persistent root cause of all the remotely-exploitable vulnerabilities.
2. The GLIBC standard allocator (1987-), and Microsoft Research Mimalloc (2019-)
The GLIBC allocator[1] (derived from ptmalloc2 itself originated from dlmalloc) has evolved
into a fast and scalable tool, but it still lacks modern security features. Mimalloc[3] seems to be
(often but not always: see our KV test) slower than GLIBC, but it offers more security:
(1) The GLIBC allocator's metadata is embedded in the allocated blocks, exposed to accidental
errors and competent opponents. Mimalloc segregates metadata… at a known offset, excluding
accidental errors but leaving its internal structures exposed to opponents.
(2) GLIBC's canary implementation requires instrumentation so it is slow and it uses known
values, helping to fix errors but not attacks. Mimalloc supports canaries in debug mode.
(3) GLIBC and Mimalloc lack most SLIMalloc features, half inherited from SlimGuard[2]:
randomly segregated metadata, guard-page chosen density, allocation location tracing, leak
detection/location/correction, and real-time allocation error locating, reporting and blocking,
realtime bad pointers blocking, and program memory usage rendered in SVG charts.
(4) GLIBC's allocation error reporting capabilities are primitive: mallopt(M_CHECK_ACTION, 1)
did not work as expected (there's no “detailed error message” after the double-free), and the
invalid-free error caused a SEGFAULT instead of the expected abort:
--- test double-free() ----------------------------------------------------
--- test free(NULL) -------------------------------------------------------
--- test free(0xbadbeef) --------------------------------------------------
Segmentation fault
The GLIBC allocator statistics are either not thread-capable and limited to the “main area”
(mallinfo and malloc_stats) or not human-readable (malloc_info) and generated in XML files.
Mimalloc does a bit better but also crashes: it caught the free(0xbadbeef) test only because this
pointer did not use the mimalloc-page alignment, and the realloc test caused a crash without any
mimalloc error message – despite the problem being the very same as for free(0xbadbeef):
--- test double-free() ----------------------------------------------------
mimalloc: error: double free detected of block 0x56d60403880 with size 16
--- test free(NULL) -------------------------------------------------------
--- test free(0xbadbeef) --------------------------------------------------
mimalloc: error: trying to free an invalid (unaligned) pointer: 0xbadbeef
--- test realloc(0xbadbeef) -----------------------------------------------
Segmentation fault
3 / 25
Yet, usability is also lacking:
(5) With GLIBC, if you need to allocate a new heap dedicated to a given task (for isolation,
security, performance, locality, etc.) then you have to write your own. This is a major concern as
dedicated heaps can greatly enhance performance, reduce memory fragmentation and the hurdle
of identifying problems hidden within a program-wise set of allocations.
In this regard, Mimalloc does much better and offers mi_heap_new and mi_heap_destroy.
(6) GLIBC's detection of memory leaks via mtrace is cumbersome and slow, requires a malloc
API instrumentation and debug symbols, but it produces human-readable output. Again, this is a
development feature – not something fast enough to be usable in production.
Mimalloc leaves it as an exercise to the developer via mi_heap_visit_blocks – but, since it comes
after-the-facts, doing so will miss many blocks and the location of leaks (name and/or address of
the function(s) and source-code file names having made these allocations).
(7) GLIBC malloc custom functions malloc_info, malloc_stats, malloc_usable_size, malloc_trim,
mcheck and mtrace are sometimes redundant, not always reliable nor even thread-safe, and some
significantly slow-down the allocator – on the top of a convoluted API and documentation.
The Mimalloc API is more reliable and provides detailed statistics but... how memory is actually
consumed by programs remains a mystery with “after-the-facts” summaries:
heap stats: peak total freed unit count
-----------------------------------------------------------------
normal 1: 13.4 mb 20.2 mb 20.2 mb 8 b 2.6 m
normal 4: 53.8 mb 81.2 mb 81.2 mb 32 b 2.6 m
normal 6: 80.7 mb 121.8 mb 121.8 mb 48 b 2.6 m
normal 8: 64 b 64 b 64 b 64 b 1
normal 9: 134.6 mb 203.0 mb 203.0 mb 80 b 2.6 m
normal 13: 269.0 mb 405.8 mb 405.8 mb 160 b 2.6 m
normal 17: 213.2 mb 213.2 mb 213.2 mb 320 b 698.9 k
normal 21: 426.7 mb 426.7 mb 426.7 mb 640 b 699.1 k
normal 23: 671.7 mb 687.4 mb 687.4 mb 896 b 804.4 k
normal 27: 61.8 mb 92.8 mb 92.8 mb 1.7 kb 54.3 k
normal 31: 123.1 mb 185.9 mb 185.9 mb 3.5 kb 54.4 k
normal 35: 249.1 mb 374.9 mb 374.9 mb 7.0 kb 54.8 k
normal 39: 489.9 mb 744.3 mb 744.3 mb 14.0 kb 54.4 k
normal 43: 395.7 mb 395.7 mb 395.7 mb 28.1 kb 14.4 k
normal 47: 780.8 mb 780.8 mb 780.8 mb 56.2 kb 14.2 k
normal 63: 5.2 mb 5.2 mb 5.2 mb 899.5 kb 6
normal 67: 10.5 mb 10.5 mb 10.5 mb 1.7 mb 6
normal 68: 2.0 mb 2.0 mb 2.0 mb 2.0 mb 1
Also, as compared to GLIBC, the extra Mimalloc features come at a price.
For example, Mimalloc's trimming strategy is eye-wateringly expensive despite not being as
effective as for other allocators, due to the MIMalloc “deferred free” tactic (more on this
particular point is very visible on SVG charts later in this document):
4 / 25
--- Microsoft Research malloc stress-test -----------------------------------
GLIBC v2.19
-----------------------------------------------------------------------------
- THREADS:6, SCALE:10%, ITER:1000, LARGE:0, 1-SIZE:0
- total time: 1.936 seconds (hh:mm:ss 00:00:01)
- user CPU time ................ 4.358 sec (0.872 per thread)
- system CPU time .............. 2.515 sec (0.503 per thread)
- VM, current virtual memory ... 442904576 bytes (422.3 MB)
- RSS, current real RAM use .... 2510848 bytes (2.4 MB)
- RSS peak ..................... 9011200 bytes (8.6 MB)
- page reclaims ................ 7290859520 bytes (6.8 GB)
- voluntary context switches ... 8472 (threads waiting, locked)
- involuntary context switches . 6324 (time slice expired)
--- Microsoft Research malloc stress-test -----------------------------------
MIMalloc (secure: 4) v1.63
-----------------------------------------------------------------------------
- THREADS:6, SCALE:10%, ITER:1000, LARGE:0, 1-SIZE:0
- total time: 6.708 seconds (hh:mm:ss 00:00:06)
- user CPU time ................ 10.928 sec (2.186 per thread)
- system CPU time .............. 11.193 sec (2.239 per thread)
- VM, current virtual memory ... 308785152 bytes (294.4 MB)
- RSS, current real RAM use .... 1892352 bytes (1.8 MB)
- RSS peak ..................... 14848000 bytes (14.1 MB)
- page reclaims ................ 15408136192 bytes (14.3 GB)
- voluntary context switches ... 1227420 (threads waiting, locked)
- involuntary context switches . 7003 (time slice expired)
3. SLIMalloc Allocator Features
Despite being faster, SLIMalloc offers all the GLIBC and Mimalloc features (and new ones) by-
design, as options that can be enabled or disabled at run time and on a per-heap basis:
heap->opt.abort = false; // warn & abort on double/invalid-free (or continue)
heap->opt.canary = false; // slightly enlarge (small) blocks for canary byte
heap->opt.guardpages = true; // catch buffer overflows (user-defined density)
heap->opt.random = false; // randomized block addresses (over-provisioning)
heap->opt.reclaim = false; // @free() release unused OS PAGES to the system
heap->opt.trace = false; // record functions making/freeing allocations
heap->opt.trim = true; // @free() release all unused memory to the system
heap->opt.zeronfree = false; // useful for short-life confidential data
All blocks, small and large, are picked from areas allocated by mmap. The options above can be
enabled for a portion of your application and disabled later – this is useful to avoid their
memory and performance penalty when they are not needed.
There are no complex APIs involved by any of the tasks associated with these options. And they
marginally slow-down the allocator when enabled (we have measured an average 5-10%
execution time increase with the very demanding Microsoft Research malloc stress-test).
Performance matters since some problems can only be experienced at high concurrencies[19].
Therefore, tools aimed at assisting users during a troubleshooting session should not prevent the
5 / 25
inspected program from reaching the state at which trouble is expected. This is even more true for
core system organs, like the system memory allocator.
This is how SLIMalloc compares to GLIBC and Mimalloc, using the features seen earlier:
(1) The allocator metadata is stored at random addresses and the allocated blocks are stored at
unrelated addresses, avoiding accidental errors and making it much easier to keep competent
opponents at bay.
(2) The canary implementation is very fast and uses a different byte value for each block in order
to discourage (instead of invite) abuse of the protection – even in production.
(3) SLIMalloc has inherited some of its security features from SlimGuard, but a very optimized
implementation made it possible to have all these features available at all times (SlimGuard used
#defines instead, which require recompilation and linkage), and to add new desirable features,
including the kind never seen before in allocators – without impacting performance (SLIMalloc
is much faster than SlimGuard without options, even with all its options enabled).
Custom heaps and per-thread default heaps can free the memory of any other heap and can use
distinct options, making them scale ideally on parallelized applications, even with legacy code
(relying on the standard malloc/free API, and not using the extended SLIMalloc API).
(4) Error reporting, allocation tracing, and memory leaks detection, get the most of what
information is available in a process, making the location feature work at all times (better than
libunwind, which misses most frames if the code is optimized with gcc -Os/1/2/3):
Executable source file name line number function name address
exported symbols yes no partial yes
debug symbols yes yes yes yes
nothing (stripped) partial no no yes
SLIMalloc prevents allocation errors in real-time, such as double-free/realloc or invalid
free/realloc so that the memory allocator will not cause crashes nor be an available way for
hackers to corrupt allocator metadata.
Dereferencing invalid pointers cause SEGFAULTs, but this is not involving the memory
allocator. And, if you have a doubt before dereferencing a pointer, SLIMalloc's isgoodptr
and/or isfreeableptr tell you if this can safely been done – without impacting performance.
Mimalloc mi_check_owned cannot be used to prevent allocation errors because, as its
documentation states, this is an “expensive function, linear in the pages in the heap”.
SLIMalloc is the first scalable memory allocator able to catch and report invalid
pointers in real-timeincluding unallocated pointers in valid areas.
Here are properly characterized errors handled by SLIMalloc which is preventing the error
condition from corrupting data, before reporting it in human-readable text:
6 / 25
--- test CANARY: malloc(10), memset(16), free() -----------------------
(10 bytes requested, 16 allocated, now writing 16 bytes)
> ERROR: heap[test-1] buffer overflow (canary)
ptr:0x800000000
end:0x80000000f block-size:16
in slim.c:320 get_canary()
caller slim.c:514 heap_free()
caller test.c:100 main()
--- test double-free() ------------------------------------------------
> ERROR: !sfree(h[test-1] 0x14000000000 sz:2048):double-free
in slim.c:354 mark_free()
caller test.c:194 main()
--- test free(0xbadbeef) ----------------------------------------------
> ERROR: !free(h[null] 0xbadbeef):invalid-ptr
in slim.c:346 heap_free()
caller test.c:197 main()
--- test realloc(0xbadbeef) -------------------------------------------
> ERROR: !realloc(h[null] 0xbadbeef):invalid-ptr
in slim.c:541 heap_realloc()
caller test.c:204 main()
--- test free(0x8000000f0) -------------------------------------------
(PTR belongs to heap and class area, but block was never allocated)
> ERROR: !free(h[test-1] 0x8000000f0):unallocated-ptr
in slim.c:357 mark_free()
caller test.c:212 main()
BUFFER OVERFLOW (compilation warning and runtime protection)
One of the most vexing memory errors are buffer overflows (writing more bytes than allocated).
In the example below, this definition supports a more subtle verification:
char *p = malloc(10); // request 10 bytes (16 bytes allocated due to size-class)
memset(p, 'A', 11); // writing 11 bytes (11 can be dynamically calculated)
Despite the fact that 16 bytes were allocated, the above memset will crash.
Why? How?
The SLIMalloc API tells the GCC compiler how much memory has been requested (by malloc,
calloc, realloc, memalign...) and to which variable it has been assigned (here: char *p = ...).
Note: the protection works only if the assignment to the pointer p is known at compile time
and not modified later (by another allocation assigned to p for example).
GCC uses this information to check that its own implementations of memset, memcpy and strcpy
do not overflow the buffer.
As a bonus, at compile time GCC warns about the buffer overflow:
7 / 25
In file included from /usr/include/string.h:640:0,
from ./slim.cinc:229,
from ./overflow.c:12:
In function 'memset',
inlined from 'main' at ./overflow.c:59:13:
warning: '__builtin___memset_chk'
writing 11 bytes into a region of size 10 overflows the destination
[-Wstringop-overflow=]
return __builtin___memset_chk (__dest, __ch, __len, __bos0 (__dest));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ ./overflow
-----------------------------------------------------------------------
SLIM / test BUFFER OVERFLOW: malloc(10), memset(11)
-----------------------------------------------------------------------
(10 bytes requested, 16 allocated, now writing 11 bytes)
*** buffer overflow detected ***: ./overflow terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7f5d48c5029f]
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f5d48ceb87c]
/lib/x86_64-linux-gnu/libc.so.6(+0x10d750)[0x7f5d48cea750]
./overflow(main+0x91)[0x4026a1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f5d48bfef45]
./overflow[0x4027cd]
======= Memory map: ========
00400000-0040a000 r-xp 00000000 08:11 12395702 ./overflow
00609000-0060a000 r--p 00009000 08:11 12395702 ./overflow
0060a000-00612000 rw-p 0000a000 08:11 12395702 ./overflow
7e0400000000-7e0400001000 rw-p 00000000 00:00 0
7e0400001000-7e0400002000 ---p 00000000 00:00 0
7e0400002000-7e0600000000 rw-p 00000000 00:00 0
If you want to avoid the above abort triggered by GCC, then you have to write your own memset,
memcpy and strcpy functions (or just use a loop to set or copy the bytes).
Exploiting Heaps to hack programs requires bugs. Programming languages predictably allocate
variables behind the scenes in standard libraries (like GLIBC for C) or, even worse, as part of the
language design (C++, Java, C#, PHP, Go, Rust...). Hackers then just have to read the memory
layout and find where to alter memory (block size/address, function pointer, return address) to
trigger control-flow and code-injection attacks[7, 11, 12, 14, 15, 16]:
1. buffer overflows/underflows,a block is written beyond its size, or before its start
2. use-after-free a block is freed, maliciously modified, and reused
3. double-free a block is deleted multiple times, corrupting metadata
4. invalid-free/realloc, a never-allocated block is deleted, corrupting metadata
5. uninitialized-reuse/execute a previously freed block is reused with unknown contents
6. use-after-return a portion of the stack is reused after a function returned
7. format bugs, integer overflows, signedness bugs, bad casts, variadic arguments, etc.
SLIMalloc contributes to the line of (investigation and) defense with[2, 4, 5, 8, 9, 13, 15]:
1. abort-on-overflow, targeted zero-on-malloc, canaries, guard-pages, segregated metadata
2. delayed block reuse, over-provisioning, and block address randomization
3. double-free real-time detection, blocking and detailed reporting
8 / 25
4. invalid-free/realloc real-time detection, blocking and detailed reporting
5. zero-on-free, targeted zero-on-malloc, and blocked access to unallocated areas
6. refresh of memory areas used by the heaps, destruction/reconstruction of new heaps,
7. ability to discard invalid pointers in real-time – a feature rarely implemented in software
(too slow) but which helps to prevent accidental errors and planned abuses (as previously
seen with the first GLIBC and Mimalloc tests at the beginning of this document).
SLIMalloc reports memory usage per block-size with fragmentation (here due to a tiny number
of blocks highlighting over-provisioning), without malloc/free overhead – as anything below the
last block in use has been used in the past (or reserved for use with address-randomization):
SLIMalloc stats (one heap; same situation previously shown by MIMalloc)
--- heap[1] --------------------------------------------------
block-size[used / total] fragmentation of-total (419.5 MB)
8[ 0 / 194446] 0% 0%
16[ 0 / 193992] 0% 1%
24[ 9 / 9] 10% 0%
32[ 0 / 194003] 0% 1%
64[ 0 / 193931] 0% 3% =
128[ 0 / 194914] 0% 6% ===
256[ 0 / 77189] 0% 4% ==
512[ 0 / 78201] 0% 9% ====
800[ 0 / 3945] 0% 1%
1600[ 0 / 4079] 0% 1%
3200[ 0 / 4180] 0% 3% =
6400[ 0 / 4044] 0% 6% ===
12800[ 0 / 3949] 0% 11% =====
25600[ 0 / 1987] 0% 11% =====
51200[ 0 / 2606] 0% 26% =============
819200[ 0 / 1] 1% 0%
(5) custom heaps dedicated to a given task are trivial to create and use and can be given a name
(per-thread heaps are named by their thread index) to identify them in error messages or during
tracing (like in the above test for canary overwrite, double-free, and invalid-free/realloc).
char *p = malloc(size); // using the implicit per-thread default heap
heap_stats(s_heap, 0,0,0, true); // s_heap: explicit per-thread default heap
heap_t heap = { .options = GUARDPAGES | ABORT, .name = "custom-1" },
*h = &heap; // custom heap, can free() blocks of other heaps (and vice-versa)
p = heap_malloc(h, size); // allocate block
heap_stats(h, 0,0,0, 3); // get heap statistics (size-class, bar-chart, total)
Given its crucial value for application development, it is beyond understanding that a venerable
allocator like GLIBC (which is 33 years old) does not offer the ability to create custom heaps.
(6) SLIMalloc can easily detect, locate and fix memory leaks – even in system libraries:
--- Microsoft Research malloc stress-test ----------------------------------------
9 / 25
SLIMalloc heap[default].opt(40): guardpages:10 trace trim
----------------------------------------------------------------------------------
- THREADS:6, SCALE:10%, ITER:1000, LARGE:0, 1-SIZE:0
> 4 GLIBC memory leak(s) detected:
1.1 KB in 4 small-block(s)
calloc(0x7c8800000240, 288) 0x7f0bdf9e7ee5 ld-linux-x86-64.so _dl_allocate_tls()
calloc(0x7c8800000360, 288) 0x7f0bdf9e7ee5 ld-linux-x86-64.so _dl_allocate_tls()
calloc(0x7c8800000480, 288) 0x7f0bdf9e7ee5 ld-linux-x86-64.so _dl_allocate_tls()
calloc(0x7c88000005a0, 288) 0x7f0bdf9e7ee5 ld-linux-x86-64.so _dl_allocate_tls()
> leaked blocks are freed now
- total time: 1.383 seconds (hh:mm:ss 00:00:01)
- user CPU time ................ 4.720 sec (0.944 per thread)
- system CPU time .............. 0.144 sec (0.029 per thread)
- VM, current virtual memory ... 17226698752 bytes (16.0 GB)
- RSS, current real RAM use .... 1908736 bytes (1.8 MB)
- RSS peak ..................... 9793536 bytes (9.3 MB)
- page reclaims ................ 33251328 bytes (31.7 MB)
- voluntary context switches ... 5627 (threads waiting, locked)
- involuntary context switches . 5302 (time slice expired)
In this particular example, if you launch new threads after the GLIBC “leaks” were freed by
SLIMalloc freeleaks, then pthread_create will crash because the memory cached by GLIBC to
skip malloc calls (in an attempt to scale?) is missing.
Third-party code “leaks” might be designed as an optimization for future use and if you
decide to ditch it then you must know what you are doing (freeing the GLIBC blocks
above is fine if you no longer create threads).
Many other GLIBC functions keep allocated memory instead of just using the stack or
dedicated OS-pages allocated by mmap when persistence is required. This bad design
generating deferred trouble should be banned, especially from system core libraries.
Due to the way blocks are allocated, these leaks can amount to a megabyte or more(!) like
GLIBC setlocale(LC_ALL, "") despite a few allocations amounting to very little memory:
--- allocations performed by GLIBC setlocale() -------------------------
000000002801818 malloc (0x400200000000, 5) 0x7f7fc5c346c5 libc.so.6
000000002906963 free (0x400200000000, 8) 0x7f7fc5c2ec8f libc.so.6
000000003116942 malloc (0x401e00000000, 120) 0x7f7fc5c2eac6 libc.so.6
000000004203406 malloc (0x400400000000, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000005312436 malloc (0x407200000000, 776) 0x7f7fc5c2ddf0 libc.so.6
000000005463572 malloc (0x401c00000000, 112) 0x7f7fc5c2ddf0 libc.so.6
000000006574515 malloc (0x407c00000000, 952) 0x7f7fc5c2ddf0 libc.so.6
000000007671427 malloc (0x403600000000, 216) 0x7f7fc5c2ddf0 libc.so.6
000000008779512 malloc (0x405600000000, 432) 0x7f7fc5c2ddf0 libc.so.6
000000008923012 malloc (0x401a00000000, 104) 0x7f7fc5c2ddf0 libc.so.6
000000009056633 malloc (0x401600000000, 88) 0x7f7fc5c2ddf0 libc.so.6
000000009150742 malloc (0x401e00000078, 120) 0x7f7fc5c2ddf0 libc.so.6
000000010327739 malloc (0x402a00000000, 168) 0x7f7fc5c2ddf0 libc.so.6
000000010422612 malloc (0x401a00000068, 104) 0x7f7fc5c2ddf0 libc.so.6
000000010556942 malloc (0x401400000000, 80) 0x7f7fc5c2ddf0 libc.so.6
000000011682154 malloc (0x403000000000, 192) 0x7f7fc5c2ddf0 libc.so.6
000000011778948 malloc (0x400400000010, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000011908027 malloc (0x400400000020, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012023430 malloc (0x400400000030, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
10 / 25
000000012138036 malloc (0x400400000040, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012251403 malloc (0x400400000050, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012362448 malloc (0x400400000060, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012474906 malloc (0x400400000070, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012587300 malloc (0x400400000080, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012698372 malloc (0x400400000090, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012810936 malloc (0x4004000000a0, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000012923157 malloc (0x4004000000b0, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000013034624 malloc (0x4004000000c0, 12) 0x7f7fc5c88b4a libc.so.6 __strdup()
000000013144145 malloc (0x4004000000d0, 12) 0x7f7fc5c2cfec libc.so.6
000000013235266 free ( 0, 0) 0x7f7fc5c2d3d8 libc.so.6 setlocale()
000000013325048 free ( 0, 0) 0x7f7fc5c2d3e2 libc.so.6 setlocale()
The trailing free(NULL) GLIBC calls above might be legitimate (ie: depending on the setlocale
function parameters), but could also be dead code that you want to clean up. SLIMalloc lets you
get that information effortlessly – along with the time of each allocation call.
This shows how vital it is for long-lived applications to know exactly what the system (and other
third-party libraries) are doing with the memory they use.
Then, on a per-case basis, you might decide to rewrite GLIBC functions (for example, to switch
from fopen to open) to avoid leaks potentially leading to (1) heap exploitation and (2) memory
fragmentation (3) preventing your applications from releasing memory back to the system
(something that will end badly if your application is a long-lasting server process).
(7) Let's now review the extended features implemented as functions:
size2used how much is allocated for a given size (useful to choose the optimal size)
ptr2heap a pointer to the heap that contains pointer, or NULL if no such heap exists
heap_ptr2block the pointer's block address that can be freed, if any
heap_ptr2size the block size of the specified pointer, if it is valid
heap_goodptr valid pointer (it belongs to allocated memory, including freed blocks)
heap_freeable pointer can be freed (belongs to still in-use allocated memory)
Example using some of the above functions in action after malloc first, and then after free:
--- SMALL(1.9 KB) malloc(2047), memset(), free() -------------------
(2047 bytes requested, 2048 bytes allocated)
ptr:0x14000000000, freeable-block-addr:0x14000000000,
block-size:2048 bytes, freeable-ptr:yes good-ptr:yes
now, after free():
ptr:0x14000000000, freeable-block-addr:0x14000000000,
block-size:2048 bytes, freeable-ptr:no good-ptr:yes
heap_getleaks list the specified heap's allocated blocks that have not yet been freed
heap_freeleaks free the specified heap's allocated blocks that have not yet been freed
heap_gettrace list of heap's allocated blocks recorded since opt.trace was set
heap_deltrace free the list of heap's allocated blocks recorded since opt.trace was set
11 / 25
free_class release memory to the system, any new class allocation will reallocate an area
heap_trim return the specified heap's allocated memory to the system... but keep it available!
heap_reset free the specified heap's allocated memory (new allocations will reallocate it)
heap_stats get per-heap statistics with the breakdown per-size, still used, or used in the past
Below, we have enabled the SLIMalloc opt.trace flag to trace allocations done by our code or by
third-parties (GLIBC, system libraries, third-party libraries, whether they are closed-source or
open-source), and even if this takes place before or after your program's main:
--- allocations performed by GLIBC popen() ---------------------------
000000014677045 malloc (0x7c8000000100, 256) at 0x7f76616b2c48 libc.so.6 popen()
000000014920715 free (0x7c8000000100, 256) at 0x7f76616b0a25 libc.so.6 fclose()
Since these flags can enable SLIMalloc options within the code at any given time during the life
of a program, they can target a specific portion of the code (and even a single function call),
limiting the need to deal with gigabytes of trace data to find the searched information.
After SLIMalloc traced memory allocations the list can be searched by: heap, time,
block-address, block-size, malloc API name/error/caller (file+function).
This is saving a lot of time usually wasted in conjectures and third-party tools (like VALGRIND)
that are not as flexible, practical, or fast as just a couple of if / then / else tests in the code of your
application – even in production (the place of all unknowns and mysteries).
4. SLIMalloc's Architecture and Implementation
How can we do more (features, performance, scalability) with less (code, CPU and RAM)?
SLIMalloc's architecture is based on SlimGuard (written by US and UK academics). Little was
changed except the ability to use per-thread and custom heaps, and to efficiently spot invalid
pointers (the rest is made of either drastic optimizations or new features):
12 / 25
(A) each AREA is divided in n blocks of a given size-class covering the whole allocator range.
(B) subdivisions make it easy to calculate a block-size from a class-size index and vice-versa.
(C) an AREA contains same-sized blocks which can also feature canary values and guardpages.
(D) the used section is below the pointer that is incremented by block-size+canary to enlarge it.
(E) the unused section, below the allocation pointer, is consumed by new memory allocations.
(F) the data area limit pointer may go backwards if the memory below it is freed.
(G) the freelist is using free blocks for small-blocks and is segregated for large-blocks.
(H) freeing blocks can be done without a loop, by lookups in the segregated metadata.
(I) a per size-class bitmap gives direct access to the free/used status of each block.
(J) guardpages protect against block-over/underflows and their density is user-defined.
The advantages given by the SlimGuard design upgraded by SLIMalloc are:
1. mostly direct access requiring little computations,
2. overall compactness so little memory is wasted.
The downsides are:
1. some lists remain, still requiring loops in the malloc() path,
2. in-area guardpages break block alignment (bad for checks, good for security),
3. not all freelists are segregated which is bad for security but good to save memory.
SLIMalloc has greatly improved the advantages and aims at reducing the downsides as much as
possible (downside #1 is gone, #2 is almost gone, and #3 is on its way to be fixed).
5. Code-Size and Features: GLIBC v2.19 / Microsoft Research Mimalloc v1.63 / SLIMalloc
In this paper we only compare SLIMalloc to GLIBC and Microsoft Research's Mimalloc
(2019) depicted by its authors as the epitome of all allocators:
“We tested mimalloc against many other leading allocators over a wide range of
benchmarks and mimalloc consistently outperforms all others”.
Since Mimalloc offers a large range of what has been done before in terms of features, portability
and optimizations, we did not test older allocators – except one, the standard GLIBC allocator.
Why? Because with its non-secure design GLIBC has a major advantage in performance and
memory savings. It is therefore quite a feat for a secure allocator to be noticeably faster and
consume less memory than GLIBC – especially if it also offers many more features.
This summary of all the features is contrasted with the source code required to implement them.
For SLIMalloc, this break-down includes features that Mimalloc and GLIBC miss or implement
in external tools – despite SLIMalloc's code being significantly smaller (faster, more reliable,
easier to understand and therefore to audit and maintain).
Source code
13 / 25
Allocator Language blank-lines comment-lines code-lines
GLIBC C 2005 3333 ( 48% of code) 6957
Mimalloc C 1709 2764 ( 33% of code) 8336
SLIMalloc C 294 3150 (133% of code) 2376
Features (security and troubleshooting tools)
Feature GLIBC Mimalloc SLIMalloc
Spot invalid pointers no
(crash)
no
too slow, unused (crash)
yes
real-time, locate in src,
warn, abort or continue
Block allocation errors no
(crash)
yes
very limited (crash)
yes
locate in src, abort/continue
Buffer overflow detection no
(ignore)
yes
(abort)
yes
(abort)
Double/invalid-free/realloc yes
abort/warn,
continue with
corruption (crash)
yes
very limited warning and
blocking, mostly
corruption (crash)
yes
locate in src-code, warn,
abort or continue
(unharmed)
Canary yes
constant, slow
yes
slow, “debug mode”
yes
fast, encoded, locate in src
Guard-pages no yes
at 4MB mimalloc-page
yes
user-defined density
Segregated metadata no yes
at known offset
yes
random by-design
Address randomization no yes
at free()
yes
at malloc()
Zero-memory on free no no yes
Zero-memory on malloc no no partial (targeted)
Delayed memory reuse no yes
via delayed free()
yes
picking random blocks
Over-provisioning no no yes
by-design
Detect memory leaks yes
slow, clunky,
requires debug
symbols
no yes
fast, locate in src (address,
visible/debug symbol
names, line numbers), fix
Benchmark and/or
monitoring charts
yes
GNU memusage
instruments the
malloc API
(slowing-down it)
and generates
PNG charts with
external tools
using libgd
no yes
(without instrumentation, for
any executable, even using
different memory allocators,
in one pass with several
series to compare different
implementations, generates
SVG charts without
dependencies)
14 / 25
6. Disclaimer
Based in the “Zurich great area” TWD Industries AG was founded in 1998. SLIMalloc upgraded
the G-WAN Web application server (2009 for Windows and Linux) fueling Global-WAN.
During this research several bugs, some of them leading to memory allocator crashes, have been
found in SlimGuard and Mimalloc, and responsibly reported to their respective authors.
15 / 25
7. Performance: GNU GLIBC / Microsoft Research Mimalloc / TWD SLIMalloc
HW: 6-Core MacPro (1x Intel Xeon CPU W3680 @ 3.33GHz), 8 GB RAM DDR3 1333 MHz
OS : Ubuntu 14.04.2 LTS, GLIBC v1.19 (v2.26-2.32 builds all fail with: “too old: GNU ld”)
---------------------------------------------------------------------------------------------------------------------
(1) Microsoft Research stress-test (6 threads, 450-5k scale, small/large blocks, 1-50 iterations)
---------------------------------------------------------------------------------------------------------------------
Note: Microsoft Research warns that its stress test “tries to reflect real-world workloads, but
execution can still depend on random thread scheduling; do not use this test as a benchmark”.
Note: we used the thread-index instead of the tid to get a fixed PRNG seed value.
All the other benchmarks Microsoft has used, as user-mode processes, are also subject to
(1) the randomness of the kernel task-scheduler, on the top of (2) the (ever-increasing)
operating-system tasks running in the background, and (3) the overhead of the test itself
and the OS kernel – both of which consume much more time than SLIMalloc malloc/free.
A test should be able to push things to the limits so we can find where things break and
improve them – and this is what this stress-test does. Don't be shy using it.
GLIBC: trimming takes place by default and is fast but does not go as far as SLIMalloc, even
with malloc_trim(0), the best option available. v2.26's per-thread cache should be faster.
Mimalloc: as mi_collect did not much, we have tested Mimalloc mi_option_page / segment_reset
which do better; mi_option_reset_delay in the [0-10,000] range did nothing;
mi_reserve_huge_os_pages gave an error (our test OS does not setup huge pages at boot time).
SLIMalloc: the enabled-by-default opt.trim option provided the expected results with a 10%
CPU time overhead with this Microsoft Research stress-test.
16 / 25
stress-test / Microsoft (6 threads, 5000% scale, 1 iteration)
MIMalloc
GLIBC
SLIMalloc
3.0 GB
2.6 GB
2.2 GB
1.8 GB
1.5 GB
1.1 GB
768.0 MB
384.0 MB
0
012345
CPU Time (hh:mm:ss)
Memory Usage
What the chart above shows:
1. top, timeline and average memory usage for the same program using different allocators
2. the execution speed of the same program when using different memory allocators
3. the trimming behavior's profile (lazy:MIMalloc, good:GLIBC, sharp:SLIMalloc)
4. a vertical initial slope reveals a fast design – but other mechanisms may compromise it
5. how much memory allocators impact the pressure on the system for the same program
6. SLIMalloc has more (smaller) peaks than GLIBC, which does better than MIMalloc
7. SLIMalloc wins on all of the above metrics, despite checking for dangling pointers.
What the chart above does not show:
We are told that advanced features and execution-speed are incompatible. And indeed
MIMalloc, the largest code-base investigated here, is both the slowest and least efficient.
VALGRIND locates memory issues while making your code 20-30 times slower (says its manual
- and threads make things a lot worse) – a service done in real-time in production by SLIMalloc
which, as a malloc API replacement, is immensely easier to use from within the application.
Is VALGRIND really the problem or is it GLIBC that deserved more care in the first
place? Why nobody noticed despite the gargantuan budgets is even more concerning.
SLIMalloc does many more things than all other memory allocators: checking for invalid and
unallocated realloc/free pointers in real-time, reporting the exact location of memory allocation
errors in source code or executables, tracing calls in the application and third-party libraries,
finding, locating and fixing all memory leaks, monitoring memory usage and creating SVG
charts, etc. – just because, despite being a newborn, it is better designed and implemented.
Microsoft stress-test without trimming:
17 / 25
stress-test / Microsoft (6 threads, 5000% scale, 1 iteration)
MIMalloc
GLIBC
SLIMalloc
4.0 GB
3.5 GB
3.0 GB
2.5 GB
2.0 GB
1.5 GB
1.0 GB
512.0 MB
0
0 1 2 3 4
CPU Time (hh:mm:ss)
Memory Usage
GLIBC does not have any option, so that's the same behavior as before (with trimming).
MIMalloc is a bit faster, and its “deferred-free” tactic is consuming even more memory – over
the whole life of the program.
SLIMalloc is as fast as it can be*, not trying to automatically reclaim memory (yet developers
can force heap trimming manually – on a per-heap basis, at the time they wish in their code).
(*) SLIMalloc would be even faster if, like MIMalloc, it enabled or disabled its many features at
compilation time instead of as runtime on a per-heap basis. Yet, the comfort of having all these
features at hand (during development and in production) is certainly worth the few milliseconds
that could be spared. In case of need, two versions could be released (features vs speed).
In the chart above, the average memory consumption of GLIBC is closer to SLIMalloc's (even if
SLIMalloc is much faster) because SLIMalloc is now using more memory than GLIBC (which
lacks the option to disable trimming).
Microsoft stress-test with large blocks and trimming:
Here “large blocks“ means many more larger block than for the regular stress-test flavor. The
exact size depends on the allocator's block-size granularity and block-area allocation strategy.
Again, the ability to visualize the whole execution profile of each process is very rewarding:
GLIBC is doing much better than MIMalloc – despite its lack of v2.26 per-thread cache !
MIMalloc is again handicapped by its slow (and lazy) trimming option.
SLIMalloc is by far the fastest and uses far less memory (see the dotted line).
18 / 25
stress-test / Microsoft (6 threads, 450% scale, 3 iterations)
MIMalloc
GLIBC
SLIMalloc
6.0 GB
5.2 GB
4.5 GB
3.7 GB
3.0 GB
2.2 GB
1.5 GB
768.0 MB
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
CPU Time (hh:mm:ss)
Memory Usage
Microsoft stress-test with large blocks and without trimming:
Doing accurate trimming is difficult because it conflicts with the need of a program to reallocate
memory as soon as possible, and, on the top of that, it must be done quickly.
GLIBC v1.19 is at a double disadvantage here because (1) of its lack of v2.26 per-thread cache
and (2) its implementation does efficient trimming by-design so it cannot be disabled.
MIMalloc is not impressive – doing only slightly better than GLIBC with trimming.
SLIMalloc is by far the fastest and uses far less memory (see the dotted line), and here, much
faster than when performing trimming. This is the intended behavior.
Microsoft stress-test with many iterations:
Several iterations of the stress-test are executed consecutively may reveal problems with the
ability of an allocator to free and reallocate heaps for repeatedly created and destroyed threads.
They also show how much CPU time and RAM is lost on the long term when you use GLIBC
instead of SLIMalloc.
Also, instead of making benchmarks, you also might also want to monitor an application.
Monitoring comes to mind for system status and critical services (Web and email servers, etc.) to
make sure that you get the whole picture at all times.
This was the purpose of SLIMalloc in the first place, as we have created it to protect G-WAN
applications from the system and third-party libraries (updating them on a remote system might
introduce bugs and incompatibilities that you did not experience on development machines) .
19 / 25
stress-test / Microsoft (6 threads, 450% scale, 3 iterations)
MIMalloc
GLIBC
SLIMalloc
6.0 GB
5.2 GB
4.5 GB
3.7 GB
3.0 GB
2.2 GB
1.5 GB
768.0 MB
0
0 1 2 3 4 5 6 7 8 9 10
CPU Time (hh:mm:ss)
Memory Usage
Below, the stress-test with 10 iterations with small blocks with trimming at 5000% scale:
50 iterations:
And now, the stress-test with 10 iterations and large-blocks without trimming at 450% scale:
50 iterations:
This SLIMalloc feature could be used by an application (with a sliding window of RAM, CPU samples, disk or network I/O) to watch the
system, another application, or itself (even if G-WAN's daemon is stopped, the supervisor will keep collecting data samples).
SLIMalloc divides your Cloud servers (or dedicated servers) recurring costs by a factor 3 or more (as compared to GLIBC).
20 / 25
stress-test / Microsoft (6 threads, 5000% scale, 50 iterations)
MIMalloc
GLIBC
SLIMalloc
3.0 GB
2.6 GB
2.2 GB
1.8 GB
1.5 GB
1.1 GB
768.0 MB
384.0 MB
0
0 30 1:00 1:30 2:00 2:30 3:00 3:30 4:00
CPU Time (hh:mm:ss)
Memory Usage
stress-test 10 rounds, no-trimming / Microsoft (6 threads, 450% scale, large-blocks, 10 iterations)
MIMalloc
GLIBC
SLIMalloc
6.0 GB
5.2 GB
4.5 GB
3.7 GB
3.0 GB
2.2 GB
1.5 GB
768.0 MB
0
0 3 6 9 12 16 19 22 25 28 31
CPU Time (hh:mm:ss)
Memory Usage
stress-test 50 rounds no-trimming / Microsoft (6 threads, 450% scale, large-blocks, 50 iterations)
MIMalloc
GLIBC
SLIMalloc
6.0 GB
5.2 GB
4.5 GB
3.7 GB
3.0 GB
2.2 GB
1.5 GB
768.0 MB
0
0 15 30 45 1:00 1:15 1:30 1:45 2:00 2:15 2:30
CPU Time (hh:mm:ss)
Memory Usage
Microsoft stress-test on a busy computer:
Instead of running one single application (the test), the test now runs on a machine transcoding a
long video – a task consuming 80% of all the CPU Cores and 32% of the available RAM. For
reference, here are the test charts without video-transcoding and with transcoding:
Here we do trimming to be fair with GLIBC (which is always trimming).
SLIMalloc is far less affected than GLIBC and MIMalloc, proof that design simplicity makes
performance more resilient.
21 / 25
stress-test / Microsoft (6 threads, 5000% scale, 1 iteration)
MIMalloc
GLIBC
SLIMalloc
3.0 GB
2.6 GB
2.2 GB
1.8 GB
1.5 GB
1.1 GB
768.0 MB
384.0 MB
0
012345
CPU Time (hh:mm:ss)
Memory Usage
stress-test + video-transcoding (6 threads, 5000% scale, 1 iteration)
MIMalloc
GLIBC
SLIMalloc
3.0 GB
2.6 GB
2.2 GB
1.8 GB
1.5 GB
1.1 GB
768.0 MB
384.0 MB
0
0 1 2 3 4 5 6 7
CPU Time (hh:mm:ss)
Memory Usage
---------------------------------------------------------------------------------------------------------------------
(2) Patricia Trie / Key-Value TEST (6 threads, one per CPU Core)
---------------------------------------------------------------------------------------------------------------------
As compared to the synthetic Microsoft Research malloc stress-test used earlier, a Key-Value
Store offers the best of both worlds (pressure on the allocator and a real-life use-case) if used
with many random-length keys (like the paragraphs of a very long book) that are added, sorted,
searched (top to bottom and in reverse order), modified, traversed, and freed – 5 times first , and
in a second test 25 times.
GLIBC is faster than MIMalloc in the first test – despite lacking its v2.26 thread-cache!
Mimalloc is faster than GLIB with the longest test, and a bit slower with a smaller test.
SLIMalloc here is much faster and requires far less memory than others in this KV test.
22 / 25
KV / Patricia Trie (6 threads, 5 rounds)
MIMalloc
GLIBC
SLIMalloc
96.0 MB
84.0 MB
72.0 MB
60.0 MB
48.0 MB
36.0 MB
24.0 MB
12.0 MB
0
0123456789
CPU Time (hh:mm:ss)
Memory Usage
KV / Patricia Trie (6 threads, 25 rounds)
MIMalloc
GLIBC
SLIMalloc
96.0 MB
84.0 MB
72.0 MB
60.0 MB
48.0 MB
36.0 MB
24.0 MB
12.0 MB
0
0 5 10 15 20 25 30 35 40 45 51
CPU Time (hh:mm:ss)
Memory Usage
---------------------------------------------------------------------------------------------------------------------
(3) Intel Corp. & NMT University / EBIZZY TEST (6 threads, one per physical CPU Core)
---------------------------------------------------------------------------------------------------------------------
Part of “Linux Test Project”, it was written by Intel Corp. and Val Henson from NMT University:
“Ebizzy is designed to replicate a common web search application server workload. A lot of
search applications have the basic pattern: (1) get a request to find a certain record, (2) index
into the chunk of memory that contains it, (3) copy it into another chunk, then (4) look it up via
binary search. The interesting parts of this workload are:
* large working set
* data alloc/copy/free cycle
* unpredictable data access patterns
The records per second should be as high as possible, and the system time as low as possible.”
Results and discussion
Intel has designed Ebizzy to report the number of searches during a fixed period of time (like 10
seconds). This shows the execution speed of the program but of course the program execution
time is fixed; so, in comparison tests, every chart area has the same length.
We have modified Ebizzy to rather reach a goal (40,000 searches) and to stop after that point.
As a result, Ebizzy now makes a better usage of memory-usage charts:
GLIBC is more than 2 seconds slower than SLIMalloc. The volatility (noise) of its horizontal
line demonstrates that it is working a bit too hard (or that the kernel has difficulties to satisfy its
requests) to stay at this level.
Mimalloc is much slower than in previous tests and still requires more memory than other
allocators. This is not encouraging for a test simulating a multi-threaded server workload.
SLIMalloc is faster, has a low memory consumption, and its area top line is effortless.
23 / 25
Ebizzy / Intel Corp (6 threads, one per physical CPU Core)
MIMalloc
GLIBC
SLIMalloc
384.0 MB
336.0 MB
288.0 MB
240.0 MB
192.0 MB
144.0 MB
96.0 MB
48.0 MB
0
0 1 2 3 4 5 6 7 8 9 1011 1213 1415 1617 1819 2021 2223
CPU Time (hh:mm:ss)
Memory Usage
The mysterious dotted line
Initially, we calculated the average of each area but this information was not a good comparison
basis because some areas last longer than others.
So, instead, we use a “relative average” where the sum of the area is divided by the largest area.
This “score” does not differentiate between fast (on the x-axis) and low memory usage (on the y-
axis) but this is a relevant metric to get an overall score number in comparison tests.
Guardpages vs CPU and RAM usage
GLIBC does not use guardpages at all so it does not dedicate any CPU or RAM to the task.
MIMalloc uses 1 guardpage per 4 MB “mimalloc-page” (using 1 mprotect() syscall and 1
allocated 4096-byte guardpage every 1,024 OS pages: 4,194,304 / 4096).
SLIMalloc has passed the performance tests above with a guardpage density of 90 (that is, 1
guardpage every 90 OS pages: blocks with a size of at least 4096 * 90 = 360 KB are all
separated by a guardpage). Smaller blocks are protected every n blocks, depending on their block
size: 120 KB blocks have a guardpage every 3 blocks.
Despite offering this much higher (user-tunable) security level, SLIMalloc consumes less RAM
and CPU resources than MIMalloc and GLIBC.
8. Conclusion
Programming has always been about execution time. A memory allocator's scalability reduces the
execution time on multicore systems. It's performance reduces the execution time. Trimming
(giving freed memory back to the OS) reduces the memory pressure on the system and other
processes while using more CPU time, and locality reduces the execution time with short-term
benefits that cumulate on the long-term.
These properties not only save money by extracting more performance from the same hardware,
they also enhance the system stability, and reliability – benefiting to the whole ecosystem.
Further, security is an ever-rising concern, and the source of ever-increasing expenses. It is time
to recognize that core system organs bare a major responsibility in everyone's exposure to risk.
Given the ever-increasing monopolistic market power of the OS vendors (the GAFAM), it is
difficult to justify the fact that the “standard” memory allocators are still not secure nowadays.
9. References
[1] “The GNU C library's (glibc's) malloc library” (1987-present)
https://sourceware.org/glibc/wiki/MallocInternals
[2] “SlimGuard: A Secure and Memory Efficient Heap Allocator” (2019)
Beichen Liu, Pierre Olivier, Binoy Ravindran
24 / 25
[3] “Mimalloc: Free List Sharding in Action” (2019)
Daan Leijen, Benjamin Zorn, Leonardo de Moura, Microsoft Research
[4] “Guarder : A Tunable Secure Allocator” (2018)
Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, Tongping Liu
[5] “FreeGuard: A Faster Secure Heap Allocator” (2017)
Sam Silvestro, Hongyu Liu, Corey Crosser, Zhiqiang Lin, Tongping Liu
[6] “Profiling a warehouse-scale computer” (2015)
Harvard University , Universidad de Buenos Aires, Google, Yahoo Labs
[7] “Security vulnerabilities of the top ten programming languages: C, Java, C++, Objective-C, C#, PHP, Visual Basic,
Python, Perl, and Ruby” (2015)
Stephen Turner, Journal of Technology Research
[8] “SoK: Eternal War in Memory” (2013)
László Szekeres, Mathias Payer, Tao Wei, Dawn Song
[9] “Watchdog: Hardware for Safe and Secure Manual Memory Management and Full Memory Safety” (2012) Santosh
Nagarakatte , Milo Martin , Stephan A. Zdancewic
[10] “Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization” (2012)
Cristiano Giuffrida , Anton Kuijsten , Andrew S. Tanenbaum
[11] “DieHarder: Securing the Heap” (2011)
Gene Novark, Emery D. Berger
[12] “Exploiting Memory Corruption Vulnerabilities in the Java Runtime” (2011)
Joshua J. Drake, Black Hat Abu Dhabi
[13] “Heap Taichi: Exploiting Memory Allocation Granularity in Heap-Spraying Attacks” (2010)
Yu Ding, Tao Wei, TieLei Wang, Zhenkai Liang, Wei Zou
[14] “Improving memory management security for C and C++” (2008)
Yves Younan, Wouter Joosen, Frank Piessens, Hans Van den Eynden
[15] “A Memory Allocation Model For An Embedded Microkernel” (2007)
Dhammika Elkaduwe, Philip Derrin, Kevin Elphinstone
[16] “DieHard: Probabilistic Memory Safety for Unsafe Languages” (2006)
Emery D. Berger, Benjamin G. Zorn
[17] “Shredding Your Garbage: Reducing Data Lifetime Through Secure Deallocation” (2005)
Jim Chow, Ben Pfaff, Tal Garfinkel, Mendel Rosenblum
[18] “Security of memory allocators for C and C++” (2005)
Yves Younan , Wouter Joosen , Frank Piessens , Hans Van den Eynden
[19] “Hoard: A Scalable Memory Allocator for Multithreaded Applications” (2001)
Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, Paul R. Wilson
[20] “Microsoft_Security_Intelligence_Report, Volume 16” (2014)
[21] “Practical Memory Safety with REST” (2018)
Kanad Sinha and Simha Sethumadhavan, Columbia University , New York, NY, USA
[22] “Trends, Challenges, and Strategic Shifts in the Software Vulnerability Mitigation Landscape” (2019) Microsoft Security
Response Center (MSRC)
[23] “Apple's T2 Security Chip Has an Unfixable Flaw – allowing hackers to disable macOS security features like System
Integrity Protection and Secure Boot and install malware” WIRED.com (2020)
25 / 25
ResearchGate has not been able to resolve any citations for this publication.
Article
In spite of years of improvements to software security, heap-related attacks still remain a severe threat. One reason is that many existing memory allocators fall short in a variety of aspects. For instance, performance-oriented allocators are designed with very limited countermeasures against attacks, but secure allocators generally suffer from significant performance overhead, e.g., running up to 10x slower. This paper, therefore, introduces FreeGuard, a secure memory allocator that prevents or reduces a wide range of heap-related attacks, such as heap overflows, heap over-reads, use-after-frees, as well as double and invalid frees. FreeGuard has similar performance to the default Linux allocator, with less than 2% overhead on average, but provides significant improvement to security guarantees. FreeGuard also addresses multiple implementation issues of existing secure allocators, such as the issue of scalability. Experimental results demonstrate that FreeGuard is very effective in defending against a variety of heap-related attacks.
Conference Paper
Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits program performance and scalability on multiprocessor systems. Previous allocators suffer from problems that include poor performance and scalability, and heap organizations that introduce false sharing. Worse, many allocators exhibit a dramatic increase in memory consumption when confronted with a producer-consumer pattern of object allocation and freeing. This increase in memory consumption can range from a factor of P (the number of processors) to unbounded memory consumption. This paper introduces Hoard, a fast, highly scalable allocator that largely avoids false sharing and is memory efficient. Hoard is the first allocator to simultaneously solve the above problems. Hoard combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case. Our results on eleven programs demonstrate that Hoard yields low average fragmentation and improves overall program performance over the standard Solaris allocator by up to a factor of 60 on 14 processors, and up to a factor of 18 over the next best allocator we tested.
Conference Paper
Memory corruption bugs in software written in low-level languages like C or C++ are one of the oldest problems in computer security. The lack of safety in these languages allows attackers to alter the program's behavior or take full control over it by hijacking its control flow. This problem has existed for more than 30 years and a vast number of potential solutions have been proposed, yet memory corruption attacks continue to pose a serious threat. Real world exploits show that all currently deployed protections can be defeated. This paper sheds light on the primary reasons for this by describing attacks that succeed on today's systems. We systematize the current knowledge about various protection techniques by setting up a general model for memory corruption attacks. Using this model we show what policies can stop which attacks. The model identifies weaknesses of currently deployed techniques, as well as other proposed protections enforcing stricter policies. We analyze the reasons why protection mechanisms implementing stricter polices are not deployed. To achieve wide adoption, protection mechanisms must support a multitude of features and must satisfy a host of requirements. Especially important is performance, as experience shows that only solutions whose overhead is in reasonable bounds get deployed. A comparison of different enforceable policies helps designers of new protection mechanisms in finding the balance between effectiveness (security) and efficiency. We identify some open research problems, and provide suggestions on improving the adoption of newer techniques.
Guarder : A Tunable Secure Allocator
  • Sam Silvestro
  • Hongyu Liu
  • Tianyi Liu
  • Zhiqiang Lin
"Guarder : A Tunable Secure Allocator" (2018) Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, Tongping Liu
Exploiting Memory Corruption Vulnerabilities in the Java Runtime
  • Yu Ding
  • Tao Wei
  • Tielei Wang
  • Zhenkai Liang
  • Wei Zou
"Exploiting Memory Corruption Vulnerabilities in the Java Runtime" (2011) Joshua J. Drake, Black Hat Abu Dhabi [13] "Heap Taichi: Exploiting Memory Allocation Granularity in Heap-Spraying Attacks" (2010) Yu Ding, Tao Wei, TieLei Wang, Zhenkai Liang, Wei Zou
A Memory Allocation Model For An Embedded Microkernel
"A Memory Allocation Model For An Embedded Microkernel" (2007) Dhammika Elkaduwe, Philip Derrin, Kevin Elphinstone
Shredding Your Garbage: Reducing Data Lifetime Through Secure Deallocation
  • D Emery
  • Benjamin G Berger
  • Zorn
"DieHard: Probabilistic Memory Safety for Unsafe Languages" (2006) Emery D. Berger, Benjamin G. Zorn [17] "Shredding Your Garbage: Reducing Data Lifetime Through Secure Deallocation" (2005) Jim Chow, Ben Pfaff, Tal Garfinkel, Mendel Rosenblum
Apple's T2 Security Chip Has an Unfixable Flaw -allowing hackers to disable macOS security features like System Integrity Protection and Secure Boot and install malware
Apple's T2 Security Chip Has an Unfixable Flaw -allowing hackers to disable macOS security features like System Integrity Protection and Secure Boot and install malware" WIRED.com (2020)