Conference PaperPDF Available

FlexOS: making OS isolation flexible

Authors:

Abstract and Figures

OS design is traditionally heavily intertwined with protection mechanisms. OSes statically commit to one or a combination of (1) hardware isolation, (2) runtime checking, and (3) software verification early at design time. Changes after deployment require major refactoring; as such, they are rare and costly. In this paper, we argue that this strategy is at odds with recent hardware and software trends: protections break (Meltdown), hardware becomes heterogeneous (Memory Protection Keys, CHERI), and multiple mechanisms can now be used for the same task (software hardening, verification, HW isolation, etc). In short, the choice of isolation strategy and primitives should be postponed to deployment time. We present FlexOS, a novel, modular OS design whose compartmentalization and protection profile can seamlessly be tailored towards a specific application or use-case at build time. FlexOS offers a language to describe components' security needs/behavior, and to automatically derive from it a compartmentalization strategy. We implement an early proto-type of FlexOS that can automatically generate a large array of different OSes implementing different security strategies.
Content may be subject to copyright.
FlexOS: Making OS Isolation Flexible
Hugo Lefeuvre
The University of Manchester
Vlad-Andrei Bădoiu
University Politehnica of Bucharest
S
,tefan Teodorescu
University Politehnica of Bucharest
Pierre Olivier
The University of Manchester
Tiberiu Mosnoi
University Politehnica of Bucharest
Răzvan Deaconescu
University Politehnica of Bucharest
Felipe Huici
NEC Laboratories Europe GmbH
Costin Raiciu
University Politehnica of Bucharest
Abstract
OS design is traditionally heavily intertwined with protection
mechanisms. OSes statically commit to one or a combina-
tion of (1) hardware isolation, (2) runtime checking, and (3)
software verication early at design time. Changes after de-
ployment require major refactoring; as such, they are rare
and costly. In this paper, we argue that this strategy is at odds
with recent hardware and software trends: protections break
(Meltdown), hardware becomes heterogeneous (Memory Pro-
tection Keys, CHERI), and multiple mechanisms can now be
used for the same task (software hardening, verication, HW
isolation, etc). In short, the choice of isolation strategy and
primitives should be postponed to deployment time.
We present FlexOS, a novel, modular OS design whose com-
partmentalization and protection prole can seamlessly be
tailored towards a specic application or use-case at build
time. FlexOS oers a language to describe components’ se-
curity needs/behavior, and to automatically derive from it a
compartmentalization strategy. We implement an early proto-
type of FlexOS that can automatically generate a large array
of dierent OSes implementing dierent security strategies.
CCS Concepts
Software and its engineering Operating systems
;
Security and privacy Operating systems security.
ACM Reference Format:
HugoLefeuvre,Vlad-AndreiBădoiu,
S
,
tefanTeodorescu,Pierre Olivier,
Tiberiu Mosnoi, Răzvan Deaconescu, Felipe Huici, and Costin Raiciu.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA
©2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8438-4/21/05.. . $15.00
https://doi.org/10.1145/3458336.3465292
2021. FlexOS: Making OS Isolation Flexible. In Workshop on Hot Top-
ics in Operating Systems (HotOS ’21), May 31–June 2, 2021, Ann Arbor,
MI, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
3458336.3465292
1 Introduction
To create secure and fast software, programmers can use three
main approaches oering various trade-os between human
eort, safety guarantees and runtime performance: software
verication, runtime checking and hardware isolation. To-
day’s software statically commits to one or a combination of
these approaches. At design time, systems are built around the
protection ensuing from these mechanisms; changing them
after deployment is rare and costly.
In operating systems, the current landscape (illustrated on
Figure 1) broadly consists of micro-kernels [
24
,
29
], which
favor hardware protection and verication over performance,
monolithic kernels [
8
], which choose privilege separation
and address spaces to isolate apps, but assume all kernel code
is trusted, and single-address-space OSes (SASOSes), which
attempt to bring isolation within the address space [
10
,
23
,
32
],
or dump all protection for maximum performance [
30
,
36
,
43
].
OS implementations are heavily interlinked with the pro-
tection mechanisms they rely upon, making changes to them
dicult to implement. For instance, removing user/kernel
separation [
37
] requires a lot of engineering eort, as does
breaking down a process into multiple address spaces for
isolation purposes [
27
]. This is the case despite the fact that
alternative approaches can be used to provide the same guar-
antees. First, verication and software hardening (SH) such
as SFI can help ensure memory isolation between separate
components even if they run in the same address space, thus
avoiding the need for hardware isolation [
19
,
52
]. Second,
hardware isolation, SH and runtime property checking can
be used to check that certain correctness properties hold (e.g.
when specied as pre and post conditions), thus relieving
the user from needing to prove code correctness statically
against a specication [
47
]. Third, both software verication
and protection domains can be used to ensure (a form of)
control-ow integrity between components, guaranteeing
79
HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA Lefeuvre et al.
Performance
Security
FasterSlower
More secure
Less secure
Good compatibility
with existing apps.
Less compatible
Area
FlexOS
Tradeoff
Micro/
Separation
kernels Monolithic
kernels SAS/
Dataplane
OSes
Legend
OS research trends
Figure 1: Design space of OS kernels.
that code execution starts only at well-dened entry points,
without needing software runtime checks [53].
The rigid use of safety primitives in modern OSes poses
a number of problems. First, when the protection oered by
hardware primitives breaks down (e.g. Meltdown), it is dif-
cult to decide how they should be replaced, and with what
costs. In cases where multiple mechanisms can be used for the
same task (e.g. SH and verication), choosing the primitive
that provides the best performance depends on many factors
such as the hardware, the workload, etc., and should ideally
be postponed to deployment time, not design time. Locking
the design to a certain isolation primitive will result in poor
performance in many scenarios.
Second,computer hardware is becoming heterogeneous [
54
]
and certain primitives are hardware-dependent (e.g. Intel
Memory Protection Keys – MPK [
12
]). When running the
same software on dierent hardware, how can we minimize
the porting eort while preserving safety?
Software modularization should, in principle, provide bet-
ter robustness and security. Most software, including OSes,
integrates modules from dierent sources, with various levels
of trust. Unfortunately, the isolation primitives assumed by
the module designers aect the way in which a module can be
used, limiting its usefulness. Take, for instance, a formally ver-
ied OS subsystem: how does one go about embedding it into
a larger project while still maintaining its safety properties?
Clearly, if one embeds this component alongside untrusted
C code, its veried properties may not hold in practice.
This leads us to the following research problem: How can
we enable users to easily and safely switch between dierent
isolation and protection primitives at deployment time, avoiding
the lock-in that characterizes the status-quo?
Our answer is FlexOS, a novel, modular OS design whose
compartmentalization and protection prole can easily and
cost-eciently be tailored towards a specic application or
use-case at build time, as opposed to design time as it is the
case today. To that aim, we extend the Library OS model (Li-
bOS) and augment its capacity to be specialized towards a
given use case, historically done for performance [
18
,
26
],
towards the security dimension.
With FlexOS, the user can decide at build time which of the
ne-grained OS components should be compartmentalized, as
well as how to instantiate isolation and protection primitives
for each compartment. FlexOS allows developers to easily
explore the trade-os than can be achieved with dierent
isolation technologies and granularities, and to select the best
security/performance prole for their use case. Concretely,
our research contributions are:
Thedesignand implementation of FlexOS, anovelframe-
work for eortlessly investigating performance vs. se-
curity trade-os in operating systems.
The identication of fundamental primitives that are
needed in order to provide isolation and protection via
a wide range of software and hardware-based mecha-
nisms.
A preliminary evaluation showing how FlexOS can be
used to explore a wide array of security/performance
proles for two apps: iperf and Redis.
2 Design Overview
The goal of FlexOS is to allow developers and OS researchers
to easily inspect and select dierent points in the security
vs. performance trade-o space. Exploring such a space is far
from trivial, and our aim is also to automate such exploration.
Here, various strategies could be followed:
Given a performance target and a set of predened com-
partments (e.g. isolate the application and the network
stack from everything else), nd the combination of
isolation primitives that maximizes security within a
certain performance budget.
Given a set of safety requirements (e.g. no buer over-
ows), nd a compliant instantiation that yields the
best performance or that can run on the largest number
of devices (based on the availability of hardware-based
mechanisms).
Both objectives above have in common the need to describe
the security attained by each mechanism, and the implications
of running one software component in the same compartment
as another one.
We base our design as an extension of the LibOS model [
18
],
since LibOSes are by nature divided into ne-grained compo-
nents/libraries. Our approach consists in supporting a set of
hardware and software hardening mechanisms, and comple-
menting the API of each such library with FlexOS metadata
specifying 1) the expected memory access behavior of other
components running in the same compartment as the library
for its safety properties to hold; 2) the areas of memory this
library can access in normal but also adversarial operation
(for example if the library’s execution ow is hijacked); and
3) API specic information.
Such metadata are created manually for each library by its
developer, a one-time and relatively low eort for the library’s
author. The metadata purpose is to capture the eects upon
the overall safety properties of running this library along-
side other libraries in the same or in a dierent compartment.
80
FlexOS: Making OS Isolation Flexible HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA
For instance, here is an example describing FlexOS’ formally
veried scheduler that we have implemented in Dafny [31]:
[Memory access] Read(Own,Shared); Write(Own,Shared)
[Call] alloc::malloc, alloc::free
[API] thread_add (. . . ); thread_rm(. . . ); yield(. . . )
[Requires] *(Read,Own), *(Write,Shared),
*(Call, thread_add), *. . .
The description concisely species that (1) the library ac-
cesses its own memory and a segment shared with other
libraries (e.g. its callers), that (2) it only uses functions pro-
vided by the memory allocator, (3) which functions it exposes
as its API, and that (4) it expects other libraries to be able to
read its own memory (but not write to it) and be able to write
in shared memory.
Consider now a component written in an unsafe language,
such as C, that is deemed potentially unsafe (perhaps due to
variable-length writes to a buer that cannot be proven safe
statically); its description will read:
[Memory access] Read(*); Write(*)
[Call] *
This specication simply outlines that the control/data ow
of this component may be hijacked at runtime, resulting in
arbitrary code execution/memory access. Since there is no
Requires
clause, this means other libraries should not be
prevented from writing to memory owned by this library.
Giventwolibrariesand their metadata, we now have enough
information to automatically decide whether they can run in
the same compartment. If both libraries have no
Requires
clause,the answer is yes. If any of the libraries has such clauses,
each clause can be automatically checked in the presence of
the other library. In our example above, for its veried prop-
erties to hold, the scheduler expects others to only read, not
write, to its own memory. The C component, on the other
hand, could write to all memory it has access to (in its com-
partment) - thus breaking the expectation: as a result, these
two libraries cannot be run in the same compartment.
Armed with information about pair-wise incompatibility,
selecting the smallest number of compartments in a FlexOS
image can be reduced to the classical graph coloring problem:
each library is a vertex, and an edge connects two incompat-
ible libraries. Graph coloring assigns the smallest number of
colors to the vertices of a graph such that no two adjacent ver-
tices have the same color. For each color, we will instantiate a
separate compartment that holds the libraries that have been
painted with that color. In the worst case where all libraries
have conicts, each library will be instantiated in its own
compartment.
When to Enable SH?
In certain cases, it is preferable from
a performance or deployment point of view to use runtime
checks (CFI [
2
], DFI [
3
,
9
], etc., grouped in the rest of this paper
under the SH acronym) instead of multiple compartments –
possibly only for a subset of the system/compartments.
To automate the process of selecting SH mechanisms, we
rst create in FlexOS a machine-readable description of the im-
pact each SH technique has on the safety behavior of a library.
This is a transformation that takes as input a library deni-
tion and outputs a changed denition describing the safety
behavior of the library when the SH technique is enabled.
For control-ow integrity, the transformation is simple:
libraries that previously declared
Call(*)
are transformed
into
Call(func. list)
where the list of functions is popu-
lated via a standard control-ow analysis of the library. For
data-ow integrity, the transformation is similar: if the data
ow graph of a library shows that all its writes are to its own
data,
Writes(*)
will be transformed to
Writes(Own)
; other
SH techniques are handled similarly. To enumerate feasible
deployments with SH, we proceed as follows: 1) for each li-
brary that writes to all memory, enable DFI / ASAN; 2) for
each library that can execute arbitrary code, enable CFI.
The result of this step will be a list of libraries that have
two versions: one with SH, and one without. We then iterate
through all combinations of such library versions and run the
graph coloring algorithm described above. This will result
in as many colorings as there are possible combinations of
libraries. Consider our example above: the unsafe C library
will have two versions now, one with SH and one without
SH. When put together with the scheduler in the same image,
the SH version will be able to share a compartment with the
scheduler, while the original version will require a separate
compartment.
Handling pre and post conditions.
The approach we took
tohandling memory access requirementscouldalso be applied
to pre-conditions that certain API functions may request to be
true when called. For instance, in the case of the scheduler, one
of
thread_add
’s preconditions is to not add a thread that has
already been added. In such cases, FlexOS could be extended
to automatically check whether the pre-conditions always
hold on call (based on a static analysis of the call graph); if
they don’t, runtime checks should be added to ensure they do
hold. In our current prototype, we add these checks manually
in our scheduler code; in future work we intend to explore
ways of deriving this automatically.
FlexOS Architecture.
FlexOS is based on a modular LibOS
(Unikraft,a unikernel framework [
30
])andallows ned-grained
OS software modules to be placed in compartments (see Fig-
ure 2). Note that the granularity of such modules is much
more ne-grained than that of traditional microkernel/multi-
server OSes. Compartments in FlexOS are separated via gates
which are made up of the API each compartment exposes. The
gates also implement isolation between compartments, and
81
HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA Lefeuvre et al.
Rest of
the
kernel
and
App.
TCP Stack
Scheduler
GATEGATE
GATE
Gates library:
MPK/shared stack
MPK/registers clear
MPK/stack switch
EPT/RPC
Compartment
hardening
library:
ASAN
CFI
Safe stack
None
Function call
...
...
Figure 2: FlexOS architecture. Gates isolate an arbi-
trary number of compartments using a wide set of
software and hardware-based security mechanisms.
can leverage dierent isolation mechanisms depending on
the available hardware (e.g. protection keys [
12
,
14
], capabil-
ities [55]) or software (e.g. CFI or ASAN).
Gates are instantiated at link time based on the require-
ments provided by the user or automated tools. Implementa-
tions vary from cheap function calls all the way to expensive
RPC across VM boundaries. Depending on the chosen gates,
the compartments will be running in the same protection
domain or in dierent ones. Further, each compartment can
be individually hardened by using SH without code changes.
FlexOS leverages Unikraft’s micro-library granularity (e.g.
a scheduler, a memory allocator or a message queue are all
micro-libs) but replaces each micro-lib’s standard function-
call based API with call gates. In the porting process, develop-
ers replace cross-micro-libs function calls with gate placehold-
ers. Once replaced by a particular implementation in the link-
ing stage, gates take care of executing the function call in the
foreign compartment, and of copying the return value back.
Programmers do not need knowledge of the internal function-
ing of gates; all gates are exposed by the same, simple API:
rc = listen(sockfd, 5); // before porting
uk_gate_r(rc, listen, sockfd, 5); // after porting
Programmers also annotate data shared with other micro-
libs so that they are allocated in shared areas according to the
compartmentalization graph.
FlexOS’s build system extends Unikraft’s to allow speci-
fying how many compartments the resulting image should
have, how they should be isolated, and whether SH tech-
niques should be applied to one or multiple of these. Using
this information, FlexOS’s builder will generate the required
protection domains (one per compartment) and replace the
call gate placeholders with the relevant code. For libraries in
the same compartment, it will replace the call gates with direct
function calls. For inter-compartment crossings, it will use
the appropriate gate for switching protection domains: in our
example in Figure 2, we have three separate compartments.
Using these basic primitives, a developer will be able to
easily experiment with various isolation techniques to nd
the fastest implementation for a given task. We show, via
experiments in §4, how we can create several networking
images that do not trust the networking stack, with vastly
dierent performance and security characteristics.
3 Implementation Prototype
To demonstrate its practicality, we implement a prototype of
FlexOS on top of Unikraft v0.4 [
49
] in 1.5K LoC. Gate support
is provided by two isolation mechanisms, referred to as isola-
tion backends: Intel MPK and VM (EPT) isolation. SH support
is available with CFI, ASAN, etc. We ported a subset of the
Unikraft micro-libraries to FlexOS, manually created compart-
ment specications and identied shared data to showcase
the trade-os FlexOS enables; implementing the automated
approach to dening compartments is left as future work.
Note that although we focused on virtualized environments
for this prototype, nothing fundamentally precludes FlexOS
to run as a bare-metal OS.
Intel MPK Backend.
Intel MPK is a mechanism providing
low-overhead intra-address space memory isolation [
1
,
5
,
46
]
at the granularity of a page. Our MPK backend places each
compartment in its own MPK memory region, including static
memory,heap,stack,and TLS. MPK permissionsfor the thread
executing on a core are held in a register named PKRU. Since
any compartment can modify its value, the MPK backend has
to prevent such unauthorized writes; it can do so via static
analysis [50], runtime checks [22] or page-table sealing [36].
In addition, the MPK backend introduces isolation require-
ments for the scheduler and the Memory Manager (MM): the
scheduler holds the value of the PKRU for threads that are not
currently running, and so its memory is as critical as the PKRU
register itself. The MM’s domain includes the page-table hold-
ing the mapping between pages and protection domains. This
implies that the scheduler and MM have to be trusted when
using MPK. In our implementation we use a provably correct
scheduler implemented in Dafny, and we can also use SH to
harden schedulers/MMs implemented in C.
Our MPK backend supports two types of gates. In the
shared-stack gate, heap and static memory are isolated and
only shared data is accessible from all compartments in dedi-
cated heap/static memory segments. Thread stacks are located
in a domain shared by all compartments. This gate is similar
to ERIM’s [
50
]. With the switched stack gate, the heap, stacks,
and static memory are all isolated. There is one stack per
thread per compartment and the stack is switched at domain
boundaries. Parameters are copied to the target domain stack,
and shared stack data is placed on a shared heap. This gate is
similar to HODOR’s [22].
VM-based Backend.
Many works use virtualization to sup-
port isolation within a kernel [
33
,
40
,
41
,
56
]. VM-based iso-
lation provides strong security guarantees and is widely sup-
ported, at the cost of higher overhead.
82
FlexOS: Making OS Isolation Flexible HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA
2
20
200
2000
20000
SH (KVM) MPK-Sha. (KVM)
MPK-Sw. (KVM) KVM Baseline
VM RPC (Xen) Xen Baseline
Iperf payload size (B)
Iperf Mb/s
2628210 212 214 216 218 220
Figure 3: iperf throughput, various
congs (Sha - shared, Sw- switched
stacks)
Component CSH: all but CSH: C only
Scheduler 496Mb/s 2.90 Gb/s
Network stack 631Mb/s 2.76 Gb/s
LibC 1.47Gb/s 1.25 Gb/s
Rest of the system 1.08Gb/s 2.50 Gb/s
Entire system 2.94Gb/s (baseline) 489 Mb/s
Table 1: iperf throughput with SH on
various components.
SET GET SET G ET SET GET
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4 No SH
SH glob al alloc
SH loca l alloc
Verified Sc hed
Payload: 5B 50B 500B
Redis Mreq/s
Figure 4: Redis throughput for var-
ious SH congs and our veried
scheduler.
Our toolchain generates one VM image per compartment.
Images contain the minimum set of micro-libraries neces-
sary to run the VM independently (platform code, memory
allocator, scheduler), along with a thin RPC implementation
based on inter-VM notications and a shared area of memory
for shared heap/static data. It is mapped in all compartments
(VMs) at an identical address so that pointers to/in shared
structures remain valid. Compartments do not share a single
address space anymore, and run on dierent vCPUs. Hence,
each compartment needs its own memory allocator and sched-
uler, so these have to be trusted. Our VM-based isolation back-
end is currently based on Xen, with KVM support underway.
SH Support.
FlexOS’s SH support is modular: we can apply
hardening mechanisms per compartment (not system-wide),
allowing for ne-grained protection and performance trade-
os. For example it is possible to apply SH only to compo-
nents that interact directly with the outside world, such as
the network stack. Our implementation supports KASAN,
Stack protector and UBSAN on GCC, and CFI and SafeStack
under clang. A key requirement for SH is the ability to have
a separate memory allocator per compartment: as many SH
techniques instrument malloc, using a single global allocator
would result in the entire system paying the cost of the instru-
mented allocator. FlexOS can be congured to use separate
memory allocators per compartment to avoid such overheads
when only a subset of compartments are hardened.
4 Preliminary Results
We studied the performance impact brought by a number
of FlexOS’ congurations for two apps: an iperf server and
Redis. We aim to conrmthat FlexOS allows easy exploration
of a wide design space of security/performance trade-os:
each conguration is obtained by setting a few options and
recompiling the LibOS against the app sources. Both apps
were manually ported to the prototype, though most of this
process should be easy to automate. Experiments were run
on a Xeon Silver 4110 (2.1 GHz), with KVM and Xen.
Safe iperf.
In our rst test, we created an iperf server where
an untrusted network stack is isolated from the rest of the OS
image. We test three congs: 1) two compartments with MPK,
one for the stack and one for rest of the OS; 2) separate VMs
for the two compartments; 3) A single compartment, with SH
applied only to the network stack.
Performance results as measured by an iperf client are
shown in Fig. 3. At the server side, we vary the size of the
buer passed to
recv
. With SH and MPK, for small buers
there is a non negligible slowdown (2x to 3x). However, these
solutions catch up quickly to the baseline, yielding similar
performance starting at 1KB buer size. Xen’s numbers are
lower due to Unikraft not being optimized for this hypervisor;
still, we observe that the payload needs to be larger for the VM
backend to catch up to the baseline, 32KB, due to increased
domain switching costs. These results show that the perfor-
mance impact of various protection mechanisms depends
on the workload, so locking into a protection mechanism at
design time is suboptimal.
iperf: Fine-Grained SH.
FlexOS’ modular design allows us
to enable/disable SH at micro-library granularity. We ran iperf
with a variable number of FlexOS’ components running with
SH: the network stack, the scheduler, the standard C library
(LibC), and the rest of the system including iperf itself.
Results are in Table 1. The performance impact strongly
depends on the component running with SH: the scheduler
brings a 1% overhead while the LibC has a 2.3x slowdown.
Interestingly, the slowdown with SH for the network stack is
low (6%). SH for the entire system has a 6x slowdown, demon-
strating the benets of FlexOS’ exibility, useful in scenarios
where components have variable levels of trust and variable
performance impact when protected with SH.
Redis: Isolation Strategies.
We ran Redis in various scenar-
ios. We dened 4 compartmentalization models:
{NW stack,
rest of the system}
(NW only),
{NW stack, scheduler,
rest of the system}
(NW/sched/rest),
{NW stack + sche-
duler, rest of the system}
(NW and sched/rest), and a
baseline with no isolation. These demonstrate FlexOS’ capac-
ity to seamlessly manage various trust models. For MPK we
ran both the shared and switched stack versions.
The results are in Figure 5. The isolation overhead depends
on the number of compartments and how they communicate.
Isolating only the network stack brings on average a 17% slow-
down, while also isolating the scheduler brings a 1.4x (shared
stack) and 2.25x (switched stack) slowdown – an increase due
83
HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA Lefeuvre et al.
0
0.5
1
1.5
5B payload
50B payload
500B payload
No Isol. Sh.
NW-only
Sw. Sh. Sw. Sh. Sw.
NW/Sched/Rest NW+Sched/Rest
GET Mreq/s
Figure 5: Redis throughput with MPK isolation.
to the stack switch overhead. This points to frequent com-
munication between the scheduler and the network stack,
making intensive use of wait queues through semaphores.
However, putting the network stack and the scheduler in the
same compartment does not increase performance: this is due
to semaphores being implemented in another compartment
(LibC). This brings the need for further compartmentalization
or redesign of the components. Similar to iperf, the isolation
overhead drops signicantly when the request size increases.
Redis: SH.
We ran Redis enabling SH for the network stack
with 1) a global allocator for the entire system and 2) a dedi-
cated local allocator for the network stack and another for the
rest of the system. The results are in Figure 4. With a global
allocator, the slowdown from running the network stack with
SH is on average 1.45x. FlexOS’ capacity to easily setup a local
allocator for the network stack allows us to reduce that over-
head to a 1.24x slowdown. Overall, the results for Redis show
that FlexOS can manage a wide range of security/performance
requirements scenarios.
Veried Scheduler.
We developed a veried cooperative
scheduler written in Dafny [
31
]; the scheduler’s safety is
given by pre- and post-conditions that are statically proven
to hold by Dafny. We generate C++ code from the scheduler
and integrate it in FlexOS by adding glue code. How can we
embed this safely alongside untrusted code? To protect the
scheduler’s memory from external writes we can either apply
SH to the rest of the unikernel or use MPK. To check that
pre-conditions hold on call we integrate the checks in the glue
code, and disable interrupts. In future work we will generate
glue code automatically.
The context switch latency of our veried scheduler is
218.6ns, 3x slower than the C scheduler (76.6ns). This is fairly
high, but Fig. 4 shows that the veried scheduler’s overhead
over the C one is always below 6% for Redis.
5 Open Questions
Decoupling OSes isolation and safety primitives from their
fundamental design brings a number of challenges.
How to minimize porting eort?
FlexOS requires port-
ing, not only for kernel-internal libraries, but also for exter-
nal user-space libraries. This porting process usually boils
down to identifying shared data and handling indirect cross-
component calls, a common eort among isolation frame-
works [
21
,
38
]. While it is a one-time, reasonably inexpensive
eort, we recognize that it might hinder the adoption of our
approach [
43
]. Further, manual approaches are not fail proof
and can result, for example in over- or under-sharing data. To
address these issues, automated porting techniques, mostly
explored at the user-space level [6, 35, 48], can be explored.
Another element of the porting process is the writing of
per-library metadata. These metadata are used by the design
space exploration tool to automatically derive a compartmen-
talization strategy. The tool is then able to guarantee that
properties hold according to the specied characteristics of
each component. The resulting kernel is guaranteed to be cor-
rect as long as the metadata themselves are correct. But who
veries the specication/metadata? The process of writing
metadata is error prone, and methods for (semi-)automatically
generating them should be explored.
Isolationalone is not enough.
TraditionalsystemcallAPIs
are designed from the outset as a trust boundary. Not only
are they copy-based and carefully check function arguments,
they are also designed as to avoid more subtle privilege esca-
lation vulnerabilities, e.g. confused deputies. For such APIs,
swapping the isolation mechanism (e.g., from standard system
calls to MPK domain switching) is relatively straightforward.
On the other hand, when the API was previously developed
without a trust model (as is the case with all kernel internal
APIs, but also userland library APIs), introducing isolation
is a more complex task; isolation alone is not enough, and in
order to provide protection against a wide range of attacks,
APIs have to be carefully revisited [
11
]. Further, in the case of
FlexOS, we only want to execute such checks when they are
really needed, depending on the instantiated kernel cong-
uration: if component A is together with component B in the
same trust domain, then checks are not necessary, but they are
when component C (in another domain) calls component B.
A possible approach to tackle this problem is the one that
we envision to take for preconditions: by enriching all microli-
braries with API metadata, the build system could possess
sucient information to automatically generate wrappers
that would include or exclude these checks on-demand.
6 Related Work
Previous work addressed the isolation ineciencies of mono-
lithic kernels by reducing the TCB through separation [
4
,
44
] and micro-kernels [
20
,
24
]. More recently, OSes provid-
ing security through software isolation brought by safe lan-
guages [
7
,
13
,
25
,
36
,
39
] have been proposed. In SASOSes,
isolation has been provided with traditional page tables [
10
,
23
,
32
] and recently through intra-address-space hardware
isolation mechanisms [
34
,
42
,
45
,
47
]. Formal verication of-
fers deterministic security guarantees, but has trouble scal-
ing to modern OSes’ large codebases [
28
,
29
]. Low-overhead
runtime protection mechanisms are commonly found in pro-
duction kernels [
15
,
17
]. However, the most security-ecient
84
FlexOS: Making OS Isolation Flexible HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA
ones [
16
] are only enabled for test runs [
51
] due to their high
performance impact.
In all, each of these approaches represents a single point
in the OS design space and lacks the exibility of FlexOS to
automatically congure variable, ne-grained security/per-
formance proles. LibrettOS [
41
] does allow a LibOS to switch
between SASOS and microkernel modes, but remains limited
to a small subset of the security/performance design space.
SOAAP [
21
] proposes a system to explore software’s compart-
mentalization space using static/dynamic analysis; however,
this work targets monolithic user-space code-bases, as opposed
to modular kernel code-bases for FlexOS.
7 Conclusion and Future Work
FlexOS provides developers the ability to mix and match iso-
lation primitives, be they hardware or software, which allows
creating tailor-made versions of the same app for target work-
loads, with good performance and improved security, as our
experiments have shown for two apps.
This paper is only an initial exploration of the potential
benets of FlexOS. Our future work aims to automate check-
ing the safety of a proposed conguration, and searching for
congurations with desired properties automatically. This
will in turn enable developers to build robust software by
mixing and matching components with various trust levels.
Acknowledgments
We would like to thank the anonymous reviewers for their
comments and insights. A special thanks goes to Julia Lawall
for her help on Coccinelle. This work was partly funded by
EU H2020 grant agreements 825377 (UNICORE), 871793 (AC-
CORDION) and 758815 (CORNET), as well as the UK’s EPSRC
New Investigator Award grant EP/V012134/1.
References
[1]
[n.d.]. Intel
®
64 and IA-32 Architectures Software Developer’s Manual.
Volume 3A, Section 4.6.2.
[2]
Martín Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. 2005.
Control-Flow Integrity. In Proceedings of the 12th ACM Conference
on Computer and Communications Security (Alexandria, VA, USA)
(CCS ’05). Association for Computing Machinery, New York, NY, USA,
340–353. https://doi.org/10.1145/1102120.1102165
[3]
P. Akritidis, C. Cadar, C. Raiciu, M. Costa, and M. Castro. 2008.
Preventing Memory Error Exploits with WIT. In 2008 IEEE Symposium
on Security and Privacy. 263–277. https://doi.org/10.1109/SP.2008.30
[4]
J. Alves-Foss, P. Oman, C. Taylor, and S. Harrison. 2006. The MILS
architecture for high-assurance embedded systems. Int. J. Embed. Syst.
2 (2006), 239–247.
[5]
Steve Bannister. 2019. Memory Tagging Extension: Enhancing memory
safety through architecture. https://community.arm.com/developer/ip-
products/processors/b/processors-ip-blog/posts/enhancing-
memory-safety Online; accessed October 27, 2020.
[6]
Markus Bauer and Christian Rossow. 2021. Cali: Compiler Assisted
Library Isolation. In Proceedings of the 16th ACM Asia Conference on
Computer and Communications Security (ASIA CCS’21). Association
for Computing Machinery.
[7]
Kevin Boos, Namitha Liyanage, Ramla Ijaz, and Lin Zhong. 2020.
Theseus: an Experiment in Operating System Structure and State Man-
agement. In Proceedings of the 14th USENIX Symposium on Operating
Systems Design and Implementation (OSDI’20). USENIX Association,
1–19. https://www.usenix.org/conference/osdi20/presentation/boos
[8]
Daniel P Bovet and Marco Cesati. 2005. Understanding the Linux Kernel:
from I/O ports to process management. O’Reilly Media, Inc.
[9]
Miguel Castro, Manuel Costa, and Tim Harris. 2006. Securing
software by enforcing data-ow integrity. In Proceedings of the 7th
USENIX Symposium on Operating Systems Design and Implementation
(symposium on operating systems design and implementation (osdi)
ed.) (OSDI’06). USENIX. https://www.microsoft.com/en-us/research/
publication/securing-software-by-enforcing-data-ow-integrity/
[10]
Jerey S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D.
Lazowska. 1994. Sharing and Protection in a Single-Address-Space
Operating System. ACM Trans. Comput. Syst. 12, 4 (Nov. 1994), 271–307.
https://doi.org/10.1145/195792.195795
[11]
R. Joseph Connor, Tyler McDaniel, Jared M. Smith, and Max Schuchard.
2020. PKU Pitfalls: Attacks on PKU-based Memory Isolation
Systems. In Proceedings of the 29th USENIX Security Symposium
(USENIX Security’20). USENIX Association, 1409–1426. https:
//www.usenix.org/conference/usenixsecurity20/presentation/connor
[12]
Jonathan Corbet. 2015. Memory protection keys. Linux Weekly News
(2015). https://lwn.net/Articles/643797/.
[13]
Cody Cutler, M Frans Kaashoek,and Rob ert T Morris. 2018. The benets
and costs of writing a POSIX kernel in a high-level language. In 13th
USENIX Symposium on Operating Systems Design and Implementation
(OSDI 18). 89–105.
[14]
Leila Delshadtehrani, Sadullah Canakci, Manuel Egele, and Ajay Joshi.
2021. Ecient Sealable Protection Keys for RISC-V. (2021).
[15]
Jack Edge. 2013. Kernel address space layout randomization.
https://lwn.net/Articles/569635/.
[16]
Jack Edge. 2014. The kernel address sanitizer. https:
//lwn.net/Articles/612153/.
[17]
Jack Edge. 2014. "Strong" stack protection for GCC. https:
//lwn.net/Articles/584225/.
[18]
D. R. Engler, M. F. Kaashoek, and J. O’Toole. 1995. Exokernel: An
Operating System Architecture for Application-Level Resource
Management. In Proceedings of the 15th ACM Symposium on Operating
Systems Principles (Copper Mountain, Colorado, USA) (SOSP ’95).
Association for Computing Machinery, New York, NY, USA, 251–266.
https://doi.org/10.1145/224056.224076
[19]
Matt Fleming. 2017. A thorough introduction to eBPF.
https://lwn.net/Articles/740157/.
[20]
David B Golub, Daniel P Julin, Richard F Rashid, Richard P Draves,
Randall W Dean, Alessandro Forin, Joseph Barrera, Hideyuki Tokuda,
Gerald Malan, and David Bohman. 1992. Microkernel operating system
architecture and Mach. In In Proceedings of the USENIX Workshop on
Micro-Kernels and Other Kernel Architectures. 11–30.
[21]
Khilan Gudka, Robert N.M. Watson, Jonathan Anderson, David
Chisnall, Brooks Davis, Ben Laurie, Ilias Marinos, Peter G. Neumann,
and Alex Richardson. 2015. Clean Application Compartmentalization
with SOAAP. In Proceedings of the 22nd ACM SIGSAC Conference on
Computer and Communications Security (Denver, Colorado, USA)
(CCS ’15). Association for Computing Machinery, New York, NY, USA,
1016–1031. https://doi.org/10.1145/2810103.2813611
[22]
Mohammad Hedayati, Spyridoula Gravani, Ethan Johnson, John
Criswell, Michael L. Scott, Kai Shen, and Mike Marty. 2019. Hodor:
Intra-Process Isolation for High-Throughput Data Plane Li-
braries. In 2019 USENIX Annual Technical Conference (USENIX
ATC’19). USENIX Association, Renton, WA, 489–504. https:
//www.usenix.org/conference/atc19/presentation/hedayati-hodor
85
HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA Lefeuvre et al.
[23]
Gernot Heiser, Kevin Elphinstone, Jerry Vochteloo, Stephen Russell,
and Jochen Liedtke. 1999. The Mungi Single-Address-Space Operating
System. Software: Practice and Experience 28, 9 (July 1999), 901–928.
https://doi.org/10.1002/(SICI)1097-024X(19980725)28:9%3C901::AID-
SPE181%3E3.0.CO;2-7
[24]
Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S.
Tanenbaum. 2006. MINIX 3: A Highly Reliable, Self-Repairing
Operating System. SIGOPS Oper. Syst. Rev. 40, 3 (July 2006), 80–89.
https://doi.org/10.1145/1151374.1151391
[25]
Galen C. Hunt and James R. Larus. 2007. Singularity: Rethinking the
Software Stack. SIGOPS Oper. Syst. Rev. 41, 2 (April 2007), 37–49.
[26]
M Frans Kaashoek, Dawson R Engler, Gregory R Ganger, Héctor M
Briceno, Russell Hunt, David Mazieres, Thomas Pinckney, Robert
Grimm, John Jannotti, and Kenneth Mackenzie. 1997. Application
performance and exibility on exokernel systems. In Proceedings of
the sixteenth ACM symposium on Operating systems principles. 52–65.
[27]
Douglas Kilpatrick. 2003. Privman: A Library for Partitioning Applica-
tions.. In USENIX AnnualTechnical Conference, FREENIX Track. 273–284.
[28]
Gerwin Klein. 2009. Operating system verication—an overview.
Sadhana 34, 1 (2009), 27–69.
[29]
Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick,
David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal
Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon
Winwood. 2009. SeL4: Formal Verication of an OS Kernel. In Proceed-
ings of the 22nd ACM Symposium on Operating Systems Principles (Big
Sky, Montana, USA) (SOSP ’09). Association for Computing Machinery,
New York, NY, USA, 207–220. https://doi.org/10.1145/1629575.1629596
[30]
Simon Kuenzer, Vlad-Andrei Bădoiu, Hugo Lefeuvre, Sharan San-
thanam, Alexander Jung, Gaulthier Gain, Cyril Soldani, Costin Lupu,
Ştefan Teodorescu, Costi Răducanu, Cristian Banu, Laurent Mathy,
Răzvan Deaconescu, Costin Raiciu, and Felipe Huici. 2021. Unikraft:
Fast, Specialized Unikernels the Easy Way. In Proceedings of the 16th
European Conference on Computer Systems (Online Event, United
Kingdom) (EuroSys ’21). Association for Computing Machinery, New
York, NY, USA, 376–394. https://doi.org/10.1145/3447786.3456248
[31]
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verier
for Functional Correctness. In Proceedings of the 16th International
Conference on Logic for Programming, Articial Intelligence, and Rea-
soning (Dakar, Senegal) (LPAR’10). Springer-Verlag, Berlin, Heidelberg,
348–370.
[32]
I. M. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, R.
Fairbairns, and E. Hyden. 1996. The design and implementation of an
operating system to support distributed multimedia applications. IEEE
Journal on Selected Areas in Communications 14, 7 (1996), 1280–1297.
https://doi.org/10.1109/49.536480
[33]
Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Götz. 2004.
Unmodied Device Driver Reuse and Improved System Dependability
via Virtual Machines. In Proceedings of the 6th USENIX Conference on
Operating Systems Design and Implementation (OSDI’04). 17–30.
[34]
Guanyu Li, Dong Du, and Yubin Xia. 2020. Iso-UniK: lightweight multi-
process unikernel through memory protection keys. Cybersecurity 3,
1 (May 2020), 11.
[35]
Shen Liu, Gang Tan, and Trent Jaeger. 2017. PtrSplit: Supporting
General Pointers in Automatic Program Partitioning. In Pro-
ceedings of the 24th ACM SIGSAC Conference on Computer and
Communications Security (Dallas, Texas, USA) (CCS ’17). Associ-
ation for Computing Machinery, New York, NY, USA, 2359–2371.
https://doi.org/10.1145/3133956.3134066
[36]
Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David
Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand,
and Jon Crowcroft. 2013. Unikernels: Library Operating Systems
for the Cloud. In Proceedings of the 18th International Conference
on Architectural Support for Programming Languages and Operating
Systems (ASPLOS ’13). Association for Computing Machinery, 461–472.
[37]
Toshiyuki Maeda and Akinori Yonezawa. 2003. Kernel Mode Linux:
Toward an operating system protected by a type theory. In Annual
Asian Computing Science Conference. Springer, 3–17.
[38]
Shravan Narayan, Craig Disselkoen, Tal Garnkel, Nathan Froyd,
Eric Rahm, Sorin Lerner, Hovav Shacham, and Deian Stefan. 2020.
Retrotting Fine Grain Isolation in the Firefox Renderer. In Proceed-
ings of the 29th USENIX Security Symposium (USENIX Security’20).
USENIX Association, 699–716. https://www.usenix.org/conference/
usenixsecurity20/presentation/narayan
[39]
Vikram Narayanan, Tianjiao Huang, David Detweiler, Dan Appel,
Zhaofeng Li, Gerd Zellweger, and Anton Burtsev. 2020. RedLeaf:
Isolation and Communication in a Safe Operating System. In Proceed-
ings of the 14th USENIX Symposium on Operating Systems Design and
Implementation (OSDI’20). USENIX Association. https://www.usenix.
org/conference/osdi20/presentation/narayanan-vikram
[40]
Ruslan Nikolaev and Godmar Back. 2013. VirtuOS: An Operating
System with Kernel Virtualization. In Proceedings of the 24th ACM
Symposium on Operating Systems Principles (Farminton, Pennsylvania)
(SOSP ’13). Association for Computing Machinery, New York, NY, USA,
116–132. https://doi.org/10.1145/2517349.2522719
[41]
Ruslan Nikolaev, Mincheol Sung, and Binoy Ravindran. 2020. LibrettOS:
A Dynamically Adaptable Multiserver-Library OS. In Proceedings
of the 16th ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments (Lausanne, Switzerland) (VEE ’20).
Association for Computing Machinery, New York, NY, USA, 114–128.
https://doi.org/10.1145/3381052.3381316
[42]
Pierre Olivier, Antonio Barbalace, and Binoy Ravindran. 2020. The
Case for Intra-Unikernel Isolation. Proceedings of the 10th Workshop
on Systems for Post-Moore Architectures (April 2020).
[43]
Pierre Olivier, Daniel Chiba, Stefan Lankes, Changwoo Min, and Binoy
Ravindran. 2019. A Binary-Compatible Unikernel. In Proceedings
of the 15th ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments (Providence, RI, USA) (VEE 2019).
Association for Computing Machinery, New York, NY, USA, 59–73.
https://doi.org/10.1145/3313808.3313817
[44]
J. M.Rushby. 1981. Design andVerication ofSecure Systems. InProceed-
ings of the 8th ACM Symposium on Operating Systems Principles (Pacic
Grove, California, USA) (SOSP ’81). Association for Computing Machin-
ery, New York, NY, USA, 12–21. https://doi.org/10.1145/800216.806586
[45]
Vasily A. Sartakov, Lluis Vilanova, and Peter Pietzuch. 2021. CubicleOS:
A Library OS with Software Componentisation for Practical Isolation
(extended abstract). In Proceedings of the 26th International Conference
on Architectural Support for Programming Languages and Operating
Systems (ASPLOS).
[46]
David Schrammel, Samuel Weiser, Stefan Steinegger, Martin Schwarzl,
Michael Schwarz, Stefan Mangard, and Daniel Gruss. 2020. Donky:
Domain Keys – Ecient In-Process Isolation for RISC-V and x86.
In Proceedings of the 29th USENIX Security Symposium (USENIX
Security’20). USENIX Association, 1677–1694. https://www.usenix.
org/conference/usenixsecurity20/presentation/schrammel
[47]
Mincheol Sung, Pierre Olivier, Stefan Lankes, and Binoy Ravindran.
2020. Intra-Unikernel Isolation with Intel Memory Protection Keys. In
Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference
on Virtual Execution Environments (Lausanne, Switzerland) (VEE ’20).
Association for Computing Machinery, New York, NY, USA, 143–156.
https://doi.org/10.1145/3381052.3381326
[48]
Stylianos Tsampas, Akram El-Korashy, Marco Patrignani, Dominique
Devriese, Deepak Garg, and Frank Piessens. 2017. Towards automatic
compartmentalization of C programs on capability machines. In
Workshop on Foundations of Computer Security 2017. 1–14.
86
FlexOS: Making OS Isolation Flexible HotOS ’21, May 31–June 2, 2021, Ann Arbor, MI, USA
[49]
Unikraft Contributors. 2020. Unikraft release 0.4. https:
//github.com/unikraft/unikraft/tree/RELEASE-0.4.
[50]
Anjo Vahldiek-Oberwagner, Eslam Elnikety, Nuno O. Duarte, Michael
Sammler, Peter Druschel, and Deepak Garg. 2019. ERIM: Secure, Ef-
cient In-process Isolation with Protection Keys (MPK). In Proceedings
of the 28th USENIX Security Symposium (USENIX Security’19). USENIX
Association, Santa Clara, CA, 1221–1238. https://www.usenix.org/
conference/usenixsecurity19/presentation/vahldiek-oberwagner
[51]
Dmitry Vyukov. 2020. Syzkaller: Adventures in continuous coverage-
guided kernel fuzzing. BlueHat IL.
[52]
Robert Wahbe, Steven Lucco, Thomas E. Anderson, and Susan L.
Graham. 1993. Ecient Software-Based Fault Isolation. In Proceedings
of the 14th ACM Symposium on Operating Systems Principles (Asheville,
North Carolina, USA) (SOSP ’93). Association for Computing Machinery,
New York, NY, USA, 203–216. https://doi.org/10.1145/168619.168635
[53]
Jiong Wang. 2018. Initial Control Flow Support for eBPF Verier.
https://lwn.net/Articles/753724/.
[54]
Xiaoguang Wang, SengMing Yeoh, Robert Lyerly, Pierre Olivier,
Sang-Hoon Kim, and Binoy Ravindran. 2020. A Framework for Software
DiversicationwithISAHeterogeneity. In23rd International Symposium
on Research in Attacks, Intrusions and Defenses (RAID 2020). 427–442.
[55]
Jonathan Woodru, Robert NM Watson, David Chisnall, Simon W
Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G
Neumann, Robert Norton, and Michael Roe. 2014. The CHERI
capability model: Revisiting RISC in an age of risk. In Proceedings of
the 41st International Symposium on Computer Architecture. 457–468.
[56]
Yiming Zhang, Jon Crowcroft, Dongsheng Li, Chengfen Zhang, Huiba
Li, Yaozheng Wang, Kai Yu, Yongqiang Xiong, and Guihai Chen.
2018. KylinX: A Dynamic Library Operating System for Simplied
and Ecient Cloud Virtualization. In 2018 USENIX Annual Technical
Conference (USENIX ATC’18). USENIX Association, 173–186.
87
... Even though interface-related vulnerabilities (denoted Compartment-Interface Vulnerabilities / CIVs in this paper) were previously identified to various extents in the literature [39], [8], [21], [61], almost all modern compartmentalization frameworks [67], [60], [19], [53], [35], [25], [45], [5], [51], [57], [30], [29], [1] neglect the problem of securing interfaces, and rather focus on transparent and lightweight spatial separation. Since CIVs are already problematic for interfaces hardened from the ground up (e.g., the system call API [28], [8]) with well-defined trust-models (kernel/user), their impact on safety is likely to be even greater when considering arbitrary interfaces and trust models that materialize when compartmentalizing existing software that was not designed with the assumption of hostile internal threats. ...
... The compartmentalization framework enforces cross-compartment control-flow integrity: one compartment can only call explicit entry points exposed by other compartments. These assumptions fit the vast majority of modern frameworks [67], [60], [19], [53], [35], [25], [45], [5], [51], [30], [29], [1]. ...
... d) Interface-Aware Compartmentalization Frameworks: Compartmentalization frameworks provide a variable degree of support for protecting security domain interfaces. The vast majority of modern compartmentalization frameworks [67], [60], [19], [53], [35], [25], [45], [5], [51], [30], [29], [1] do not achieve more than basic ABI-level interface sanitization at security domain crossing, such as switching the stack and clearing registers. Combined with the fact that most also rely on relatively coarse-grain shared memory-based communication for performance reasons, this opens up a wide range of CIVs and was one of our motivations to develop ConfFuzz. ...
... Even though interface-related vulnerabilities (denoted Compartment-Interface Vulnerabilities / CIVs in this paper) were previously identified to various extents in the literature [39], [8], [21], [61], almost all modern compartmentalization frameworks [67], [60], [19], [53], [35], [25], [45], [5], [51], [57], [30], [29], [1] neglect the problem of securing interfaces, and rather focus on transparent and lightweight spatial separation. Since CIVs are already problematic for interfaces hardened from the ground up (e.g., the system call API [28], [8]) with well-defined trust-models (kernel/user), their impact on safety is likely to be even greater when considering arbitrary interfaces and trust models that materialize when compartmentalizing existing software that was not designed with the assumption of hostile internal threats. ...
... The compartmentalization framework enforces cross-compartment control-flow integrity: one compartment can only call explicit entry points exposed by other compartments. These assumptions fit the vast majority of modern frameworks [67], [60], [19], [53], [35], [25], [45], [5], [51], [30], [29], [1]. ...
... d) Interface-Aware Compartmentalization Frameworks: Compartmentalization frameworks provide a variable degree of support for protecting security domain interfaces. The vast majority of modern compartmentalization frameworks [67], [60], [19], [53], [35], [25], [45], [5], [51], [30], [29], [1] do not achieve more than basic ABI-level interface sanitization at security domain crossing, such as switching the stack and clearing registers. Combined with the fact that most also rely on relatively coarse-grain shared memory-based communication for performance reasons, this opens up a wide range of CIVs and was one of our motivations to develop ConfFuzz. ...
Preprint
Full-text available
Least-privilege separation decomposes applications into compartments limited to accessing only what they need. When compartmentalizing existing software, many approaches neglect securing the new inter-compartment interfaces, although what used to be a function call from/to a trusted component is now potentially a targeted attack from a malicious compartment. This results in an entire class of security bugs: Compartment Interface Vulnerabilities (CIVs). This paper provides an in-depth study of CIVs. We taxonomize these issues and show that they affect all known compartmentalization approaches. We propose ConfFuzz, an in-memory fuzzer specialized to detect CIVs at possible compartment boundaries. We apply ConfFuzz to a set of 25 popular applications and 36 possible compartment APIs, to uncover a wide data-set of 629 vulnerabilities. We systematically study these issues, and extract numerous insights on the prevalence of CIVs, their causes, impact, and the complexity to address them. We stress the critical importance of CIVs in compartmentalization approaches, demonstrating an attack to extract isolated keys in OpenSSL and uncovering a decade-old vulnerability in sudo. We show, among others, that not all interfaces are affected in the same way, that API size is uncorrelated with CIV prevalence, and that addressing interface vulnerabilities goes beyond writing simple checks. We conclude the paper with guidelines for CIV-aware compartment interface design, and appeal for more research towards systematic CIV detection and mitigation.
... Hardware-assisted compartmentalization The idea of using MPK for compartmentalizing applications is not new; PKU in 64-bit x86 is used to augment SFI approaches that generally suer from high enforcement overheads [23], [88]. Such work falls into two categories: 1) in-process isolation [44]- [47], [51]- [56], and 2) isolation for unikernels and library OSs [48]- [50]. Lack of PKRU access control has been scrutinized for leaving PKU-based schemes vulnerable to bypass of established isolation domains [53], [82], [84]. ...
... Many other compartmentalization abstractions can be used for platforms that do not support hardware capabilities, relying on various isolation mechanisms. These can be process-based isolation leveraging page tables [40], [41]; VM-based isolation using hardware-assisted virtualization [42], [43]; trusted execution environments [44], [45] and other ISA extensions such as Intel MPK [46]- [49]; and finally software-only solutions such as SFI [50]. These techniques offer various security/performance trade-offs and generally require a particular porting effort to manage data shared between compartments. ...
Preprint
Rust is a popular memory-safe systems programming language. In order to interact with hardware or call into non-Rust libraries, Rust provides \emph{unsafe} language features that shift responsibility for ensuring memory safety to the developer. Failing to do so, may lead to memory safety violations in unsafe code which can violate safety of the entire application. In this work we explore in-process isolation with Memory Protection Keys as a mechanism to shield safe program sections from safety violations that may happen in unsafe sections. Our approach is easy to use and comprehensive as it prevents heap and stack-based violations. We further compare process-based and in-process isolation mechanisms and the necessary requirements for data serialization, communication, and context switching. Our results show that in-process isolation can be effective and efficient, permits for a high degree of automation, and also enables a notion of application rewinding where the safe program section may detect and safely handle violations in unsafe code.
Article
Driven by the recent improvements in device and networks capabilities, Extended Reality (XR) is becoming more pervasive; industry and academia alike envision ambitious projects such as the metaverse. However, XR is still limited by the current architecture of mobile systems. This article makes the case for an XR-specific operating system (XROS). An XROS integrates hardware-support, computer vision algorithms, and XR-specific networking as the primitives supporting XR technology. These primitives represent the physical-digital world as a single shared resource among applications. Such an XROS allows for the development of coherent and system-wide interaction and display methods, systematic privacy preservation on sensor data, and performance improvement while simplifying application development.
Conference Paper
Full-text available
Unikernels are famous for providing excellent performance in terms of boot times, throughput and memory consumption, to name a few metrics. However, they are infamous for making it hard and extremely time consuming to extract such performance, and for needing significant engineering effort in order to port applications to them. We introduce Unikraft, a novel micro-library OS that (1) fully modularizes OS primitives so that it is easy to customize the unikernel and include only relevant components and (2) exposes a set of composable, performance-oriented APIs in order to make it easy for developers to obtain high performance. Our evaluation using off-the-shelf applications such as nginx, SQLite, and Redis shows that running them on Unikraft results in a 1.7x-2.7x performance improvement compared to Linux guests. In addition, Unikraft images for these apps are around 1MB, require less than 10MB of RAM to run, and boot in around 1ms on top of the VMM time (total boot time 3ms-40ms). Unikraft is a Linux Foundation open source project and can be found at www.unikraft.org.
Article
Full-text available
Unikernel, specializing a minimalistic libOS with an application, is an attractive design for cloud computing. However, the Achilles’ heel of unikernel is the lack of multi-process support, which makes it less flexible and applicable. Many applications rely on the process abstraction to isolate different components. For example, Apache with the multi-processing module isolates a request handler in a process to guarantee security. Prior art tackles the problem by simulating multi-process with multiple unikernels, which is incompatible with existing cloud providers and also introduces high overhead. This paper proposes Iso-UniK, a new unikernel design enabling multi-task applications with the support of both functionality and isolation. Iso-UniK leverages a recent hardware feature, named Intel Memory Protection Key (Intel MPK), to provide lightweight and efficient isolation for multi-process in unikernel. Our design has three benefits compared with previous approaches. First, Iso-UniK does not need hypervisor support and is thus compatible with existing cloud computing platforms; second, Iso-UniK promises fast system calls with only 45 cycles; last, a process can be isolated with a flexible configuration. We have implemented a prototype based on OSv, a unikernel system supporting unmodified applications. Iso-UniK can achieve fast fork operation with only 66 μs for multi-process applications. Our evaluation shows that the isolation and multi-process support in Iso-UniK will not damage the applications’ performance.
Conference Paper
Unikernels are minimal single-purpose virtual machines. They are highly popular in the research domain due to the benefits they provide. A barrier to their widespread adoption is the difficulty/impossibility to port existing applications to current unikernels. HermiTux is the first unikernel providing binary-compatibility with Linux applications. It is composed of a hypervisor and lightweight kernel layer emulating OS interfaces at load- and runtime in accordance with the Linux ABI. HermiTux relieves application developers from the burden of porting software, while providing unikernel benefits such as security through hardware-assisted virtualized isolation, swift boot time, and low disk/memory footprint. Fast system calls and kernel modularity are enabled through binary rewriting and analysis techniques, as well as shared library substitution. Compared to other unikernels, HermiTux boots faster and has a lower memory/disk footprint. We demonstrate that over a range of native C/C++/Fortran/Python Linux applications, HermiTux performs similarly to Linux in most cases: its performance overhead averages 3% in memory- and compute-bound scenarios.
Conference Paper
Isolating sensitive state and data can increase the security and robustness of many applications. Examples include protecting cryptographic keys against exploits like OpenSSL's Heartbleed bug or protecting a language runtime from native libraries written in unsafe languages. When runtime references across isolation boundaries occur relatively infrequently, then conventional page-based hardware isolation can be used, because the cost of kernel- or hypervisor-mediated domain switching is tolerable. However, some applications, such as the isolation of cryptographic session keys in network-facing services, require very frequent domain switching. In such applications, the overhead of kernel- or hypervisor-mediated domain switching is prohibitive. In this paper, we present ERIM, a novel technique that provides hardware-enforced isolation with low overhead on x86 CPUs, even at high switching rates (ERIM's measured overhead is less than 1% for 100,000 switches per second). The key idea is to combine protection keys (MPKs), a feature recently added to x86 that allows protection domain switches in userspace, with binary inspection to prevent circumvention. We show that ERIM can be applied with little effort to new and existing applications, doesn't require compiler changes, can run on a stock Linux kernel, and has low runtime overhead even at high domain switching rates.
Conference Paper
Partitioning a security-sensitive application into least-privileged components and putting each into a separate protection domain have long been a goal of security practitioners and researchers. However, a stumbling block to automatically partitioning C/C++ applications is the presence of pointers in these applications. Pointers make calculating data dependence, a key step in program partitioning, difficult and hard to scale; furthermore, C/C++ pointers do not carry bounds information, making it impossible to automatically marshall and unmarshall pointer data when they are sent across the boundary of partitions. In this paper, we propose a set of techniques for supporting general pointers in automatic program partitioning. Our system, called PtrSplit, constructs a Program Dependence Graph (PDG) for tracking data and control dependencies in the input program and employs a parameter-tree approach for representing data of pointer types; this approach is modular and avoids global pointer analysis. Furthermore, it performs selective pointer bounds tracking to enable automatic marshalling/unmarshalling of pointer data, even when there is circularity and arbitrary aliasing. As a result, PtrSplit can automatically generate executable partitions for C applications that contain arbitrary pointers.