Conference PaperPDF Available

Argo: An Exascale Operating System and Runtime

Authors:

Abstract

New computing technologies are expected to change the high-performance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks. Compute nodes are expected to host both general-purpose and special-purpose processors or accelerators, with more complex memory hierarchies. At those scale the HPC community expects that we will also require new programming models, to take advantage of both intra-node and inter-node parallelism. In this context, the Argo Project is developing a new operating system and runtime for exascale machines. It is designed from the ground up to run future HPC application at extreme scales. At the heart of the project are four key innovations: dynamic reconfiguring of node resources in response to workload changes, allowance for massive concurrency, a hierarchical framework for management of nodes, and a cross-layer communication infrastructure that allows resource managers and optimizers to communicate efficiently across the platform.
Argo: An Exascale Operating System and Runtime
Swann Perarnau
swann@anl.gov Rinku Gupta
rgupta@anl.gov
Pete Beckman
beckman@anl.gov
Keywords
High-Performance Computing, Supercomputers, Exascale,
Operating System, Runtime
1. INTRODUCTION
Exascale supercomputers are expected to comprise hun-
dreds of thousands of heterogeneous compute nodes linked
by complex networks. Those compute nodes will have an in-
tricate mix of general-purpose multi-cores and special-purpose
accelerators targeting compute-intensive workloads with deep
multi-level memory hierarchies. As such, the HPC commu-
nity expects exascale systems to require new programming
models, to take advantage of both intra-node and inter-node
parallelism.
The Argo project, funded under the DOE ExaOSR ini-
tiative, aims to provide an Operating System and Runtime
(OS/R) designed to support extreme-scale scientific compu-
tations. With this goal in mind, Argo seeks to efficiently ex-
ploit new processor, memory and interconnect technologies
while addressing the new modalities, programming environ-
ments, and workflows expected at exascale. At the heart
of this project are four key innovations: dynamic reconfig-
uring of node resources in response to workload changes,
allowance for massive concurrency, a hierarchical framework
for management of nodes, and a cross-layer communication
infrastructure that allows resource managers and optimizers
to communicate efficiently across the platform. These inno-
vations will result in an open-source prototype system that
is expected to form the basis of production exascale systems
deployed in the 2020 timeframe.
We provide here a overall description of the project, be-
fore highlighting recent achievements in performance and
integration with existing systems.
2. THE ARGO PROJECT
Providing a complete software stack for exascale systems,
the Argo components span all levels of the machine: a par-
allel runtime seats on top of a HPC-aware operating system
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SC’15 November 15–20, 2015, Austin, TX, USA
Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00.
for each node, while a distributed collection of services man-
ages all nodes, using a global communication bus.
NodeOS is the operating system running on each node
of an Argo machine. It is a based on the Linux kernel,
tuned and extended for HPC use on future architectures.
In particular, we leverage the control groups interface and
extend it to provide lightweight compute containers with ex-
clusive access to hardware resources. To limit OS noise on
the node, system services are restricted to a small dedicated
share of cores and memory nodes. Additionally, the NodeOS
provides custom memory and scheduling policies, as well as
specialized interfaces for parallel runtimes.
Argobots is the runtime component of Argo. It imple-
ments of low-level threading and tasking framework entirely
in user-space, giving users total control over their resource
utilization, and provides data movement infrastructure and
tasking libraries for massively concurrent systems.
GlobalOS is a collection of services implementing a dis-
tributed, dynamic control of the Argo machine. It divides
the system into enclaves, groups of nodes sharing the same
configuration and managed as a whole. Those enclaves can
be subdivided, forming a hierarchy, with dedicated nodes
(masters) at each level to respond to events. Among the
provided services, the GlobalOS includes distributed algo-
rithms for power management and fault management mim-
icking exceptions across the enclave tree.
The Global Information Bus(GIB) is a scalable com-
munication infrastructure taking advantage of modern high
performance networks to provide applications and system
services efficient reporting and resource monitoring services.
In its current state, the Argo project has prototype im-
plementations of most of its components:
1. A prototype design and implementation of Global OS
built on top of OpenStack services. This current imple-
mentation relies on bare metal provisioning of compute
nodes, and provides enclave creation and tracking, con-
figuration of system services and job launching.
2. The GlobalOS also includes distributed enclave and
system-wide power management algorithms.
3. BEACON, the pub/sub framework of the Global Infor-
mation bus is available in its 1.0 version. A prototype
implementation on top of EVPATH and the RIAK key
value store is also available.
4. The Argobots runtime has been successfully integrated
with several existing programming models: MPI, Open
MP, Charm++, Cilk Plus, PTGE.
5. In addition, collaboration with RIKEN in Japan led to
a highly scalable OpenMP implementation for nested
and irregular loops/tasks on top of Argobots.
6. The NodeOS currently provides partitioning of CPU
and memory ressources, a prototype implementation of
its compute containers as well as a custom scheduling
policy for modern HPC runtimes.
7. DI-MMAP, a tool to integrate NVRAM into the mem-
ory hierarchy of the system and use it for parallel ap-
plication is also integrated.
For the near future, the Argo project will focus on greater
integration between its components, aiming for scalability
and functionality testing on large scale DOE facilities and
applications. In particular, we will focus on the following
points:
1. Refining the functionality of Global OS to include:
failure management, fault tolerance, recursive enclave
management and user customization of enclaves.
2. Add functionality to EXPOSE, the performance mon-
itoring component, and integrate it with TAU.
3. Research new features in NodeOS, Argobots runtime
and the GIB that may arise as more information on
future systems is made available.
4. Prepare and demonstrate at future conferences a com-
plete integrated software stack on a large scale system.
3. ADDITIONAL AUTHORS
Argonne National Laboratory: Judicael Zounmevo, Hui-
wei Lu, Kenneth Raffenetti, Sangmin Seo, Pavan Balaji,
Franck Cappello, Kamil Iskra, Rajeev Thakur, Kazutomo
Yoshii, Marc Snir.
University of Illinois at Urbana-Champaign: Cyril Bor-
dage, Laxmikant Kale, Yanhua Sun, Jonathan Lifflander.
University of Tennessee: George Bosilca, Jack Dongarra,
Damien Genet, Thomas Herault.
University of Oregon: Sameer Shende, Xuechen Zheng,
Wyatt Spear, Daniel Ellsworth, Allen D. Malony.
Lawrence Livermore National Laboratory: Maya Gokhale,
Barry Rountree, Martin Schulz, Brian Van Essen, Edgar
Leon.
University of Chicago: Henry Hoffman, Nikita Mishra,
Huazhe Zhang
Pacific Northwest National Laboratory: Sriram Krish-
namoorthy, Roberto Gioiosa, David Callahan, Gokcen Kestor.
4. REFERENCES
[1] A. Danalis, G. Bosilca, A. Bouteiller, T. Herault, and
J. Dongarra. PTG: An abstraction for unhindered
parallelism. In International Workshop on
Domain-Specific Languages and High-Level
Frameworks for High Performance Computing
(WOLFHPC), New Orleans, LA, 11/2014 2014. IEEE
Press, IEEE Press.
[2] D. Ellsworth, A. Malony, M. Schulz, and B. Rountree.
POW: System-wide Dynamic Reallocation of Limited
Power in HPC. In 24th International ACM
Symposium on High-Performance Distributed
Computing (HPDC 2015), 2015.
[3] H. Hoffmann and M. Maggio. PCP: A generalized
approach to optimizing performance under power
constraints through resource management. In ICAC,
2014.
[4] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann.
Poet: A portable approach to minimizing energy
under soft real-time constraints. In RTAS, 2015.
[5] N. Mishra, H. Zhang, J. D. Lafferty, and H. Hoffmann.
A probabilistic graphical model-based approach to
minimizing energy under performance constraints. In
ASPLOS, 2015.
[6] T. Patki, D. Lowenthal, A. Sasidharan, M. Maiterth,
B. Rountree, M. Schulz, and B. de Supinski. Practical
resource management in power-constrained, high
performance computing. In 24th International ACM
Symposium on High-Performance Distributed
Computing (HPDC 2015), 2015.
[7] S. Perarnau, R. Thakur, K. Iskra, K. Raffenetti,
F. Cappello, R. Gupta, P. Beckman, M. Snir,
H. Hoffmann, M. Schulz, and B. Rountree. Distributed
Monitoring and Management of Exascale Systems in
the Argo Project. In 15th IFIP International
Conference on Distributed Applications and
Interoperable Systems (DAIS 2015), June 2015.
[8] B. Van Essen, M. Jiang, and M. Gokhale. Developing
a framework for analyzing data movement within a
memory management runtime for data-intensive
applications. In Non-Volatile Memories Workshop,
San Diego, CA, Mar. 2015.
[9] W. Wu, A. Bouteiller, G. Bosilca, M. Faverge, and
J. Dongarra. Hierarchical dag scheduling for hybrid
distributed systems. In 29th IEEE International
Parallel & Distributed Processing Symposium
(IPDPS), Hyderabad, India, 05/2015 2015. IEEE,
IEEE.
[10] H. Zhang and H. Hoffmann. A quantitative evalution
of the RAPL power control system. In Feedback
Computing, 2015.
[11] J. A. Zounmevo, K. Iskra, K. Yoshii, R. Gioiosa,
B. C. V. Essen, M. B. Gokhale, and E. A. Leon. A
single-kernel approach to OS specialization and node
resource partitioning for exascale computing. 11th
USENIX Symposium on Operating Systems Design
and Implementation (OSDI ’14), Oct. 2014. (Poster).
[12] J. A. Zounmevo, S. Perarnau, K. Iskra, K. Yoshii,
R. Gioiosa, B. C. V. Essen, M. B. Gokhale, and E. A.
Leon. A container-based approach to OS specialization
for exascale computing. In 1st International Workshop
on Container Technologies and Container Clouds
(WoC ’15), held in conjunction with IEEE
International Conference on Cloud Engineering (IC2E
’15), Tempe, AZ, Mar. 2015.
... This extraordinary scale processing framework will be proficient to compute 1018 FLOPS activities for each subsequent that is thousand-crease increment in current Petascale framework. As per expectations, Exascale figuring framework will be involved countless heterogeneous process hubs connected by complex systems [5]. The essential issue for HPC frameworks is that such Extreme (Exascale) processing framework doesn't exist yet, anyway everything toward Exascale is simply expectations and contemplations. ...
... Operating system and run-time research funded by the Exascale Computing Project (ECP) and ASCR (Argo (Perarnau et al., 2013) and Hobbes (Brightwell et al., 2013)) investigates system support for unconventional HPC programming models, support for multiple concurrent runtimes, and advanced virtualization capabilities that could be leveraged to support desired ISDM capabilities. However, as show, HPC platforms still do not support all the capabilities needed for in situ workflows. ...
Article
In January 2019, the US Department of Energy, Office of Science program in Advanced Scientific Computing Research, convened a workshop to identify priority research directions (PRDs) for in situ data management (ISDM). A fundamental finding of this workshop is that the methodologies used to manage data among a variety of tasks in situ can be used to facilitate scientific discovery from many different data sources—simulation, experiment, and sensors, for example—and that being able to do so at numerous computing scales will benefit real-time decision-making, design optimization, and data-driven scientific discovery. This article describes six PRDs identified by the workshop, which highlight the components and capabilities needed for ISDM to be successful for a wide variety of applications—making ISDM capabilities more pervasive, controllable, composable, and transparent, with a focus on greater coordination with the software stack and a diversity of fundamentally new data algorithms.
... Many efforts focus on extreme scalability at both the intra-node and inter-node level, including Barrelfish [3], Andromeda [44], and Corey [7]. Lightweight kernels (LWKs) specifically designed for raw performance have been around in HPC for more than a decade [13,21,28], and the HPC community is now also looking at Unikernels [29], in addition to multi-kernel and co-kernel approaches [4,12,38,47]. Benefits of specialized kernels include their small size, their performance, predictability, and in some cases security. Unikernels and LWKs can make virtualization more attractive, as their execution environment can be more hypervisorfriendly [24]. ...
Conference Paper
Specialized operating systems have enjoyed a recent revival driven both by a pressing need to rethink the system software stack in several domains and by the convenience and flexibility that on-demand infrastructure and virtual execution environments offer. Several barriers exist which curtail the widespread adoption of such highly specialized systems, but perhaps the most consequential of them is that these systems are simply difficult to use. In this paper we discuss the challenges faced by specialized OSes, both for HPC and more broadly, and argue that what is needed to make them practically useful is a reasonable development and deployment model that will form the foundation for a kernel ecosystem that allows intrepid developers to discover, experiment with, contribute to, and write programs for available kernel frameworks while safely ignoring complexities such as provisioning, deployment, cross-compilation, and interface compatibility. We argue that such an ecosystem would allow more developers of highly tuned applications to reap the performance benefits of specialized kernels.
... To cope with the challenges of exascale computing, some new operating systems, such as the Hobbes [3] and Argo [4], are designed from the ground up. So there are opportunities to give up TCP/IP and to design new communication protocol (such as Portal 4 [5]) only for MPI, that is, the new communication protocol serves MPI whole-heartedly, and other communication services in exascale computing systems are carried out by other communication protocols (such as TCP/IP etc.). ...
Preprint
Full-text available
This paper provides some new approaches to MPI implementations to improve MPI performance. These approaches include dynamically composable libraries, reducing average layer numbers of MPI libraries, and a single entity of MPI-network, MPI-protocol, and MPI.
Conference Paper
Emerging workloads on supercomputing platforms are pushing the limits of traditional high-performance computing software environments. Multi-physics, coupled simulations, big data processing and machine learning frameworks, and multi- component workloads pose serious challenges to system and application developers. At the heart of the problem is the lack of cross-stack coordination to enable flexible resource management among multiple runtime components. In this work, we analyze seven real-world applications that represent emerging workloads and illustrate the scope and magnitude of the problem. We then extract several themes from these applications that highlight next-generation requirements for node resource managers. Finally, using these requirements, we propose a general, cross-stack coordination framework and outline its components and functionality.
Chapter
The German research project FFMK aims to build a new HPC operating system platform that addresses hardware and software challenges posed by future exascale systems. These challenges include massively increased parallelism (e.g., nodes and cores), overcoming performance variability, and most likely higher failure rates due to significantly increased component counts. We also expect more complex applications and the need to manage system resources in a more dynamic way than on contemporary HPC platforms, which assign resources to applications statically. The project combines and adapts existing system-software building blocks that have already matured and proven themselves in other areas. At the lowest level, the architecture is based on a microkernel to provide an extremely lightweight and fast execution environment that leaves as many resources as possible to applications. An instance of the microkernel controls each compute node, but it is complemented by a virtualized Linux kernel that provides device drivers, compatibility with existing HPC infrastructure, and rich support for programming models and HPC runtimes such as MPI . Above the level of individual nodes, the system architecture includes distributed performance and health monitoring services as well as fault-tolerant information dissemination algorithms that enable failure handling and dynamic load management. In this chapter, we will give an overview of the overall architecture of the FFMK operating system platform. However, the focus will be on the microkernel and how it integrates with Linux to form a multi-kernel operating system architecture.
Conference Paper
Full-text available
Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak computational capacity. Despite significant advances in the programming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel efficiency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments.
Conference Paper
Full-text available
Future exascale systems will impose several conflicting challenges on the operating system (OS) running on the compute nodes of such machines. On the one hand, the targeted extreme scale requires the kind of high resource usage efficiency that is best provided by lightweight OSes. At the same time, substantial changes in hardware are expected for exascale systems. Compute nodes are expected to host a mix of general-purpose and special-purpose processors or accelerators tailored for serial, parallel, compute-intensive, or I/O-intensive workloads. Similarly, the deeper and more complex memory hierarchy will expose multiple coherence domains and NUMA nodes in addition to incorporating nonvolatile RAM. That expected workload and hardware heterogeneity and complexity is not compatible with the simplicity that characterizes high performance lightweight kernels. In this work, we describe the Argo Exascale node OS, which is our approach to providing in a single kernel the required OS environments for the two aforementioned conflicting goals. We resort to multiple OS specializations on top of a single Linux kernel coupled with multiple containers.
Conference Paper
Full-text available
New computing technologies are expected to change the highperformance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks-resources that need to be actively monitored and controlled, at a scale difficult to manage from a central point as in previous systems. In this context, we describe here on-going work in the Argo exascale software stack project to develop a distributed collection of services working together to track scientific applications across nodes, control the power budget of the system, and respond to eventual failures. Our solution leverages the idea of enclaves: a hierarchy of logical partitions of the system, representing groups of nodes sharing a common configuration, created to encapsulate user jobs as well as by the user inside its own job. These enclaves provide a second (and greater) level of control over portions of the system, can be tuned to manage specific scenarios, and have dedicated resources to do so.
Conference Paper
Current trends for high-performance systems are leading us towards hardware over-provisioning where it is no longer possible to run each component at peak power without exceeding a system or facility wide power bound. In such scenarios, the power consumed by individual components must be artificially limited to guarantee system operation under a given power bound. In this paper, we present the design of a power scheduler capable of enforcing such a bound using dynamic system-wide power reallocation in an application-agnostic manner. Our scheduler is expected to achieve better job runtimes than a naive power scheduling approach without requiring a priori knowledge of application power behavior.
Conference Paper
Power management is one of the key research challenges on the path to exascale. Supercomputers today are designed to be worst-case power provisioned, leading to two main problems --- limited application performance and under-utilization of procured power. In this paper, we propose RMAP, a practical, low-overhead resource manager targeted at future power-constrained clusters. The goals for RMAP are to improve application performance as well as system power utilization, and thus minimize the average turnaround time for all jobs. Within RMAP, we design and analyze an adaptive policy, which derives job-level power bounds in a fair-share manner and supports overprovisioning and power-aware backfilling. Our results show that our new policy increases system power utilization while adhering to strict job-level power bounds and leads to 31% (19% on average) and 54% (36% on average) faster average turnaround time when compared to worst-case provisioning and naive overprovisioning respectively.
Article
Embedded real-time systems must meet timing constraints while minimizing energy consumption. To this end, many energy optimizations are introduced for specific platforms or specific applications. These solutions are not portable, however, and when the application or the platform change, these solutions must be redesigned. Portable techniques are hard to develop due to the varying tradeoffs experienced with different application/platform configurations. This paper addresses the problem of finding and exploiting general tradeoffs, using control theory and mathematical optimization to achieve energy minimization under soft real-time application constraints. The paper presents POET, an open-source C library and runtime system that takes a specification of the platform resources and optimizes the application execution. We test POET's ability to portably deliver predictable timing and energy reduction on two embedded systems with different tradeoff spaces - the first with a mobile Intel Haswell processor, and the second with an ARM big.LITTLE System on Chip. POET achieves the desired latency goals with small error while consuming, on average, only 1.3% more energy than the dynamic optimal oracle on the Haswell and 2.9% more on the ARM. We believe this open-source, librarybased approach to resource management will simplify the process of writing portable, energy-efficient code for embedded systems.
Article
In many deployments, computer systems are underutilized - meaning that applications have performance requirements that demand less than full system capacity. Ideally, we would take advantage of this under-utilization by allocating system resources so that the performance requirements are met and energy is minimized. This optimization problem is complicated by the fact that the performance and power consumption of various system configurations are often application - or even input - dependent. Thus, practically, minimizing energy for a performance constraint requires fast, accurate estimations of application-dependent performance and power tradeoffs. This paper investigates machine learning techniques that enable energy savings by learning Pareto-optimal power and performance tradeoffs. Specifically, we propose LEO, a probabilistic graphical model-based learning system that provides accurate online estimates of an application's power and performance as a function of system configuration. We compare LEO to (1) offline learning, (2) online learning, (3) a heuristic approach, and (4) the true optimal solution. We find that LEO produces the most accurate estimates and near optimal energy savings.
PCP: A generalized approach to optimizing performance under power constraints through resource management
  • H Hoffmann
  • M Maggio
H. Hoffmann and M. Maggio. PCP: A generalized approach to optimizing performance under power constraints through resource management. In ICAC, 2014.
Developing a framework for analyzing data movement within a memory management runtime for data-intensive applications
  • B Van Essen
  • M Jiang
  • M Gokhale
B. Van Essen, M. Jiang, and M. Gokhale. Developing a framework for analyzing data movement within a memory management runtime for data-intensive applications. In Non-Volatile Memories Workshop, San Diego, CA, Mar. 2015.