Bernhard Egger

Bernhard Egger
Seoul National University | SNU · Department of Computer Science and Engineering

PhD

About

73
Publications
9,697
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
705
Citations
Additional affiliations
February 2011 - present
Seoul National University
Position
  • Professor
February 2008 - January 2011
Samsung Advanced Institute of Technology
Position
  • Senior Researcher
Education
February 2003 - March 2008
Seoul National University
Field of study
  • Computer Science
August 1996 - October 2001
ETH Zurich
Field of study
  • Computer Science

Publications

Publications (73)
Article
Infrastructure-as-a-Service (IaaS) clouds not only have to meet business requirements but also need to consider other important metrics that influence the quality of service, such as performance, availability, and power consumption. In this paper, we present a Stochastic Activity Network (SAN) based analytical approach that simultaneously computes...
Article
The era of personal resources being sufficient for enterprise big data computations has passed. As computations are executed in the cloud, small policy changes of cloud operators may cause considerable changes in operational costs. Carefully choosing the amount of resources for a given application is thus of great importance. This, however, require...
Chapter
Full-text available
Containerization technology helps achieving not only better portability and interoperability but also better performance and efficiency on various cloud computing arrangements. Such technology is expected to empower cloud federations by enhancing portability and scalability across the federation. In this paper, we propose an architecture by adding...
Chapter
As more and more memory-intensive applications are moved into the cloud, data center operators face the challenge of providing sufficient main memory resources while achieving high resource utilization. Solutions to overcome the unsatisfying performance degradation of traditional on-demand paging include memory disaggregation that allows applicatio...
Chapter
Cloud resource providers are putting more and more emphasis on efficiently management the resources of their data centers to achieve high utilization while minimizing energy consumption. Despite these efforts, an analysis of recent data center traces reveals that the utilization of CPU and memory resources has not improved significantly over the pa...
Article
Companies depend on mining data to grow their business more than ever. To achieve optimal performance of Big Data analytics workloads, a careful configuration of the cluster and the employed software framework is required. The lack of flexible and accurate performance models, however, render this a challenging task. This paper fills this gap by pre...
Chapter
Power management and task placement pose two of the greatest challenges for future many-core processors in data centers. With hundreds of cores on a single die, cores experience varying memory latencies and cannot individually regulate voltage and frequency, therefore calling for new approaches to scheduling and power management. This work presents...
Chapter
Live migration of virtual machines (VMs) enables maintenance, load balancing, and power management in data centers. The cost of live migration on several key metrics combined with strict service-level objectives (SLOs), however, typically limits its practical application to situations where the underlying physical host has to undergo maintenance. A...
Chapter
Live migration of virtual machines (VMs) is an important tool for data center operators to achieve maintenance, power management, and load balancing. The relatively high cost of live migration makes it difficult to employ live migration for rapid load balancing or power management operations, leaving much of its promised benefits unused. The advanc...
Conference Paper
Full-text available
High-performance computing (HPC) clusters suffer from an overall low memory utilization that is caused by the node-centric memory allocation combined with the variable memory requirements of HPC workloads. The recent provisioning of nodes with terabytes of memory to accommodate workloads with extreme peak memory requirements further exacerbates the...
Conference Paper
Full-text available
Changyeon Jo [0000−0002−9707−1256] , Hyunik Kim [0000−0001−7740−8344] , and Bernhard Egger [0000−0002−6645−6161] Abstract. Live migration of virtual machines (VMs) is an important tool for data center operators to achieve maintenance, power management, and load balancing. The relatively high cost of live migration makes it difficult to employ live...
Conference Paper
Full-text available
Power management and task placement pose two of the greatest challenges for future many-core processors in data centers. With hundreds of cores on a single die, cores experience varying memory latencies and cannot individually regulate voltage and frequency, therefore calling for new approaches to scheduling and power management. This work presents...
Conference Paper
Full-text available
Youngsu Cho [0000−0002−5519−3562] , Changyeon Jo [0000−0002−9707−1256] , Hyunik Kim [0000−0001−7740−8344] , and Bernhard Egger [0000−0002−6645−6161] Abstract. Live migration of virtual machines (VMs) enables maintenance, load balancing, and power management in data centers. The cost of live migration on several key metrics combined with strict serv...
Article
Understanding memory performance in multi-core platforms is a prerequisite to perform optimizations. To this end, this paper presents analytical models based on Stochastic Reward Nets (SRNs) to model and evaluate the memory performance of Non-Uniform Memory Access (NUMA) multi-core architectures. The approach considers the details of the architectu...
Article
Full-text available
Predicting the performance of parallel loops on modern shared-memory multi-socket multi-core systems in dependence of the allocated resources is an important means to achieve better system utilization. Previous prediction techniques are tied to specific architectures and do not allow for purely online performance predictions without requiring an of...
Article
Full-text available
This paper describes a framework to verify the functional correctness of the Samsung Reconfigurable Processor, a dual-mode very-long instruction word (VLIW) and coarse-grained reconfigurable array (CGRA) processor integrated in smartphones, cameras, printers, and SmartTVs. Reflecting the reconfigurable nature of the processor, the test generator co...
Conference Paper
Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. Thus it is typical to use an energy-efficient a...
Conference Paper
Full-text available
With an increasing number of cores and memory controllers in multiprocessor platforms, co-location of parallel applications is gaining on importance. Key to achieve good performance is allocating the proper number of threads to co-located applications. This paper presents NuPoCo, a framework for automatically managing parallelism of co-located para...
Conference Paper
Full-text available
Integrating CPUs and GPUs on the same die provides new opportunities for optimization, especially for irregular data-parallel workloads that fail to fully exploit the computational power of the GPU. Such workloads benefit from a proper partitioning between the CPU and the GPU. This paper presents an on-the-fly workload partitioning technique for ir...
Article
Full-text available
A convolutional neural network (CNN) architecture supporting on-device user customization is proposed. The network architecture consists of a large CNN trained on a general data and a smaller augmenting network that can be re-trained on-device using a small user-specific data provided by the user. The proposed approach is applied to handwritten cha...
Article
Full-text available
As more and more deep learning tasks are pushed to mobile devices, accelerators for running these networks efficiently gain in importance. We show a that an existing class of general purpose accelerators, modulo-scheduled coarse-grained reconfigurable array (CGRA) processors typically used to accelerate multimedia workloads, can be a viable alterna...
Conference Paper
Full-text available
We propose and evaluate a framework to test the functional correctness of coarse-grained reconfigurable array (CGRA) processors for pre-silicon verification and post-silicon validation. To reflect the reconfigurable nature of CGRAs, an architectural model of the system under test is built directly from the hardware description files. A guided place...
Article
We propose and evaluate a framework to test the functional correctness of coarse-grained reconfigurable array (CGRA) processors for pre-silicon verification and post-silicon validation. To reflect the reconfigurable nature of CGRAs, an architectural model of the system under test is built directly from the hardware description files. A guided place...
Article
Full-text available
Modulo-scheduled course-grain reconfigurable array (CGRA) processors excel at exploiting loop-level parallelism at a high performance per watt ratio. The frequent reconfiguration of the array, however, causes between 25% and 45% of the consumed chip energy to be spent on the instruction memory and fetches therefrom. This article presents a hardware...
Article
In this paper, Stochastic Activity Networks (SANs) are used to model and evaluate the performance and power consumption of an Infrastructure-as-a-Service (IaaS) cloud. The proposed SAN model is scalable and flexible, yet encompasses some details of an IaaS cloud, such as Virtual Machine (VM) provisioning, VM multiplexing, and failure/repair behavio...
Conference Paper
This paper presents a convolutional neural network architecture that supports transfer learning for user customization. The architecture consists of a large basic inference engine and a small augmenting engine. Initially, both engines are trained using a large dataset. Only the augmenting engine is tuned to the user-specific dataset. To preserve th...
Conference Paper
Full-text available
Live migration is one of the key technologies to improve data center utilization, power efficiency, and maintenance. Various live migration algorithms have been proposed; each exhibiting distinct characteristics in terms of completion time, amount of data transferred, virtual machine (VM) downtime, and VM performance degradation. To make matters wo...
Conference Paper
Full-text available
Traditional approaches for cache-coherent shared memory architectures running symmetric multiprocessing (SMP) operating systems are not adequate for future manycore chips where power management presents one of the most important challenges. In this work, we present a power management framework for many-core systems that does not require coherent sh...
Conference Paper
Full-text available
Schedulers for symmetric multiprocessing (SMP) machines use sophisticated algorithms to schedule processes onto the available processor cores. Hardware-dependent code and the use of locks to protect shared data structures from simultaneous access lead to poor portability, the difficulty to prove correctness, and a myriad of problems associated with...
Conference Paper
Full-text available
Many-core chips are especially attractive for data center operators providing cloud computing service models. With the advance of many-core chips in such environments energy-conscious scheduling of independent processes or operating systems (OSes) is gaining importance. An important research question is how the scheduler of such a system should ass...
Conference Paper
Full-text available
Space-sharing is regarded as the proper resource management scheme for many-core OSes. For today’s many-core chips and parallel programming models providing no explicit resource requirements, an important research problem is to provide a proper resource allocation to the running applications while considering not only the architectural features but...
Article
Full-text available
The ability to save the state of a running virtual machine (VM) for later restoration is an important tool for home, server, and virtual desktop cloud (VDC) environments in order to achieve optimal and balanced hardware utilization. With guest memory sizes of four to eight gigabytes being the norm the time- and space-overhead of storing VM checkpoi...
Conference Paper
Full-text available
We present an accurate online scalability prediction model for data-parallel programs on NUMA many-core systems. Memory contention is considered to be the major limiting factor of program scalability as data parallelism limits the amount of synchronization or data dependencies between parallel work units. Reflecting the architecture of NUMA systems...
Patent
Provided is an apparatus and method for generating code overlay capable of minimizing the number of memory copies. A static temporal relationship graph (STRG) is generated in which each of functions of a program corresponds to a node of the STRG and a conflict miss value corresponds to an edge of the STRG. The conflict miss value is the maximum num...
Patent
A multiprocessor using a shared virtual memory (SVM) is provided. The multiprocessor includes a plurality of processing cores and a memory manager configured to transform a virtual address into a physical address to allow a processing core to access a memory region corresponding to the physical address.
Patent
Full-text available
A reconfigurable processor which merges an inner loop and an outer loop which are included in a nested loop and allocates the merged loop to processing elements in parallel, thereby reducing processing time to process the nested loop. The reconfigurable processor may extract loop execution frequency information from the inner loop and the outer loo...
Patent
Full-text available
An apparatus and method for scheduling an instruction are provided. The apparatus includes an analyzer configured to analyze dependency of a plurality of recurrence loops and a scheduler configured to schedule the recurrence loops based the analyzed dependencies. When scheduling a plurality of recurrence loops, the apparatus first schedules a domin...
Patent
Full-text available
A debugging apparatus and method are provided. The debugging apparatus may include a breakpoint setting unit configured to store a first instruction corresponding to a breakpoint in a table, stop a program currently being executed, and insert a breakpoint instruction including current location information of the first instruction into the breakpoin...
Patent
Full-text available
An apparatus and method for dynamically determining the execution mode of a reconfigurable array are provided. Performance information of a loop may be obtained before and/or during the execution of the loop. The performance information may be used to determine whether to operate the apparatus in a very long instruction word (VLIW) mode or in a coa...
Patent
Full-text available
A scheduler of a reconfigurable array, a method of scheduling commands, and a computing apparatus are provided. To perform a loop operation in a reconfigurable array, a recurrence node, a producer node, and a predecessor node are detected from a data flow graph of the loop operation such that resources are assigned to such nodes so as to increase t...
Patent
Full-text available
A processor and a processor control method which efficiently perform an operation on data using a register, are provided. The register may include a data type field and a data field. The processor may generate the data type bits and store the generated data type bits in the data type field.
Patent
Full-text available
Provided are a reconfigurable processor and operating method thereof. The reconfigurable processor may use a configuration memory distributed to each operation unit. The distributed configuration memory may be separated into a distributed operation configuration memory including configuration information about an operation of a function unit, and a...
Article
Full-text available
To exploit the abundant computational power of the world’s fastest supercomputers, an even workload distribution to the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, accurate performance estimation models for modern GPUs do not exist. This paper presents two accurate...
Patent
Full-text available
An apparatus and method for generating a very long instruction word (VLIW) command that supports predicated execution, and a VLIW processor and method for processing a VLIW are provided herein. The VLIW command includes an instruction bundle formed of a plurality of instructions to be executed in parallel and a single value indicating predicated ex...
Conference Paper
Full-text available
Live migration of virtual machines (VM) from one physical host to another is a key enabler for virtual desktop clouds (VDC). The prevalent algorithm, pre-copy, suffers from long migration times and a high data transfer volume for non-idle VMs which hinders effective use of live migration in VDC environments. In this paper, we present an optimizatio...
Patent
Full-text available
A memory managing apparatus and method are provided. The memory managing apparatus may determine, based on a pointer indicator bit, the target memory area on which garbage collection is to be performed, and may perform the garbage collection on the target memory area. The memory managing apparatus may generate the pointer indicator bit and store th...
Patent
Full-text available
Described herein is a reconfigurable processor which uses a distributed configuration memory structure and an operation method thereof in which power consumption is reduced. A processing unit which configures the reconfigurable processor includes a functional unit, a distributed configuration memory, a no-operation (NOP) register, and a controller....
Patent
Full-text available
An interrupt support determining apparatus and method for an equal-model processor, and a processor including the interrupt support determining apparatus are provided. The interrupt support determining apparatus determines whether an instruction input to a processor decoder is a multiple latency instruction, compares a current latency of the instru...
Patent
Full-text available
A computing apparatus and method of handling an interrupt are provided. The computing apparatus includes a coarse-grained array, a host processor, and an interrupt supervisor. When an interrupt occurs in the coarse-grained array while performing a loop operation, the host processor processes the interrupt, and the interrupt supervisor may perform m...
Article
Full-text available
Saving the state of a running virtual machine (VM) for later restoration has become an indispensable tool to achieve balanced and energy-efficient usage of the underlying hardware in virtual desktop cloud environments (VDC). To free up resources, a remote user’s VM is saved to external storage when the user disconnects and restored when the user re...
Patent
Full-text available
An interrupt handling technology and a reconfigurable processor are provided. The reconfigurable processor includes a plurality of processing elements, and some of the processing elements are designated for interrupt handling. When an interrupt request occurs while the reconfigurable processor is executing a loop operation, the designated processin...
Conference Paper
Full-text available
Live migration of virtual machines (VM) across distinct physical hosts is an important feature of virtualization technology for maintenance, load-balancing and energy reduction, especially so for data centers operators and cluster service providers. Several techniques have been proposed to reduce the downtime of the VM being transferred, often at t...
Conference Paper
Full-text available
The advent of many-core chips in the embedded world imposes new challenges to programmers. One of the most important challenges to achieve optimal performance is that the variance in memory access time depending on the issuing core and the location of the data has to be taken into consideration in the already complex environment of parallel applica...
Article
Full-text available
There is an increasing interest in explicitly managed memory hierarchies, where a hierarchy of distinct memories is exposed to the programmer and managed explicitly in software. These hierarchies can be found in typical embedded systems and an emerging class of multicore architectures. To run an application that requires more code memory than the a...
Conference Paper
Full-text available
Checkpointing, i.e., recording the volatile state of a virtual machine (VM) running as a guest in a virtual machine monitor (VMM) for later restoration, includes storing the memory available to the VM. Typically, a full image of the VM's memory along with processor and device states are recorded. With guest memory sizes of up to several gigabytes,...
Conference Paper
Full-text available
In high-end embedded systems, coarse-grained reconfigurable architectures (CGRA) continue to replace traditional ASIC designs. CGRAs offer high performance at a low power consumption, yet provide flexibility through programmability. In this paper we introduce a recurrence cycle-aware scheduling technique for CGRAs. Our modulo scheduler groups opera...
Article
Full-text available
There have been strong demands for a fast and cycle-accurate virtual platforms in the embedded systems area where developers can do meaningful software development including performance debugging in the context of the entire platform. In this paper, we describe the design and implementation of a fast and cycle-accurate architecture simulator called...
Conference Paper
Full-text available
There have been strong demands for a fast and cycle-accurate virtual platforms in the embedded systems area where developers can do meaningful software development including performance debugging in the context of the entire platform. In this paper, we describe the design and implementation of a fast and cycle-accurate architecture simulator called...