Sunwoo Lee

Sunwoo Lee
Inha University · Department of Computer Science & Engineering

Ph.D.

About

38
Publications
3,996
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
256
Citations
Introduction
I am an assistant professor of Computer Engineering at Inha University, South Korea and the director of Large-Scale Machine Learning Systems research lab (LMLS-Lab). My research interests are scalable Deep Learning and Machine Learning for large-scale applications. I also study High-Performance Computing problems such as parallel I/O and communication algorithms.
Additional affiliations
December 2008 - March 2013
Humax
Position
  • Software Engineer
Description
  • (Alternative Military Service) Worked as a software engineer developing user-level device drivers for embedded systems.
October 2020 - August 2022
University of Southern California
Position
  • Postdoc
Description
  • I worked as a postdoctoral researcher. I was advised by Prof. Salman Avestimehr.
April 2013 - January 2015
Samsung
Position
  • Researcher
Description
  • Worked on projects for developing system software in high-performance SSD-based storage server.
Education
March 2015 - September 2019
Northwestern University
Field of study
  • Computer Science
March 2007 - February 2009
Hanyang University
Field of study
  • Computer Engineering
March 2003 - February 2007
Hanyang University
Field of study
  • Computer Engineering

Publications

Publications (38)
Conference Paper
Full-text available
Training Convolutional Neural Network (CNN) is a computationally intensive task whose parallelization has become critical in order to complete the training in an acceptable time. However, there are two obstacles to developing a scalable parallel CNN in a distributed-memory computing environment. One is the high degree of data dependency exhibited i...
Preprint
Full-text available
Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple workers and periodically averages the model across all the workers. When local SGD runs with many workers, however, the periodic averaging causes a significant model discr...
Article
We consider synchronous data-parallel neural network training with a fixed large batch size. While the large batch size provides a high degree of parallelism, it degrades the generalization performance due to the low gradient noise scale. We propose a general learning rate adjustment framework and three critical heuristics that tackle the poor gene...
Article
In Federated Learning (FL), clients may have weak devices that cannot train the full model or even hold it in their memory space. To implement large-scale FL applications, thus, it is crucial to develop a distributed learning method that enables the participation of such weak clients. We propose EmbracingFL , a general FL framework that allows al...
Conference Paper
Sharpness-aware minimization (SAM) is known to improve the generalization performance of neural networks. However, it is not widely used in real-world applications yet due to its expensive model perturbation cost. A few variants of SAM have been proposed to tackle such an issue, but they commonly do not alleviate the cost noticeably. In this paper,...
Preprint
Full-text available
In Federated Learning (FL), clients may have weak devices that cannot train the full model or even hold it in their memory space. To implement large-scale FL applications, thus, it is crucial to develop a distributed learning method that enables the participation of such weak clients. We propose EmbracingFL, a general FL framework that allows all a...
Article
Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple clients and periodically averages the model across all the clients. This periodic model averaging potentially causes a significant model discrepancy across the clients mak...
Preprint
Full-text available
Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence inst...
Article
In Federated Learning (FL), a common approach for aggregating local solutions across clients is periodic full model averaging. It is, however, known that different layers of neural networks can have a different degree of model discrepancy across the clients. The conventional full aggregation scheme does not consider such a difference and synchroniz...
Preprint
Full-text available
In cross-device Federated Learning (FL) environments, scaling synchronous FL methods is challenging as stragglers hinder the training process. Moreover, the availability of each client to join the training is highly variable over time due to system heterogeneities and intermittent connectivity. Recent asynchronous FL methods (e.g., FedBuff) have be...
Preprint
Federated Learning (FL) enables collaborations among clients for train machine learning models while protecting their data privacy. Existing FL simulation platforms that are designed from the perspectives of traditional distributed training, suffer from laborious code migration between simulation and production, low efficiency, low GPU utility, low...
Conference Paper
In Federated Learning, a common approach for aggregating local models across clients is periodic averaging of the full model parameters. It is, however, known that different layers of neural networks can have a different degree of model discrepancy across the clients. The conventional full aggregation scheme does not consider such a difference and...
Preprint
Full-text available
Federated learning (FL) has gained substantial attention in recent years due to the data privacy concerns related to the pervasiveness of consumer devices that continuously collect data from users. While a number of FL benchmarks have been developed to facilitate FL research, none of them include audio data and audio-related tasks. In this paper, w...
Preprint
Full-text available
Limited compute and communication capabilities of edge users create a significant bottleneck for federated learning (FL) of large models. We consider a realistic, but much less explored, cross-device FL setting in which no client has the capacity to train a full large model nor is willing to share any intermediate activations with the server. To th...
Article
Exploiting oxygen vacancies has emerged as a versatile tool to tune the electronic and optoelectronic properties of complex oxide heterostructures. For the precise manipulation of the oxygen vacancies, the capability of directly probing the defect distribution in nanoscale is essential, but still lacking. Here we estimate the spatial distribution o...
Article
Full-text available
Resistive switching devices have been regarded as a promising candidate of multi-bit memristors for synaptic applications. The key functionality of the memristors is to realize multiple non-volatile conductance states with high precision. However, the variation of device conductance inevitably causes the state-overlap issue, limiting the number of...
Preprint
Full-text available
In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and ana...
Article
Synchronous SGD with data parallelism, the most popular parallelization strategy for CNN training, suffers from the expensive communication cost of averaging gradients among all workers. The iterative parameter updates of SGD cause frequent communications and it becomes the performance bottleneck. In this paper, we propose a lazy parameter update a...
Conference Paper
Full-text available
Many scientific applications have started using deep learning methods for their classification or regression problems. However, for data-intensive scientific applications, I/O performance can be the major performance bottleneck. In order to effectively solve important real-world problems using deep learning methods on High-Performance Computing (HP...
Article
In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, help us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and anal...
Preprint
Full-text available
Memristors are essential elements for hardware implementation of artificial neural networks. The key functionality of the memristors is to realize multiple non-volatile conductance states with high precision. However, the variation of device conductance limits the number of allowed states. Since actual data for neural network training inherently ha...
Preprint
Full-text available
Federated Learning (FL) is transforming the ML training ecosystem from a centralized over-the-cloud setting to distributed training over edge devices in order to strengthen data privacy. An essential but rarely studied challenge in FL is label deficiency at the edge. This problem is even more pronounced in FL compared to centralized training due to...
Article
Full-text available
The massive amount of data produced during simulation on high-performance computers has grown exponentially over the past decade, exacerbating the need for streaming compression and decompression methods for efficient storage and transfer of this data—key to realizing the full potential of large-scale computational science. Lossy compression approa...
Conference Paper
Full-text available
Synchronous Stochastic Gradient Descent (SGD) with data parallelism, the most popular parallel training strategy for deep learning, suffers from expensive gradient communications. Local SGD with periodic model averaging is a promising alternative to synchronous SGD. The algorithm allows each worker to locally update its own model, and periodically...
Article
Two-phase I/O is a well-known strategy for implementing collective MPI-IO functions. It redistributes I/O requests among the calling processes into a form that minimizes the file access costs. As modern parallel computers continue to grow into the exascale era, the communication cost of such request redistribution can quickly overwhelm collective I...
Conference Paper
Full-text available
Training Convolutional Neural Network (CNN) is a computationally intensive task, requiring efficient parallelization to shorten the execution time. Considering the ever-increasing size of available training data, the parallelization of CNN training becomes more important. Data-parallelism, a popular parallelization strategy that distributes the inp...
Preprint
Two-phase I/O is a well-known strategy for implementing collective MPI-IO functions. It redistributes I/O requests among the calling processes into a form that minimizes the file access costs. As modern parallel computers continue to grow into the exascale era, the communication cost of such request redistribution can quickly overwhelm collective I...
Conference Paper
Full-text available
Training modern Convolutional Neural Network (CNN) models is extremely time-consuming, and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known parallel synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchro...
Conference Paper
Full-text available
Intel Xeon Phi is a processor based on MIC architecture that contains a large number of compute cores with a high local memory bandwidth and 512-bit vector processing units. To achieve high performance on Xeon Phi, it is important for programmers to explore all the software features provided by the Intel compiler and libraries to fully utilize the...
Conference Paper
Community detection is an important data clustering technique for studying graph structures. Many serial algorithms have been developed and well studied in the literature. As the problem size grows, the research attention has recently been turning to parallelizing the technique. However, the conventional parallelization strategies that divide the p...
Conference Paper
Multiprocessor embedded software presents major challenges including the increased complexity and stringent performance requirements raised by parallel processing capability. Although component-based approaches can greatly alleviate the complexity problem, traditional approaches do not provide adequate support for performance requirements on multip...

Network

Cited By