Mingrui Liu’s research while affiliated with George Mason University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (32)


Local Steps Speed Up Local GD for Heterogeneous Distributed Logistic Regression
  • Preprint

January 2025

Michael Crawshaw

·

Blake Woodworth

·

Mingrui Liu

We analyze two variants of Local Gradient Descent applied to distributed logistic regression with heterogeneous, separable data and show convergence at the rate O(1/KR) for K local steps and sufficiently large R communication rounds. In contrast, all existing convergence guarantees for Local GD applied to any problem are at least Ω(1/R)\Omega(1/R), meaning they fail to show the benefit of local updates. The key to our improved guarantee is showing progress on the logistic regression objective when using a large stepsize η1/K\eta \gg 1/K, whereas prior analysis depends on η1/K\eta \leq 1/K.


A Nearly Optimal Single Loop Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

December 2024

·

1 Read

This paper studies the problem of stochastic bilevel optimization where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level function is strongly convex. This problem is motivated by meta-learning applied to sequential data, such as text classification using recurrent neural networks, where the smoothness constant of the upper-level loss function scales linearly with the gradient norm and can be potentially unbounded. Existing algorithm crucially relies on the nested loop design, which requires significant tuning efforts and is not practical. In this paper, we address this issue by proposing a Single Loop bIlevel oPtimizer (SLIP). The proposed algorithm first updates the lower-level variable by a few steps of stochastic gradient descent, and then simultaneously updates the upper-level variable by normalized stochastic gradient descent with momentum and the lower-level variable by stochastic gradient descent. Under standard assumptions, we show that our algorithm finds an ϵ\epsilon-stationary point within O~(1/ϵ4)\widetilde{O}(1/\epsilon^4)\footnote{Here O~()\widetilde{O}(\cdot) compresses logarithmic factors of 1/ϵ1/\epsilon and 1/δ1/\delta, where δ(0,1)\delta\in(0,1) denotes the failure probability.} oracle calls of stochastic gradient or Hessian-vector product, both in expectation and with high probability. This complexity result is nearly optimal up to logarithmic factors without mean-square smoothness of the stochastic gradient oracle. Our proof relies on (i) a refined characterization and control of the lower-level variable and (ii) establishing a novel connection between bilevel optimization and stochastic optimization under distributional drift. Our experiments on various tasks show that our algorithm significantly outperforms strong baselines in bilevel optimization.


Federated Learning under Periodic Client Participation and Heterogeneous Data: A New Communication-Efficient Algorithm and Analysis

October 2024

·

4 Reads

In federated learning, it is common to assume that clients are always available to participate in training, which may not be feasible with user devices in practice. Recent works analyze federated learning under more realistic participation patterns, such as cyclic client availability or arbitrary participation. However, all such works either require strong assumptions (e.g., all clients participate almost surely within a bounded window), do not achieve linear speedup and reduced communication rounds, or are not applicable in the general non-convex setting. In this work, we focus on nonconvex optimization and consider participation patterns in which the chance of participation over a fixed window of rounds is equal among all clients, which includes cyclic client availability as a special case. Under this setting, we propose a new algorithm, named Amplified SCAFFOLD, and prove that it achieves linear speedup, reduced communication, and resilience to data heterogeneity simultaneously. In particular, for cyclic participation, our algorithm is proved to enjoy O(ϵ2)\mathcal{O}(\epsilon^{-2}) communication rounds to find an ϵ\epsilon-stationary point in the non-convex stochastic setting. In contrast, the prior work under the same setting requires O(κ2ϵ4)\mathcal{O}(\kappa^2 \epsilon^{-4}) communication rounds, where κ\kappa denotes the data heterogeneity. Therefore, our algorithm significantly reduces communication rounds due to better dependency in terms of ϵ\epsilon and κ\kappa. Our analysis relies on a fine-grained treatment of the nested dependence between client participation and errors in the control variates, which results in tighter guarantees than previous work. We also provide experimental results with (1) synthetic data and (2) real-world data with a large number of clients (N=250)(N = 250), demonstrating the effectiveness of our algorithm under periodic client participation.


An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness

September 2024

·

2 Reads

This paper investigates a class of stochastic bilevel optimization problems where the upper-level function is nonconvex with potentially unbounded smoothness and the lower-level problem is strongly convex. These problems have significant applications in sequential data learning, such as text classification using recurrent neural networks. The unbounded smoothness is characterized by the smoothness constant of the upper-level function scaling linearly with the gradient norm, lacking a uniform upper bound. Existing state-of-the-art algorithms require O~(1/ϵ4)\widetilde{O}(1/\epsilon^4) oracle calls of stochastic gradient or Hessian/Jacobian-vector product to find an ϵ\epsilon-stationary point. However, it remains unclear if we can further improve the convergence rate when the assumptions for the function in the population level also hold for each random realization almost surely (e.g., Lipschitzness of each realization of the stochastic gradient). To address this issue, we propose a new Accelerated Bilevel Optimization algorithm named AccBO. The algorithm updates the upper-level variable by normalized stochastic gradient descent with recursive momentum and the lower-level variable by the stochastic Nesterov accelerated gradient descent algorithm with averaging. We prove that our algorithm achieves an oracle complexity of O~(1/ϵ3)\widetilde{O}(1/\epsilon^3) to find an ϵ\epsilon-stationary point. Our proof relies on a novel lemma characterizing the dynamics of stochastic Nesterov accelerated gradient descent algorithm under distribution drift with high probability for the lower-level variable, which is of independent interest and also plays a crucial role in analyzing the hypergradient estimation error over time. Experimental results on various tasks confirm that our proposed algorithm achieves the predicted theoretical acceleration and significantly outperforms baselines in bilevel optimization.


Algorithmic Foundation of Federated Learning with Sequential Data

March 2024

·

1 Read

Proceedings of the AAAI Conference on Artificial Intelligence

The current analysis of federated optimization algorithms for training deep neural networks assumes that the data is non-sequential (e.g., images), which incurs a smooth loss objective. In contrast, edge devices generate lots of sequential data every day, where these sequences exhibit significant sequential correlation at different time stamps (e.g., text messages). In order to learn from such sequential data, people typically use a class of neural networks that is inherently nonsmooth, with a potentially unbounded smoothness parameter. Examples include recurrent neural networks, long-short-term memory networks, and transformers. It remains unclear how to design provably efficient algorithms for training these neural networks to learn from sequential data. My goal is to lay the algorithmic foundation of federated learning with sequential data, which contributes novel algorithms for learning from a range of real-world sequential data (e.g., natural language, electronic health record, transportation, time series, etc.) using state-of-the-art deep neural networks. In this talk, I will first motivate the problem by showing that the transformer, which is widely used for sequential data learning, has an unbounded smooth landscape. Then, I will introduce provably efficient federated deep learning algorithms in the presence of unbounded smoothness. In particular, I will introduce a few efficient algorithms for various settings of federated learning, including homogeneous data, heterogeneous data, and partial client participation. The main result is twofold. First, we show that the designed algorithms provably small computational and communication complexities. Second, we establish fundamental hardness results in the unbounded smoothness setting. Ultimately, I will discuss the future challenges of extending our research framework from small-scale neural networks to large language models.


EPISODE: Episodic Gradient Clipping with Periodic Resampled Corrections for Federated Learning with Heterogeneous Data

February 2023

·

2 Reads

Gradient clipping is an important technique for deep neural networks with exploding gradients, such as recurrent neural networks. Recent studies have shown that the loss functions of these networks do not satisfy the conventional smoothness condition, but instead satisfy a relaxed smoothness condition, i.e., the Lipschitz constant of the gradient scales linearly in terms of the gradient norm. Due to this observation, several gradient clipping algorithms have been developed for nonconvex and relaxed-smooth functions. However, the existing algorithms only apply to the single-machine or multiple-machine setting with homogeneous data across machines. It remains unclear how to design provably efficient gradient clipping algorithms in the general Federated Learning (FL) setting with heterogeneous data and limited communication rounds. In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting. The key ingredients of the algorithm are two new techniques called \textit{episodic gradient clipping} and \textit{periodic resampled corrections}. At the beginning of each round, EPISODE resamples stochastic gradients from each client and obtains the global averaged gradient, which is used to (1) determine whether to apply gradient clipping for the entire round and (2) construct local gradient corrections for each client. Notably, our algorithm and analysis provide a unified framework for both homogeneous and heterogeneous data under any noise level of the stochastic gradient, and it achieves state-of-the-art complexity results. In particular, we prove that EPISODE can achieve linear speedup in the number of machines, and it requires significantly fewer communication rounds. Experiments on several heterogeneous datasets show the superior performance of EPISODE over several strong baselines in FL.


F-Measure Optimization for Multi-class, Imbalanced Emotion Classification Tasks

September 2022

·

10 Reads

·

1 Citation

Lecture Notes in Computer Science

Recent NLP breakthroughs have significantly advanced the state of emotion classification (EC) over text data. However, current treatments guide learning by traditional performance metrics, such as classification error rate, which are not suitable for the highly-imbalanced EC problems; in fact, EC models are predominantly evaluated by variations of the F-measure, recognizing the data imbalance. This paper addresses the dissonance between the learning objective and the performance evaluation for EC with moderate to severe data imbalance. We propose a series of increasingly powerful algorithms for F-measure improvement. An ablation study demonstrates the superiority of learning an optimal class decision threshold. Increased performance is demonstrated when joint learning is carried out over both the representation and the class decision thresholds. Thorough empirical evaluation on benchmark EC datasets that span the spectrum of number of classes and class imbalance shows clear F-measure improvements over baseline models, with good improvements over pre-trained deep models and higher improvements over untrained deep architectures.KeywordsEmotion classificationMulti-class classificationClass imbalanceF-measure optimizationTransformer modelsDeep learning


Robustness to Unbounded Smoothness of Generalized SignSGD

August 2022

·

7 Reads

Michael Crawshaw

·

Mingrui Liu

·

·

[...]

·

Zhenxun Zhuang

Traditional analyses in non-convex optimization typically rely on the smoothness assumption, namely requiring the gradients to be Lipschitz. However, recent evidence shows that this smoothness condition does not capture the properties of some deep learning objective functions, including the ones involving Recurrent Neural Networks and LSTMs. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this relaxed assumption, it has been theoretically and empirically shown that the gradient-clipped SGD has an advantage over the vanilla one. In this paper, we show that clipping is not indispensable for Adam-type algorithms in tackling such scenarios: we theoretically prove that a generalized SignSGD algorithm can obtain similar convergence rates as SGD with clipping but does not need explicit clipping at all. This family of algorithms on one end recovers SignSGD and on the other end closely resembles the popular Adam algorithm. Our analysis underlines the critical role that momentum plays in analyzing SignSGD-type and Adam-type algorithms: it not only reduces the effects of noise, thus removing the need for large mini-batch in previous analyses of SignSGD-type algorithms, but it also substantially reduces the effects of unbounded smoothness and gradient norms. We also compare these algorithms with popular optimizers on a set of deep learning tasks, observing that we can match the performance of Adam while beating the others.


Fast Composite Optimization and Statistical Recovery in Federated Learning

July 2022

·

6 Reads

As a prevalent distributed learning paradigm, Federated Learning (FL) trains a global model on a massive amount of devices with infrequent communication. This paper investigates a class of composite optimization and statistical recovery problems in the FL setting, whose loss function consists of a data-dependent smooth loss and a non-smooth regularizer. Examples include sparse linear regression using Lasso, low-rank matrix recovery using nuclear norm regularization, etc. In the existing literature, federated composite optimization algorithms are designed only from an optimization perspective without any statistical guarantees. In addition, they do not consider commonly used (restricted) strong convexity in statistical recovery problems. We advance the frontiers of this problem from both optimization and statistical perspectives. From optimization upfront, we propose a new algorithm named \textit{Fast Federated Dual Averaging} for strongly convex and smooth loss and establish state-of-the-art iteration and communication complexity in the composite setting. In particular, we prove that it enjoys a fast rate, linear speedup, and reduced communication rounds. From statistical upfront, for restricted strongly convex and smooth loss, we design another algorithm, namely \textit{Multi-stage Federated Dual Averaging}, and prove a high probability complexity bound with linear speedup up to optimal statistical precision. Experiments in both synthetic and real data demonstrate that our methods perform better than other baselines. To the best of our knowledge, this is the first work providing fast optimization algorithms and statistical recovery guarantees for composite problems in FL.


Will Bilevel Optimizers Benefit from Loops

May 2022

·

7 Reads

Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.


Citations (8)


... There is also some work on the generalization analysis of pairwise or triplet wise loss functions in a similar i.i.d. setting as we consider [Lei et al., 2020, Yang et al., 2021, Lei et al., 2021. However, such works do not control the dependence on the number of samples in each input tuple. ...

Reference:

Generalization Analysis for Deep Contrastive Representation Learning
Generalization Guarantee of SGD for Pairwise Learning
  • Citing Conference Paper
  • Full-text available
  • December 2021

... Many works in the literature address the problem of imbalanced classification using optimization, and different objective functions; accuracy, F1-score, and G-mean are utilized, however, extensive empirical testing on imbalance datasets reveals significant F1-score improvements over baseline models [4,25]. For that, the F1-score is selected as the fitness value (F) of the detector d that is defined as follows: ...

F-Measure Optimization for Multi-class, Imbalanced Emotion Classification Tasks
  • Citing Chapter
  • September 2022

Lecture Notes in Computer Science

... Balancing the issues of gradient vanishing and gradient exploding is a crucial challenge in deep learning models. Currently, the main solutions include using ReLU activation function (13), batch normalization (14) and gradient clipping (15). However, these methods have their own advantages and disadvantages. ...

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks
  • Citing Preprint
  • May 2022

... calculated in validation dataset). The initial learning rate value is varied between 0.001 and 0.000001, the batch size between 2 and 32, the optimizer for training chosen from a set of available optimizers (i.e., SGD, Adam, AdamW, NAdama, RAdam, and RMSProp [35]), and the architecture chosen from various configurations of YOLOv8 (i.e., YOLOv8n-seg, YOLOv8s-seg, YOLOv8m-seg, YOLOv8l-seg, and YOLOv8x-seg) [29]. The input frame size is fixed at 640 × 640 px. ...

Understanding AdamW through Proximal Methods and Scale-Freeness
  • Citing Preprint
  • January 2022

... In AD-SGD, workers do not wait for all others and only communicate in a decentralized manner, thus enabling wait-free computation and communication. The asynchronous decentralized algorithm has been widely studied due to its superior performance (Jiang et al. 2021;Cui et al. 2021;Nadiradze et al. 2021;Lan and Zhou 2021;Xu, Zhang, and Wang 2021). ...

Asynchronous Decentralized Distributed Training of Acoustic Models
  • Citing Article
  • October 2021

IEEE/ACM Transactions on Audio Speech and Language Processing

... Many methods have recently been proposed for the escape from saddle points. These methods are either based on adding noise to gradients [7,13,14,17] or leveraging high-order information, such as Hessian, Hessian-vector product or relaxed Hessian information [1,3,4,5,6,19,20,22,23,25]. To the best of our knowledge, little work has been done in terms of avoiding saddle points with only first-order information. ...

On Noisy Negative Curvature Descent: Competing with Gradient Descent for Faster Non-convex Optimization
  • Citing Article
  • September 2017