Yunming Liao’s research while affiliated with University of Science and Technology of China and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (28)


Resource-Efficient Federated Fine-Tuning Large Language Models for Heterogeneous Data
  • Preprint

March 2025

·

1 Read

Jun Liu

·

Yunming Liao

·

·

Fine-tuning large language models (LLMs) via federated learning, i.e., FedLLM, has been proposed to adapt LLMs for various downstream applications in a privacy-preserving way. To reduce the fine-tuning costs on resource-constrained devices, FedLoRA is proposed to fine-tune only a small subset of model parameters by integrating low-rank adaptation (LoRA) into FedLLM. However, apart from resource constraints, there is still another critical challenge, i.e., data heterogeneity, severely hindering the implementation of FedLoRA in practical applications. Herein, inspired by the previous group-based federated learning paradigm, we propose a hierarchical FedLoRA framework, termed HierFedLoRA, to address these challenges. Specifically, HierFedLoRA partitions all devices into multiple near-IID groups and adjusts the intra-group aggregation frequency for each group to eliminate the negative effects of non-IID data. Meanwhile, to reduce the computation and communication cost, HierFedLoRA dynamically assigns diverse and suitable fine-tuning depth (i.e., the number of continuous fine-tuning layers from the output) for each group. HierFedLoRA explores jointly optimizing aggregation frequency and depth upon their coupled relationship to better enhance the performance of FedLoRA. Extensive experiments are conducted on a physical platform with 80 commercial devices. The results show that HierFedLoRA improves the final model accuracy by 1.6% to 4.2%, speeding up the fine-tuning process by at least 2.1×\times, compared to the strong baselines.


A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models

March 2025

Zuan Xie

·

·

·

[...]

·

Zhiwei Yao

Recent advancements in large language models (LLMs) have catalyzed a substantial surge in demand for LLM services. While traditional cloud-based LLM services satisfy high-accuracy requirements, they fall short in meeting critical demands for low delay and enhanced privacy. To address these limitations, we propose HAT, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding. HAT partitions the LLM into three submodels, and the input and output submodels, stacked with a lightweight adapter network, are deployed as a small language model (SLM) on each end device. Meanwhile, the middle submodel, encompassing the majority of the LLM's decoder layers, is hosted in the cloud to perform speculative decoding with on-device SLMs. During inference, HAT exchanges hidden states (rather than raw tokens) of input or draft tokens between devices and the cloud, thereby incurring substantial communication delays. Besides, processing hidden states of long prompts will exacerbate computation delays in the cloud, further compromising inference efficiency. To improve efficiency, we introduce a prompt chunking mechanism that segments long prompts into shorter chunks, enabling parallel transmission and processing. Furthermore, HAT is implemented to dynamically determine optimal chunk sizes for devices handling long prompts, thereby improving overall inference speed. Extensive experiments are conducted on a physical testbed comprising 30 NVIDIA Jetson devices and a server with 8 NVIDIA A6000 GPUs. Experimental results demonstrate that HAT achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines.



Fig. 1: Layer redundancy of different models.
Fig. 3: Illustration of the pruning process of COMP. Algorithm 1 Pruning process of COMP. Input: Model M , Calibration data batch X, Pruning ratio r, Number of removed layers n. Output: Pruned model M ′ . // Layer-grained Pruning repeat Calculate layer importance by Eq. (3). Remove the least important layer. until n layers are removed. // Neuron-grained Pruning for Each remaining layer l do Calculate pruning ratio r l by Eq. (10). Set the variance threshold v T to 0. repeat for Each dense k in the l-th layer do Calculate neuron importance by Eq. (9). Set the number of pruning neurons c to 0. repeat Prune c least important neurons. Tune the maskˆmmaskˆ maskˆm c by Eq. (5). Increase c. until Var( ˆ m c ) ≥ v T Update the number of pruned parameters. end for Increase v T . until The pruning ratio of the l-th layer reaches r l . end for
Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models
  • Preprint
  • File available

January 2025

·

8 Reads

Considering the hardware-friendly characteristics and broad applicability, structured pruning has emerged as an efficient solution to reduce the resource demands of large language models (LLMs) on resource-constrained devices. Traditional structured pruning methods often need fine-tuning to recover performance loss, which incurs high memory overhead and substantial data requirements, rendering them unsuitable for on-device applications. Additionally, post-training structured pruning techniques typically necessitate specific activation functions or architectural modifications, thereby limiting their scope of applications. Herein, we introduce COMP, a lightweight post-training structured pruning method that employs a hybrid-granularity pruning strategy. COMP initially prunes selected model layers based on their importance at a coarse granularity, followed by fine-grained neuron pruning within the dense layers of each remaining model layer. To more accurately evaluate neuron importance, COMP introduces a new matrix condition-based metric. Subsequently, COMP utilizes mask tuning to recover accuracy without the need for fine-tuning, significantly reducing memory consumption. Experimental results demonstrate that COMP improves performance by 6.13\% on the LLaMA-2-7B model with a 20\% pruning ratio compared to LLM-Pruner, while simultaneously reducing memory overhead by 80\%.

Download

Efficient Deployment of Large Language Models on Resource-constrained Devices

January 2025

·

4 Reads

Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4×\times-6.9×\times and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.


PairingFL: Efficient Federated Learning With Model Splitting and Client Pairing

January 2025

Federated learning (FL) has recently gained tremendous attention in edge computing and the Internet of Things, due to its capability of enabling clients to perform model training at the network edge or end devices ( i.e. , clients). However, these end devices are usually resource-constrained without the ability to train large-scale models. In order to accelerate the training of large-scale models on these devices, we incorporate Split Learning (SL) into Federated Learning (FL), and propose a novel FL framework, termed PairingFL . Specifically, we split a full model into a bottom model and a top model, and arrange participating clients into pairs, each of which collaboratively trains the two partial models as one client does in typical FL. Driven by the advantages of SL and FL, PairingFL is able to relax the computation burden on clients and protect model privacy. However, considering the features of system and statistical heterogeneity in edge networks, it is challenging to pair the clients by carefully developing the strategies of client partitioning and matching for efficient model training. To this end, we first theoretically analyze the convergence property of PairingFL, and obtain a convergence upper bound. Guided by this, we then design a greedy and efficient algorithm, which makes the joint decision of client partitioning and matching, so as to well balance the trade-off between convergence rate and model accuracy. The performance of PairingFL is evaluated through extensive simulation experiments. The experimental results demonstrate that PairingFL can speed up the training process by 4.6×4.6\times compared to baselines when achieving the corresponding convergence accuracy.


Enhancing Federated Learning Through Layer-Wise Aggregation Over Non-IID Data

January 2025

·

3 Reads

IEEE Transactions on Services Computing

Nowadays, federated learning (FL) has been widely adopted to train deep neural networks (DNNs) among massive devices without revealing their local data in edge computing (EC). To relieve the communication bottleneck of the central server in FL, hierarchical federated learning (HFL), which leverages edge servers as intermediaries to perform model aggregation among devices in proximity, comes into being. Nevertheless, the existing HFL systems may not perform training effectively due to bandwidth constraints and non-IID issues on devices. To conquer these challenges, we introduce an H FL system with device- e dge a ssignment and l ayer selection, namely Heal. Specifically, Heal organizes all the devices into a hierarchical structure ( i.e. , device-edge assignment) and enables each device to forward only a sub-model with several valuable layers for aggregation ( i.e. , layer selection). This processing procedure is called layer-wise aggregation. To further save communication resource and improve the convergence performance, we then design an iteration-based algorithm to optimize the development of our layer-wise aggregation strategy by considering the data distribution as well as resource constraints among devices. Extensive experiments on both the physical platform and the simulated environment show that Heal accelerates DNN training by about 1.4-12.5×, and reduces the network traffic consumption by about 31.9-64.1%, compared with the existing HFL systems.


Adaptive Parameter-Efficient Federated Fine-Tuning on Heterogeneous Devices

December 2024

·

3 Reads

Federated fine-tuning (FedFT) has been proposed to fine-tune the pre-trained language models in a distributed manner. However, there are two critical challenges for efficient FedFT in practical applications, i.e., resource constraints and system heterogeneity. Existing works rely on parameter-efficient fine-tuning methods, e.g., low-rank adaptation (LoRA), but with major limitations. Herein, based on the inherent characteristics of FedFT, we observe that LoRA layers with higher ranks added close to the output help to save resource consumption while achieving comparable fine-tuning performance. Then we propose a novel LoRA-based FedFT framework, termed LEGEND, which faces the difficulty of determining the number of LoRA layers (called, LoRA depth) and the rank of each LoRA layer (called, rank distribution). We analyze the coupled relationship between LoRA depth and rank distribution, and design an efficient LoRA configuration algorithm for heterogeneous devices, thereby promoting fine-tuning efficiency. Extensive experiments are conducted on a physical platform with 80 commercial devices. The results show that LEGEND can achieve a speedup of 1.5-2.8×\times and save communication costs by about 42.3% when achieving the target accuracy, compared to the advanced solutions.


Collaborative Inference for Large Models with Task Offloading and Early Exiting

December 2024

·

6 Reads

In 5G smart cities, edge computing is employed to provide nearby computing services for end devices, and the large-scale models (e.g., GPT and LLaMA) can be deployed at the network edge to boost the service quality. However, due to the constraints of memory size and computing capacity, it is difficult to run these large-scale models on a single edge node. To meet the resource constraints, a large-scale model can be partitioned into multiple sub-models and deployed across multiple edge nodes. Then tasks are offloaded to the edge nodes for collaborative inference. Additionally, we incorporate the early exit mechanism to further accelerate inference. However, the heterogeneous system and dynamic environment will significantly affect the inference efficiency. To address these challenges, we theoretically analyze the coupled relationship between task offloading strategy and confidence thresholds, and develop a distributed algorithm, termed DTO-EE, based on the coupled relationship and convex optimization. DTO-EE enables each edge node to jointly optimize its offloading strategy and the confidence threshold, so as to achieve a promising trade-off between response delay and inference accuracy. The experimental results show that DTO-EE can reduce the average response delay by 21%-41% and improve the inference accuracy by 1%-4%, compared to the baselines.



Citations (14)


... For AGNews, 100 devices are selected per round [66]. The network bandwidth for each device fluctuates randomly between 1 Mbps and 100 Mbps, a typical setting for end devices in prior literature [38,39]. DropPEFT improves time-to-accuracy performance. ...

Reference:

Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout
ParallelSFL: A Novel Split Federated Learning Framework Tackling Heterogeneity Issues
  • Citing Conference Paper
  • December 2024

... This issue is more pronounced compared to CL, where the model has access to the full dataset at once and can optimize without needing to aggregate partial updates. SFL continues to face robustness limitations, particularly in environments with significant data variability across nodes [43,44,45]. ...

MergeSFL: Split Federated Learning with Feature Merging and Batch Size Regulation
  • Citing Conference Paper
  • May 2024

... Edge nodes typically possess limited and heterogeneous resources [17]- [20]. For instance, computing and bandwidth capacities may vary by more than tenfold among different edge nodes [21]- [23]. Given the same computing load across different edge nodes, the system heterogeneity will lead to long response delay on weak nodes and resource waste on strong nodes, significantly impacting the response efficiency. ...

Asynchronous Decentralized Federated Learning for Heterogeneous Devices
  • Citing Article
  • October 2024

IEEE/ACM Transactions on Networking

... HFMDS [15] learned essential class-relevant features of real samples to generate an auxiliary synthetic dataset, which was shared among clients for local training, helping to alleviate data heterogeneity. Additionally, Aorta [16] utilized the mixup data augmentation method in clients to balance class distributions and assigned aggregation weights based on local model quality, ensuring better models had greater influence during global aggregation. Despite these advancements, these studies primarily focused on improving local training and global aggregation algorithms, often overlooking the influence of client selection on FL convergence. ...

Overcoming Noisy Labels and Non-IID Data in Edge Federated Learning
  • Citing Article
  • December 2024

IEEE Transactions on Mobile Computing

... By leveraging the predictable nature of data correlations, substantial benefits can be achieved, including streamlined computations and efficient pipeline execution [28]. To highlight the importance of data correlation, we focus on both the temporal and spatial localities of intermediate data using the ResNet101 model [29] on the widely-used UCF101 video dataset [30]. Additionally, building on experiences in Section IV-B with the ImageNet-100 dataset [31], we demonstrate the versatility of our observations across various scenarios. ...

Federated Learning With Experience-Driven Model Migration in Heterogeneous Edge Networks
  • Citing Article
  • August 2024

IEEE/ACM Transactions on Networking

... 2) Dynamic Environment. The task arrival rate changes over time and space [24]- [27]. For example, the cameras deployed at a crowded train station will generate more tasks than those at an empty campus. ...

Decentralized Federated Learning With Adaptive Configuration for Heterogeneous Participants
  • Citing Article
  • January 2023

IEEE Transactions on Mobile Computing

... To effectively handle the data heterogeneity, the local data distribution of each group together should be close to IID. Herein, the label distribution, a vector Γ = { ∈ [0, 1], ∈ [1, ]} ( =1 = 1) to parameterize a categorical distribution of class labels over classes [40,41], is utilized to guide the device grouping. ...

YOGA: Adaptive Layer-Wise Model Aggregation for Decentralized Federated Learning
  • Citing Article
  • January 2023

IEEE/ACM Transactions on Networking

... The FL aims to minimize the averaged sum of loss functions among the distributed and scattered data samples and explore a set of model parameters. Thus, model training can be formally described as optimizing the following objective function [17], as Eq. (2): ...

Adaptive Configuration for Heterogeneous Participants in Decentralized Federated Learning
  • Citing Conference Paper
  • May 2023

... This enables designers to locally generate sketches with fused styles, aiding in the creative process. Regarding Challenge 2, Federated Learning (FL) [39], [66] meets the requirement well. As the FL framework only transfers model weights, preserving data privacy and reducing communication load [35]. ...

Accelerating Federated Learning With Data and Model Parallelism in Edge Computing
  • Citing Article
  • January 2023

IEEE/ACM Transactions on Networking