Dongsheng li

Dongsheng li
  • Professor (Full) at National lab for parallel and distributed processing

About

296
Publications
25,605
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,061
Citations
Current institution
National lab for parallel and distributed processing
Current position
  • Professor (Full)

Publications

Publications (296)
Article
LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on questions with short and concrete correct answers that are easy to check faithfulness. Hallucination detections for text generation with open-ended answers are more hard. Some researchers use external knowledge to detect hallucinat...
Preprint
Full-text available
Graph Edit Distance (GED) is an important similarity measure in graph retrieval, which quantifies the minimum cost of transforming one graph into another through edit operations, and offers flexibility by allowing customizable operation costs. Recent learning-based approaches approximate GEDs with the distances between representations in vector spa...
Article
Transformer-based models like large language models (LLMs) have attracted significant attention in recent years due to their superior performance. A long sequence of input tokens is essential for industrial LLMs to provide better user services. However, memory consumption increases quadratically with the increase of sequence length, posing challeng...
Article
In current federated learning frameworks, a central server randomly selects a small number of clients to train local models at the beginning of each global iteration. Since clients' local data are non-dependent and identically distributed, partial local models are not consistent with the global model. Existing studies employ model cleaning methods...
Article
Large-scale models have gained significant attention in a wide range of fields, such as computer vision and natural language processing, due to their effectiveness across various applications. However, a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units (GPUs). In this paper, we present...
Article
Event extraction (EE) is a complex natural language processing (NLP) task that aims at identifying and classifying triggers and arguments in raw text. The polysemy of triggers and arguments stands out as one of the key challenges affecting the precise extraction of events. Existing approaches commonly consider the semantic distribution of triggers...
Article
Pipeline parallelism is a crucial technique for large-scale model training, enabling parameter splitting and performance enhancement. However, creating effective pipeline schedules often requires significant manual effort and coding skills, leading to practical inconveniences and complex debugging. Major frameworks such as DeepSpeed and ColossalAI...
Article
Full-text available
Antigen peptides that are presented by a major histocompatibility complex (MHC) and recognized by a T cell receptor (TCR) have an essential role in immunotherapy. Although substantial progress has been made in predicting MHC presentation, accurately predicting the binding interactions between antigen peptides, MHCs and TCRs remains a major computat...
Article
Full-text available
Low-precision training has emerged as a practical approach, saving the cost of time, memory, and energy during deep neural networks (DNNs) training. Typically, the use of lower precision introduces quantization errors that need to be minimized to maintain model performance, often neglecting to consider the potential benefits of reducing training pr...
Preprint
Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the...
Preprint
Full-text available
Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as si...
Article
Full-text available
Large-scale Language Models (LLMs) have achieved significant breakthroughs in Natural Language Processing (NLP), driven by the pre-training and fine-tuning paradigm. While this approach allows models to specialize in specific tasks with reduced training costs, the substantial memory requirements during fine-tuning present a barrier to broader deplo...
Article
Full-text available
Asynchronous pipeline model parallelism with a “1F1B” (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the “1F1B” schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously...
Article
Large-scale deep learning models are trained distributedly due to memory and computing resource limitations. Few existing strategy generation approaches take optimal memory minimization as the objective. To fill in this gap, we propose a novel algorithm that generates optimal parallelism strategies with the constraint of minimal memory redundancy....
Article
This paper presents URSAL, an HDD-only block storage system that achieves ultra-efficiency, reliability, scalability and availability at low cost. Compared to existing block stores such as URSA, Ceph, and Sheepdog, URSAL has the following distinctions. First, since parallelism is harmful to the random I/O performance on HDDs, we restrict URSAL stor...
Article
To accommodate the increasingly large-scale models within limited-capacity GPU memory, various coarse-grained techniques, such as recomputation and swapping, have been proposed to optimize memory usage. However, these methods have encountered limitations, either in terms of inefficient memory reduction or diminished training performance. In respons...
Preprint
The size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-consuming. Recently, the Mixture of Expert (MoE) has drawn significant attention as it can scale models to extra-large sizes with a s...
Preprint
Full-text available
LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on the questions with short and concrete correct answers that are easy to check the faithfulness. Hallucination detections for text generation with open-ended answers are more challenging. Some researchers use external knowledge to de...
Preprint
In various domains, the increasing application of machine learning allows researchers to access inexpensive predictive data, which can be utilized as auxiliary data for statistical inference. Although such data are often unreliable compared to gold-standard datasets, Prediction-Powered Inference (PPI) has been proposed to ensure statistical validit...
Article
The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in s...
Preprint
Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), greatly reducing memory usage but re...
Article
Full-text available
Graph index as an effective data structure is widely applied in subgraph retrieval and matching. It records and compares the frequencies of a set of specific features to detect subgraph containment on the fly, which is the foundation of the filtering techniques for subgraph retrieval and matching. However, due to the NP-hardness of the subgraph cou...
Preprint
Full-text available
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. The knowledge boundary (KB) of an LLM limits its factual understanding, beyond which it may begin to hallucinate. Investigating the perception of LLMs' KB is crucial for detecting hallucinations and LLMs' reliable generation. Current studies perceive...
Preprint
Full-text available
Score-based generative models have demonstrated significant practical success in data-generating tasks. The models establish a diffusion process that perturbs the ground truth data to Gaussian noise and then learn the reverse process to transform noise into data. However, existing denoising methods such as Langevin dynamic and numerical stochastic...
Conference Paper
Full-text available
The heavy ball momentum method is a commonly used technique for accelerating training processes in the machine learning community. However, empirical evidence suggests that the convergence of stochastic gradient descent (SGD) with heavy ball may slow down when the momentum hyperparameter approaches 1. Despite this observation, there are no establis...
Article
Full-text available
In this paper, we propose a general deep learning training framework XGrad which introduces weight prediction into the popular gradient-based optimizers to boost their convergence and generalization when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, the future weights are predicted according to the...
Preprint
Full-text available
Graph index as an effective data structure is widely applied in subgraph retrieval and matching. It records and compares the frequencies of a set of specific features to detect subgraph containment on the fly, which is the foundation of the filtering techniques for subgraph retrieval and matching. However, due to the NP-hardness of the subgraph cou...
Chapter
Face recognition has made significant progress in recent years due to deep convolutional neural networks (CNN). In many face recognition (FR) scenarios, face images are acquired from a sequence with huge intra-variations. These intra-variations, which are mainly affected by low-quality face images, cause instability of recognition performance. Prev...
Preprint
Full-text available
12 Federated Learning (FL) is a distributed machine learning framework in com-13 munication network systems. However, the systems' Non-Independent and Iden-14 tically Distributed (Non-IID) data negatively affect the convergence efficiency of 15 the global model, since only a subset of these data samples are beneficial for model 16 convergence. In p...
Article
The substantial success of Vision Transformer (ViT) in computer vision tasks is largely attributed to the architecture design. This underscores the necessity of efficient architecture search for designing better ViTs automatically. As training-based architecture search methods are computationally intensive, there’s a growing interest in training-fr...
Article
Gradient-based optimization methods implemented on distributed computing architectures are increasingly used to tackle large-scale machine learning applications. A key bottleneck in such distributed systems is the high communication overhead for exchanging information, such as stochastic gradients, between workers. The inherent causes of this bottl...
Article
Low-resource conversation models are becoming increasingly important. Existing conversation models tend to generate uninformative responses that lack diversity, especially when the training data are limited. Researchers address this issue by refining training objectives or incorporating additional data sources. Learned from masses of diverse texts,...
Article
Recent years, there is a growing interest in knowledge graph embedding (KGE), which maps symbolic entities and relations into low-dimensional vector space to effectively represent structured data from the knowledge graph. In addition, the concept of temporal knowledge graph is proposed to document dynamically changing facts in the real world. Exist...
Article
Recently, the data-parallel pipeline approach has been widely used in training DNN models on commodity GPU servers. However, there are still three challenges for hybrid parallelism on commodity GPU servers: i) a balanced model partition is crucial for efficiency, whereas prior works lack a sound solution to generate a balanced partition automatical...
Preprint
Full-text available
The Lipschitz smoothness assumption is crucial for analyzing the convergence of nonconvex Decentralized Stochastic Gradient Descent (DSGD). However, it is often unrealistic and cannot be satisfied in practical scenarios, such as polynomial function optimization and complex neural networks. In this study, we propose a novel approach for DSGD that in...
Preprint
Full-text available
Communication-efficient distributed learning with massive workers has demonstrated significant success in training large-scale deep neural networks. However, the empirical performance of various communication-efficient algorithms deteriorates severely in heterogeneous settings, where each local data distribution differs drastically from the others....
Article
Graph neural networks (GNNs) have been successfully applied to many important application domains on graph data. As graphs become increasingly large, existing GNN training frameworks typically use mini-batch sampling during feature aggregation to lower resource burdens, which unfortunately suffer from long memory accessing latency and inefficient d...
Preprint
Full-text available
Sign-based stochastic methods have gained attention due to their ability to achieve robust performance despite using only the sign information for parameter updates. However, the current convergence analysis of sign-based methods relies on the strong assumptions of first-order gradient Lipschitz and second-order gradient Lipschitz, which may not ho...
Chapter
Full-text available
The heavy ball momentum technique is widely used in accelerating the machine learning training process, which has demonstrated significant practical success in optimization tasks. However, most heavy ball methods require a preset hyperparameter that will result in excessive tuning, and a calibrated fixed hyperparameter may not lead to optimal perfo...
Chapter
Graph Neural Networks (GNNs) have gained considerable attention in recent years for their exceptional performance on graph-structured data. Sampling-based GNN training is the most common method used for training GNNs on large-scale graphs, and it is often accelerated by caching feature data on the GPU. However, the emergence of more complex models...
Preprint
Full-text available
Stochastic gradient descent (SGD) performed in an asynchronous manner plays a crucial role in training large-scale machine learning models. However, the generalization performance of asynchronous delayed SGD, which is an essential metric for assessing machine learning algorithms, has rarely been explored. Existing generalization error bounds are ra...
Chapter
Recent studies have shown that the integration of external knowledge greatly improves the performance of commonsense question answering. However, the problems of semantic representation discrepancy between questions and external knowledge as well as weak discrimination between choices have not been well ameliorated. To address the above problems, w...
Conference Paper
Full-text available
Self-training emerges as an important research line on domain adaptation. By taking the model's prediction as the pseudo labels of the unlabeled data, self-training bootstraps the model with pseudo instances in the target domain. However, the prediction errors of pseudo labels (label noise) challenge the performance of self-training. To address thi...
Preprint
Full-text available
Self-training emerges as an important research line on domain adaptation. By taking the model's prediction as the pseudo labels of the unlabeled data, self-training bootstraps the model with pseudo instances in the target domain. However, the prediction errors of pseudo labels (label noise) challenge the performance of self-training. To address thi...
Preprint
Full-text available
Text classification is a fundamental task for natural language processing, and adapting text classification models across domains has broad applications. Self-training generates pseudo-examples from the model's predictions and iteratively trains on the pseudo-examples, i.e., minimizes the loss on the source domain and the Gibbs entropy on the targe...
Article
Prompt learning is an effective paradigm that bridges gaps between the pre-training tasks and the corresponding downstream applications. Approaches based on this paradigm have achieved great transcendent results in various applications. However, it still needs to be answered how to design a general-purpose framework based on the prompt learning par...
Article
Deep learning (DL) has gained great success in recent years, leading to state-of-the-art performance in research community and industrial fields like computer vision and natural language processing. One of the reasons for this success is the huge amount parameters adopted in DL models. However, it is impractical to train a moderately large model wi...
Conference Paper
Full-text available
Text classification is a fundamental task for natural language processing, and adapting text classification models across domains has broad applications. Self-training generates pseudo-examples from the model's predictions and iteratively trains on the pseudo-examples, i.e., minimizes the loss on the source domain and the Gibbs entropy on the targe...
Conference Paper
Full-text available
Open knowledge graph (OpenKG) link prediction aims to predict missing factual triples in the form of (head noun phrase, relation phrase, tail noun phrase). Since triples are not canonicalized, previous methods either focus on canonicalizing noun phrases (NPs) to reduce graph sparsity, or utilize textual forms to improve type compatibility. However,...
Conference Paper
Full-text available
The heavy ball momentum technique is widely used in accelerating the machine learning training process, which has demonstrated significant practical success in optimization tasks. However, most heavy ball methods require a preset hyperparame-ter that will result in excessive tuning, and a calibrated fixed hy-perparameter may not lead to optimal per...
Article
Full-text available
The decentralized stochastic gradient method emerges as a promising solution for solving large-scale machine learning problems. This paper studies the decentralized Markov chain gradient descent (DMGD), a variant of the decentralized stochastic gradient method, which draws random samples along the trajectory of a Markov chain. DMGD arises when obta...
Article
The generalization ability often determines the success of machine learning algorithms in practice. Therefore, it is of great theoretical and practical importance to understand and bound the generalization error of machine learning algorithms. In this paper, we provide the first generalization results of the popular stochastic gradient descent (SGD...
Preprint
Large Language Models (LLMs) are gaining increasing attention due to their exceptional performance across numerous tasks. As a result, the general public utilize them as an influential tool for boosting their productivity while natural language processing researchers endeavor to employ them in solving existing or new research problems. Unfortunatel...
Conference Paper
Full-text available
Sign Stochastic Gradient Descent (SIGNSGD) is a communication-efficient stochastic algorithm that only uses the sign information of the stochastic gradient to update the model's weights. However, the existing convergence theory of SIGNSGD either requires increasing batch sizes during training or assumes the gradient noise is symmetric and unimodal....
Preprint
Deep learning is experiencing a rise in foundation models that are expected to lead in various fields. The massive number of parameters necessitates the use of tensor model parallelism (TMP) in foundation model training. However, TMP requires frequent communication operations which significantly reduces the training efficiency. In this paper, we pr...
Preprint
Text classification tasks often encounter few shot scenarios with limited labeled data, and addressing data scarcity is crucial. Data augmentation with mixup has shown to be effective on various text classification tasks. However, most of the mixup methods do not consider the varying degree of learning difficulty in different stages of training and...
Article
Gradient quantization has been widely used in distributed training of deep neural network (DNN) models to reduce communication cost. However, existing quantization methods overlook that gradients have a nonuniform distribution changing over time, which can lead to significant compression error that not only increases the number of training iteratio...
Article
Full-text available
Class-Incremental Few-Shot Named Entity Recognition (CIFNER) aims to identify entity categories that have appeared with only a few newly added (novel) class examples. However, existing class-incremental methods typically introduce new parameters to adapt to new classes and treat all information equally, resulting in poor generalization. Meanwhile,...
Article
Full-text available
Knowledge graphs are crucial foundations for building intelligent systems, such as question answering and recommendation. However, their performance is hampered by the incompleteness of KGs, so the knowledge graph completion arises to infer whether a triple of the form (head entity, relation, tail entity) is a missing fact. The path-based approach...
Article
Foundation models are in the process of becoming the dominant deep learning technology. Pretraining a foundation model is always time-consuming due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the pretraining process is extremely memory- and communication-intensive. These challenges make it...
Article
Distributed data-parallel training (DDP) is prevalent in large-scale deep learning. To increase the training throughput and scalability, high-performance collective communication methods such as AllReduce have recently proliferated for DDP use. However, these approaches require long communication periods with increasing model sizes. Collective comm...
Chapter
The search space is crucial in neural architecture search (NAS), and can determine the upper limit of the performance. Most methods focus on the design of depth and width when designing the search space, ignoring the receptive field. With a larger receptive field, the model is able to aggregate hierarchical information and strengthen its representa...
Article
Full-text available
The matrix multiplication-based convolutional algorithm, which can efficiently implement convolutions with different parameters, is the first choice of convolution performance optimization for a given chip. Based on the architecture of Phytium heterogeneous multi-core DSPs(digital signal processors) developed by National University of Defense Techn...
Conference Paper
Neural architecture search (NAS) has brought significant progress in recent image recognition tasks. Most existing NAS methods apply restricted search spaces, which limits the upper-bound performance of searched models. To address this issue, we propose a new search space named MobileNet3-MT. By reducing human-prior knowledge in omni dimensions of...
Preprint
Full-text available
Practical networks for edge devices adopt shallow depth and small convolutional kernels to save memory and computational cost, which leads to a restricted receptive field. Conventional efficient learning methods focus on lightweight convolution designs, ignoring the role of the receptive field in neural network design. In this paper, we propose the...
Preprint
Full-text available
Neural architecture search (NAS) has made tremendous progress in the automatic design of effective neural network structures but suffers from a heavy computational burden. One-shot NAS significantly alleviates the burden through weight sharing and improves computational efficiency. Zero-shot NAS further reduces the cost by predicting the performanc...
Article
This paper systematically studies 99 distributed performance bugs from five widely-deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-descriptio...
Preprint
Full-text available
TAPS is a Topology-Aware intra-operator Parallelism strategy Searching algorithm that generates intra-operator parallelism strategies by considering both intra-node and inter-node bandwidth. Most of the existing auto-parallelism works use the communication volume as the communication cost directly when generating strategies, which we prove to be su...

Network

Cited By