Conference Paper

Distilling the Knowledge in a Neural Network

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To address these challenges, efficient lightweight techniques, also known as model optimization techniques, have emerged. These techniques, such as weight pruning [19], model quantization [20], and knowledge distillation [21], reduced the computational complexity and inference time in DL-based malware classification methods. In particular, knowledge distillation has been actively studied as an efficient lightweight technique for DL-based malware classification methods. ...
... Knowledge distillation was first introduced by Bucila et al. for model compression [26] and gained popularity after Hinton et al. generalized it [21]. Knowledge distillation is a model optimization technique that allows the small and shallow student model to mimic the performance of a large and deep teacher model. ...
... For all methods, including Vanilla KD, MLKD, and Self-MCKD, we set the temperature factor to 4, following the common practices in knowledge distillation experiments [21,30,36]. ...
Article
Full-text available
As malware continues to evolve, AI-based malware classification methods have shown significant promise in improving the malware classification performance. However, these methods lead to a substantial increase in computational complexity and the number of parameters, increasing the computational cost during the training process. Moreover, the maintenance cost of these methods also increases, as frequent retraining and transfer learning are required to keep pace with evolving malware variants. In this paper, we propose an efficient knowledge distillation technique for AI-based malware classification methods called Self-MCKD. Self-MCKD transfers output logits that are separated into the target class and non-target classes. With the separation of the output logits, Self-MCKD enables efficient knowledge transfer by assigning weighted importance to the target class and non-target classes. Also, Self-MCKD utilizes small and shallow AI-based malware classification methods as both the teacher and student models to overcome the need to use large and deep methods as the teacher model. From the experimental results using various malware datasets, we show that Self-MCKD outperforms the traditional knowledge distillation techniques in terms of the effectiveness and efficiency of its malware classification.
... As a measure of similarity between the vectors, we follow the literature [9,20] and leverage the use of temperaturescaled cosine similarity s : ...
... Table 7 shows a detailed report of the results of all recent methods on CIFAR-10, including the ones based on rotations. Table 4. Detailed results of the maximum AUC reached for each CIFAR-10 class, over three different λ values (5,10,20). The mean of those maximums are reported in the last column, and the best AUC for each class in the last row. ...
... We reproduce the experiments (PLAD † ) fixing λ to the most frequent value (λ = 5). PLUME performance is also reported with λ values fixed over classes (5,10,20), and considering the best value per class (∼). ...
Preprint
Full-text available
One-class anomaly detection aims to detect objects that do not belong to a predefined normal class. In practice training data lack those anomalous samples; hence state-of-the-art methods are trained to discriminate between normal and synthetically-generated pseudo-anomalous data. Most methods use data augmentation techniques on normal images to simulate anomalies. However the best-performing ones implicitly leverage a geometric bias present in the benchmarking datasets. This limits their usability in more general conditions. Others are relying on basic noising schemes that may be suboptimal in capturing the underlying structure of normal data. In addition most still favour the image domain to generate pseudo-anomalies training models end-to-end from only the normal class and overlooking richer representations of the information. To overcome these limitations we consider frozen yet rich feature spaces given by pretrained models and create pseudo-anomalous features with a novel adaptive linear feature perturbation technique. It adapts the noise distribution to each sample applies decaying linear perturbations to feature vectors and further guides the classification process using a contrastive learning objective. Experimental evaluation conducted on both standard and geometric bias-free datasets demonstrates the superiority of our approach with respect to comparable baselines. The codebase is accessible via our public repository.
... As an approach to address the challenge of degrading model performance in lightweight networks, several techniques, such as model compression [12], transfer learning [13], and knowledge distillation (KD) [14], have been proposed in the literature. Of these, KD has been one of the most popular and successful methods in academia and industry. ...
... Knowledge Distillation (KD) [14] is a popular technique that enables complex and larger networks to train lightweight models to improve their performance without sacrificing on efficiency. As KD was first developed with the main idea of transferring the output, lately, there is a growing interest in the image segmentation community to distil feature and structural information as well. ...
... Knowledge distillation [14] (Hinton et al., 2014) aims to make the student network replicate the teacher's final output, utilizing metrics such as cross-entropy and Kullback-Leibler (KL) divergence.In this paper, a Prediction Maps Distillation (PMD) module is introduced, and the main approach targets to do this by calculating the differences among the final layers. PMD further improves the teacher and student interrelationship by comparing the output segmentation maps of the two models to strengthen the student network from a spatial perspective. ...
Article
Recent advances in feature-based knowledge distillation have shown promise in computer vision, yet their direct application to medical image segmentation has been challenging due to the inherent high intra-class variance and class imbalance prevalent in medical images. This paper introduces a novel approach that synergizes knowledge distillation with contrastive learning to enhance the performance of student networks in medical image segmentation. By leveraging importance maps and region affinity graphs, our method encourages the student network to deeply explore the regional feature representations of the teacher network, capturing essential structural information and detailed features. This process is complemented by class-guided contrastive learning, which sharpens the discriminative capacity of the student network for different class features, specifically addressing intra-class variance and inter-class imbalance. Experimental validation on the colorectal cancer tumor dataset demonstrates notable improvements, with student networks ENet, MobileNetV2, and ResNet-18 achieving Dice coefficient score enhancements of 4.92%, 4.34%, and 4.59%, respectively. When benchmarked against teacher networks FANet, PSPNet, SwinUnet, and AttentionUnet, our best-performing student network exhibited performance boosts of 2.45%, 5.84%, 6.58%, and 3.56%, respectively, underscoring the efficacy of integrating knowledge distillation with contrastive learning for medical image segmentation.
... The recent advent of Generative AI systems like Large Language Models (LLMs) has transformed how we solve problems but our understanding of their decision-making processes and metacognitive capabilities is significantly limited [1,2]. In addition, traditional approaches to model distillation and ensemble methods have primarily focused on performance metrics rather than the qualitative aspects of decision-making that characterize human cognition [3,4]. In this work-in-progress paper, we propose to bridge this gap by introducing a framework, as shown in Figure 1, that integrates Dual-Process Cognitive Theory from neuroscience with LLM ensemble architectures as a way to generalize the traditional Teacher-Student Model as well as to quantify metacognition in LLMs. ...
... The teacher model is usually trained on a comprehensive dataset and then the student model is trained using a combination of the original training data and the soft targets provided by the teacher model; these soft targets are the probabilities assigned to each class by the teacher model, which contain more information than the hard targets (binary class labels) [5,6]. Current Teacher-Student models in machine learning primarily focus on knowledge transfer through probability distribution matching [3]. ...
Preprint
Full-text available
In this work-in-progress paper, we propose a novel framework for quantifying meta-cognitive processes in ensembles of Large Language Models (LLMs) by extending the traditional teacher-student model through the lens of dual-process cognitive theory. We take a systematic approach that maps the System 1 and System 2 thinking paradigms onto various LLM architectures. This mapping enables the quantifiable measurement of metacognitive processes through emotional response analysis, correctness evaluation , experiential matching, conflicting information estimation, and problem importance task prioritization. In this approach, the rapid, intuitive thinking of System 1 is mapped onto a smaller "student" (single LLM or ensemble of bagged LLMs) while the deliberate, analytical reasoning of System 2 is mapped onto a larger "teacher" (ensemble of boosted LLMs). Additionally, we utilize a graph-theoretic architecture to model the LLM ensembles' interactions in order to understand how different LLMs collaborate for improved decision-making. Our proposed framework provides a potential theoretical foundation for both quantifying metacognition in LLMs as well as understanding and implementing more sophisticated decision-making processes using ensembles of LLMs.
... Knowledge distillation [22] is a method of model compression, which first trains a small model by constructing a lightweight model and then uses supervised information of a larger model with better performance to improve the performance and accuracy of the constructed model. As shown in Fig. 1, in a visible-to-infrared image translation algorithm based on a GAN, the discriminator often contains rich infrared image information. ...
... Knowledge distillation [22] was initially proposed to solve the application problem of deep neural networks on mobile devices with limited computing resources. However, knowledge distillation has recently been widely explored and adopted in various applications, such as image classification [29,30], domain adaptation [31,32], object detection [33,34], semantic segmentation [35,36], and image generation [37]. ...
Article
Full-text available
This paper proposes a discriminator-guided visible-to-infrared image translation algorithm based on a generative adversarial network and designs a multi-scale fusion generative network. The generative network enhances the perception of the image’s fine-grained features by fusing features of different scales in the channel direction. Meanwhile, the discriminator performs the infrared image reconstruction task, which provides additional infrared information to train the generator. This enhances the convergence efficiency of generator training through soft label guidance generated through knowledge distillation. The experimental results show that compared to the existing typical infrared image generation algorithms, the proposed method can generate higher-quality infrared images and achieve better performance in both subjective visual description and objective metric evaluation, and that it has better performance in the downstream tasks of the template matching and image fusion tasks.
... Since semantic segmentation has shown great potential in many applications like autonomous driving, video surveillance, robot sensing, and so on, how to keep efficient inference speed and high accuracy with high-resolution images is a critical problem. The focus here is knowledge distillation, which is introduced by Hinton et al. [1] based on a teacher-student framework, and has received increasing attention in semantic segmentation community. ...
... In importance sampling, for each sample s ∈ kM , we set 9 anchors with 3 scales and 3 aspect ratios at the location. In this way, we generate 9 region proposals {r i }(i ∈ [1,9]) for each s, and its sample probability can be calculated by, ...
Preprint
Low-level texture feature/knowledge is also of vital importance for characterizing the local structural pattern and global statistical properties, such as boundary, smoothness, regularity, and color contrast, which may not be well addressed by high-level deep features. In this paper, we aim to re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks. To this end, we take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge, and Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge with the corresponding Quantization Congruence Loss (QDL). Moreover, we propose the Co-occurrence TIEM (C-TIEM) and generic segmentation frameworks, namely STLNet++ and U-SSNet, to enable existing segmentation networks to harvest the structural and statistical texture information more effectively. Extensive experimental results on three segmentation tasks demonstrate the effectiveness of the proposed methods and their state-of-the-art performance on seven popular benchmark datasets, respectively.
... Additionally, newer and more innovative network architectures have recently been introduced [17], which could potentially be suitable for seizure detection tasks if they are optimized for hardware efficiency [18,19]. We hypothesize that training a deep model on seizure data and then distilling this knowledge [20] into a smaller model can simultaneously improve both accuracy and efficiency. ...
... Knowledge distillation is a technique in which a deep, pre-trained model is used to train a smaller (or otherwise optimized) model that achieves comparable accuracy with greater efficiency [20]. Typically, this process follows a teacher-student framework, where the smaller model learns not directly from the label distribution but from the output distribution of the teacher (deeper) model, a process known as teacher-student distillation [31]. ...
Preprint
Full-text available
Enhancing the accuracy and efficiency of machine learning algorithms employed in neural interface systems is crucial for advancing next-generation intelligent therapeutic devices. However, current systems often utilize basic machine learning models that do not fully exploit the natural structure of brain signals. Additionally, existing learning models used for neural signal processing often demonstrate low speed and efficiency during inference. To address these challenges, this study introduces Micro Tree-based NAM (MT-NAM), a distilled model based on the recently proposed Neural Additive Models (NAM). The MT-NAM achieves a remarkable 100×\times improvement in inference speed compared to standard NAM, without compromising accuracy. We evaluate our approach on the CHB-MIT scalp EEG dataset, which includes recordings from 24 patients with varying numbers of sessions and seizures. NAM achieves an 85.3\% window-based sensitivity and 95\% specificity. Interestingly, our proposed MT-NAM shows only a 2\% reduction in sensitivity compared to the original NAM. To regain this sensitivity, we utilize a test-time template adjuster (T3A) as an update mechanism, enabling our model to achieve higher sensitivity during test time by accommodating transient shifts in neural signals. With this online update approach, MT-NAM achieves the same sensitivity as the standard NAM while achieving approximately 50×\times acceleration in inference speed.
... • Technique: Using optimization techniques such as pruning [23], [24], transfer learning [25], and model distillation [26] to reduce the number of neurons and layers without sacrificing performance. Pruning has been shown to effectively reduce computational cost and memory usage while retaining accuracy by removing redundant parameters [23]. ...
... Transfer learning leverages pre-trained models to adapt to new tasks, reducing training time and resource requirements [25]. Model distillation, as introduced in [26], compresses large networks into smaller, efficient models by transferring knowledge from teacher networks to student networks. By strategically simplifying the architecture, we can achieve computational efficiency while maintaining accuracy. ...
Preprint
Deep learning models are often considered black boxes due to their complex hierarchical transformations. Identifying suitable architectures is crucial for maximizing predictive performance with limited data. Understanding the geometric properties of neural networks involves analyzing their structure, activation functions, and the transformations they perform in high-dimensional space. These properties influence learning, representation, and decision-making. This research explores neural networks through geometric metrics and graph structures, building upon foundational work in arXiv:2007.06559. It addresses the limited understanding of geometric structures governing neural networks, particularly the data manifolds they operate on, which impact classification, optimization, and representation. We identify three key challenges: (1) overcoming linear separability limitations, (2) managing the dimensionality-complexity trade-off, and (3) improving scalability through graph representations. To address these, we propose leveraging non-linear activation functions, optimizing network complexity via pruning and transfer learning, and developing efficient graph-based models. Our findings contribute to a deeper understanding of neural network geometry, supporting the development of more robust, scalable, and interpretable models.
... In conventional offline knowledge distillation paradigms (Hinton et al., 2015;Tian et al., 2019;Chen et al., 2021), a pre-trained teacher model provides soft supervision to guide student training through post-hoc knowledge transfer. This two-stage paradigm inherently suffers from computational redundancy and temporal decoupling between teacher-student interactions. ...
... KD has attracted wide interest in vision and language applications (Meng et al., 2022;Niu et al., 2022;Sanh et al., 2019). KD can be formulated in logits-based (Hinton et al., 2015;Zhao et al., 2022;Huang et al., 2022;Sun et al., 2024), feature-based (Heo et al., 2019;Chen et al., 2021;Li, 2022;Xiaolong et al., 2023;Huang et al., 2024), and relation-based (Tian et al., 2019). Recent studies have focused on automating the process of discovering effective knowledge distillation strategies or architectures Dong et al., 2023;Hao et al., 2024). ...
Preprint
Full-text available
Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.
... The Generator is inspired by the well-established work of Huang et al. [30] in video frame interpolation and incorporates convolutional blocks that efficiently estimate intermediate optical flows in an end-to-end manner. Additionally, our approach integrates a knowledge distillation framework [31,32], where ZAugNet's Generator (Student) is a smaller, optimized subset of a larger Teacher network. The Teacher is trained in a self-supervised manner, as described earlier, and its knowledge is transferred to the Student without loss of validity. ...
... For these pixel-wise losses, the Laplacian pyramid representation loss [41] was used. Additionally, we use the distillation loss [32] L dis to transfer the extra knowledge learned by the teacher to the student, which is then used for prediction, as defined in the following equation: ...
Preprint
Full-text available
Three-dimensional biological microscopy has significantly advanced our understanding of complex biological structures. However, limitations due to microscopy techniques, sample properties or phototoxicity often result in poor z-resolution, hindering accurate cellular measurements. Here, we introduce ZAugNet, a fast, accurate, and self-supervised deep learning method for enhancing z-resolution in biological images. By performing nonlinear interpolation between consecutive slices, ZAugNet effectively doubles resolution with each iteration. Compared on several microscopy modalities and biological objects, it outperforms competing methods on most metrics. Our method leverages a generative adversarial network (GAN) architecture combined with knowledge distillation to maximize prediction speed without compromising accuracy. We also developed ZAugNet+, an extended version enabling continuous interpolation at arbitrary distances, making it particularly useful for datasets with nonuniform slice spacing. Both ZAugNet and ZAugNet+ provide high-performance, scalable z-slice augmentation solutions for large-scale 3D imaging. They are available as open-source frameworks in PyTorch, with an intuitive Colab notebook interface for easy access by the scientific community.
... Numerous methods leverage KD to mitigate CF by designating a previous version of the model as a teacher, which guides the current model as a student [15,21,31,38,39,41,42,46]. KD [14] is a technique used to transfer knowledge from a teacher model to a student model. This is achieved by having the student model learn from targets provided by the teacher model, thereby capturing nuanced patterns that enhance the student model's generalization ability [11]. ...
... In this context, p represents a temperature parameter used to control the smoothness of the probability distribution, with values typically set to p > 1 [14]. ...
Preprint
Continual Learning (CL) remains a central challenge in deep learning, where models must sequentially acquire new knowledge while mitigating Catastrophic Forgetting (CF) of prior tasks. Existing approaches often struggle with efficiency and scalability, requiring extensive memory or model buffers. This work introduces ``No Forgetting Learning" (NFL), a memory-free CL framework that leverages knowledge distillation to maintain stability while preserving plasticity. Memory-free means the NFL does not rely on any memory buffer. Through extensive evaluations of three benchmark datasets, we demonstrate that NFL achieves competitive performance while utilizing approximately 14.75 times less memory than state-of-the-art methods. Furthermore, we introduce a new metric to better assess CL's plasticity-stability trade-off.
... Although these high-performance models excel in accuracy, they often face issues of slow inference speed and high computational costs, which limit their deployment in resourceconstrained environments. To address this issue, Hinton knowledge distillation technology [14], which enables small student models to learn from large teacher models without adding extra computational costs. Knowledge distillation is mainly divided into two categories: distillation based on logit outputs [14] and distillation based on intermediate layer features [15][16][17][18], with the latter being widely used in various tasks due to its rich semantic information. ...
... To address this issue, Hinton knowledge distillation technology [14], which enables small student models to learn from large teacher models without adding extra computational costs. Knowledge distillation is mainly divided into two categories: distillation based on logit outputs [14] and distillation based on intermediate layer features [15][16][17][18], with the latter being widely used in various tasks due to its rich semantic information. ...
... The KL divergence loss is also scaled by the square of the temperature (T 2 ) before it is incorporated into the weighted loss. This temperature scaling approach was recommended by the authors of the paper [69]. ...
... Instead of reducing the parameter count directly, KD transfers knowledge from a high-capacity teacher model to a smaller student model. Through this process, the student model captures the essential patterns learned by the teacher during its training, enabling it to achieve a similar performance, despite its highly reduced complexity [69,81]. ...
Article
Full-text available
DeepFake detection models play a crucial role in ambient intelligence and smart environments, where systems rely on authentic information for accurate decisions. These environments, integrating interconnected IoT devices and AI-driven systems, face significant threats from DeepFakes, potentially leading to compromised trust, erroneous decisions, and security breaches. To mitigate these risks, neural-network-based DeepFake detection models have been developed. However, their substantial computational requirements and long training times hinder deployment on resource-constrained edge devices. This paper investigates compression and transfer learning techniques to reduce the computational demands of training and deploying DeepFake detection models, while preserving performance. Pruning, knowledge distillation, quantization, and adapter modules are explored to enable efficient real-time DeepFake detection. An evaluation was conducted on four benchmark datasets: “SynthBuster”, “140k Real and Fake Faces”, “DeepFake and Real Images”, and “ForenSynths”. It compared compressed models with uncompressed baselines using widely recognized metrics such as accuracy, precision, recall, F1-score, model size, and training time. The results showed that a compressed model at 10% of the original size retained only 56% of the baseline accuracy, but fine-tuning in similar scenarios increased this to nearly 98%. In some cases, the accuracy even surpassed the original’s performance by up to 12%. These findings highlight the feasibility of deploying DeepFake detection models in edge computing scenarios.
... To reduce computational overhead, researchers are increasingly turning to knowledge distillation techniques (Hinton et al., 2015). These works focus on transferring general capabilities from advanced LLMs to their more cost-efficient counterparts through carefully curated instructions (Taori et al., 2023;Chiang et al., 2023;Wu et al., 2024). ...
... Knowledge Distillation from LLMs. In light of the high computational demands or issues of proprietary access, many studies explore knowledge distillation techniques (Hinton et al., 2015) to transfer the capabilities of LLMs into more compact and accessible models (Taori et al., 2023;Chiang et al., 2023;Wu et al., 2024;Muralidharan et al., 2024). Recent advancements in this field concentrate on optimizing distillation objectives to improve the efficiency and effectiveness of the distillation process (Zhong et al., 2024;Ko et al., 2024;Agarwal et al., 2024). ...
Preprint
Full-text available
This paper presents a compact model that achieves strong sentiment analysis capabilities through targeted distillation from advanced large language models (LLMs). Our methodology decouples the distillation target into two key components: sentiment-related knowledge and task alignment. To transfer these components, we propose a two-stage distillation framework. The first stage, knowledge-driven distillation (\textsc{KnowDist}), transfers sentiment-related knowledge to enhance fundamental sentiment analysis capabilities. The second stage, in-context learning distillation (\textsc{ICLDist}), transfers task-specific prompt-following abilities to optimize task alignment. For evaluation, we introduce \textsc{SentiBench}, a comprehensive sentiment analysis benchmark comprising 3 task categories across 12 datasets. Experiments on this benchmark demonstrate that our model effectively balances model size and performance, showing strong competitiveness compared to existing small-scale LLMs.
... Research is advancing in parameter-efficient transfer learning methods, including adapter layers and sparse fine-tuning approaches [20]. Knowledge distillation techniques [21] are still being developed to transfer learning from larger to smaller models whilst hardware-aware algorithms and distributed frameworks are being created to handle large-scale applications more effectively. This is further augmented by advancements in federated learning. ...
... These methods integrate regularization techniques with replay mechanisms, utilizing a memory of previous class samples that are replayed during the learning of new tasks. iCaRL [40] employs Nearest Mean Classification (NMC) along with exemplars and utilizes Knowledge Distillation [41] during training to mitigate forgetting. Techniques like BiC [42] and IL2M [43] correct network outputs using exemplars to mitigate inter-task confusion and task-recency bias. ...
Preprint
Full-text available
Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, resulting in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose an effective approach to consolidate feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our approach, which we call Elastic Feature Consolidation++ (EFC++) exploits a tractable second-order approximation of feature drift based on a proposed Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes. In addition, we introduce a post-training prototype re-balancing phase that updates classifiers to compensate for feature drift. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset, ImageNet-1K and DomainNet demonstrate that EFC++ is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.
... Knowledge distillation (KD) [26] has become a valuable technique for improving deep learning models, especially in resource-limited environments. In 3D object detection, KD enables the transfer of knowledge from a larger teacher model to a smaller student model, enhancing detection accuracy without significant computational overhead [27], [28]. ...
Preprint
Full-text available
LiDAR-based 3D object detection presents significant challenges due to the inherent sparsity of LiDAR points. A common solution involves long-term temporal LiDAR data to densify the inputs. However, efficiently leveraging spatial-temporal information remains an open problem. In this paper, we propose a novel Semantic-Supervised Spatial-Temporal Fusion (ST-Fusion) method, which introduces a novel fusion module to relieve the spatial misalignment caused by the object motion over time and a feature-level semantic supervision to sufficiently unlock the capacity of the proposed fusion module. Specifically, the ST-Fusion consists of a Spatial Aggregation (SA) module and a Temporal Merging (TM) module. The SA module employs a convolutional layer with progressively expanding receptive fields to aggregate the object features from the local regions to alleviate the spatial misalignment, the TM module dynamically extracts object features from the preceding frames based on the attention mechanism for a comprehensive sequential presentation. Besides, in the semantic supervision, we propose a Semantic Injection method to enrich the sparse LiDAR data via injecting the point-wise semantic labels, using it for training a teacher model and providing a reconstruction target at the feature level supervised by the proposed object-aware loss. Extensive experiments on various LiDAR-based detectors demonstrate the effectiveness and universality of our proposal, yielding an improvement of approximately +2.8% in NDS based on the nuScenes benchmark.
... However, mainstream CNN contains millions of parameters and floating-point operations (FLOPs), which significantly limits their deployment on resource-constrained devices. In recent years, model compression techniques, such as network pruning [5][6][7][8][9][10], low-rank weight approximation [11][12][13], weight quantization [14][15][16], and knowledge distillation [17][18][19], have emerged as active research fields. Among them, pruning aims to extract sub-network from the original network, and maintains model performance comparable to the original, thereby effectively reducing the parameters and computational overhead of the model. ...
Article
Full-text available
Channel pruning can effectively compress Convolutional Neural Networks (CNNs) for deployment on edge devices. Most existing pruning methods are data-driven, relying heavily on datasets and necessitating fine-tuning the pruned models for several epochs. However, data privacy protection increases the difficulty of getting a dataset, making data inaccessible in some scenarios. Inaccessible datasets lead to current pruning methods infeasible. To solve this issue, we propose a fine-grained data-free CNN pruning method that does not require data. It involves filter reconstruction and feature reconstruction. To reduce kernels in each filter, we group the kernels in each filter based on the similarity of kernels and calculate a representative kernel for each group to reconstruct the filters. During inference, we conduct feature reconstruction to match input channels of reconstructed filter so as to satisfy the operational criteria of convolutional neural networks. We validate the effectiveness of our method through extensive experiments using ResNet, MobileNet, and VGG on CIFAR and ImageNet datasets. For MobileNet-V2, we obtain FLOPs reduction of 53.2% with only Top-1 accuracy reduction of 0.64% without fine-tuning on ImageNet.
... Present investigations can be bifurcated into two principal streams: knowledge distillation and knowledge transfer. Via knowledge distillation, the expertise of a large model is conferred upon a more compact model, ensuring the latter maintains commendable performance while remaining computationally efficient [26]. By integrating both local and global modality information and applying distillation techniques, Bao et al. improved the feature representation of video understanding models, thereby achieving greater accuracy in temporal localization tasks [16]. ...
Article
Full-text available
Given an untrimmed video and a natural language query, the video temporal grounding task aims to accurately locate the target segment within the video. Functioning as a critical conduit between computer vision and natural language processing, this task holds profound importance in advancing video comprehension. Current research predominantly centers on enhancing the performance of individual models, thereby overlooking the extensive possibilities afforded by multi-model synergy. While knowledge flow methods have been adopted for multi-model and cross-modal collaborative learning, several critical concerns persist, including the unidirectional transfer of knowledge, low-quality pseudo-label generation, and gradient conflicts inherent in cooperative training. To address these issues, this research proposes a Multi-Model Collaborative Learning (MMCL) framework. By incorporating a bidirectional knowledge transfer paradigm, the MMCL framework empowers models to engage in collaborative learning through the interchange of pseudo-labels. Concurrently, the mechanism for generating pseudo-labels is optimized using the CLIP model’s prior knowledge, bolstering both the accuracy and coherence of these labels while efficiently discarding extraneous temporal fragments. The framework also integrates an iterative training algorithm for multi-model collaboration, mitigating gradient conflicts through alternate optimization and achieving a dynamic balance between collaborative and independent learning. Empirical evaluations across multiple benchmark datasets indicate that the MMCL framework markedly elevates the performance of video temporal grounding models, exceeding existing state-of-the-art approaches in terms of mIoU and Rank@1. Concurrently, the framework accommodates both homogeneous and heterogeneous model configurations, demonstrating its broad versatility and adaptability. This investigation furnishes an effective avenue for multi-model collaborative learning in video temporal grounding, bolstering efficient knowledge dissemination and charting novel pathways in the domain of video comprehension.
... Unlike traditional homogeneous FL, several clients expect to design local models themselves, which brings about model heterogeneous FL. Most current approaches for heterogeneity [42], [43], [44], [45] leverage knowledge distillation technique [46] to facilitate communication between clients without direct data or parameter sharing. FedMD [20] enables heterogeneous models to interact by learning the average aggregated soft labels across all clients. ...
Preprint
This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments. Our code and models are public available at https://github.com/FangXiuwen/RAHFL.
... In this work, we propose a more effective approach to learn KMs, by relying on distillation (Hinton, 2015), where the model's output is trained to match the output of a teacher network that can have access to additional information or knowledge. Our idea is based on Context Distillation (Snell et al., 2023), originally proposed to internalize task instructions or reasoning steps into a LM. ...
Preprint
Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.
... Moreover, the proxy task-learner may learn distinct visual features from those of the task-learner. To align their behavior, we establish teacher-student learning between the target and proxy task-learners, employing knowledgedistillation loss (Hinton et al., 2015) where the target task-learner acts as the teacher network. This process involves minimizing the mean-squared error between the corresponding logits generated by the target and proxy task-learner. ...
Preprint
Full-text available
Active learning aims to select optimal samples for labeling, minimizing annotation costs. This paper introduces a unified representation learning framework tailored for active learning with task awareness. It integrates diverse sources, comprising reconstruction, adversarial, self-supervised, knowledge-distillation, and classification losses into a unified VAE-based ADROIT approach. The proposed approach comprises three key components - a unified representation generator (VAE), a state discriminator, and a (proxy) task-learner or classifier. ADROIT learns a latent code using both labeled and unlabeled data, incorporating task-awareness by leveraging labeled data with the proxy classifier. Unlike previous approaches, the proxy classifier additionally employs a self-supervised loss on unlabeled data and utilizes knowledge distillation to align with the target task-learner. The state discriminator distinguishes between labeled and unlabeled data, facilitating the selection of informative unlabeled samples. The dynamic interaction between VAE and the state discriminator creates a competitive environment, with the VAE attempting to deceive the discriminator, while the state discriminator learns to differentiate between labeled and unlabeled inputs. Extensive evaluations on diverse datasets and ablation analysis affirm the effectiveness of the proposed model.
... We aim to retain this performance in smaller, faster models. Drawing inspiration from knowledge distillation [6,12], we propose a method to transfer the performance of a Skelite model into a more compact version. As shown in Figure 3, our distillation strategy takes advantage of the iterative nature of the skeletonization process, providing supervision at each step. ...
Preprint
Full-text available
Skeletonization extracts thin representations from images that compactly encode their geometry and topology. These representations have become an important topological prior for preserving connectivity in curvilinear structures, aiding medical tasks like vessel segmentation. Existing compatible skeletonization algorithms face significant trade-offs: morphology-based approaches are computationally efficient but prone to frequent breakages, while topology-preserving methods require substantial computational resources. We propose a novel framework for training iterative skeletonization algorithms with a learnable component. The framework leverages synthetic data, task-specific augmentation, and a model distillation strategy to learn compact neural networks that produce thin, connected skeletons with a fully differentiable iterative algorithm. Our method demonstrates a 100 times speedup over topology-constrained algorithms while maintaining high accuracy and generalizing effectively to new domains without fine-tuning. Benchmarking and downstream validation in 2D and 3D tasks demonstrate its computational efficiency and real-world applicability
... However, unregulated adapter turning may still lead to over-adaptation. We thus integrate knowledge distillation (KD) [27] to regularize adapter updates, suppress-ing feature drift and preserving stability. Experiments validate that the adapter-KD synergy balances stability-plasticity while enhancing task accuracy (see Section IV-F-3). ...
Preprint
Class-incremental learning (CIL) for time series data faces critical challenges in balancing stability against catastrophic forgetting and plasticity for new knowledge acquisition, particularly under real-world constraints where historical data access is restricted. While pre-trained models (PTMs) have shown promise in CIL for vision and NLP domains, their potential in time series class-incremental learning (TSCIL) remains underexplored due to the scarcity of large-scale time series pre-trained models. Prompted by the recent emergence of large-scale pre-trained models (PTMs) for time series data, we present the first exploration of PTM-based Time Series Class-Incremental Learning (TSCIL). Our approach leverages frozen PTM backbones coupled with incrementally tuning the shared adapter, preserving generalization capabilities while mitigating feature drift through knowledge distillation. Furthermore, we introduce a Feature Drift Compensation Network (DCN), designed with a novel two-stage training strategy to precisely model feature space transformations across incremental tasks. This allows for accurate projection of old class prototypes into the new feature space. By employing DCN-corrected prototypes, we effectively enhance the unified classifier retraining, mitigating model feature drift and alleviating catastrophic forgetting. Extensive experiments on five real-world datasets demonstrate state-of-the-art performance, with our method yielding final accuracy gains of 1.4%-6.1% across all datasets compared to existing PTM-based approaches. Our work establishes a new paradigm for TSCIL, providing insights into stability-plasticity optimization for continual learning systems.
... These outputs can be biased or miscalibrated (Guo et al., 2017). Despite this miscalibration, it has been demonstrated, soft labels improve the generalisation of neural networks (Hinton et al., 2015;Peterson et al., 2019;Uma et al., 2020;Grossmann et al., 2022). Therefore, we propose to use soft labels for aggregating predictions of an ensemble. ...
Preprint
Full-text available
Ensembling in deep learning improves accuracy and calibration over single networks. The traditional aggregation approach, ensemble averaging, treats all individual networks equally by averaging their outputs. Inspired by crowdsourcing we propose an aggregation method called soft Dawid Skene for deep ensembles that estimates confusion matrices of ensemble members and weighs them according to their inferred performance. Soft Dawid Skene aggregates soft labels in contrast to hard labels often used in crowdsourcing. We empirically show the superiority of soft Dawid Skene in accuracy, calibration and out of distribution detection in comparison to ensemble averaging in extensive experiments.
... It extracts specific knowledge from a stronger model (i.e., "teacher") and transfers it to a weaker model (i.e., "student") through additional training signals. There has been a large body of work on transferring knowledge with the same modality, such as model compression [48], [49], [50] and domain adaptation [51], [52]. However, the data or labels for some modalities might not be available during training or testing, so it is essential to distill knowledge between different modalities [53], [54]. ...
Preprint
Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.
... Knowledge distillation (KD) (Kim & Rush, 2016;Hinton, 2015) is another prominent approach for developing models with the help of enlarged models, wherein a small student model learns to mimic a large teacher model. Recent research (Team et al., 2024;Liang et al., 2023;Gunter et al., applies them together to obtain high performance models from existing pretrained models. ...
Preprint
Full-text available
Recent advancements in large language models have intensified the need for efficient and deployable models within limited inference budgets. Structured pruning pipelines have shown promise in token efficiency compared to training target-size models from scratch. In this paper, we advocate incorporating enlarged model pretraining, which is often ignored in previous works, into pruning. We study the enlarge-and-prune pipeline as an integrated system to address two critical questions: whether it is worth pretraining an enlarged model even when the model is never deployed, and how to optimize the entire pipeline for better pruned models. We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery under a single cosine annealing learning rate schedule. This approach is further complemented by a novel iterative structured pruning method for gradual parameter removal. The proposed method helps to mitigate the knowledge loss caused by the rising learning rate in naive enlarge-and-prune pipelines and enable effective redistribution of model capacity among surviving neurons, facilitating smooth compression and enhanced performance. We conduct comprehensive experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
... CK4Gen, specifically designed for survival data, employs an autoencoder with clustering to identify latent risk profiles. However, its multi-step training pipeline, reliance on a pre-trained CoxPH model for knowledge distillation [87], and lack of end-to-end training make it complex and inflexible. Like VAEs and GANs, it cannot efficiently generate subgroup-specific data without retraining. ...
Preprint
Full-text available
Access to real-world healthcare data is limited by stringent privacy regulations and data imbalances, hindering advancements in research and clinical applications. Synthetic data presents a promising solution, yet existing methods often fail to ensure the realism, utility, and calibration essential for robust survival analysis. Here, we introduce Masked Clinical Modelling (MCM), an attention-based framework capable of generating high-fidelity synthetic datasets that preserve critical clinical insights, such as hazard ratios, while enhancing survival model calibration. Unlike traditional statistical methods like SMOTE and machine learning models such as VAEs, MCM supports both standalone dataset synthesis for reproducibility and conditional simulation for targeted augmentation, addressing diverse research needs. Validated on a chronic kidney disease electronic health records dataset, MCM reduced the general calibration loss over the entire dataset by 15%; and MCM reduced a mean calibration loss by 9% across 10 clinically stratified subgroups, outperforming 15 alternative methods. By bridging data accessibility with translational utility, MCM advances the precision of healthcare models, promoting more efficient use of scarce healthcare resources.
... Knowledge Distillation (Hinton et al., 2015) transfer knowledge from a high-capacity teacher model to a lower-capacity student model. Knowledge Distillation has been employed for domain adaptation, as discussed by Farahani et al. (2020). ...
Preprint
Full-text available
Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data. Prior research has primarily focused on learning domain-invariant features across the source and target domains. However, these methods often require training a model using source domain data, which is time-consuming and can limit model usage for applications with different source data. This paper introduces a simple framework that utilizes the impressive generalization capabilities of Large Language Models (LLMs) for target data annotation without the need of source model training, followed by a novel similarity-based knowledge distillation loss. Our extensive experiments on cross-domain text classification reveal that our framework achieves impressive performance, specifically, 2.44\% accuracy improvement when compared to the SOTA method.
... The tensor-based model outperforms matrix-based methods in predicting user preferences, especially in cold-start scenarios [86]. It enhances personalization by capturing contextual influences on user behavior [87]. ...
Preprint
Full-text available
Tensor decomposition has emerged as a fundamental tool in machine learning, enabling efficient representation, compression, and interpretation of high-dimensional data. Unlike traditional matrix factorization methods, tensor decomposition extends to multi-way data structures, capturing complex relationships and latent patterns that would otherwise remain hidden. This paper provides a comprehensive overview of tensor decomposition techniques, including CANDECOMP/PARAFAC (CP), Tucker, and Tensor Train (TT) decompositions, and their applications in various machine learning domains. We explore optimization strategies, computational challenges, and real-world case studies demonstrating the effectiveness of tensor methods in areas such as natural language processing, recommender systems, deep learning compression, and biomedical informatics. Furthermore, we discuss emerging trends and future research directions, including the integration of tensor decomposition with deep learning, scalability improvements, and applications in quantum computing. As machine learning continues to evolve, tensor decomposition is poised to play an increasingly critical role in data-driven discovery and model interpretability.
... However, in practical clinical settings, acquiring multiple modalities of medical images simultaneously (for example, the BraTS dataset series for brain tumor segmentation, which includes FLAIR, T1, T1ce, and T2 MRI modalities) is challenging, and modality missingness is common. Knowledge distillation [30] has emerged as one of the most popular approaches for medical image segmentation in the presence of missing modalities, as it has demonstrated significant effectiveness by transferring knowledge across modalities [31]- [34]. Inspired by the knowledge distillation approach, we use the combination of the specific encoder and decoder for 7T MRI as our teacher network, while viewing the combination of the specific encoder and decoder for 3T MRI as our student network. ...
Article
Full-text available
Accurate segmentation of hippocampal subfields in MRI scans is crucial for aiding in the diagnosis of various neurological diseases and for monitoring brain states. However, due to limitations of imaging systems and the inherent complexity of hippocampal subfield delineation, achieving accurate hippocampal subfield delineation on routine 3T MRI is highly challenging. In this paper, we propose a novel Guided Learning Network (GLNet) that leverages 7T MRI to enhance the accuracy of hippocampal subfield segmentation on routine 3T MRI. GLNet aligns and learns shared features between 3T MRI and 7T MRI through a modeling approach based on domain-specific and domain-shared feature learning, leveraging the features of 7T MRI to guide learning for 3T MRI features. In this process, we also introduce a Multi-Feature Attention Fusion (MFAF) block to integrate both specific and shared features from each modality. By leveraging an attention mechanism, MFAF adaptively focuses on relevant information between the specific and shared features within the same modality, thereby reducing the impact of irrelevant information. Additionally, we further proposed an Online Knowledge Distillation (OLKD) method to distill detailed knowledge from 7T MRI into 3T MRI, enhancing the feature representation capability and robustness of the 3T MRI segmentation model. Our method was validated on PAIRED 3T-7T HIPPOCAMPAL SUBFIELD DATASET, and the experimental results demonstrate that GLNet outperforms other competitive methods.
... The COMEDIAN approach combines self-supervised learning and knowledge distillation [15] to initialize spatiotemporal transformers for action spotting. Knowledge Distillation (KD) is a model compression technique where a smaller, simpler model (called the student) is trained to replicate the behavior of a larger, more complex model (called the teacher). ...
Article
Full-text available
In the rapidly advancing field of computer vision, the application of multimodal models—specifically, vision-language frameworks—has shown substantial promise for complex tasks such as video-based action spotting. This paper introduces Soccer-CLIP, a vision-language model specially designed for soccer action spotting. Soccer-CLIP incorporates an innovative domain-specific prompt engineering strategy, leveraging large language models (LLMs) to refine textual representations for precise alignment with soccer-specific actions. Our model integrates both visual and textual features to enhance recognition accuracy of critical soccer events. With the temporal augmentation techniques devised for input videos, Soccer-CLIP builds upon existing methodologies to address the inherent challenges of temporally sparse event annotations within video sequences. Evaluations on the SoccerNet Action Spotting benchmark demonstrate that Soccer-CLIP outperforms previous state-of-the-art models, exploring the effectiveness of our model’s capacity to capture domain-specific contextual nuances. This work represents a significant advancement in automated sports analysis, providing a robust and adaptable framework for broader applications in video recognition and temporal action localization tasks.
... Knowledge distillation is a model compression technique that transfers knowledge from a large, complex neural network (the "teacher") to a smaller, more efficient "student" network, enabling the creation of lightweight models that retain the predictive power of their larger counterparts while significantly reducing computational and memory demands [13], [14]. The training process combines standard supervised learning with a distillation objective, aligning the student's outputs with the teacher's softened probabilities, which carry nuanced class relationship information [15]. ...
Preprint
Superconducting qubits are among the most promising candidates for building quantum information processors. Yet, they are often limited by slow and error-prone qubit readout -- a critical factor in achieving high-fidelity operations. While current methods, including deep neural networks, enhance readout accuracy, they typically lack support for mid-circuit measurements essential for quantum error correction, and they usually rely on large, resource-intensive network models. This paper presents KLiNQ, a novel qubit readout architecture leveraging lightweight neural networks optimized via knowledge distillation. Our approach achieves around a 99% reduction in model size compared to the baseline while maintaining a qubit-state discrimination accuracy of 91%. KLiNQ facilitates rapid, independent qubit-state readouts that enable mid-circuit measurements by assigning a dedicated, compact neural network for each qubit. Implemented on the Xilinx UltraScale+ FPGA, our design can perform the discrimination within 32ns. The results demonstrate that compressed neural networks can maintain high-fidelity independent readout while enabling efficient hardware implementation, advancing practical quantum computing.
... Model quantization [589] is a technique employed to reduce model size, offering faster inference at the cost of some accuracy. Other techniques, such as pruning redundant weights [590] and knowledge distillation [591], serve similar purposes. Further experimentation is needed to explore the trade-off between reasoning capability, model size, and inference speed. ...
Preprint
As the potential for autonomous vehicles to be integrated on a large scale into modern traffic systems continues to grow, ensuring safe navigation in dynamic environments is crucial for smooth integration. To guarantee safety and prevent collisions, autonomous vehicles must be capable of accurately predicting the trajectories of surrounding traffic agents. Over the past decade, significant efforts from both academia and industry have been dedicated to designing solutions for precise trajectory forecasting. These efforts have produced a diverse range of approaches, raising questions about the differences between these methods and whether trajectory prediction challenges have been fully addressed. This paper reviews a substantial portion of recent trajectory prediction methods and devises a taxonomy to classify existing solutions. A general overview of the prediction pipeline is also provided, covering input and output modalities, modeling features, and prediction paradigms discussed in the literature. In addition, the paper discusses active research areas within trajectory prediction, addresses the posed research questions, and highlights the remaining research gaps and challenges.
... A wide swath has high coverage and low capture latency and we reasonably do not consider scenarios where compute latency exceeds frame deadline. Prior techniques on model compression [40], quantization [38], and tiling [31] reduce compute latency and are complementary. ...
Preprint
Full-text available
Advancements in nanosatellite technology lead to more Earth-observation satellites in low-Earth orbit. We explore using nanosatellite constellations to achieve low-latency detection for time-critical events, such as forest fires, oil spills, and floods. The detection latency comprises three parts: capture, compute and transmission. Previous solutions reduce transmission latency, but we find that the bottleneck is capture latency, accounting for more than 90% of end-to-end latency. We present a measurement study on how various satellite and ground station design factors affect latency. We offer design guidance to operators on how to choose satellite orbital configurations and design an algorithm to choose ground station locations. For six use cases, our design guidance reduces end-to-end latency by 5.6 to 8.2 times compared to the existing system.
Article
Full-text available
Coke image segmentation is a crucial step in coke particle size control of the sintering process. However, due to the complexity of model architecture and the dense distribution of coke particles in the images, existing segmentation methods fail to satisfy the efficiency and accuracy requirements for coke image segmentation in industrial scenarios. To address these challenges, this paper proposes a two-stage distillation-aware adaptive segment anything model to balance efficiency and accuracy in coke image particle size segmentation, referred to as TsDa-ASAM. In the first stage, knowledge distillation methods are employed to distill the Segment Anything Model (SAM) into a lightweight model, explicitly focusing on enhancing segmentation efficiency. In the second stage, a domain knowledge injection strategy is formulated, which incorporates domain knowledge into the distillation model to effectively enhance the accuracy. Moreover, an adaptive prompt point selection algorithm is introduced to address the redundancy issue of prompt points in SAM, improving the efficiency of TsDa-ASAM. The effectiveness of TsDa-ASAM is validated through extensive experiments on the publicly available dataset SA-1B and the coke image dataset from industrial sites. After distillation and fine-tuning, the segmentation accuracy of the proposed model improved by 10%, and the segmentation efficiency of TsDa-ASAM was enhanced by 2 to 3 times with the integration of the adaptive prompt point selection algorithm. The experimental results have effectively demonstrated the potential of the proposed model in balancing accuracy and efficiency.
Article
Full-text available
X-ray crystallography reconstruction, which transforms discrete X-ray diffraction patterns into three-dimensional molecular structures, relies critically on accurate Bragg peak finding for structure determination. As X-ray free electron laser (XFEL) facilities advance toward MHz data rates (1 million images per second), traditional peak finding algorithms that require manual parameter tuning or exhaustive grid searches across multiple experiments become increasingly impractical. While deep learning approaches offer promising solutions, their deployment in high-throughput environments presents significant challenges in automated dataset labeling, model scalability, edge deployment efficiency, and distributed inference capabilities. We present an end-to-end deep learning pipeline with three key components: (1) a data engine that combines traditional algorithms with our peak matching algorithm to generate high-quality training data at scale, (2) a modular architecture that scales from a few million to hundreds of million parameters, enabling us to train large expert-level models offline while deploying smaller, distilled models at the edge, and (3) a decoupled producer-consumer architecture that separates specialized data source layer from model inference, enabling flexible deployment across diverse computing environments. Using this integrated approach, our pipeline achieves accuracy comparable to traditional methods tuned by human experts while eliminating the need for experiment-specific parameter tuning. Although current throughput requires optimization for MHz facilities, our system's scalable architecture and demonstrated model compression capabilities provide a foundation for future high-throughput XFEL deployments.
ResearchGate has not been able to resolve any references for this publication.