Access to this full-text is provided by Springer Nature.
Content available from Applied Intelligence
This content is subject to copyright. Terms and conditions apply.
Applied Intelligence (2024) 54:11804–11844
https://doi.org/10.1007/s10489-024-05747-w
A comprehensive review of model compression techniques in machine
learning
Pierre Vilar Dantas1
·Waldir Sabino da Silva Jr1
·Lucas Carvalho Cordeiro2
·Celso Barbosa Carvalho1
Accepted: 5 August 2024 / Published online: 2 September 2024
© The Author(s) 2024
Abstract
This paper critically examines model compression techniques within the machine learning (ML) domain, emphasizing their
role in enhancing model efficiency for deployment in resource-constrained environments, such as mobile devices, edge com-
puting, and Internet of Things (IoT) systems. By systematically exploring compression techniques and lightweight design
architectures, it is provided a comprehensive understanding of their operational contexts and effectiveness. The synthesis of
these strategies reveals a dynamic interplay between model performance and computational demand, highlighting the bal-
ance required for optimal application. As machine learning (ML) models grow increasingly complex and data-intensive, the
demand for computational resources and memory has surged accordingly. This escalation presents significant challenges for
the deployment of artificial intelligence (AI) systems in real-world applications, particularly where hardware capabilities are
limited. Therefore, model compression techniques are not merely advantageous but essential for ensuring that these models can
be utilized across various domains, maintaining high performance without prohibitive resource requirements. Furthermore,
this review underscores the importance of model compression in sustainable artificial intelligence (AI) development. The
introduction of hybrid methods, which combine multiple compression techniques, promises to deliver superior performance
and efficiency. Additionally, the development of intelligent frameworks capable of selecting the most appropriate compression
strategy based on specific application needs is crucial for advancing the field. The practical examples and engineering applica-
tions discussed demonstrate the real-world impact of these techniques. By optimizing the balance between model complexity
and computational efficiency, model compression ensures that the advancements in AI technology remain sustainable and
widely applicable. This comprehensive review thus contributes to the academic discourse and guides innovative solutions for
efficient and responsible machine learning practices, paving the way for future advancements in the field.
Keywords Lightweight design approaches ·Neural network compression ·Architectural innovations ·Computational
efficiency ·Model generalization ·Technological evolution in machine learning
1 Introduction
1.1 Background and significance of machine
learning (ML) and deep learning (DL)
The evolution of ML and deep learning (DL) has been punc-
tuated by a series of landmark models and technologies,
each representing a significant leap in the field. The percep-
tron, developed in 1958 [1], laid the early groundwork for
deep neural networks (DNNs) in pattern recognition. In the
1990s, support vector machines (SVMs) [2] gained promi-
Waldir Sabino da Silva Jr, Lucas Carvalho Cordeiro, and Celso Barbosa
Carvalho contributed equally to this work.
Extended author information available on the last page of the article
nence for their ability to handle high-dimensional data in
classification and regression tasks. Long short-term mem-
ory (LSTM) [3], introduced in 1997, became essential for
sequential data processing in language modeling and speech
recognition. LeNet-51, introduced in 1998 [4], was one
of the first convolutional neural network (CNN), pioneer-
ing digit recognition and setting the stage for future CNN
developments. In the early 2000s, ensemble methods such
as random forest emerged [5,6], enhancing the capabili-
ties of classification and regression. Deep belief network
(DBN), unveiled in 2006 [7], reignited interest in DNNs,
1LeNet-5 itself is not an acronym; it is a name. The ’Le’ in LeNet-5 is
derived from the name of one of its developers, Yann LeCun. The ’Net’
part refers to the fact that it is a neural network (NN). The ’5’ denotes
that this was the fifth iteration or version of the model developed.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ushering in the modern era of DL. A major milestone was
achieved with the advent of AlexNet in 2012 [8], a DNN
that dominated the ImageNet challenge and brought DL into
the AI spotlight. The development of generative adversarial
networks (GANs) in 2014 [9] introduced a novel genera-
tive modeling approach, impacting unsupervised learning
and image generation. The introduction of the transformer
model in 2017 [10], and subsequently bidirectional encoder
representations from transformers (BERT) in 2018 [11], rev-
olutionized natural language processing (NLP), setting new
performance benchmarks and highlighting the significance
of context in language understanding. These milestones not
only mark critical points in AI but also showcase the diverse
methodologies and increasing sophistication in ML and DL.
Numerous practical advantages have been offered by DL,
revolutionizing various fields. One of the primary benefits
is its ability to automatically extract features from raw data,
significantly reducing the need for manual feature engineer-
ing. This capability is particularly impactful in domains
with complex data structures, such as image and speech
recognition, where traditional methods struggle to achieve
high accuracy [7,8]. In healthcare, DL models are used
for analyzing medical images to detect diseases like can-
cer, providing early and accurate diagnoses that are crucial
for effective treatment [8]. For instance, CNNs have been
successfully applied to mammography images to identify
breast cancer with higher precision than traditional meth-
ods [8]. In industrial applications, DL enhances quality
control processes by detecting defects in products on assem-
bly lines, thus improving efficiency and reducing waste.
Additionally, in the automotive industry, DL is a corner-
stone of autonomous driving technology, enabling vehicles
to interpret and respond to their environment in real-time.
Beyond these specialized applications, DL also impacts the
daily lives of citizens through various consumer technolo-
gies. Mobile phone applications, such as virtual assistants
(e.g., Siri, Google Assistant), rely on DL to understand and
process natural language commands, providing users with
convenient and hands-free interaction with their devices [12].
Furthermore, facial recognition technology, powered by DL,
is used for secure authentication in smartphones, enhancing
both security and user experience [12]. Personalized recom-
mendations on platforms like Netflix, Amazon, and Spotify
also utilize DL algorithms to analyze user behavior and
preferences, delivering tailored content and improving user
satisfaction [13]. These practical examples highlight how DL
not only pushes the boundaries of AI but also provides sig-
nificant improvements and solutions to real-world problems
across various sectors, making everyday life more efficient
and convenient.
The expanding frontiers of ML and DL expose a paradox-
ical combination of advancement and limitation. [14,15].
The exponential growth in training compute for large-scale
ML and DL models since the early 2010s marks a signifi-
cant evolution in computational technology [16]. Surpassing
the traditional bounds of Moore’s law, training compute has
doubled approximately every six months, introducing the
large-scale era around late 2015 [17]. This era, marked by the
need for 10 to 100 times more compute power for ML mod-
els, has significantly increased the demand for computational
resources and expertise in advanced ML systems [14–16,
18]. One of the most noted manifestations of this growth is
the expansion of the largest dense models in DL. Since the
2020s, these models have expanded from one hundred mil-
lion parameters to over one hundred billion, primarily due to
advancements in system technology, including model paral-
lelism, pipeline parallelism, and zero redundancy optimizer
(ZeRO) optimization techniques [17]. These changes made it
possible to train larger and better models, which has changed
how we handle ML. As computational capabilities continue
to expand, the increase in graphics processing unit (GPU)
memory, from 16 GB to 80 GB, struggles to keep pace with
the exponential growth in the computational demands of ML
models [14]. This gap tests the limits of current hardware
and magnifies the importance of more efficient utilization
of available resources. The integration of ML with high-
performance computing (HPC), or high-performance data
analytics (HPDA), has been pivotal in this context, enabling
faster ML algorithms and facilitating the resolution of more
complex problems [14,16,18]. Advanced techniques like
DeepSpeed [15] and ZeRO-Infinity [17] further demonstrate
how innovative system optimizations can push the bound-
aries of DL model training.
Nevertheless, the continued increase in model size and
complexity underscores the need for model optimization [19–
23]. Compressing ML models emerges as a vital approach,
reducing the disparity between escalating computational
demands and inadequate memory expansion by adjusting
models to be compressed without significantly affecting per-
formance. This approach encompasses several techniques,
quantization, and knowledge distillation [24–29]. Model
compression not only addresses the challenge of deploying
AI systems in resource-constrained environments, such as
mobile devices and embedded systems, but also improves
the efficiency and speed of these models, making them more
accessible and scalable. For instance, in mobile applica-
tions, compressed models enable faster inference times and
lower power consumption, which are critical for enhancing
user experience and extending battery life. Additionally, in
edge computing scenarios, where computational resources
are limited, compressed models facilitate real-time data pro-
cessing and decision-making, enabling a wide range of
applications from smart home devices to autonomous drones.
In essence, model compression becomes not just a beneficial
strategy, but a necessity for the practical deployment of AI
systems, particularly in environments where resources are
A comprehensive review of model... 11805
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
inherently limited. By optimizing the balance between model
complexity and computational efficiency, model compres-
sion ensures that the advancements in AI technology remain
sustainable and widely applicable across various domains
and industries [24–29].
In conclusion, the rapid advancement in training com-
pute for DL models marks a remarkable era of technological
progress, juxtaposed with significant challenges. The stark
contrast between the exponential demands of these models
and the more modest growth in GPU memory capacity under-
scores a pivotal issue in the field of ML. It is this imbalance
that necessitates innovative approaches like model compres-
sion and system optimization. As we proceed, this paper will
delve deeper into these challenges, exploring the intricacies
of model compression techniques and their critical role in
optimizing large-scale ML and DL models. We will exam-
ine how these techniques not only address the limitations of
current hardware but also open new avenues for efficient,
practical deployment of AI systems in various real-world
scenarios.
1.2 Main contributions and novelty
This paper makes significant contributions to the field of
model compression techniques in ML, focusing on their
applicability and effectiveness in resource-constrained envi-
ronments such as mobile devices, edge computing, and
internet of things (IoT) systems. The main contributions of
this paper are as follows:
1. Comprehensive review of model compression tech-
niques: we provide an in-depth review of various model
compression strategies, including pruning, quantization,
low-rank factorization, knowledge distillation, trans-
fer learning, and lightweight design architectures. This
review not only covers the theoretical underpinnings
of these techniques but also evaluates their practical
implementations and effectiveness in different opera-
tional contexts.
2. Highlighting the balance between performance and
computational demand: our synthesis reveals the dyna-
mic interplay between model performance and compu-
tational requirements. We emphasize the necessity for a
balanced approach that optimizes both aspects, crucial
for the sustainable development of AI.
3. Identification of research gaps: by examining the cur-
rent state of model compression research, we identify
critical gaps, highlighting the need for more research on
integrating digital twins, physics-informed residual net-
works (PIResNet), advanced data-driven methods like
gated recurrent units for better predictive maintenance
of industrial components, predictive maintenance using
DL in smart manufacturing, and reinforcement learning
(RL) in supply chain optimization.
4. Future research directions: the paper advocates for
future studies to focus on hybrid compression methods
that combine multiple techniques for enhanced effi-
ciency. Additionally, we suggest the development of
autonomous selection frameworks that can intelligently
choose the most suitable compression strategy based on
the specific requirements of the application.
5. Practical examples and applications: to bridge the gap
between theory and practice, we provide practical exam-
ples and case studies demonstrating the application of
model compression techniques in real-world scenarios.
These examples illustrate how model compression can
lead to significant improvements in computational effi-
ciency without compromising model accuracy.
6. Innovative solutions for efficient ML: we propose
innovative solutions for improving the efficiency and
effectiveness of ML models in resource-constrained envi-
ronments. This includes the development of lightweight
model architectures and the integration of advanced com-
pression techniques to facilitate the deployment of ML
models in practical, real-world applications.
The novelty of this paper lies in its approach to under-
standing and advancing model compression techniques. By
synthesizing existing knowledge and identifying critical
research gaps, we provide a comprehensive roadmap for
future research in this domain. Our focus on practical applica-
tions and innovative solutions further enhances the relevance
and impact of this work, making it a valuable resource for
both researchers and practitioners in the field of ML.
1.3 Emerging areas and research gaps
The existing literature provides a comprehensive overview
of various model compression techniques and their applica-
tions across different domains. However, there is a noticeable
gap in addressing the specific challenges and advancements
in many ML applications. Key areas where further research
is necessary and emerging areas that leverage the latest
developments in ML and model compression techniques to
enhance performance, efficiency, and reliability include:
1. Digital twin-driven intelligent systems: digital twins,
virtual replicas of physical systems, offer significant
potential for real-time monitoring and predictive mainte-
nance. Current research lacks a thorough exploration of
how digital twins can be integrated with advanced ML
models. Integrating model compression techniques can
further enhance their efficiency by reducing the compu-
tational burden during real-time monitoring [30–33].
P. . antas et al.DV
11806
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2. PIResNet: traditional ML models have been extensively
studied, but the incorporation of physical laws into these
models, such as in PIResNet, remains underexplored.
This approach can enhance model accuracy and reliabil-
ity by embedding domain-specific knowledge. Applying
model compression techniques can optimize PIResNet
for deployment in resource-constrained environments
without sacrificing diagnostic accuracy [34–36].
3. Gated recurrent units (GRU): there is a need for inno-
vative data-driven approaches that leverage multi-scale
fused features and advanced recurrent units, like GRUs.
Existing studies often focus on conventional methods,
missing the potential benefits of these sophisticated tech-
niques. Incorporating model compression techniques can
further enhance the applicability of these approaches by
reducing memory and computational requirements [37–
40].
4. Predictive maintenance using DL in smart manu-
facturing: predictive maintenance involves using ML
models to predict equipment failures before they occur,
allowing for timely maintenance and reducing downtime.
Current research gaps include optimizing DL models
for deployment in smart factories by integrating them
with IoT devices for continuous monitoring and real-
time analysis. Applying model compression techniques
can make these models more efficient for real-time data
processing [41,42].
5. RL in supply chain optimization: RL algorithms learn
optimal policies through interactions with the environ-
ment, making them well-suited for dynamic and complex
systems like supply chains. Current research gaps include
optimizing various aspects such as inventory manage-
ment, demand forecasting, and logistics by simulating
different scenarios and learning from outcomes. To make
RL models more feasible for real-time application in sup-
ply chain operations, model compression techniques can
be utilized to reduce the model’s complexity and enhance
operational efficiency, facilitating faster decision-making
processes [43–45].
1.4 Material and methods
A systematic literature search was conducted across several
databases, including IEEE Xplore [46], ScienceDirect [47],
and Google Scholar [48]. Keywords related to model com-
pression techniques such as pruning,quantization,knowl-
edge distillation,transfer learning, and lightweight model
design were used. The search was limited to papers published
in the last decade to ensure relevance and innovation in the
fields of ML and AI, including classical papers. Studies were
included based on the following criteria: detailed discussion
on model compression techniques; empirical evaluation of
compression methods on ML models; availability of perfor-
mance metrics like compression ratio, speedup, and accuracy
retention; and relevant real-world applications. Exclusion
criteria involved: papers not in English, reviews without
original research, and studies focusing solely on theoretical
aspects without empirical validation.
Data extracted from the selected studies included author
names, publication year, compression technique evaluated,
model used, datasets, performance metrics (e.g., compres-
sion ratio, inference speedup, accuracy), and key findings.
A thematic synthesis approach was used to categorize the
compression techniques and summarize their effectiveness
across different applications and model architectures. The
synthesis involved comparing and contrasting the effective-
ness of different model compression techniques, highlighting
their advantages and limitations. The impact of these tech-
niques on computational efficiency, model size reduction,
and performance metrics was analyzed to identify trends and
potential areas for future research.
1.5 Paper organization
The structure of this paper has been systematically designed
to guide the reader through a comprehensive exploration of
model compression techniques in ML. The sections are orga-
nized as follows:
Section 1. Introduction: the significance of model com-
pression in enhancing the efficiency of ML models, espe-
cially in resource-constrained environments, is introduced.
An overview of the main contributions and the novelty of
this paper is provided.
Section 2. Challenges in machine learning (ML) and
deep learning (DL): the historical context and evolution of
ML and DL are discussed, highlighting key milestones and
the exponential growth in computational demands.
Section 3. Common model compression approaches:
key model compression techniques such as pruning, quan-
tization, low-rank factorization, knowledge distillation, and
transfer learning are delved into. Detailed explorations of
each technique, including theoretical foundations, practical
implementation considerations, and their impact on model
performance, are provided.
Section 4. Lightweight model design and synergy with
model compression techniques: an overview of lightweight
model architectures, such as SqueezeNet, MobileNet, and
EfficientNet, is presented. The design principles and the
synergy with model compression techniques to achieve
enhanced efficiency and performance are discussed.
A comprehensive review of model... 11807
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Section 5. Performance evaluation criteria: the crite-
ria for evaluating the performance of compressed models,
including metrics like compression ratio, speed-up rate, and
robustness metrics, are discussed. The importance of bal-
ancing model performance with computational demand is
emphasized.
Section 6. Model compression in various domains:
recent innovations in model compression are highlighted,
and case studies demonstrating the application of these
techniques in various domains are presented. The signifi-
cant improvements in computational efficiency achieved by
compressed models without compromising performance are
illustrated.
Section 7. Innovations in model compression and
performance enhancement: The applications of model
compression techniques across various fields are explored,
demonstrating their implementation in real-world scenar-
ios such as mobile devices, edge computing, IoT systems,
autonomous vehicles, and healthcare. Specific examples
illustrate the practical benefits and challenges of deploying
compressed models in these environments.
Section 8. Challenges, strategies, and future directions:
future research directions in model compression are out-
lined. Potential advancements and innovations that could
enhance the efficiency and applicability of model compres-
sion techniques are discussed, including hybrid methods
and autonomous selection frameworks. This section aims to
inspire further research to address existing gaps.
Section 9. Discussion: the findings from the compre-
hensive review of model compression techniques are syn-
thesized. The implications for future research and practical
applications are evaluated, research gaps are identified, and
future directions are suggested.
Section 10. Conclusion: the paper concludes with an
exploration of recent innovations in model compression and
performance enhancement. The ongoing advancements in the
field and the potential for future research to optimize ML
models are underscored.
Appendix A. Comprehensive summary of the refer-
ences used in this paper: summary of references used in
this paper, categorized by their specific application areas.
This table provides a comprehensive overview of the key
publications that have been referenced throughout the study,
offering insights into the foundational and recent advance-
ments in each area.
This organization ensures a logical progression from
introducing the importance of model compression to explor-
ing specific techniques, discussing their applications, and
concluding with future research directions. The structure
provides a clear roadmap for readers, facilitating a deeper
understanding of the topic.
2 Challenges in machine learning (ML) and
deep learning (DL)
2.1 Computational demands vs. computational
memory growth
A model serves as a mathematical construct that represents
a system or process. This construct is primarily used for the
purpose of prediction or decision-making based on the anal-
ysis of input data. Typically, DL models are DNNs, which
consist of numerous interconnected nodes or neurons. These
nodes collectively process incoming data to produce output
predictions or decisions, as depicted in Fig. 1. DL models can
be implemented using a variety of programming frameworks,
including TensorFlow [49] and PyTorch [50].
The training process of these models involves the use of
substantial datasets, aiming to refine their predictive accuracy
and enhance their generalization capabilities across unseen
data. The training of a DL model is a critical process where
large volumes of data are employed to iteratively adjust the
model’s internal parameters, such as weights and biases. This
adjustment process, known as backpropagation, involves the
computation of the gradient of the loss function - a mea-
sure of model error - relative to the network’s parameters.
The optimization of these parameters is executed through
algorithms like stochastic gradient descent (SGD), aiming to
minimize the loss function and thereby improve the model’s
performance. Various programming environments, such as
TensorFlow and PyTorch, provide sophisticated application
programming interface (API) that support the development
and training of complex DNN architectures. These environ-
ments also offer access to a range of pre-trained models,
which can be directly applied or further fine-tuned for tasks
in diverse domains, including image recognition, NLP, and
beyond.
Fig. 1 A NN commonly used in DL scenarios. The illustration show-
cases the network’s architecture, highlighting the input layer, hidden
layers, and output layer. Each node represents a neuron, and the con-
nections between them indicate the pathways through which data flows
and weights are adjusted during training
P. . antas et al.
DV
11808
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
2.2 Model size and complexity
Model compression in DL is a technique aimed at reducing
the size of a model without significantly compromising its
predictive accuracy. This process is vital in the context of
deploying DL models on resource-constrained devices, such
as mobile phones or IoT devices. By compressing a model,
it becomes feasible to utilize advanced DL capabilities in
environments where computational power and storage are
limited.
The performance of a DL model is fundamentally its
ability to make accurate predictions or decisions when
confronted with new, unseen data. This performance is
quantitatively measured through various metrics, including
accuracy, precision, recall, F1-score, and area under the curve
of receiver operating characteristic (AUC-ROC). The selec-
tion of these metrics is contingent upon the specific nature
of the problem being addressed and the type of model in use,
ensuring a comprehensive evaluation of the model’s effec-
tiveness in real-world applications.
We have conducted a comprehensive analysis that delin-
eates the performance retention, model size reduction, and
other critical dimensions across different compression tech-
niques. This comparison elucidates the nuanced distinctions
between the methods, counteracting the impression that all
techniques yield similar outcomes in performance mainte-
nance and size reduction. In Table 1is encapsulated the com-
parative analysis of these methods, addressing the strengths
and drawbacks of each. This table provides a nuanced view
of how each model compression method balances between
model size reduction and performance retention, along with
their computational efficiency and application suitability.
This comparison should clarify the unique attributes and
trade-offs of each model compression technique, offering a
more refined understanding of their individual and compar-
ative impacts [51–55].
In Table 2is presented an overview of model compression
approaches applied across various ML application domains.
It summarizes the most suitable techniques for specific fields,
such as image and speech analysis, highlighting the bene-
fits and limitations of each approach. This comprehensive
comparison aims to illustrate the effectiveness of pruning,
quantization, low-rank factorization, knowledge distillation,
and transfer learning in reducing model size, retaining per-
formance, and enhancing computational efficiency.
2.3 Resource allocation and efficiency
Balancing model performance with computational demand is
a critical consideration in the development and deployment of
ML models, especially in resource-constrained environments
such as mobile devices, edge computing, and IoT systems.
This balance ensures that models are not only accurate but
also efficient enough to be deployed in real-world applica-
tions.
Achieving this balance involves making trade-offs between
the size, speed, and accuracy of the models. Techniques
such as pruning, quantization, low-rank factorization, and
knowledge distillation are pivotal in this regard. For instance,
pruning reduces the number of parameters by eliminating less
significant ones, which can decrease computational require-
ments while maintaining performance [89,90]. Quantization
further enhances efficiency by reducing the precision of
model parameters, thereby decreasing memory usage and
accelerating computation [23,25]. Low-rank factorization
decomposes large weight matrices into smaller matrices,
which can capture essential information with fewer param-
eters [91,92]. Knowledge distillation involves training a
smaller model to replicate the behavior of a larger, well-
trained model, effectively transferring knowledge while
reducing computational complexity [93,94]. This technique
is particularly useful for deploying models in environments
with limited resources without significantly sacrificing accu-
racy.
For instance, lightweight models like MobileNet and
SqueezeNet are designed to operate efficiently on mobile
devices, with MobileNet using depthwise separable con-
volutions to reduce computational load while maintaining
accuracy [95], and SqueezeNet achieving AlexNet-level
accuracy with significantly fewer parameters through the use
of fire modules [96]. In edge computing scenarios, models
must balance performance with the limited computational
capacity of edge devices, utilizing techniques such as quan-
tization and pruning to ensure real-time inference without
excessive latency [97]. For IoT applications, model compres-
sion is crucial for deploying intelligent analytics on devices
with stringent power and memory constraints, with tech-
niques like low-rank factorization and knowledge distillation
creating compact models that can operate efficiently in such
environments [86].
In conclusion, the interplay between model performance
and computational demand is a dynamic challenge that neces-
sitates a balanced approach. By leveraging various model
compression techniques, it is possible to develop efficient
models that are suitable for deployment in a variety of
resource-constrained environments, thus advancing the prac-
tical application of AI technologies.
3 Common model compression approaches
This section delves into key model compression techniques
in DNNs. Each technique addresses the challenge of deploy-
ing advanced DNNs in scenarios with limited computational
power, such as mobile devices and edge computing plat-
forms, highlighting the trade-offs between model size reduc-
A comprehensive review of model... 11809
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 1 Various model compression methods are evaluated, detailing their compression ratios, performance retention, computational efficiency, key strengths, and potential drawbacks, providing
insights into their suitability for different applications
Technique Compression Ratio Performance Retention Computing Efficiency Key Strength Drawbacks
Pruning Reduces overhead effectively Risk of over-pruning
Quantization Increases speed Errors can impact performance
Low-rank factorization Efficient in redundancy reduction Limited in non-redundant models
Knowledge distillation Smaller models perform well Possible performance gap
Transfer Learning Saves resources, improves performance Risk of negative transfer
P. . antas et al.DV
11810
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 2 A comprehensive overview of model compression techniques applied across various application domains in ML
Application Domain P Q L K T Rationale
Image classification [53,56–58]Efficient for reducing model size and speeding up
inference, leveraging redundancy in CNNs without
major accuracy loss.
Speech recognition [59–62]Enhances real-time processing; knowledge distilla-
tion simplifies complex models for edge deployment.
NLP [63–65]Manages large matrix operations efficiently, crucial
for maintaining performance in translation and senti-
ment analysis.
Real-time applications [66,67]Minimizes latency and resource use on constrained
devices, essential for immediate responses.
Domain-specific tasks [68–70]Adapts pre-trained models to new environments effi-
ciently, optimizing for performance and efficiency.
Model deployment on edge devices [67,71,72]Balances model complexity and deployment feasibil-
ity on devices with limited resources.
Autonomous vehicles [73–76]Benefits from reduced model sizes and faster infer-
ence times for real-time decision-making.
Augmented/virtual reality [77–79]Ensures high-speed processing for immersive expe-
riences through efficient computation and reduced
model sizes.
Recommender systems [62,80,81]Efficiently captures essential information from vast
amounts of sparse data, enhancing speed and perfor-
mance.
Medical image analysis [82–85]Quickly adapts existing models to specific medical
tasks and optimizes them for efficient analysis without
sacrificing accuracy.
IoT applications [70,86–88]Produces lightweight models enabling smarter, real-
time analytics at the edge with stringent power and
computational constraints.
Each domain is paired with commonly used compression approaches and their underlying rationale, highlighting the benefits and potential limitations
in terms of model size reduction, performance retention, and computational efficiency. The letter P stands for pruning, Q for quantization, L for
low-rank factorization, K for knowledge distillation, and T for transfer learning
tion and performance retention. This exploration underlines
the importance of innovative approaches to model compres-
sion, essential for the practical application of DNNs across
various domains. Readers are guided through a detailed
exploration of model compression techniques in DNNs, espe-
cially in scenarios where resources are constrained. Each
technique is explained, encompassing theoretical founda-
tions, practical implementation considerations, and their
direct impact on model performance, including accuracy,
inference speed, and memory utilization.
Pruning systematically removes less significant parame-
ters from a DNN to reduce size and computational complex-
ity while maintaining performance. Quantization reduces the
precision of model parameters to lower-bit representations,
decreasing memory usage and speeding up computation,
which is ideal for constrained devices. Low-rank factoriza-
tion decomposes large weight matrices into smaller, low-
rank matrices, capturing essential information and reducing
model size and computational demands. Knowledge dis-
tillation transfers knowledge from a larger, well-trained
teacher model to a smaller student model, retaining high
accuracy with fewer parameters. Transfer learning lever-
ages pre-trained models on extensive datasets to adapt to
new tasks, minimizing the need for extensive data collec-
tion and training. Lightweight design architectures, such
as SqueezeNet and MobileNet, are engineered with fewer
parameters and lower computational requirements with-
out significantly compromising accuracy. Collectively, these
techniques address the challenge of deploying advanced
ML models in resource-constrained environments, balanc-
ing model performance with computational demand, and
highlighting their importance in efficient and sustainable AI
development.
3.1 Pruning
Pruning is a process for enhancing ML model efficiency
and effectiveness. By systematically removing less signifi-
cant parameters of a DNN, pruning reduces the model’s size
and computational complexity without substantially com-
A comprehensive review of model... 11811
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
promising its performance [19–22,98,99]. This practice is
especially vital in contexts where storage and computational
resources are limited. In Fig. 2it is depicted one illustration
of pruned DNN.
Pruning involves the selective removal of network param-
eters (weights and neurons) that contribute the least to the
network’s output [100]. This process leads to a compressed
and efficient model, facilitating faster inference times and
reduced energy consumption [101]. The common types of
pruning includes neuron pruning, which involves removing
entire neurons or filters from the network [102]. It is com-
monly used in CNNs and targets neurons that contribute less
to the network’s ability to model the problem. By remov-
ing these neurons, the network’s complexity is reduced,
potentially leading to faster inference times [59]. Weight
pruning, focused on eliminating individual weights within
a DNN [103], involves identifying weights with minimal
impact (typically those with the smallest absolute values)
and setting them to zero. This process creates a sparse
weight matrix, which can significantly reduce the model’s
size and computational requirements [104]. Structured prun-
ing focuses on removing larger structural components of a
network, such as entire layers or channels [105]. It is aligned
with hardware constraints and optimizes for computational
efficiency and regular memory access patterns.
The parameters of DNN are selected based on their impact
on the output. Techniques like sensitivity analysis or heuris-
tics are often used to identify these parameters [106]. Various
algorithms, like magnitude-based pruning or gradient-based
approaches, are employed to determine and execute the
removal of parameters. These methods frequently involve
iteratively pruning and testing the network to find an optimal
balance between size and performance [107]. After pruning,
it’s essential to evaluate the pruned model’s performance to
ensure that accuracy or other performance metrics are not
significantly compromised [108]. Re-training or fine-tuning
the pruned network is typically required to recover any loss
in accuracy [56]. Post-pruning, it is crucial to validate the
model on a relevant dataset to ensure that its accuracy and
efficiency meet the required standards [109].
Pruning is emphasized as a vital technique for remov-
ing excess in oversized models [90,110]. Although, the
main challenge arises from over-pruning, which can result
in the loss of crucial information, adversely affecting the
model’s performance [90,110]. Researchers have argued for
the necessity of optimized DNN approaches that meticu-
lously avoid the negative consequences of over-pruning [111,
112]. The conversation extends to the impact of over-pruning
on cloud-edge collaborative inference, with suggestions for a
more conservative approach to network pruning to maintain
model effectiveness [97,113]. This reflects a consensus on
the need to preserve essential information while streamlining
models for efficiency. Moreover, the optimization challenges
of pruning a distributed CNN for IoT performance enhance-
ment are illustrated through a case study, emphasizing the
complexity of achieving optimal pruning without compro-
mising model integrity [114]. These discussions collectively
underscore the importance of research focusing on devel-
oping pruning methodologies that reduce model size and
computational demands and safeguard against the loss of
essential information.
Pruning allows for the creation of more efficient and com-
pressed ML models. While it involves a trade-off between
model size and performance, with careful implementation,
it can significantly enhance computational efficiency. Ongo-
ing research in this field continues to refine and develop new
pruning techniques, making it a dynamic and essential aspect
of DNN optimization.
3.2 Quantization
Quantization serves as a pivotal technique for model com-
pression, playing a key role in enhancing computational
efficiency and reducing storage requirements [23–27,69].
This process is particularly critical in deploying DNN mod-
els on devices with limited resources. For example, most
modern DNN are made up of billions of parameters, and
the smallest large language model (LLM) has 7B param-
eters [115]. If every parameter is 32 bit, then it is needed
(7×109)×32 =112Gbit just to store the parameter on
disk. This implies that large models are not readily accessible
on a conventional computer or on an edge device. Quan-
tization refers to the process of reducing the precision of
the DNN’s parameters (weights and activations), simplify-
ing the model, leading to decreased memory usage and faster
computation, without significantly compromising model per-
formance. Quantization aims to reduce the total number of
bits required to represent each parameter, usually converting
floating-point numbers into integers [116].
Uniform quantization involves mapping input values to
equally spaced levels. It typically converts floating-point rep-
resentations into lower-bit representations, like 8-bit integers.
Uniform quantization simplifies computations and reduces
model size, but it must be carefully managed to avoid signifi-
cant loss in model accuracy [117]. Non-uniform quantization
uses unevenly spaced levels, which are often optimized for
the specific distribution of the data. Techniques like log-
arithmic or exponential scaling are used to allocate more
levels where the data is denser. Non-uniform quantization
can be more efficient in representing complex data distri-
butions, potentially leading to better preservation of model
accuracy [118]. Post-training quantization involves applying
quantization to a model after it has been fully trained. It sim-
plifies the process as it doesn’t require retraining; however, it
may require calibration on a subset of the dataset to maintain
accuracy [119].
P. . antas et al.DV
11812
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Fig. 2 The process of weight pruning in a DNN. (a) Shows the original
DNN with all nodes and connections intact. (b) Highlights the nodes
that have been pruned, indicating the parts of the network identified as
non-essential. (c) Displays the pruned connections with dashed lines,
illustrating the streamlined network structure after the less significant
weights have been removed
Selecting the right parameters (like weights and activa-
tions) to quantize is crucial. The selection is based on their
impact on output and the potential for computational savings.
Techniques include linear quantization, which maintains a
linear relationship between the quantized and original val-
ues, and non-linear quantization, which can better adapt to
data distribution [70]. These methods often require additional
consideration to ensure minimal impact on the model’s per-
formance. It is crucial to assess the post-quantification to
ensure that there is no significant loss in accuracy or effi-
ciency. In some cases, fine-tuning the quantized model can
help regain any lost accuracy. Methods include retraining
the model with a lower learning rate or using techniques
like knowledge distillation [120]. It is important to check
if the model works well with a specific set of data to
make sure it is accurate and fast [121]. Quantization effec-
tively compresses DNN models, enabling their deployment
in resource-constrained environments. It helps make DNN
simpler and faster by reducing the number of computational
requirements [122]. Advancements in quantization methods
continue to focus on maintaining model performance while
maximizing compression [71].
Quantization plays a significant role in model size reduc-
tion and inference speed, suitable for mobile and edge com-
puting and real-time applications [121,123–125]. Despite
its benefits, it carries the risk of introducing quantization
errors that can significantly impair model accuracy, espe-
cially in complex DL models [121,123]. This concern
is well-documented in the literature, with several studies
addressing the impact of quantization on model accuracy
and proposing various methods to mitigate these effects.
For instance, research has highlighted the effectiveness of
higher-bits integer quantization, thereby achieving a balance
between reduced model size and maintained accuracy [126].
Additionally, the risks associated with quantization errors
have been explicitly discussed, emphasizing the negative
impact these errors can pose on model accuracy [127]. More-
over, the development of methodologies such as sharpness-
and quantization-aware training (SQuAT) has been shown to
mitigate the challenges of quantization and enhance model
performance [128]. These studies collectively underscore
the critical need for continued innovation in quantization
techniques, aiming to minimize the adverse effects of quan-
tization errors while leveraging the efficiency gains offered
by this approach in the field of DL.
3.3 Low-rank factorization
Low-rank factorization is a way to make NN smaller and sim-
pler without making it less effective. This method focuses on
decomposing large, dense weight matrices found in DNN
into two smaller, lower-rank matrices [28,29,129]. The
essence of low-rank factorization is its ability to combine
these two resulting data matrices to approximate the orig-
inal, thereby achieving compression. This method reduces
the model size and data processing demands, making CNN
more suitable for applications in resource-limited environ-
ments [91]. This process aims to capture the most significant
information in the network’s weights, allowing for a more
compact representation with minimal loss in performance.
Matrix decomposition and tensor decomposition are com-
mon applied techniques. Matrix decomposition in low-rank
factorization involves breaking down large weight matri-
ces into simpler matrix forms. Single value decomposition
(SVD) is a common method, where a matrix is decom-
posed into three smaller matrices, capturing the essential
features of the original matrix. This reduces the number of
parameters in DNN models, leading to less storage and com-
putational requirements, while striving to maintain model
performance [130]. Extending beyond matrix decompo-
sition, tensor decomposition deals with multidimensional
arrays (tensors) in DNNs. Techniques like canonical polyadic
A comprehensive review of model... 11813
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
decomposition (CPD) or tucker decomposition are used,
which factorize a tensor into a set of smaller tensors [131].
Tensor decomposition is particularly effective for com-
pressing CNNs, often achieving higher compression rates
compared to matrix decomposition.
Layers with larger weight matrices or those contributing
less to the output variance are prime candidates for factor-
ization. Techniques like sensitivity analysis can help identify
these layers [132]. The model’s accuracy and dimension need
to be assessed after the factorization process. Metrics like
accuracy, inference time, and model size are key consider-
ations. Fine-tuning the factorized network can help recover
any loss in accuracy due to the compression [133]. This might
involve continued training with a reduced learning rate or
applying techniques like knowledge distillation. It is recom-
mended to validate the factorized model on a relevant dataset
to ensure that it still meets the required performance stan-
dards.
Low-rank factorization efficiently reduces redundancies
in models, particularly in fully connected layers [28,29,134].
Low-rank factorization faces challenges in its broad appli-
cability, despite its effectiveness in reducing redundancies
in large-scale models. Its suitability is somewhat limited to
scenarios with significant redundant information in fully con-
nected layers, indicating a constraint in its versatility across
different types of models [28,29]. For instance, the effec-
tiveness of approaches like Kronecker tensor decomposition
in compressing weight matrices and reducing the parameter
dimension in CNN highlights the potential of low-rank fac-
torization techniques in specific contexts [135]. However, the
literature indicates that the applicability of low-rank factor-
ization may be somewhat constrained, reflecting limitations,
especially in models with varying architectural complex-
ities or those not characterized by significant redundancy
in their fully connected layers [136,137]. Such challenges
underscore the necessity for ongoing research to expand the
scope and efficacy of low-rank factorizationmethods, poten-
tially through approaches that can be applicable to a broader
spectrum of DL architectures without compromising model
performance or efficiency.
Low-rank factorization is an effective approach for com-
pressing DNNs, particularly useful in environments with
limited computational resources. The compromise between
model size and precision is inevitable, but an optimiza-
tion can mitigate the impact of performance dips. Ongoing
research in this area continues to explore more efficient fac-
torization techniques and their applications in various types
of NNs [138].
3.4 Knowledge distillation
Knowledge distillation is primarily utilized in the domain
of DNNs for model compression and optimization [63,93,
94,105]. It works by transferring experience from a large-
scale model (teacher) to a smaller-scale model (student),
enhancing the latter’s performance without the computa-
tional intensity of the former [139]. At its core, knowledge
distillation is about extracting the informative aspects of a
large model’s behavior and instilling this knowledge into a
smaller model. This approach allows for the retention of high
accuracy in the student model while significantly reducing
its size and complexity.
In a teacher-student model framework, a large-scale, well-
trained model is employed to guide the implementation of a
smaller-scale model. The large-scale network provides guid-
ance to the small-scale network [116]. The small-scale model
aims to mimic the large-scale model’s output while having
fewer parameters and computational complexity. The small-
scale model is optimized to infer the correct categorization,
and also to replicate the large-scale model’s output (predic-
tions or intermediate features). The small-scale model can be
trained to match the softmax output of the large-scale model,
or to match its feature representations. There are common
loss functions that measure how closely the small-scale out-
puts match the large-scale outputs [140]. Distillation loss,
for example, helps the small-scale model to learn the behav-
ior of the large-scale model, going beyond mere categorical
inferring. Knowledge distillation is especially effective at
simplifying models’ complexity, making it suited for appli-
cations in limited-resource systems. The small-scale model’s
performance is similar to the large-scale model, but it requires
fewer computational resources.
While it may seem that the overall performance of
small-scale models would decrease compared to large-scale
models, the primary goal of knowledge distillation is not to
achieve identical performance across all tasks but rather to
maintain similar performance on specific tasks while reduc-
ing model size and computational complexity. However,
these large models often come with significant computational
costs and memory requirements, making them impracti-
cal for deployment in resource-constrained environments or
real-time applications. By distilling the knowledge from a
large-scale model into a smaller counterpart, the goal is to
retain the essential information and decision-making capabil-
ities necessary for specific tasks while reducing the model’s
size and computational demands. While it is true that small-
scale models may not match the performance of large-scale
models across all tasks, the focus is on achieving compara-
ble performance on targeted tasks of interest while benefiting
from the efficiency and speed advantages of smaller mod-
els. The aim is not to replicate the exact performance of
the large model but to strike a balance between model size,
computational efficiency, and task-specific performance,
making knowledge distillation a valuable technique for
model compression and optimization in practical applica-
tions.
P. . antas et al.DV
11814
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Distillation techniques can significantly enhance the per-
formance of smaller models, often outperforming models
trained in standard ways [141]. Knowledge distillation has
been successfully adopted in areas like computational vision,
NLP, and speech recognition, demonstrating its versatility
and effectiveness. The choice of large and small-scale mod-
els is crucial. Too complex a teacher can make the distillation
process less effective. The architecture of the large and small-
scale models also has a significant impact on the distillation
process’s success. Tuning hyperparameters such as tempera-
ture in softmax and the weight of distillation loss is vital for
achieving optimal performance in the student model [142].
Knowledge distillation, well-known for its capacity to
encapsulate the functionalities of larger models into forms
that are suited to deployment in resource-restricted set-
tings, faces a series of intricate challenges and drawbacks.
A notable issue is the deployment of extensive pre-trained
language models (PLM) on devices with limited memory,
which necessitates a delicate balance to optimize perfor-
mance without overwhelming system resources [143]. Fur-
thermore, the generalization capacity of distilled models may
be compromised when utilizing public datasets that differ
from the training datasets, diluting the model’s relevance
and accuracy [144]. The constraints of existing knowledge
distillation-based approaches in federated learning under-
score the need for innovative solutions to address the scarcity
of cross-lingual alignments for knowledge transfer and
the potential for unreliable temporal knowledge discrepan-
cies [145,146]. Additionally, the sparsity, randomness, and
varying density of point cloud data in light detection and
ranging (LiDAR) semantic segmentation present challenges
that can yield inferior results when traditional distillation
approaches are directly applied [147]. These challenges high-
light the necessity for continuous exploration and refinement
of knowledge distillation techniques to ensure they can effec-
tively reduce model size and complexity while maintaining
or even enhancing performance across a broad spectrum of
applications.
Knowledge distillation stands as a powerful tool in the
realm of CNN, offering an efficient way to compress mod-
els and enhance the performance of smaller networks. As
the demand for deploying sophisticated models in resource-
limited environments grows, knowledge distillation will
continue going in a fundamental area of knowledge and
development, paving the way for more efficient and accessi-
ble AI applications [148].
3.5 Transfer learning
Transfer learning is a technioque in the DNN’s domain that
enables models to leverage pre-existing knowledge for new,
often related tasks. This methodology significantly reduces
the need for extensive data collection and training from
scratch. In essence, transfer learning involves taking a model
established for one purpose and repurposing it for a different
but related task. The assumption behind this strategy is that
the knowledge gained by a model in learning one task can be
beneficial in learning another, especially when the tasks are
similar [149–155].
In the feature extractor approach, a model employed on
a large dataset is applied as a fixed feature extractor. The
pre-training layers encompass common capabilities that are
applicable to a diverse set of tasks. Common in image and
speech recognition tasks, this method is beneficial when
there’s limited training data for the new task [156]. It allows
for leveraging complex features learned by the model without
extensive retraining. Fine-tuning involves adjusting a pre-
conditioned model by continuing the learning process on a
different dataset. This approach often involves modifying the
model design in order to better suit the upcoming task, and
then training these layers (or the entire model) on the new
data. Fine-tuning can lead to more tailored and accurate mod-
els for specific tasks.
Transfer learning drastically reduces the time and resources
required to develop effective models, as the initial learning
phase is bypassed. Models can achieve higher accuracy, espe-
cially in tasks where training data is scarce, by building upon
pre-learned patterns and features [157]. Transfer learning
has seen successful applications in areas like medical image
analysis [158], NLP [159], and autonomous vehicles [160],
showcasing its versatility. The selection of the preconditioned
model should reflect the nature of the upcoming task. Factors
like the similarity of the datasets and the complexity of the
model need consideration. Careful adjustment of the model
is required to avoid overfitting to the new task or underfitting
due to insufficient training. Regularization techniques and
data augmentation can be helpful in this regard [161].
Transfer learning is not without its challenges and draw-
backs. One such challenge is the need for large and diverse
datasets to effectively train models, coupled with the lim-
ited interpretability of DL models [162]. In the context of
face recognition with masks, the reduction in visible features
due to masks poses a significant challenge to maintaining
model performance, highlighting the complexity of adapt-
ing transfer learning to new and evolving scenarios [163].
Furthermore, the application of transfer learning in breast
cancer classification underscores the technique’s dependency
on domain-specific data to achieve state-of-the-art (SOTA)
performance, suggesting limitations in its versatility across
different domains [164]. Moreover, scenarios with limited
resources emphasize the need for optimized transfer learning
models [165]. The selection of appropriate transfer learning
algorithms for practical applications in industrial scenar-
ios presents another layer of complexity, underscoring the
challenge of applying transfer learning to varied real-world
applications [166]. Additionally, hypothesis transfer learn-
A comprehensive review of model... 11815
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ing in binary classification highlights the balance required
between leveraging existing knowledge and adapting to
new tasks, further complicating the deployment of transfer
learning in applications reliant on big data [167]. These ref-
erences collectively underscore some challenges associated
with transfer learning, from dataset and interpretability issues
to computational constraints and the risk of negative trans-
fer, highlighting the need for research and development to
expand it across more comprehensive applications.
Transfer learning represents a significant leap in training
DNNs, offering a practical and efficient pathway to model
development and deployment. It accelerates the training pro-
cess and opens up possibilities for tasks with limited data
availability. As AI continues to evolve, transfer learning is
poised to play an increasingly vital role [168].
4 Lightweight model design and synergy
with model compression techniques
The quest for efficient and effective NN architectures is
paramount. Two critical approaches emerge in this pur-
suit: lightweight model design and model compression. Both
methodologies aim to enhance the ease of deployment and
performance of DNN, especially in resource-constrained
environments [57,169]. This section delves into the concept
of lightweight model design, exemplified by groundbreaking
architectures, and draws connections to model compression,
illustrating how these strategies collectively drive advance-
ments in the ML domain.
Lightweight model design focuses on constructing DNN
from the ground up, with an emphasis on minimalism and
efficiency. This approach often involves innovating architec-
tural elements, such as the fire modules in SqueezeNet and the
low-rank separable convolutions in SqueezeNext, to reduce
the model’s scale and computational needs without signif-
icantly compromising its performance. The objective is to
create inherently efficient models that can operate effectively
on devices with reduced computational capacity and mem-
ory, such as smartphones, IoT appliances, and embedded
systems [169]. However, model optimization methodologies
are applied to pre-existing, often more complex, DNN mod-
els. The goal is to make these post-training models smaller
and easier to use in limited resource settings. These meth-
ods are used to make networks smaller-scale. They balance
keeping the network efficient with reducing its size [170].
4.1 Overview of lightweight model architectures
4.1.1 SqueezeNet architecture
SqueezeNet represents a significant advancement in design-
ing NN models [96,171]. Developed with an emphasis
on minimizing model size without compromising accuracy,
SqueezeNet stands as an example of lightweight model
design in ML.
At the heart of SqueezeNet is the use of fire modules,
which are small, carefully designed CNN that drastically
reduce the number of parameters without affecting per-
formance [21,172]. This design aligns with the growing
need for deployable DL models in limited-resource appli-
cations, such as smartphones and embedded systems. The
compact nature of SqueezeNet also offers significant ben-
efits in terms of reduced memory requirements and faster
computational speeds, making it ideal for real-time applica-
tions [96]. SqueezeNet’s architecture has also been influential
in the realm of model compression. Its highly efficient design
makes it an excellent baseline for applying further com-
pression techniques. These methods enhance SqueezeNet’s
ease of deployment, particularly in scenarios where computa-
tional resources are limited. The adaptability of SqueezeNet
to various compression techniques exemplifies its versatility
and robustness as a DL model [89].
The application of SqueezeNet extends beyond theoretical
research, finding practical use in areas including media anal-
ysis and mobile applications. Its influence has also paved the
way for future research in lightweight NN design, inspiring
the development of subsequent architectures like MobileNet
and SqueezeNext. These models build on the foundational
principles established by SqueezeNet, further pushing the
boundaries of efficiency in NN design [95,173].
4.1.2 SqueezeNext architecture
SqueezeNext is an advanced CNN architecture lightweight
model [173]. Building upon the principles of SqueezeNet,
SqueezeNext integrates novel design elements to achieve
even greater efficiency in model size and computation.
SqueezeNext stands out for its innovative architectural
choices, which include low-rank separable convolutions and
optimized layer configurations. These features enable it
to maintain high accuracy while drastically reducing the
model’s size and computational demands. This efficiency
is particularly beneficial for deployment in environments
with stringent memory and processing constraints, such as
mobile devices and edge computing platforms. The design
of SqueezeNext demonstrates the progress made in crafting
models that are both lightweight and capable [173].
SqueezeNext’s design also contributes significantly to the
field of model compression. Its inherent efficiency provides
one foundation for applying additional compression tech-
niques. These methods further enhance the model’s suitabil-
ity for deployment in resource-limited settings, showcasing
SqueezeNext’s versatility in various application scenarios.
The architecture serves as a benchmark in the study of model
compression, providing insights into achieving an optimal
P. . antas et al.DV
11816
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
balance between model size, speed, and accuracy [21]. The
impact of SqueezeNext extends to practical applications in
areas like image processing, real-time analytics, and IoT
devices.
4.1.3 MobileNetV1 architecture
MobileNetV1, introduced by researchers at Google, marks
a significant milestone in the development of efficient DL
architectures [95]. It is specifically engineered for mobile
and embedded vision applications, offering a perfect blend
of compactness, speed, and accuracy. The core innovation of
MobileNetV1 lies in its use of depth-wise separable convolu-
tions. This design reduces the computational cost and model
size compared to conventional CNNs. Depthwise separable
convolutions split the standard convolution into two layers -
a depthwise convolution and a pointwise convolution - which
substantially decreases the number of parameters and opera-
tions required. This architectural choice makes MobileNetV1
exceptionally suited for mobile devices, where computa-
tional resources and power are limited [21,95].
MobileNetV1’s efficient design has significantly impacted
the deployment of DL models on mobile and edge devices.
Its ability to deliver high performance with low latency and
power consumption has enabled a wide range of applications,
from real-time image and video processing to complex ML
tasks on handheld devices. This breakthrough has opened up
new possibilities in the field of mobile computing, where the
demand for powerful yet efficient AI models is constantly
growing [174].
MobileNetV1 not only stands as a remarkable achieve-
ment in its own right, but also lays the groundwork for future
advancements in lightweight DL models. It has inspired a
series of subsequent architectures, like MobileNetV2 and
MobileNetV3, each iterating on the initial design to achieve
even greater efficiency and performance. The principles
established by MobileNetV1 continue to influence the design
of NN aimed at edge computing and IoT devices [175].
4.1.4 MobileNetV2 architecture
MobileNetV2, an evolution of its predecessor MobileNetV1,
further refines the concept of efficient NN design for
mobile and edge devices. Introduced by Google researchers,
MobileNetV2 incorporates novel architectural features to
enhance performance and efficiency, making it a standout
choice in the landscape of lightweight DL models [174].
MobileNetV2 introduces the concept of inverted residu-
als and linear bottlenecks, which are key to its improved
efficiency and accuracy. These innovations involve using
lightweight, depth-wise separable convolutions to filter fea-
tures in the intermediate expansion layer, and then projecting
them back to a low-dimensional space. This approach reduces
the computational burden and preserves important informa-
tion flowing through the network. The result is a model that
offers higher accuracy and efficiency, particularly in appli-
cations where latency and power consumption are critical
considerations [174].
MobileNetV2’s enhanced efficiency has significant impli-
cations for mobile and edge computing. Its ability to deliver
high-performance ML with minimal resource usage has
broadened the scope of applications possible on mobile
devices. This includes advanced image and video process-
ing tasks, real-time object detection, and augmented/virtual
reality (AR/VR) - all on devices with limited computa-
tional capabilities. MobileNetV2’s architecture has set a new
benchmark for developing AI models that are both power-
ful and resource-efficient [175]. Its architectural innovations
have been foundational in the creation of more advanced
models like MobileNetV3 and beyond, which continue to
push the boundaries of efficiency and performance in NN
design. The legacy of MobileNetV2 is evident in the ongo-
ing efforts to optimize DL models for the increasingly diverse
requirements of mobile and edge AI [176].
4.1.5 MobileNetV3 architecture
MobileNetV3 represents a further refinement in the develop-
ment of efficient and compact DL models tailored for mobile
and edge devices. Developed by Google, MobileNetV3
builds upon the foundations laid by its predecessors,
MobileNetV1 and MobileNetV2, incorporating several novel
architectural innovations to enhance performance while
maintaining efficiency [177]. One of the key innovations
in MobileNetV3 is the use of a neural architecture search
(NAS) to optimize the network structure. This automated
design process identifies the most efficient network config-
urations, balancing the trade-offs between latency, accuracy,
and computational cost. Additionally, MobileNetV3 intro-
duces squeeze-and-excitation modules, which adaptively
recalibrate channel-wise feature responses by explicitly mod-
eling interdependencies between channels. This improves the
model’s representational power without a significant increase
in computational burden [177].
MobileNetV3 also incorporates a combination of hard
swish (h-swish) activation functions and new efficient build-
ing blocks, such as the MobileNetV3 blocks, which include
lightweight depthwise convolutions and linear bottleneck
structures. These architectural features collectively reduce
the computational load and enhance the model’s perfor-
mance on mobile and edge devices [177]. The efficiency
and high performance of MobileNetV3 make it particularly
suitable for real-time applications, such as image classifi-
cation, object detection, and other vision-related tasks on
resource-constrained devices. Its compact design ensures low
latency and reduced power consumption, enabling deploy-
A comprehensive review of model... 11817
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ment in diverse environments, from smartphones to IoT
devices [177].
The principles and techniques introduced in MobileNetV3
have been adopted and extended in various new architec-
tures, further advancing the SOTA in lightweight and efficient
model design. These developments continue to push the
boundaries of what is achievable in the context of mobile
and edge AI applications, ensuring that high-performance
DL models remain accessible and practical for real-world
use [178,179].
4.1.6 ShuffleNetV1 architecture
ShuffleNetV1 marks a significant advancement in the field of
efficient NN architectures. Developed to cater to the increas-
ing demand for computational efficiency in mobile and edge
computing, ShuffleNetV1 introduces an innovative approach
to designing lightweight DL models. The defining feature
of ShuffleNetV1 is its use of pointwise group convolutions
and channel shuffle operations. These techniques dramati-
cally reduce computational costs while maintaining model
accuracy. Point-wise group convolutions divide the input
channels into groups, reducing the number of parameters and
computations. The channel shuffle operation then allows for
the cross-group information flow, ensuring that the grouped
convolutions do not weaken the network’s representational
capabilities. This unique combination of features enables
ShuffleNetV1 to offer a highly efficient network architec-
ture, particularly suitable for scenarios where computational
resources are limited [180].
ShuffleNetV1’s efficiency and high performance make it
a valuable asset in mobile and edge computing applications.
Its design addresses the challenges of running complex DL
models on devices with constrained processing power and
memory, such as smartphones and IoT devices. The archi-
tecture has been widely adopted for tasks like real-time
image classification and object detection, offering a practical
solution for deploying advanced AI capabilities in resource-
limited environments [181].
The introduction of ShuffleNetV1 has had a significant
impact on the research and development of efficient NN mod-
els. Its approach to reducing computational demands with-
out compromising accuracy has contributed to subsequent
architectures, including ShuffleNetV2. These developments
continue to explore and expand the opportunities of what is
possible in the realm of lightweight and efficient DL mod-
els [180].
4.1.7 ShuffleNetV2 architecture
ShuffleNetV2 represents a progression in the evolution of
efficient NN architectures, building upon the foundations laid
by its predecessor, ShuffleNetV1. The ShuffleNetV2 archi-
tecture was specifically designed to address the limitations
and challenges observed in previous lightweight models,
particularly in the context of computational efficiency and
practical deployment on mobile and edge devices. By intro-
ducing a series of novel design principles and techniques,
ShuffleNetV2 achieves a superior balance between speed and
accuracy, making it highly effective for real-world applica-
tions [181].
The core innovation of ShuffleNetV2 lies in its strat-
egy to optimize the network’s computational graph through
a more refined use of channel operations. Unlike its pre-
decessor, ShuffleNetV2 focuses on addressing the issues
of memory access cost and network fragmentation. The
architecture introduces an enhanced channel split opera-
tion, where each layer’s input is split into two branches:
one that undergoes a pointwise convolution and another
that remains unchanged, significantly reducing the compu-
tation and memory footprint. Additionally, ShuffleNetV2
employs an improved channel shuffle operation that ensures
an even and efficient mixing of information across fea-
ture maps, thereby enhancing the network’s representational
power without introducing substantial computational over-
head [181]. ShuffleNetV2 outperforms its predecessor and
other contemporary lightweight models in terms of speed
and accuracy on various benchmarks. It achieves a favorable
trade-off between model size and computational efficiency,
making it particularly well-suited for deployment in scenar-
ios with stringent resource constraints, such as mobile and
edge AI applications [181].
The impact of ShuffleNetV2 extends beyond its immedi-
ate performance benefits. Its introduction has influenced the
broader field of efficient NN design, inspiring subsequent
research and development efforts aimed at further optimizing
lightweight architectures. By addressing the critical bottle-
necks in mobile and edge AI deployment, ShuffleNetV2 has
set a new standard for what is achievablein terms of balancing
efficiency and accuracy in DL models. This has paved the way
for more sophisticated applications in real-time image pro-
cessing, object detection, and other AI-driven tasks, ensuring
that high-performance DL remains accessible and practical
for a wide range of real-world uses [181].
4.1.8 EfficientNet architecture
EfficientNet, a groundbreaking series of CNN, represents
a significant advancement in the efficient scaling of DL
models. Developed with a focus on balanced scaling of
network dimensions, EfficientNet has set new standards
for achieving SOTA accuracy with remarkably efficient
resource utilization. The key innovation of EfficientNet is
its systematic approach to scaling, called compound scaling.
Unlike traditional methods that independently scale network
dimensions (depth, width, or resolution), EfficientNet uses
P. . antas et al.DV
11818
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
a compound coefficient to uniformly scale these dimensions
in a principled manner. This balanced scaling method allows
EfficientNet to achieve higher accuracy without an expo-
nential increase in computational complexity. The network
efficiently utilizes resources, making it highly effective for
both high-end and resource-constrained environments [176].
EfficientNet’s performance sets a benchmark for vari-
ous ML challenges, especially in image classification tasks.
The network’s ability to scale efficiently across differ-
ent computational budgets makes it adaptable for a wide
range of applications, from mobile devices to cloud-based
servers. EfficientNet has demonstrated superior performance
in tasks requiring high accuracy and efficiency, such as object
detection, image segmentation, and transfer learning across
different domains [176]. Its principles have been adopted and
adapted in subsequent research, pushing the limits of what is
possible in terms of efficiency and performance in NNs [182,
183].
4.1.9 EfficientNetV2 architecture
EfficientNetV2 represents a significant advancement in the
field of efficient DL models, building upon the success of the
original EfficientNet architecture. Developed by researchers
at Google, EfficientNetV2 introduces several novel tech-
niques to further enhance performance and efficiency, mak-
ing it one of the leading models for mobile and edge device
applications [184].
EfficientNetV2 incorporates a new scaling method called
progressive learning, which adjusts the size of the model dur-
ing training to improve both accuracy and efficiency. This
technique begins training with smaller resolutions and sim-
pler augmentations, progressively increasing the resolution
and complexity as training progresses. This method not only
speeds up the training process but also helps the model
achieve higher accuracy. Another key innovation in Effi-
cientNetV2 is the use of fused convolutional blocks, which
combine the efficiency of depthwise convolutions with the
accuracy benefits of regular convolutions. These blocks help
reduce the overall computational cost while maintaining high
performance. Additionally, EfficientNetV2 employs various
training-aware optimizations, such as improved data aug-
mentations and regularization techniques, which contribute
to its superior performance [184].
The architecture of EfficientNetV2 is designed to be
versatile, performing well across a wide range of tasks,
including image classification, object detection, and segmen-
tation. Its balanced approach to scaling and optimization
allows it to deliver SOTA accuracy with significantly reduced
computational resources, making it ideal for deployment in
environments with limited processing power and memory,
such as mobile devices and IoT platforms [184]. Efficient-
NetV2 has set new benchmarks in the field of DL, influencing
subsequent research and inspiring new directions in the
development of efficient NN models. The principles and
techniques introduced in EfficientNetV2 have been adopted
and further refined in various other architectures, pushing the
boundaries of what is possible in efficient model design for
real-world applications [185].
4.1.10 Overview of lightweight model architectures
Some of the key lightweight model architectures are summa-
rized in the Table 3, with their year of launch, key features,
and impact on various applications highlighted.
4.2 Integration with compression techniques
The concept of lightweight design focuses on architecturally
optimizing DNNs to minimize their demand on compu-
tational resources without significantly undermining their
efficacy. Innovations such as efficient convolutional lay-
ers [95,174], introduce structural efficiencies that lower the
parameter count and computational load. These innovations
are crucial for enabling the deployment of high-performing
DNNs on devices with limited computational capacity, like
smartphones and IoT devices. The fusion of lightweight
design and model compression in DNNs represents a cru-
cial advancement for deploying advanced ML models under
computational and resource constraints [186–188].
The synergy between lightweight model design and model
compression represents a comprehensive approach to opti-
mizing CNN. While the former approach is proactive,
building efficiency into the model’s architecture, the lat-
ter is reactive, refining and streamlining models that have
already been developed. Together, they address the diverse
challenges in deploying advanced ML models, from the ini-
tial design phase through to post-training optimization [170].
This section will explore how SqueezeNet, SqueezeNext, and
similar architectures embody the principles of lightweight
design and how their integration with model compression
techniques exemplifies the broader strategy of NN optimiza-
tioninML.
Lightweight model design and model compression, though
related, represent distinct approaches in DL. Lightweight
model design focuses on creating architectures that are inher-
ently optimized for performance and low resource consump-
tion, while maintaining satisfactory accuracy. This involves
techniques like employing smaller convolutional filters and
depthwise separable convolutions to reduce the number of
parameters and computational intensity of each layer [189].
In contrast, model compression is the process of downsiz-
ing an existing model to diminish its size and computational
demands without significantly compromising accuracy. The
objective here is to adapt a pre-trained model for more effi-
cient deployment on specific hardware platforms [190]. Both
A comprehensive review of model... 11819
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 3 Summary of lightweight model architectures: year of launch, key features, impact, and applications
Architecture Year Key Features Impact and Applications
SqueezeNet 2016 Fire modules to reduce parameters, compact design,
efficient for low-resource devices
Significant model size reduction, used in smartphones
and embedded systems, baseline for further com-
pression techniques, practical in media analysis and
mobile applications.
SqueezeNext 2018 Low-rank separable convolutions, optimized layers,
enhanced efficiency
Greater model size and computation efficiency, useful
in mobile devices and edge computing, benchmark for
model compression.
MobileNetV1 2017 Depth-wise separable convolutions to reduce compu-
tational cost
Suitable for mobile and edge devices, real-time image
and video processing, inspired subsequent architec-
tures like MobileNetV2 and V3.
MobileNetV2 2018 Inverted residuals, linear bottlenecks, depth-wise sep-
arable convolutions
Higher efficiency and accuracy, broadened scope for
mobile applications, influenced further research in
efficient NN design.
MobileNetV3 2019 NAS for optimal network structure, squeeze-and-
excitation modules, h-swish activation
Enhanced performance for mobile devices, low
latency and power consumption, influenced new
architectures in efficient model design.
ShuffleNetV1 2018 Pointwise group convolutions, channel shuffle opera-
tions
Highly efficient for mobile and edge computing, prac-
tical for real-time image classification and object
detection.
ShuffleNetV2 2018 Optimized channel operations, enhanced channel split
and shuffle operations
Superior speed and accuracy,well-suited for resource-
constrained environments, set new standards for
lightweight NN design.
EfficientNet 2019 Compound scaling to balance network dimensions SOTA accuracy with efficient resource utilization,
adaptable for mobile to cloud applications, influenced
model scaling techniques.
EfficientNetV2 2021 Progressive learning, fused convolutional blocks,
training-aware optimizations
Improved training efficiency and accuracy, versatile
for various tasks, set benchmarks in DL, influencing
new efficient architectures.
methods aim to produce models that are well-suited for
deployment on devices with limited resources. The following
subsections highlight some prominent lightweight models
designed to have fewer parameters and lower computational
requirements compared to traditional DNNs [191,192].
Recent studies have contributed to the field by proposing
novel approaches [69,193,194], exploring various com-
pression techniques for DNNs, including compact models,
tensor decomposition [131,195], data quantization [122,
196], and network sparsification [197,198]. These methods
are instrumental in the design of NN accelerators, facilitat-
ing the deployment of efficient ML models on constrained
devices. A noteworthy application in video coding [199] sug-
gested a lightweight model achieving up to 6.4% average bit
reduction compared to high efficiency video coding (HEVC),
showcasing the potential of architecturally optimized DNNs
in real-world applications. Additionally, it a novel and
lightweight model for efficient traffic classification [200]was
developed, utilizing thin modules and multi-head attention
to significantly reduce parameter count and running time,
demonstrating the practical utility of lightweight designs
in enhancing running efficiency. A pruning algorithm to
decrease the computational cost and improve the accuracy
of action recognition in CNNs [201] reduces the model size
and also decreases overfitting, leading to enhanced perfor-
mance on large-scale datasets. Finally, a hardware/software
co-design approach for a NN accelerator focuses on model
compression and efficient execution [202]. A two-phase fil-
ter pruning framework was proposed for model compression,
optimizing the execution of DNNs. This co-design approach
exemplifies how integration of hardware and software can
enhance the performance and efficiency of DNNs in practi-
cal applications.
4.2.1 Combined impact on performance and efficiency
Innovative techniques, such as pruning depthwise separa-
ble convolution networks [203], highlight the potential for
improving speed and maintaining accuracy, emphasizing the
importance of structural efficiency in lightweight design.
Meanwhile, the work on adaptive tensor-train decomposi-
tion [204] showcases the significant reduction in parameters
and computation, further underscoring the advancements
in model compactness and efficiency for mobile devices.
Cyclic sparsely connected (CSC) architectures suggests
structurally sparse architectures for both fully connected and
P. . antas et al.DV
11820
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
convolutional layers in CNNs [205]. Unlike traditional prun-
ing methods that require indexing, CSC architectures are
designed to be inherently sparse, reducing memory and com-
putation complexity to O(Nlog N). The number Ndenotes
the number of connections presents in a layer. This technique
demonstrates an innovative way to achieve model com-
pactness and computational efficiency without the overhead
associated with conventional sparsity methods. An efficient
evolutionary algorithm was introduced for NAS [206]. This
method enhances the search efficiency for task-specific NN
models, illustrating how evolutionary strategies can automate
the design of efficient and effective CNN architectures for
various tasks. These examples collectively underscore the
diverse and innovative strategies being explored to make
DNNs more efficient and adaptable, reflecting the ongo-
ing commitment within the research community to push the
boundaries of what is possible in ML efficiency.
The combinations of model compression techniques and
their impacts on model performance reveals a complex land-
scape. Integrating various model compression techniques to
avoid compromising the original model’s effectiveness is a
well-acknowledged challenge in the field [207,208]. Com-
bining different compression methods can indeed lead to
increased efficiency. However, it presents the challenge of
balancing improvements in memory usage and computa-
tional efficiency against the potential for accuracy reduction
and the introduction of noise. This variability underscores the
need for application-specific evaluation and adaptation [56,
60]. Moreover, the complexity of optimizing these meth-
ods for specific applications highlights an ongoing research
area, necessitating innovation to address factors like fairness
and bias and to explore hardware advancements for fur-
ther enhancement. This includes developing strategies that
can effectively leverage the strengths of each compression
approach while mitigating their drawbacks, ensuring that the
resulting models are efficient and suitable for deployment in
limited-resource settings and capable of performing near the
standard’s set [123,209].
4.2.2 Synergies between model compression and
explainable artificial intelligence (XAI)
When discussing model compression, it is crucial to also con-
sider the role of explainable artificial intelligence (XAI) as
a complementary tool in the process [210,211]. XAI pro-
vides insights into how ML models make decisions, which is
particularly beneficial during the compression process. By
understanding which parts of the model are most impor-
tant for making accurate predictions, developers can make
more informed decisions about which components to prune
or quantize. This targeted approach can help maintain the
model’s performance while reducing its size. Furthermore,
can help identify potential biases or errors introduced during
compression, ensuring that the compressed model remains
robust and reliable [212–214]. Integrating XAI with model
compression techniques not only enhances the interpretabil-
ity of the compressed models but also aids in fine-tuning the
balance between model size and performance. This synergy
is essential for developing efficient, scalable, and trustwor-
thy AI systems capable of operating effectively in diverse
and resource-limited environments.
In looking towards future directions for the advancement
of CNN, a multidisciplinary approach emerges across various
domains. The potential of automated ML (AutoML) [215]
elucidates how it can streamline model optimization by sim-
plifying the search for efficient architectures, thus making
the model design process easier. Meanwhile, the imperative
of energy efficiency takes center stage [216], who pushes for
greener practices in CNN development, and encourages for
the adoption of energy-efficient models to mitigate environ-
mental impact.
The exploration of lightweight design and model compres-
sion techniques underscores a significant stride in optimizing
DNN architectures for efficient deployment on devices with
constrained resources. Lightweight design approaches proac-
tively embed efficiency into the model’s architecture, while
model compression methods reactively refine existing mod-
els to reduce their size and computational demands. This dual
strategy addresses the diverse challenges encountered from
the initial design phase to post-training optimization. The
integration of these techniques exemplifies a comprehensive
approach to NN optimization, balancing performance and
resource efficiency. Studies have demonstrated the practical
utility of these approaches in various applications, including
video coding, traffic classification, and action recognition,
highlighting their impact on enhancing model performance
and efficiency. The ongoing research and innovations in this
field continue to push the boundaries of what is achiev-
able in ML efficiency, ensuring that advanced models can
be effectively deployed in real-world scenarios with limited
computational capacity.
5 Performance evaluation criteria
This section delves into the methodologies and metrics used
to assess the efficacy of model compression techniques.
Key aspects of performance evaluation, such as accuracy,
model size, computational speed, and energy efficiency,
are discussed. The trade-offs between maintaining high
accuracy and achieving significant compression rates are
explored, highlighting the challenges and breakthroughs in
this domain. Additionally, this section discusses the practical
implications of model compression in real-world applica-
tions, emphasizing the need for robust and efficient models
that can operate under computational constraints.
A comprehensive review of model... 11821
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
5.1 Compression ratio
The compression ratio αcan be determined by calculating
the fraction between the original and compressed model’s
size [21,90]. Consider that the original model size is 100
MB and the compressed model size is 10 MB, then the com-
pression ratio would be 10:1 (100:10). Secondly, it can be
expressed as a proportion of the total amount of parameters
in the original model and the simplified model [217], as the
following expression:
α(M,M∗)=a
a∗
where ais the amount of parameters in the initial model M
and a∗is the number of parameters in the simplified model
M∗. The compression ratio α(M,M∗)of M∗over Mis the
proportion of the total number of parameters in Mto the total
number of parameters in M∗. In addition, a commonly used
benchmark is the index space-saving β, defined as:
β(M,M∗)=a−a∗