ArticlePDF Available

An Integrated Framework for Bearing Fault Diagnosis: Convolutional Neural Network Model Compression Through Knowledge Distillation

Authors:

Abstract

The industrial application of rolling bearing fault diagnosis necessitates achieving high classification accuracy while minimizing the number of model parameters to reduce the computational resources and storage space required for the model. To meet this requirement, this study proposes a knowledge distillation convolutional neural network-deep forest (KDCNN-DF) hybrid model framework. The proposed method integrates the continuous wavelet transform (CWT) for signal data processing, a convolutional neural network (CNN) optimized by knowledge distillation (KD) for feature extraction, and a simplified multi-granular scanning (MGS) process using deep forest (DF) for fault classification. Besides, during the construction of the student models, this study found that the arrangement order of kernel sizes in the CNN convolutional layers significantly impacts the extraction of bearing fault features. Experimental validation confirmed that architecture with a smaller kernel size preceding a larger kernel size in shallow-level models is more effective. This effect is particularly pronounced after the KD process and adoption in hybrid models, resulting in higher classification accuracy. The proposed KD method reduces the parameter count of the CNN model to 5% of the original number while maintaining relatively high accuracy and significantly reducing computing time. In addition, the modeling architecture of DF has been simplified by adopting a streamlined MGS process. The proposed model achieves the highest accuracy on the original Case Western Reserve University (CWRU) datasets, with 99.75% on the 48 kHz dataset, 99.90% on the 12 kHz dataset, and a perfect 100% on the Ottawa dataset. These results surpass the accuracy of existing methods.
IEEE SENSORS JOURNAL, VOL. 24, NO. 23, 1 DECEMBER 2024 40083
An Integrated Framework for Bearing Fault
Diagnosis: Convolutional Neural Network Model
Compression Through Knowledge Distillation
Jun Ma , Wei Cai , Yuhao Shan, Yuting Xia, and Runtong Zhang , Senior Member, IEEE
AbstractThe industrial application of rolling bearing
fault diagnosis necessitates achieving high classification
accuracy while minimizing the number of model parame-
ters to reduce the computational resources and storage
space required for the model. To meet this requirement, this
study proposes a knowledge distillation convolutional neural
network-deep forest (KDCNN-DF) hybrid model framework.
The proposed method integrates the continuous wavelet
transform (CWT) for signal data processing, a convolutional
neural network (CNN) optimized by knowledge distillation
(KD) for feature extraction, and a simplified multigranular
scanning (MGS) process using deep forest (DF) for fault
classification. Besides, during the construction of the stu-
dent models, this study found that the arrangement order
of kernel sizes in the CNN convolutional layers significantly
impacts the extraction of bearing fault features. Experimental
validation confirmed that architecture with a smaller kernel
size preceding a larger kernel size in shallow-level models
is more effective. This effect is particularly pronounced after
the KD process and adoption in hybrid models, resulting in higher classification accuracy. The proposed KD method
reduces the parameter count of the CNN model to 5% of the original number while maintaining relatively high accuracy
and significantly reducing computing time. In addition, the modeling architecture of DF has been simplified by adopting
a streamlined MGS process. The proposed model achieves the highest accuracy on the original Case Western Reserve
University (CWRU) datasets, with 99.75% on the 48-kHz dataset, 99.90% on the 12-kHz dataset, and a perfect 100% on the
Ottawa dataset. These results surpass the accuracy of existing methods.
Index TermsContinuous wavelet transform (CWT), deep forest (DF), intelligent bearing fault diagnosis, knowledge
distillation (KD), neural network compression.
I. INT RODUCTIO N
ROTATING machinery equipment plays an important role
in modern industries, such as energy, transportation, and
aerospace [1],[2],[3]. Among its key components, rolling
bearings have a significant impact on the safety and reliability
of rotating machinery [4]. In fact, rolling bearing faults can
Received 17 June 2024; revised 8 October 2024; accepted 8 October
2024. Date of publication 21 October 2024; date of current version
27 November 2024. The associate editor coordinating the review of this
article and approving it for publication was Dr. Chao Hu. (Corresponding
author: Runtong Zhang.)
Jun Ma, Yuhao Shan, Yuting Xia, and Runtong Zhang are with the
School of Economics and Management, Beijing Jiaotong University,
Beijing 100044, China (e-mail: 21711056@bjtu.edu.cn; 21711003@
bjtu.edu.cn; 21711071@bjtu.edu.cn; rtzhang@bjtu.edu.cn).
Wei Cai is with the School of Mechanical, Electronic and Control
Engineering, Beijing Jiaotong University, Beijing 100044, China (e-mail:
21116005@bjtu.edu.cn).
Digital Object Identifier 10.1109/JSEN.2024.3481298
lead to financial losses and even casualties [5]. Therefore, it is
essential to accurately diagnose and monitor the health status
of rolling bearings.
In the past, fault diagnosis of rolling bearings primar-
ily relied on manually converting vibration signals and
extracting fault features for analysis. However, manual fault
feature extraction methods have significant shortcomings.
These methods demand challenging expert knowledge and are
time-consuming and labor-intensive for massive datasets [6].
Fortunately, with the integration of artificial intelligence
methods into the field of bearing fault diagnosis, intelligent
fault diagnosis of rolling bearings has become prominent.
In the early development of intelligent bearing fault diagnosis,
shallow machine learning (SML) emerged, relying on three
sequential steps: data collection, artificial feature extraction,
and fault identification [7]. Due to their easily understandable
and simple architectures, SML models, such as artificial neural
1558-1748 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
40084 IEEE SENSORS JOURNAL, VOL. 24, NO. 23, 1 DECEMBER 2024
network (ANN) [8] and support vector machine (SVM) [9],
employ these steps through pattern recognition to achieve
fault diagnosis, making them widely used. Nonetheless, it has
been increasingly recognized that SML methods may struggle
with large and complex datasets, leading to less accurate
predictions.
Consequently, deeper learning (DL) methods, exempli-
fied by the convolutional neural network (CNN), have been
employed to extract subtle fault features that are not easily
detectable, utilizing vast amounts of data to provide more
accurate diagnostic results [10],[11]. Sun et al. [12] proposed
a hybrid fault diagnosis method that leverages the capabil-
ities of CNN to integrate diverse frequency and sequence
features. They combined a gap-gated recurrent unit network
with adaptive batch normalization to refine feature extrac-
tion and enhance robustness. Zhang et al. [13] enhance the
extraction of fault-related features from nonstationary bearing
signals by incorporating a cascaded multiscale information
fusion layer into the 2-D CNN architecture. Among these,
researchers have discovered that stacking layers in a CNN
model enables the extraction of more complex patterns and
features, leading to improved model performance. Deeper
CNN network models consequently exhibit higher accuracy
and stronger generalization capabilities compared to shallow
network models. However, layer stacking also introduces sev-
eral new challenges.
The increase in the number of layers in a CNN sig-
nificantly inflates the model parameters, directly leading to
high computational and time costs as well as increased
memory requirements. As a result, the limited computa-
tional power and storage capacity of onboard equipment or
wearable devices typically prevent the practical deployment
of deep CNN models [14]. Given these constraints, model
compression is essential to reduce computational and memory
demands, enabling the practical deployment of deep CNN
models in resource-limited environments [15],[16]. Recent
research on model compression highlights the predominant
methods of model pruning [17], parameter quantization [18],
low-rank decomposition [19], lightweight model design [20],
and knowledge distillation (KD) [21]. Model pruning is a
technique that reduces the size of neural networks by removing
less important parameters. However, it can lead to perfor-
mance degradation and pose challenges in implementation,
particularly when addressing irregular sparsity. Parameter
quantization reduces the model size by lowering parameter
precision, and low-rank decomposition minimizes storage by
decomposing tensors to reconstruct convolutional kernels.
However, both parameter quantization and low-rank decom-
position necessitate extensive model fine-tuning or retraining,
which is inefficient. Although the lightweight model design
approach is easy to train and deploy, it suffers from poor
generalization ability and significant challenges in integrating
with other models. KD is a technique in which a pretrained
large teacher network transfers prior knowledge to a smaller
student network, enabling the student model to mimic the
teacher model and improve the performance. By utilizing a
large teacher model to transfer knowledge to a smaller student
model, KD achieves model compression without significantly
degrading accuracy, thereby facilitating easier deployment
on resource-constrained mobile devices [16]. By transferring
knowledge from the teacher model to the student model, the
KD method is relatively free in the model network architecture
and can well implement the construction of hybrid models.
The primary KD methods, including response-based KD,
feature-based KD, and relation-based KD, focus on enhancing
model compression ability and improving model accu-
racy [16],[22]. Zhong et al. [23] proposed a method combining
KD and generative adversarial networks to address the issue
of bearing fault diagnosis with small sample sizes. The model
compression is achieved through dense distillation of the
student network by multiple teacher networks. Gong et al. [24]
utilized a lightweight model with a channel-wise CNN as the
student network to compress a complex residual neural net-
work model, thereby enhancing the model’s output efficiency.
Yang et al. [25] proposed a two-stage edge-side fault diagnosis
method based on double KD, resulting in a lightweight model
for accurate and efficient fault diagnosis. Although these
methods enhance the data feature extraction capabilities of
small models through knowledge transfer, KD is not always
effective. When the architecture of the student network differs
significantly from that of the teacher network or when the
student network has far fewer parameters than the teacher
network, the student model struggles to mimic the teacher
model, rendering knowledge transfer nearly ineffective [26].
In addition, simplifying the student network architecture while
reducing parameters and training time can lead to overfitting.
Therefore, KD inevitably faces a tradeoff between simplifying
the model architecture and enhancing model performance and
generalization capability.
Numerous studies have demonstrated that the strength of
CNN lies in its feature extraction capabilities through con-
volutional layers. However, their fully connected layers and
softmax output layers are not ideal for recognizing and clas-
sifying fault features. In other words, using hybrid models to
replace the CNN’s output layer with more effective classifiers
can more efficiently identify the features extracted by the
CNN. Various models, such as SVM, XGBoost, and RNN,
have been shown to improve the classification performance
significantly [27]. Xu et al. [28] employed a deep forest
(DF) to replace the classifier in the original 2-D CNN model,
achieving higher accuracy in bearing fault diagnosis than the
standalone CNN model. The DF model outperforms other
models as a classifier because its core structure, the cascade
forest, builds diverse ensembles of trees iteratively in an
adaptive manner, enabling more reliable multiscale feature
recognition.
Motivated by the research and theories above, we propose
a KDCNN-DF hybrid model framework to compensate for
the accuracy decline of using the KD method through a
hybrid model. This framework achieves an effective model
compression by training an extremely compact student CNN
model through KD and enhances the model performance by
using the DF model as a classifier. Specifically, refining the
multigranular scanning (MGS) process in the DF model to
streamline the existing model architecture, combined with the
cascade forest, can mitigate the accuracy degradation caused
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
MA et al.: INTEGRATED FRAMEWORK FOR BEARING FAULT DIAGNOSIS: CNN MODEL COMPRESSION THROUGH KD 40085
by the limited parameters of the student model. In addition,
by rearranging the kernel sizes in the convolutional layers,
we aim to achieve more effective feature extraction on a
single scale, thereby counteracting potential overfitting and
generalization problems. As outlined by the proposed method,
there are three contributions as follows.
1) To compress the model and improve classification accu-
racy, a novel hybrid model framework named KDCNN-DF for
bearing fault diagnosis is proposed. This framework processes
data using continuous wavelet transform (CWT), extracts fault
features with KDCNN, and employs DF as the classifier.
Using KD to assist student model training, the CNN model
parameters are compressed to 5% of the original, significantly
reducing computational resource consumption.
2) To enhance the performance of CNN models, a con-
struction method of placing a smaller sized kernel before a
larger sized kernel is proposed. This strategic arrangement
of convolutional kernel sizes in shallow CNN student models
significantly improves the identification of fault features. The
discovery is then utilized to improve the feature extraction
performance of KDCNN.
3) To simplify the DF model architecture, a method to
simplify the MGS process is proposed. This method facilitates
seamless interfacing with subsequent cascaded forests using
solely a sliding window approach. Its application in the hybrid
model enables the classification accuracy to surpass that of
existing methods.
The remaining sections of this article are organized as
follows. In Section II, we provide foundational background
knowledge on the basic models. Section III introduces the
framework of the proposed models. In Section IV, we conduct
experiments on publicly available bearing datasets from the
two testbeds, CWRU and Ottawa, analyzing the experimental
results in three distinct Cases. Section Vconcludes this entire
article.
II. MO DE L BACKGROUND
A. Convolutional Neural Network
As a classic model in the DL field, a basic CNN model
might contain several convolutional layers, pooling layers, and
fully connected layers. To achieve feature extraction, CNN
relies on filters to progressively transmit input data across
successive layers. As a result, CNNs are highly efficient and
scalable for various image recognition tasks, making them
a popular choice in DL applications. The most prominent
characteristics of CNNs are the sparse connections between
adjacent layers and the shared kernel weights within each
layer. Specifically, sparse connections reduce the number of
parameters in a CNN by ensuring that each convolutional
kernel is connected only to a localized region of the input [29].
This approach decreases the computational load and mitigates
overfitting by focusing on local features rather than global
patterns. Moreover, shared kernel weights further decrease
the parameter count by using the same set of weights for
each kernel as it slides across the input feature map. This
promotes weight reuse and consistency across the network,
thereby reducing the likelihood of overfitting and improving
generalization.
B. Decision Tree-Based Ensemble Models
Decision tree-based ensemble classifier models leverage
decision trees as weak classifiers and utilize ensemble learning
techniques to mitigate the instability and overfitting issues
associated with individual decision tree models. Among these
models, the random forest classifier (RFC) and extreme gradi-
ent boosting (XGBoost) are widely recognized for their high
accuracy in handling high-dimensional data. RFC trains multi-
ple decision tree subclassifiers based on predefined conditions
and generates the final classification result through majority
voting. By employing bagging, which combines bootstrapping
and aggregation, multiple subclassifiers contribute to the clas-
sification results of the RFC ensemble model, enhancing model
accuracy and reducing overfitting.
Similarly, using decision trees as weak classifiers, XGBoost
continuously optimizes the model by incorporating a regu-
larized objective function during iterations. During training,
it minimizes the loss function through gradient descent and
utilizes second-order derivatives to optimize tree growth and
splitting. This allows XGBoost to incrementally add weak
classifiers over multiple iterations, thereby enhancing the over-
all performance of the model.
III. PROPOSE D FAULT DIAG NOSIS FRAMEWO RK
We propose a bearing fault diagnosis method, referred to
as the hybrid model architecture KDCNN-DF, which involves
three sequential implementation steps: 1) data segmentation
and CWT data preprocessing; 2) training the KDCNN feature
extraction model using KD from the teacher model to the
student model; and 3) Fault classification using a DF model
with an adopted simplified MGS process.
A. Raw Data Segmentation and CWT
Through scale and translation operations, CWT demon-
strates adaptability to signals with varying frequencies over
time, providing high time resolution in high-frequency regions
and high-frequency resolution in low-frequency regions. CWT
achieves its localized analysis through a combination of scal-
ing and translation operations. Scaling involves stretching or
compressing the wavelet function, while translation involves
shifting it along the time axis. These operations are applied
iteratively to refine the analysis progressively
CWTx(a,b)=x(t), ψa,b(t)
=1
aZ+∞
−∞
x(t)¯
ψtb
adt .(1)
The CWT process is expressed as (1) that CWTx(a,b)
denotes the inner product of x(t)and ψa,b(t), where x(t)rep-
resents the normalized data value at time variable t. Moreover,
ψa,bis the wavelet basis function, and it has the conjugate of
¯
ψa,b. The scale parameter a governs the scaling or stretching
of the wavelet function, with smaller values corresponding
to higher frequency components and larger values to lower
frequency components. The translation parameter bcontrols
the wavelet function’s movement along the time axis. In CWT,
the selection of wavelet scale is the critical link, and the
selection of wavelet generating function is a more important
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
40086 IEEE SENSORS JOURNAL, VOL. 24, NO. 23, 1 DECEMBER 2024
thing to determine fault feature extraction [28]. In particular,
Chopra and Marfurt [30] and Kocahan et al. [31] indepen-
dently reported superior performance of the Morlet wavelet
compared to other wavelets in their respective applications.
Morlet wavelet is characterized by a single-frequency sine
function under a Gaussian envelope, providing symmetry
crucial for signal analysis. According to the Morlet wavelet
expressed in (2), researchers noted that the amplitude–phase
representation associated with the composite Morlet wavelet
shows distinct information features in the presence of bearing
faults [32]. Therefore, the Morlet function aligns well with the
impact characteristics generated by bearing defects
ψ(t)=cos (5t)·exp t2
2(2)
N=LW
S+1.(3)
In this study, a sliding window of 1024 length is used
to intercept data fragments. For the available length of the
original vibration signal that is L, the sliding window has a
step size of S, and the calculated and rounded-down number
of windows Nis given by (3).
B. Knowledge Distillation
KD is widely used to transfer knowledge from large and
deep teacher models to smaller student models. The underlying
assumption of model compression based on the KD approach
is that the constructed deep teacher model outperforms the
shallow student model in terms of classification accuracy. The
KD process can thus be guided by the teacher model to target
the shallower, less-parameterized student model for classifi-
cation accuracy improvement. This enhancement is mainly
achieved by assisting the training of the student model through
the soft-labeling transfer of knowledge from the teacher model,
thus improving the accuracy of the student model perfor-
mance. As shown in Fig. 1, the process starts with training a
high-accuracy teacher model and transferring its classification
results, using both soft labels (probability distribution vectors)
and hard labels (labels for the original dataset) to form
the training set for the student model. KD further adjusts
the student model’s loss function, progressively aligning it
with the teacher model’s loss during the process, ultimately
improving the student model’s classification performance
Ldistill =a×Ltask +(1a)×KL (PQ)(4)
Ltask =
N
X
i
tilog (pi)(5)
KL (PQ)=X
x
P(x)·log P(x)
Q(x)(6)
P(x)=1
T2Softmax logitsstudent
T(7)
Q(x)=
Softmax logitsteacher
T
Z.(8)
This study employs a KD loss function that is controlled
by αand temperature T. In (4),Ldistill is calculated by
Fig. 1. Designing CNN architectures for KD: teacher–student training
approach.
combining the task-specific loss with the Kullback–Leibler
(KL) divergence between the teacher model’s predictions P
and the student model’s predictions Q, with a weighting
factor αused to balance the contributions of each term. The
term Ltask in (5) represents the task-specific loss, typically
calculated using the cross-entropy loss function, where tiis the
binary indicator (taking values 0 or 1) and piis the predicted
probability for class i. The KL divergence, as denoted by (6),
serves to measure the discrepancy between two probability
distributions. Minimizing KL divergence between the smooth
probability distributions of the teacher and student models
facilitates knowledge transfer, aiding the student in capturing
the teacher’s knowledge and improving performance. In this
context, P(x)and Q(x)delineate the probability distributions
of the teacher and student models, respectively. Specifically,
P(x)characterizes the probability distribution of the teacher
model’s soft labels, while Q(x)represents the anticipated
probability distribution of the student model. Equations (7)
and (8), respectively, give the corresponding terms of the
loss function of the student model and the teacher model
in the divergence formula. In (7),P(x)is the logitsstudent
function that represents the predictive output of the student
model given an input. In (8),Q(x)includes the logitsteacher,
representing the teacher model’s predictive output for the same
raw input. To be specific, both logitsstudent and logitsteacher
are raw outputs of the models, measuring their confidence in
various bearing fault categories. KD involves comparing these
raw outputs before applying softmax, enabling knowledge
transfer from teacher to student. By converting the logits of
the student model’s classification output into vector form using
the softmax function, (8) generates normalized classification
probability distributions for each image input, ensuring that
their sum is 1 with the normalization factor Z.
1) Teacher CNN Model:The teacher model is structured,
as shown in Table I, which includes four convolutional lay-
ers, four max-pooling layers, a batch normalization layer,
a flattening layer, and two fully connected layers. Conv1 uses
a 5 ×5 kernel, while Conv2, Conv3, and Conv4 use a 3 ×
3 kernel with ReLU activation and a stride of 1. The padding
ensures consistent channel sizes for each convolutional layer.
All four max-pooling layers share a 2 ×2 kernel with a stride
of 1. The diagram illustrates the channel numbers for each
feature extraction block, and each of the two consecutive fully
connected layers contains 128 neurons.
2) Student CNN Model:The student CNN model features
a simplified architecture with fewer layers and parameters,
prioritizing computational efficiency for both training and
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
MA et al.: INTEGRATED FRAMEWORK FOR BEARING FAULT DIAGNOSIS: CNN MODEL COMPRESSION THROUGH KD 40087
TABLE I
TEAC HE R CNN M OD EL ARC HI TE CTU RE
deployment to ensure the optimal performance within resource
constraints. This study introduces a shallow CNN framework
for effectively extracting coefficient matrix representations
after CWT transformation.
The convolutional layer serves as a foundational element
in CNNs, with factors such as the size of the convolutional
kernel, the number of channels, and the stride greatly influ-
encing the model’s classification outcomes. Li et al. [33]
and Zhang et al. [34] have elucidated that the pairing of
convolutional layers and activation functions during training
is crucial, as it affects the overall parameter training of the
model through backpropagation, consequently impacting the
classification performance. Jia et al. [35] have provided insight
by demonstrating the role of the initial convolutional kernel
as a filter for extracting key information, with subsequent
kernels providing varying degrees of activation. It is evident
that filter size and activation of subsequent convolutional
kernels are pivotal factors influencing the performance of
shallow CNN student models. Therefore, this article explores
the effect of different convolution kernel sizes on the accuracy
of the arrangement order in shallow neural networks through
experiments.
In this study, we adopt the model construction of the
remaining layers from [28], focusing specifically on the con-
figurations of Conv1 and Conv2 convolutional layers. Under
the premise of common model architectures, we investigate
how different settings of convolutional kernel sizes in the pre-
ceding and subsequent convolutional layers affect the student
model’s classification performance. Common CNN convolu-
tional kernel sizes used in bearing fault diagnosis, specifically
5×5 and 3 ×3, are employed with padding to maintain
consistency in channel data size between input and output.
The four models shown in Table II are named Student_33 (a),
Student_35 (b), Student_53 (c), and Student_55 (d) according
to the order in which the different convolution kernel sizes
are arranged in the two convolution cores. Conv1 and Conv2
are all convolutional layers with eight channels. Max-pooling
layer kernel sizes, Max1 and Max2, are set to 4 ×4. After a
flattened layer transforms the 2-D data into 1-D form, a fully
connected layer FC1 with 64 neurons is constructed.
TABLE II
STU DEN T CNN MOD EL WITH VARYING KERNEL SIZ ES
Fig. 2. Simplified MGS process: data flattening, sliding window data
extraction, and concatenation.
C. Deep Forest
The DF comprises two fundamental components: MGS and
Cascade Forest Classifier. The simplified MGS process used in
this study is illustrated in Fig. 2. First, segments of length M2
are extracted as inputs for MGS. These segments are reshaped
into 2-D matrices with dimensions M×M. A sliding window
of dimensions J×J, with a stride of S, is then used to extract
data samples, resulting in Nsamples of dimensions J×J,
as determined by (9). These data samples undergo flattening
and concatenation operations, ultimately forming a 1-D vector
of length N×J2, which serves as the input for the Cascaded
Forest
N=(MJ)
S+12
.(9)
The architecture of the Cascade Forest is illustrated in [36],
taking the features extracted through the MGS process as data
input and further training the model to classify the input data.
The DF model, initially proposed by Zhou and Feng [36],
comprises two forest classifiers at each layer. During training,
for the same data input, each layer separately trains these two
forest classifiers. This study employs XGBoost and RFC as
subclassifiers within the model based on Zhou’s model design.
For the Cascade Forest, each layer’s forest classifier consists
of a specified number of decision trees. These decision trees
generate probability distribution vectors corresponding to each
classification category through node partitioning. Subclassi-
fiers integrate multiple decision trees through average voting
to form the output results of the subclassifiers. When these
two subclassifiers in the previous layer finish training, the
subclassifier in the next layer inherits the probability distribu-
tion vectors outputted by the previous layer. Through several
iterations, layer by layer, the cascade process is implemented,
which allows the subclassifiers in the subsequent layer to
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
40088 IEEE SENSORS JOURNAL, VOL. 24, NO. 23, 1 DECEMBER 2024
TABLE III
DATA DES CRI PT ION
readjust parameters to refine the model’s classification weights,
thereby enhancing classification accuracy.
IV. EXP ERIMENTAL SETUP AND RES ULT ANA LYSI S
The computer configuration used in this study comprises an
Intel Core i5-12500H CPU with a base speed of 2.50 GHz,
12 cores, and 16 logical processors. The system is equipped
with 16 GB of RAM operating at 4800 MHz and utilizing all
eight available memory slots.
The effectiveness of the proposed methodology framework
is evaluated using datasets from two distinct testbeds: the Case
Western Reserve University (CWRU) bearing dataset [37],
widely regarded as a benchmark, and the University of
Ottawa’s rolling-element dataset [38], which serves as a more
recent complement to the CWRU dataset. It is worth noting
that for the CWRU dataset, we selected all available frequen-
cies, namely, the 12- and 48-kHz sampling frequencies. The
12-kHz dataset was utilized due to its widespread used as a
benchmark for robust comparison of the method’s effective-
ness, while the 48-kHz dataset, representing a relatively higher
sampling frequency, was employed as a complementary test
to further validate the model’s performance. In addition, the
Ottawa testbed, configured with a 42-kHz sampling rate, offers
a dataset that not only contains more fault categories but also
bridges the two selected CWRU frequencies, providing further
experimental validation.
A. Dataset Description and Data Segmentation
In the CWRU bearing dataset, single-point faults were
induced in SKF test bearings through electrodischarge machin-
ing, as shown in Fig. 3(a). The experimental fault diameters of
bearings included 0.007, 0.014, and 0.021 in. These faults were
separately introduced at the inner raceway (IR), ball (BA),
and outer raceway (OR). The compromised bearings were
reinstalled into the test motor, and vibration data were recorded
under motor loads of 0, 1, 2, and 3 hp. Accordingly, under
each load condition, the faulty bearings corresponded to three
fault types and three fault diameters, in addition to normal
bearings, forming a ten-class bearing fault diagnosis problem.
To ensure consistent data length under each load, only loads
of 1, 2, and 3 hp were selected. Specifically, in the following
Fig. 3. Rolling bearing test stand included. (a) CWRU bearing appara-
tus. (b) Ottawa rolling-element apparatus.
sections, Case 1 used the 48-kHz data, and Case 2 used the
12-kHz data.
1) Case 1—CWRU 48-kHz Dataset:In Case 1, the vibration
dataset from CWRU with a sampling rate of 48 kHz com-
prises over 482 000 data points for each fault type. The first
482 000 data points were selected for data segmentation to
ensure consistent data quantity across all categories. A sliding
window of length 1024 data points was employed to extract
500 samples with a sliding step size of 963 data points.
As depicted in Table III, for each category, 500 data segments
were stratified and sampled in a 6:4 ratio. Consequently,
300 samples were designated for the training set, and 200 sam-
ples were allocated for the testing set.
2) Case 2—CWRU 12-kHz Dataset:For this case, the vibra-
tion dataset from CWRU with a sampling rate of 12 kHz
was utilized. Considering the smaller data volume in the
12-kHz dataset, the first 120 000 data points were selected for
segmentation. A sliding window of length 1024 data points
was employed, with a sliding step size of 396 data points.
As a result, 300 labeled data segments were obtained. For load
levels of 1, 2, and 3 hp, a stratified sampling approach was
implemented to partition the data into training and testing sets.
As a result, 200 data segments were designated for the training
set, and the remaining 100 data segments were allocated to
the test set, as outlined in Table III. Taking advantage of the
smaller data volume, this case was used to train the model
with a reduced proportion of data, allowing for a comparison
of the model’s performance against state-of-the-art methods.
In the Ottawa case, raw data were collected at a constant
nominal speed and load, sampled at a rate of 42 kHz, as shown
in Fig. 3(b). The Ottawa dataset encompassed four fault types,
namely, inner race, outer race, ball, and cage. Each bearing was
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
MA et al.: INTEGRATED FRAMEWORK FOR BEARING FAULT DIAGNOSIS: CNN MODEL COMPRESSION THROUGH KD 40089
Fig. 4. Visualization of the confusion matrix results for the segmentation of vibration signals from the inner race (IR), ball (BA), and outer race (OR)
of bearings with 0.007-in faults under 1-hp load condition in the CWRU 48-kHz dataset after CWT process.
subjected to data collection in three distinct states: healthy,
developing fault, and faulty. Importantly, this study selected
four bearings representing the faulty types and one bearing
representing the healthy type, forming a five-class bearing
fault diagnosis problem. Specifically, the four faulty bearings
corresponded to datasets labeled as I_1_2, O_6_2, B_11_2,
and C_16_2, representing inner race, outer race, ball, and cage
faults, respectively. Conversely, the bearing representing the
healthy type was labeled as H_2_0.
3) Case 3—Ottawa 42-kHz Dataset:In Case 3, all vibration
signal data of 420 000 data points were selected for each
fault category. The sliding window approach mentioned above
was adopted with a window size of 1024 data points and a
sliding step of 839 data points. Subsequently, 500 samples
were obtained. The dataset was then partitioned into a training
set with 300 samples and a test set with 200 samples using a
6:4 ratio.
For Cases 1–3, the segmented data fragments obtained via
sliding window segmentation were all transformed into 2-D
arrays using CWT. During this process, a set of discrete values
formed an arithmetic progression, as depicted in Table IV.
These values were employed as the scales for the CWT.
Specifically, for Cases 1 and 2 corresponding to the CWRU
dataset, the scale parameter for the CWT ranged from 1 to
2049 with a step size of 32. Conversely, for Case 3 cor-
responding to the Ottawa dataset, a scale sequence ranging
TABLE IV
SCA LE PARAMETERS USED B Y CWT IN TH REE CA SE S
from 1 to 2049 with a step size of 64 was utilized for the
transformation. These data, following the CWT using the
“Morlet” mother wavelet, underwent a reshaping procedure,
resulting in coefficient matrices of size 64 ×64. Taking the
vibration sensor data from the CWRU 48 kHz with a 1-hp load
as an example, Fig. 4illustrates the numerical distribution and
visualization of the wavelet coefficient matrix formed after the
transformation of data segments with a length of 1024 through
CWT.
B. Parameter Settings
All models designed for the KD process, including the
teacher model, the KDCNN model, and the original student
model used for comparison, were trained with a batch size
of 128 and a learning rate of 0.01, employing ReLU as the
activation function. For the selection of the loss function,
the teacher model and all original student models used a
cross-entropy loss, while the KDCNN model incorporated a
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
40090 IEEE SENSORS JOURNAL, VOL. 24, NO. 23, 1 DECEMBER 2024
TABLE V
TES T SET RES ULTS O F TEAC HE R MODEL, STU DE NT MODEL,AN D DIS TIL LE D STUDENT MODEL AFTER TEN TRAINING ITERATIONS
distillation loss computed from (9). The distillation parameter
αvaried across datasets, with a value of 0.75 for the CWRU
48-kHz and CWRU 12-kHz 1-hp load datasets, and a value of
0.45 for the CWRU 12-kHz 2- and 3-hp loads, as well as the
Ottawa datasets. For the temperature parameter T, a uniform
setting of 2.5 was applied in all cases. Training epochs were
set at 35 for the teacher model and 40 for both the original
student and KDCNN models.
In the DF process, the study extracted the output of the
first fully connected layer from the KDCNN model, resulting
in a vector of length 64. Each vector underwent a simplified
MGS process involving a 2-D sliding window of size 6 ×6,
ultimately yielding a 1-D vector of length 324, which served
as the input for the cascaded forest. The cascaded forest model
employed an XGBoost classifier and an RFC. For XGBoost,
the regularization elastic net parameter αwas set to 0.8, and
the parameter λwas set to 0.6. XGBoost utilized 30 trees,
while the RFC used 45 trees. With regard to the depth of
the cascade forest, in case 1, the number of cascade levels
was set to 3 for 1- and 2-hp loads and was set to 2 for the
3-hp load. In Cases 2 and 3, the number of cascades is set to
2. In Cases 2 and 3, the number of cascading layers was set
to 2.
C. Analysis of Kernel Size Sequencing in KDCNN
The experiment in this section discovers and elucidates the
feature extraction characteristics of the CNN by exploring
the arrangement of common 2-D convolutional kernel sizes,
specifically 3 ×3 and 5 ×5, across two convolutional
layers in the student models. Table Vpresents the accuracy
of the test datasets after ten rounds of training. Notably, the
suffixes indicated by underscores “_” represent the order of the
kernel sizes in the model’s convolutional layers. For example,
“KDCNN Student_53” denotes a KD model with the first
convolutional kernel size being 5 ×5 and the second being
3×3.
The shallow CNN model “Original Student_35, where
smaller convolutional kernels precede larger ones, was trained
under the KD method to obtain the “KDCNN Student_35”
model. It demonstrated higher average classification accuracy
and relatively significant standard deviation in most cases
compared to the teacher CNN, despite having a significantly
reduced parameter count. Although the “KDCNN Student_35”
model shown in Table VI does not have the highest parameter
TABLE VI
PARAMETER SCALE S AN D COM PUT IN G TIME S OF DI FFER EN T MODELS
count among the four student models, with a count of 9048,
which suggests that high accuracy is not solely determined
by the model’s parameter count. Indeed, the arrangement of
convolutional kernel sizes plays a more decisive role than the
number of parameters.
The more likely explanation for this phenomenon is that an
earlier placement of smaller 3 ×3 kernels can initially capture
subtle fault features more effectively. In contrast, larger kernels
ensure diversity in further fault feature extraction by leveraging
a larger receptive field. By reducing the distillation loss, the
KDCNN model achieved higher classification accuracy than
the original student model. Visualizations in Fig. 5, including
confusion matrices and CNN feature maps, reveal that the
student model extracted fewer features after two convolutional
layers, while the distillation process enhanced the feature
extraction in its second convolutional layer to be more aligned
with the output of the teacher model’s final convolutional layer.
The experimental results demonstrate that although the KD
model compression method reduces the model’s parameters to
5% of the original, its accuracy does not significantly decrease
and remains at a high level compared to the original model
without KD training.
D. Comparison of Computing Time
Once the KDCNN model is trained, it can be indepen-
dently applied to resource-constrained devices for bearing
fault diagnosis tasks. Therefore, the effectiveness of model
compression can be evaluated by the reduction in parameters
and the computation time required. CPU time is the total CPU
time required for the model to compute the elapsed time, and
elapsed time includes the total time required for the model to
interact with the operating system on top of the CPU time.
This part of the study compared the elapsed time and the
CPU time required for the teacher model and various KDCNN
models on the 1-hp test set of the CWRU 48-kHz dataset
in Case 1 to assess the effectiveness of the KD method in
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
MA et al.: INTEGRATED FRAMEWORK FOR BEARING FAULT DIAGNOSIS: CNN MODEL COMPRESSION THROUGH KD 40091
Fig. 5. Comparative analysis of classification accuracy between (a) teacher model, (b) Original Student_35 model, (c) Original Student_55 model,
(d) KDCNN Student_35 model, and (e) KDCNN Student_55 model. Along with visualization of feature maps in various convolutional layers, a case
study using the CWRU 48-kHz 1-hp load dataset.
reducing computation time. Importantly, to simulate limited
computational resources, this part of the study exclusively
employed a single CPU logical processor operating in a single-
thread mode.
The results presented in Table VI show that in contrast to
the teacher model, the shallow networks trained through KD
significantly reduced the computation time on a single CPU
logical process due to the simpler model structures and fewer
parameters. Among these models, the “KDCNN Student_33”
model has the fewest parameters, with a CPU time of only
1.42 s, making it the shortest among all models. In comparison,
the CPU time and elapsed times of the “KDCNN Student_35”
model are also short, and its accuracy is significantly higher.
Considering both accuracy and model compression efficiency,
we selected the “KDCNN Student_35” model for feature
extraction in the final hybrid model, which required only 1.47 s
of CPU time and 1.55 s of elapsed time, significantly less than
the teacher model. Due to the simplified model architecture,
the number of parameters was reduced from 213 344 to 9944,
achieving a compression rate exceeding 95%.
E. Result Analysis and Comparison With
State-of-the-Art Findings
In this study, the KDCNN model with the highest classifi-
cation accuracy was selected for feature extraction, combining
the DF classifier to form the final hybrid model. The highest
classification accuracy results on the test datasets are pre-
sented in Table VII. In line with the previous results, the
“KDCNN_35” model, which features a small convolutional
kernel size in the first layer and a larger one in the second,
achieved the highest classification accuracy. Furthermore, the
rightmost column of Table VII shows that the final hybrid
model consistently outperformed the standalone KDCNN
model, confirming that the DF classifier is more effective
than the CNN network alone. In addition, Fig. 6presents the
confusion matrix for bearing fault diagnosis on the correspond-
ing CWRU datasets. Under consistent training and testing
conditions, the results on the test set indicate that the proposed
method achieved classification accuracies exceeding 98.75%
on the CWRU 48 kHz dataset in Case 1 and reached over
98.90% on the CWRU 12-kHz dataset in Case 2. Moreover,
the t-distributed stochastic neighbor embedding (t-SNE) visu-
alization displayed in Fig. 7shows only a small amount of
overlap between the faulty bearing classes. When it comes to
Case 3, as shown in Table VII, the KDCNN-DF hybrid model
achieved 100% classification accuracy on the Ottawa dataset.
The confusion matrix and t-SNE visualization results for the
hybrid model on the Ottawa dataset are presented in Fig. 8(a)
and (b), respectively.
This study compares the proposed method with current
relevant approaches in the field. In order to select the dataset
load conditions used by most methods to make the results more
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
40092 IEEE SENSORS JOURNAL, VOL. 24, NO. 23, 1 DECEMBER 2024
TABLE VII
TEN -TRIAL HI GH EST AC CU RAC Y: STU DE NT MODELS, DI ST ILL ED ST UDENT MODELS, TEAC HER MODELS,AND HYBRID MODELS
Fig. 6. Confusion matrix visualization of the CWRU Dataset classification results includes matrices of representing (a)(c) 48-kHz and
(d)(f) 12-kHz sampling rates datasets, both covering 1–3-hp operating conditions.
valuable for reference, we used the CWRU 12-kHz dataset
with a 2-hp load. The compared approaches include DF [28],
CWT-CNN-DF [28], Weighted XGBoost [39], 2-DCNN-
RF [40], federated transfer learning and KD (FTLKD) [41],
time series transformer [42], and multiassistant KD with
decreasing threshold channel pruning (DTCP-MAKD) [43].
Specifically, DF [28] and CWT-CNN-DF [28] are similar
to the proposed method but do not use the simplified MGS
method. Moreover, the CWT-CNN-DF [28] method does not
adopt a KD model compression method as well. Weighted
XGBoost [39] employs a Decimation-In-Time combined with
fast Fourier transform for signal preprocessing, which is then
used to train the XGBoost model. 2DCNN-RF [40] converts
input data into 2-D images for feature extraction using CNN
and then employs RFC as the classifier. FTLKD [41] combines
a federated transfer learning strategy with a multisource KD
method, using multiple teacher CNN networks to train a stu-
dent CNN model under different working conditions. The time
series transformer [42] generates token sequences from 1-D
data using a proposed time-series tokenizer and combines them
with a transformer model for bearing fault diagnosis. DTCP-
MAKD [43] bridges the gap between teacher and student
networks using multiple medium-sized auxiliary networks to
facilitate the KD process. It also applies model pruning to
remove channels that are not beneficial to the KD process,
further achieving model compression.
We adopted the data preprocessing techniques and image
input sizes used in the original methods. Specifically, we mod-
ified the model architectures that employ DF as their base
model, including DF [28] and CWT-CNN-DF [28]. We applied
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
MA et al.: INTEGRATED FRAMEWORK FOR BEARING FAULT DIAGNOSIS: CNN MODEL COMPRESSION THROUGH KD 40093
Fig. 7. t-SNE visualization of the CWRU Dataset classification results. (a)(c) 48-kHz sampling rate 1–3-hp operating condition dataset.
(d)(f) 12-kHz sampling rate 1–3-hp operating condition dataset.
Fig. 8. Final classification results for the Ottawa Dataset include.
(a) Distillation CNN-DF hybrid model confusion matrix and (b) t-SNE
visualization result.
the simplified MGS method proposed in these two studies to
simplify their DF model architectures. Besides, we maintained
the original experimental designs for all other methods for
comparison. According to the results shown in Table VIII, the
proposed hybrid model method achieved high classification
accuracy. Notably, the hybrid model outperformed the indi-
vidual DF, XGBoost, and CNN models. Compared to the
CWT-CNN-DF method without KD-based model compres-
sion, the proposed method achieved higher accuracy due to
the improvements in CNN feature extraction facilitated by the
KD method.
F. Generalization Ability Under Noisy Test Conditions
To test the model’s generalization ability and simulate
the noise interference commonly encountered in industrial
scenarios, this part of the study introduced the Gaussian white
TABLE VIII
COMPARISON WITH STATE-OF -THE -ART METHODS
noise into the vibration signals. Specifically, noise was added
only to the test set, requiring the model to resist interference
from features it had never encountered during training, which
is a challenging task. The signal-to-noise (SNR) ratio was used
as an indicator to measure noise intensity, where Psignal is the
signal power and Pnoise is the noise power, as shown in (10).
The Gaussian white noise with SNR equal to 0, 2, 4, and 6 was
selected for this experiment. The performance of the teacher
model, “KDCNN Student_35” model, “KDCNN_53” model,
and the hybrid model was carried under different SNR levels
on the CWRU 48-kHz 1-hp dataset
SNR =10 log10 Psignal/Pnoise .(10)
The results in Fig. 9show that the models can still achieve
a high classification accuracy of 98.60% under low noise
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
40094 IEEE SENSORS JOURNAL, VOL. 24, NO. 23, 1 DECEMBER 2024
Fig. 9. Performance of each model under different SNRs.
conditions. However, as the noise becomes more significant,
the classification accuracy of the models noticeably decreases.
Overall, while the hybrid model can improve classification
accuracy to some extent in the presence of noise in the test set,
the improvement is very limited under strong noise conditions.
V. CONC LUSION
This study proposes a KDCNN-DF hybrid model for fault
diagnosis using KD to compress the model and reduce
computational resource consumption effectively. Initially, data
segmentation is performed using a sliding window followed by
a wavelet transform to prepare the data for model input. Next,
KD is used to transfer knowledge from the proposed teacher
model to the student model via soft labels, achieving model
compression. Finally, the output from the fully connected layer
of the KDCNN model is used as input for the DF model,
which enhances classification accuracy through a simplified
MGS process and a cascade forest.
The experimental results on the arrangement of convolu-
tional kernel sizes significantly contributed to the enhanced
fault feature recognition of the hybrid model. Additional
experiments evaluated the model’s generalization capabilities.
By compressing the model and improving classification accu-
racy, this framework facilitates the deployment of intelligent
bearing fault diagnosis in resource-constrained industrial set-
tings. Future research could explore transfer learning and
improvements in KD to further enhance model generalization.
REF ER EN CE S
[1] E. Estupian, D. Espinoza, and A. Fuentes, “Energy losses caused by
misalignment in rotating machinery: A theoretical, experimental and
industrial approach,” International Journal Comadem, vol. 11, no. 2,
pp. 1–26, 2008.
[2] Z. Chen, S. Tian, X. Shi, and H. Lu, “Multiscale shared learning for
fault diagnosis of rotating machinery in transportation infrastructures,”
IEEE Trans. Ind. Informat., vol. 19, no. 1, pp. 447–458, Jan. 2023.
[3] K. S. Haran et al., “High power density superconducting rotating
machines—Development status and technology roadmap, Superconduc-
tor Sci. Technol., vol. 30, no. 12, Dec. 2017, Art. no. 123002.
[4] W. Li, X. Zhong, H. Shao, B. Cai, and X. Yang, “Multi-mode data
augmentation and fault diagnosis of rotating machinery using modified
ACGAN designed with new framework,” Adv. Eng. Informat., vol. 52,
Apr. 2022, Art. no. 101552.
[5] Y. Xiao, H. Shao, S. Han, Z. Huo, and J. Wan, “Novel joint transfer net-
work for unsupervised bearing fault diagnosis from simulation domain to
experimental domain,” IEEE/ASME Trans. Mechatronics, vol. 27, no. 6,
pp. 5254–5263, Dec. 2022.
[6] X. Wang, C. Shen, M. Xia, D. Wang, J. Zhu, and Z. Zhu, “Multi-scale
deep intra-class transfer learning for bearing fault diagnosis,” Rel. Eng.
Syst. Saf., vol. 202, Oct. 2020, Art. no. 107050.
[7] J. Cen, Z. Yang, X. Liu, J. Xiong, and H. Chen, “A review of data-
driven machinery fault diagnosis using machine learning algorithms,
J. Vib. Eng. Technol., vol. 10, pp. 2481–2507, Oct. 2022.
[8] Y. S. Wang, N. N. Liu, H. Guo, and X. L. Wang, An engine-
fault-diagnosis system based on sound intensity analysis and wavelet
packet pre-processing neural network,” Eng. Appl. Artif. Intell., vol. 94,
Sep. 2020, Art. no. 103765.
[9] M. Wang et al., “Roller bearing fault diagnosis based on integrated fault
feature and SVM,” J. Vibrat. Eng. Technol., vol. 10, no. 3, pp. 853–862,
Apr. 2022.
[10] B. Zhao, X. Zhang, H. Li, and Z. Yang, “Intelligent fault diagnosis
of rolling bearings based on normalized CNN considering data imbal-
ance and variable working conditions, Knowl.-Based Syst., vol. 199,
Jul. 2020, Art. no. 105971.
[11] W. Cai, X. Zhu, K. Bai, A. Ye, and R. Zhang, “An explainable dual-mode
convolutional neural network for multivariate time series classification,”
Knowl.-Based Syst., vol. 299, Sep. 2024, Art. no. 112015.
[12] L. Sun, X. Zhu, J. Xiao, W. Cai, Q. Ma, and R. Zhang, “A hybrid fault
diagnosis method for rolling bearings based on GGRU-1DCNN with
AdaBN algorithm under multiple load conditions,” Meas. Sci. Technol.,
vol. 35, no. 7, Jul. 2024, Art. no. 076201.
[13] L. Zhang, Y. Lv, W. Huang, and C. Yi, “Bearing fault diagnosis
under various operation conditions using synchrosqueezing transform
and improved two-dimensional convolutional neural network,” Meas.
Sci. Technol., vol. 33, no. 8, Aug. 2022, Art. no. 085002.
[14] J. Zhang, W. Wang, C. Lu, J. Wang, and A. K. Sangaiah, “Lightweight
deep network for traffic sign classification, Ann. Telecommun., vol. 75,
nos. 7–8, pp. 369–379, Aug. 2020.
[15] J. Sun, Z. Liu, J. Wen, and R. Fu, “Multiple hierarchical compression
for deep neural network toward intelligent bearing fault diagnosis, Eng.
Appl. Artif. Intell., vol. 116, Nov. 2022, Art. no. 105498.
[16] Z. Li, H. Li, and L. Meng, “Model compression for deep neural
networks: A survey,” Computers, vol. 12, no. 3, p. 60, Mar. 2023.
[17] J. Guo, D. Xu, and W. Ouyang, “Multidimensional pruning and its
extension: A unified framework for model compression, IEEE Trans.
Neural Netw. Learn. Syst., vol. 2, no. 5, pp. 1–15, Jun. 2023.
[18] H. Zhang, B. Yao, W. Shao, X. Xu, L. Liu, and Y. Peng, “Mixed pre-
cision quantized neural network accelerator for remote sensing images
classification,” in Proc. IEEE 16th Int. Conf. Electron. Meas. Instrum.
(ICEMI), vol. 30, Harbin, China, Aug. 2023, pp. 172–176.
[19] R. Dian, Y. Liu, and S. Li, “Spectral super-resolution via deep low-rank
tensor representation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 21,
no. 1, pp. 1–11, Jun. 2024.
[20] Z. Meng, C. Luo, J. Li, L. Cao, and F. Fan, “Research on fault diagnosis
of rolling bearing based on lightweight model with multiscale features,”
IEEE Sensors J., vol. 23, no. 12, pp. 13236–13247, Jun. 2023.
[21] K. Sun, L. Bo, H. Ran, Z. Tang, and Y. Bi, “Unsupervised domain
adaptation method based on domain-invariant features evaluation and
knowledge distillation for bearing fault diagnosis, IEEE Trans. Instrum.
Meas., vol. 72, pp. 1–10, 2023.
[22] J. Liu, T. Zheng, G. Zhang, and Q. Hao, “Graph-based knowledge distil-
lation: A survey and experimental evaluation,” 2023, arXiv:2302.14643.
[23] H. Zhong, S. Yu, H. Trinh, Y. Lv, R. Yuan, and Y. Wang, “A novel small-
sample dense teacher assistant knowledge distillation method for bearing
fault diagnosis,” IEEE Sensors J., vol. 23, no. 20, pp. 24279–24291,
Oct. 2023.
[24] R. Gong, C. Wang, J. Li, and Y. Xu, “Lightweight fault diagnosis method
in embedded system based on knowledge distillation,” J. Mech. Sci.
Technol., vol. 37, no. 11, pp. 5649–5660, Nov. 2023.
[25] Y. Yang, Y. Long, Y. Lin, Z. Gao, L. Rui, and P. Yu, “Two-stage edge-
side fault diagnosis method based on double knowledge distillation,
Comput., Mater. Continua, vol. 76, no. 3, pp. 3623–3651, 2023.
[26] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and
H. Ghasemzadeh, “Improved knowledge distillation via teacher assis-
tant,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 5191–5198.
[27] M. El Khayati, I. Kich, and Y. Taouil, “CNN-based methods for
offline Arabic handwriting recognition: A review,” Neural Process. Lett.,
vol. 56, no. 2, p. 115, Mar. 2024.
[28] Y. Xu, Z. Li, S. Wang, W. Li, T. Sarkodie-Gyan, and S. Feng, A
hybrid deep-learning model for fault diagnosis of rolling bearings,”
Measurement, vol. 169, Feb. 2021, Art. no. 108502.
[29] L. Wen, X. Li, L. Gao, and Y. Zhang, “A new convolutional neural
network-based data-driven fault diagnosis method, IEEE Trans. Ind.
Electron., vol. 65, no. 7, pp. 5990–5998, Jul. 2017.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
MA et al.: INTEGRATED FRAMEWORK FOR BEARING FAULT DIAGNOSIS: CNN MODEL COMPRESSION THROUGH KD 40095
[30] S. Chopra and K. J. Marfurt, “Choice of mother wavelets in CWT
spectral decomposition,” in Proc. SEG Tech. Program Expanded Abstr.,
vol. 24, Aug. 2015, pp. 2957–2961.
[31] O. Kocahan, E. Tiryaki, E. Coskun, and S. Ozder, “Determination of
phase from the ridge of CWT using generalized Morse wavelet, Meas.
Sci. Technol., vol. 29, no. 3, Mar. 2018, Art. no. 035203.
[32] C. Malla, A. Rai, V. Kaul, and I. Panigrahi, “Rolling element bearing
fault detection based on the complex Morlet wavelet transform and
performance evaluation using artificial neural network and support vector
machine,” Noise Vibrat. Worldwide, vol. 50, nos. 9–11, pp. 313–327,
Oct. 2019.
[33] J. Li et al., “Explainable CNN with fuzzy tree regularization for
respiratory sound analysis,” IEEE Trans. Fuzzy Syst., vol. 30, no. 6,
pp. 1516–1528, Jun. 2022.
[34] Q. Zhang, Y. Yang, H. Ma, and Y. N. Wu, “Interpreting CNNs via
decision trees,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2019, pp. 6254–6263.
[35] F. Jia, Y. Lei, N. Lu, and S. Xing, “Deep normalized convolutional
neural network for imbalanced fault classification of machinery and its
understanding via visualization,” Mech. Syst. Signal Process., vol. 110,
pp. 349–367, Sep. 2018.
[36] Z.-H. Zhou and J. Feng, “Deep forest,” Nat. Sci. Rev., vol. 6, no. 1,
pp. 74–86, Jan. 2019.
[37] W. A. Smith and R. B. Randall, “Rolling element bearing diagnostics
using the case Western reserve university data: A benchmark study,”
Mech. Syst. Signal Process., vols. 64–65, pp. 100–131, Dec. 2015.
[38] M. Sehri, P. Dumond, and M. Bouchard, “University of Ottawa constant
load and speed rolling-element bearing vibration and acoustic fault
signature datasets,” Data Brief, vol. 49, Aug. 2023, Art. no. 109327.
[39] C. Xiang, Z. Ren, P. Shi, and H. Zhao, “Data-driven fault diagnosis for
rolling bearing based on DIT-FFT and XGBoost, Complexity, vol. 2021,
pp. 1–13, May 2021.
[40] S. Yang et al., “A 2DCNN-RF model for offshore wind turbine
high-speed bearing-fault diagnosis under noisy environment, Energies,
vol. 15, no. 9, p. 3340, May 2022.
[41] Y. Zhou, J. Wang, and Z. Wang, “Bearing faulty prediction method based
on federated transfer learning and knowledge distillation, Machines,
vol. 10, no. 5, p. 376, May 2022.
[42] Y. Jin, L. Hou, and Y. Chen, “A time series transformer based method
for the rotating machinery fault diagnosis,” Neurocomputing, vol. 494,
pp. 379–395, Jul. 2022.
[43] H. Zhong, S. Yu, H. Trinh, Y. Lv, R. Yuan, and Y. Wang, “Multiassistant
knowledge distillation for lightweight bearing fault diagnosis based on
decreasing threshold channel pruning,” IEEE Sensors J., vol. 24, no. 1,
pp. 486–494, Jan. 2024.
Jun Ma was born in Beijing, China, in 2002.
He is currently pursuing the bachelor’s degree
in management information systems with Beijing
Jiaotong University, Beijing, China.
His primary research interests lie in machine
learning and deep learning.
Wei Cai received the B.S. degree from South-
west Jiaotong University, Chengdu, China,
in 2021. He is currently pursuing the Ph.D.
degree in mechanical engineering with the
School of Mechanical, Electronic and Control
Engineering, Beijing Jiaotong University, Beijing,
China.
His current research interests include deep
learning, explainable artificial intelligence, and
intelligent fault diagnosis.
Yuhao Shan is currently pursuing the bachelor’s
degree in management information systems with
Beijing Jiaotong University, Beijing, China.
His main research interests include machine
learning, data analytics, and predictive model-
ing.
Yuting Xia is currently pursuing the B.Sc.
degree in management information systems with
Beijing Jiaotong University, Beijing, China.
Her research interests focus on machine learn-
ing, data analytics, and predictive modeling.
Runtong Zhang (Senior Member, IEEE)
received the B.S. degree in computer science
and automation from Dalian Maritime University,
Dalian, China, in 1985, and the Ph.D. degree
in production engineering and management
from the Technical University of Crete, Chania,
Greece, in 1996.
He was with the Swedish Institute of Computer
Science, Stockholm, Sweden, as a Senior
Researcher, and Por t of Tianjin Authority, as an
Engineer. He is currently a Professor and
the Department Head of Beijing Jiaotong University, Beijing, China.
He has authored or co-authored more than 400 papers in referenced
journals and conferences and 40 books. He has been a PI for more
than 100 research projects and is a holder of 16 patents. His current
research interests include big data analysis, operations research, and
artificial intelligence.
Dr. Zhang has been the general chair or Co-Chair for more than ten
IEEE sponsored international conferences.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on November 28,2024 at 06:44:45 UTC from IEEE Xplore. Restrictions apply.
Article
Full-text available
The fault diagnosis of rolling bearings is a critical aspect of rotating machinery, as it significantly contributes to the overall operational safety of the mechanical equipment. In the practical engineering environment, the complex and variable working conditions, along with the presence of overlapping noise, contribute to intricate frequency information in the acquired signals and their highly time-dependent characteristics, which makes it difficult to extract the available fault features hidden in the signal. Based on this, a hybrid fault diagnosis method named GGRU-1DCNN-AdaBN is introduced, which combines improved gap-gated recurrent unit network(GGRU), one-dimensional convolutional neural network(1DCNN), and adaptive batch normalization(AdaBN). The proposed approach involves several parts to enhance fault diagnosis accuracy in vibration signals under constant load conditions and variable load conditions. Firstly, the end-layer structure of the traditional GRU is replaced with a one-dimensional global average pooling layer (1D-GAP) to aggregate the influence components of defects and reduce model training parameters. Secondly, the fusion of different types of frequency and sequence features is achieved by combining 1DCNN, addressing the limitation of a single network's feature extraction capability and the loss of temporal features in a cascaded hybrid model. Subsequently, the fused features are input into a softmax multi-classifier to obtain fault type identification results. Lastly, the GGRU-1DCNN method is further improved by incorporating the AdaBN algorithm, enhancing the model's domain adaptive capability under variable load conditions and noisy environments. The method is validated using datasets obtained from Case Western Reserve University (CWRU), aero-engine bearings, Xi'an Jiaotong University, and the Changxing Sumyoung Technology (XJTU-SY). The findings suggest that the proposed method demonstrates superior accuracy and robustness in fault diagnosis, as well as excellent generalization capability and universal applicability.
Article
Full-text available
Arabic Handwriting Recognition (AHR) is a complex task involving the transformation of handwritten Arabic text from image format into machine-readable data, holding immense potential across various applications. Despite its significance, AHR encounters formidable challenges due to the intricate nature of Arabic script and the diverse array of handwriting styles. In recent years, Convolutional Neural Networks (CNNs) have emerged as a pivotal and promising solution to address these challenges, demonstrating remarkable performance and offering distinct advantages. However, the dominance of CNNs in AHR lacks a dedicated comprehensive review in the existing literature. This review article aims to bridge the existing gap by providing a comprehensive analysis of CNN-based methods in AHR. It covers both segmentation and recognition tasks, delving into advancements in network architectures, databases, training strategies, and employed methods. The article offers an in-depth comparison of these methods, considering their respective strengths and limitations. The findings of this review not only contribute to the current understanding of CNN applications in AHR but also pave the way for future research directions and improved practices, thereby enriching and advancing this critical domain. The review also aims to uncover genuine challenges in the domain, providing valuable insights for researchers and practitioners.
Article
Full-text available
With the rapid development of the Internet of Things (IoT), the automation of edge-side equipment has emerged as a significant trend. The existing fault diagnosis methods have the characteristics of heavy computing and storage load, and most of them have computational redundancy, which is not suitable for deployment on edge devices with limited resources and capabilities. This paper proposes a novel two-stage edge-side fault diagnosis method based on double knowledge distillation. First, we offer a clustering-based self-knowledge distillation approach (Cluster KD), which takes the mean value of the sample diagnosis results, clusters them, and takes the clustering results as the terms of the loss function. It utilizes the correlations between faults of the same type to improve the accuracy of the teacher model, especially for fault categories with high similarity. Then, the double knowledge distillation framework uses ordinary knowledge distillation to build a lightweight model for edge-side deployment. We propose a two-stage edge-side fault diagnosis method (TSM) that separates fault detection and fault diagnosis into different stages: in the first stage, a fault detection model based on a denoising auto-encoder (DAE) is adopted to achieve fast fault responses; in the second stage, a diverse convolution model with variance weighting (DCMVW) is used to diagnose faults in detail, extracting features from micro and macro perspectives. Through comparison experiments conducted on two fault datasets, it is proven that the proposed method has high accuracy, low delays, and small computation, which is suitable for intelligent edge-side fault diagnosis. In addition, experiments show that our approach has a smooth training process and good balance.
Article
Multivariate time series classification (MTSC) is a crucial machine learning problem prevalent across various real-life domains. Traditional deep learning approaches with high accuracy in MTSC are often criticized for their “black box” nature, offering no insight into their operational mechanisms or decision-making processes. Explainable artificial intelligence (XAI) becomes a key idea for dealing with such limitations in decision-sensitive areas. While some researchers have explored the interpretability of MTSC, the majority have focused on elucidating the relationship between variables over time, neglecting the intricate connections among different variables. To address the interpretability issue in MTSC, we introduce an explainable dual-mode convolutional neural network (XDM-CNN) designed specifically for MTSC. The proposed XDM-CNN framework comprises two modules: a classification module and an explanation module. The classification module can ensure exceptional classification performance by combining both one-dimensional and two-dimensional convolutional neural networks, while the explanation module can mine the underlying logical relationships among variables during the classification process by utilizing a synergy of visualization and quantification techniques. We validate our approach using 26 datasets sourced from the University of East Anglia (UEA) archive. The experimental results demonstrate that XDM-CNN not only exhibits excellent performance in classification accuracy, but also has strong explainability. By combining visual and numerical explanation, the hidden logical relationships among multivariate time series are explored and interpreted, providing human users with a decidable basis for classification decisions.
Article
Spectral super-resolution has attracted the attention of more researchers for obtaining hyperspectral images (HSIs) in a simpler and cheaper way. Although many convolutional neural network (CNN)-based approaches have yielded impressive results, most of them ignore the low-rank prior of HSIs resulting in huge computational and storage costs. In addition, the ability of CNN-based methods to capture the correlation of global information is limited by the receptive field. To surmount the problem, we design a novel low-rank tensor reconstruction network (LTRN) for spectral super-resolution. Specifically, we treat the features of HSIs as 3-D tensors with low-rank properties due to their spectral similarity and spatial sparsity. Then, we combine canonical-polyadic (CP) decomposition with neural networks to design an adaptive low-rank prior learning (ALPL) module that enables feature learning in a 1-D space. In this module, there are two core modules: the adaptive vector learning (AVL) module and the multidimensionwise multihead self-attention (MMSA) module. The AVL module is designed to compress an HSI into a 1-D space by using a vector to represent its information. The MMSA module is introduced to improve the ability to capture the long-range dependencies in the row, column, and spectral dimensions, respectively. Finally, our LTRN, mainly cascaded by several ALPL modules and feedforward networks (FFNs), achieves high-quality spectral super-resolution with fewer parameters. To test the effect of our method, we conduct experiments on two datasets: the CAVE dataset and the Harvard dataset. Experimental results show that our LTRN not only is as effective as state-of-the-art methods but also has fewer parameters. The code is available at https://github.com/renweidian/LTRN.
Article
Bearing fault detection and classification under a diagnostics model with fewer parameters has been a challenging problem. A commonly solution is knowledge distillation (KD) using teacher-student models. Through the distillation process, the student model can acquire knowledge from the teacher model to enhance performance without introducing extra parameters. However, when using a powerful teacher model, distillation performance is not always ideal. This is because a more powerful teacher model can generate more specific classification strategies, which may result in poorer distillation performance. To this end, the multi-assistant knowledge distillation (MAKD) method is proposed, which bridges the gap between the teacher-student models by incorporating several intermediate-sized assistant models. Moreover, these teacher assistant models have the same architecture, which creates a better knowledge transfer condition at the logit layer. To further optimize the network structure to improve the distillation performance, decreasing threshold channel pruning (DTCP) is proposed to generate the required assistant models. DTCP leverages the scatter value of the decreasing function to prune the channels of the teacher model, which retains more channels close to the output layer. Finally, 4-class and 10-class classification experiments are conducted on two bearing datasets. The experimental results demonstrate that the proposed DTCP-MAKD method improves distillation performance and outperforms other state-of-the-art KD methods.
Article
Deep learning (DL) has garnered attention in mechanical device health management for its ability to accurately identify faults and predict component life. However, its high computational cost presents a significant challenge for resource-limited embedded devices. To address this issue, we propose a lightweight fault diagnosis model based on knowledge distillation. The model employs complex residual networks with high classification accuracy as teachers and simple combinatorial convolutional networks as students. The student model has a similar structure to the teacher model, with fewer layers, and uses pixel-wise convolution and channel-wise convolution instead of the original convolution. Students learn the probability distribution rule of the output layer of teacher models to enhance their fault classification accuracy and achieve model compression. This process is called knowledge distillation. The combination of a lightweight model structure and the model training method of knowledge distillation results in a model that not only achieves higher classification accuracy than other small-sized classical models, but also has faster inference speed in embedded devices.
Article
Numerous unsupervised domain adaptation methods for bearing fault diagnosis rely on extracting domain-invariant features to mitigate the impact of domain shift interference. However, the lack of evaluation criteria results in limited interpretability of domain-invariant features. Additionally, current pseudo-labels prediction methods heavily rely on label information or computational resources, and the traditional Softmax function fails to capture valuable information. To address these problems, this paper proposes an unsupervised domain adaptation method based on domain-invariant features evaluation and knowledge distillation for bearing fault diagnosis. Firstly, mutual information and soft attention mechanism are integrated into the extraction of multivariate features to access the quality of domain-invariant features and enhance interpretability. Then, the concept of knowledge distillation is introduced to predict pseudo-labels in the target domain without relying on label information or computational resources. Furthermore, an asynchronous feature metric adaptive strategy is developed to adjust the feature alignment metric, considering the maturity and precision of pseudo-labels. The effectiveness and superiority of the proposed method are demonstrated through comparative experiments and ablation studies on two bearing datasets.
Article
Recently, deep learning models have been widely studied and applied in fault diagnosis. However, two common drawbacks are: (a) they usually require a large amount of storage resources, making it difficult to run them on embedded devices; (b) there is usually no access to sufficient reliable training data to train a comprehensive diagnosis model. In this study, a fusion approach is proposed based on knowledge distillation and generative adversarial network (GAN). This approach is named small-sample dense teacher assistant knowledge distillation (SS-DTAKD), which aims to enable bearing fault diagnosis with small samples and limited on-board storage resources. First, the proposed self-attention generative adversarial network (SGAN) is used to expand the training data for the diagnostic model. The advantage is that the generator and discriminator embedded with the self-attention module can help improve the quality of the generated data. Then, the dense teacher-assistant knowledge distillation (DTAKD) method is proposed to compress the model parameter, where the dense distillation of multiple teacher-assistant networks helps the student network learn correct knowledge without requiring additional data and storage resources. Additionally, the dual-type data hierarchical training (DDHT) method is applied to train the student network, which is designed to utilize actual data to improve the student network’s performance. Extensive experiments on two bearing fault datasets demonstrate that the data generated by the SGAN has high similarity and robustness. Furthermore, compared to other existing knowledge distillation methods, the proposed SS-DTAKD method can obtain higher fault diagnosis accuracy with small samples and limited on-board storage resources.