ArticlePDF Available

Coal and Gangue Detection Networks with Compact and High-Performance Design

MDPI
Sensors
Authors:

Abstract and Figures

The efficient separation of coal and gangue remains a critical challenge in modern coal mining, directly impacting energy efficiency, environmental protection, and sustainable development. Current machine vision-based sorting methods face significant challenges in dense scenes, where label rewriting problems severely affect model performance, particularly when coal and gangue are closely distributed in conveyor belt images. This paper introduces CGDet (Coal and Gangue Detection), a novel compact convolutional neural network that addresses these challenges through two key innovations. First, we proposed an Object Distribution Density Measurement (ODDM) method to quantitatively analyze the distribution density of coal and gangue, enabling optimal selection of input and feature map resolutions to mitigate label rewriting issues. Second, we developed a Relative Resolution Object Scale Measurement (RROSM) method to assess object scales, guiding the design of a streamlined feature fusion structure that eliminates redundant components while maintaining detection accuracy. Experimental results demonstrate the effectiveness of our approach; CGDet achieved superior performance with AP50 and AR50 scores of 96.7% and 99.2% respectively, while reducing model parameters by 46.76%, computational cost by 47.94%, and inference time by 31.50% compared to traditional models. These improvements make CGDet particularly suitable for real-time coal and gangue sorting in underground mining environments, where computational resources are limited but high accuracy is essential. Our work provides a new perspective on designing compact yet high-performance object detection networks for dense scene applications.
This content is subject to copyright.
Citation: Cao, X.; Liu, H.; Liu, Y.; Li, J.;
Xu, K. Coal and Gangue Detection
Networks with Compact and
High-Performance Design. Sensors
2024,24, 7318. https://doi.org/
10.3390/s24227318
Academic Editors: Gaochang Wu,
Zizhu Fan and Dong Pan
Received: 2 October 2024
Revised: 5 November 2024
Accepted: 14 November 2024
Published: 16 November 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Coal and Gangue Detection Networks with Compact and
High-Performance Design
Xiangyu Cao 1, Huajie Liu 1, Yang Liu 1,2, Junheng Li 1and Ke Xu 1,*
1Collaborative Innovation Center of Steel Technology, University of Science and Technology Beijing,
Beijing 100083, China; clovey_cxy@163.com (X.C.); lhjk2s@163.com (H.L.); leosea88@gmail.com (Y.L.);
lijunheng0906@163.com (J.L.)
2Hebei Puyang Iron & Steel Co., Ltd., East of Yangyi Town, Wu’an City 056305, China
*Correspondence: xuke@ustb.edu.cn
Abstract: The efficient separation of coal and gangue remains a critical challenge in modern coal
mining, directly impacting energy efficiency, environmental protection, and sustainable development.
Current machine vision-based sorting methods face significant challenges in dense scenes, where label
rewriting problems severely affect model performance, particularly when coal and gangue are closely
distributed in conveyor belt images. This paper introduces CGDet (Coal and Gangue Detection),
a novel compact convolutional neural network that addresses these challenges through two key
innovations. First, we proposed an Object Distribution Density Measurement (ODDM) method to
quantitatively analyze the distribution density of coal and gangue, enabling optimal selection of
input and feature map resolutions to mitigate label rewriting issues. Second, we developed a Relative
Resolution Object Scale Measurement (RROSM) method to assess object scales, guiding the design
of a streamlined feature fusion structure that eliminates redundant components while maintaining
detection accuracy. Experimental results demonstrate the effectiveness of our approach; CGDet
achieved superior performance with AP50 and AR50 scores of 96.7% and 99.2% respectively, while
reducing model parameters by 46.76%, computational cost by 47.94%, and inference time by 31.50%
compared to traditional models. These improvements make CGDet particularly suitable for real-time
coal and gangue sorting in underground mining environments, where computational resources are
limited but high accuracy is essential. Our work provides a new perspective on designing compact
yet high-performance object detection networks for dense scene applications.
Keywords: coal–gangue detection; object distribution density measurement (ODDM); relative resolution
object scale measurement (RROSM); label rewriting problem; compact neural network
1. Introduction
Coal, as a cornerstone of global economic development, serves dual roles as an es-
sential energy source and critical chemical raw material [
1
]. The imperative to address
environmental concerns while maintaining coal’s economic utility has led to the emergence
of green and intelligent mining technologies [
2
]. These technologies represent a significant
advancement in sustainable mining practices, integrating underground gangue separation
with sophisticated backfilling techniques to minimize environmental impact and prevent
mining-induced geological hazards [
3
]. Such integration not only reduces surface pollution
from coal preparation facilities but also effectively mitigates the risk of ground subsidence,
marking a substantial advancement in sustainable mining practices. A critical challenge
in implementing green mining technologies lies in the spatial constraints of underground
operations, which preclude the use of conventional surface processing equipment [
4
,
5
]. The
dimensional limitations and operational complexities of traditional surface equipment pose
significant barriers to underground deployment, necessitating innovative solutions for in
situ coal processing. While intelligent sorting robots offer a promising solution due to their
Sensors 2024,24, 7318. https://doi.org/10.3390/s24227318 https://www.mdpi.com/journal/sensors
Sensors 2024,24, 7318 2 of 17
compact design [
6
], their effectiveness fundamentally depends on accurate machine vision
systems for real-time coal and gangue discrimination [
7
]. The development of efficient and
accurate machine vision methods is crucial for enabling these robots to perform swift and
precise gangue removal during raw coal transportation, thereby facilitating the transition
toward environmentally sustainable coal production practices.
Recent developments in convolutional neural networks have demonstrated remark-
able potential in object detection tasks [
8
10
], particularly in coal–gangue discrimination
applications [
11
]. Bounding boxes are used by object detection algorithms based on convo-
lutional neural networks to identify the category and location of objects in images [
12
14
].
The evolution of deep learning architectures has revolutionized machine vision capabilities,
enabling unprecedented accuracy in object detection and classification tasks. However, the
application of these technologies in underground mining environments presents unique
challenges that current solutions have yet to adequately address. Existing approaches
primarily fall into two categories: two-stage detectors and single-stage detectors. Two-
stage detectors, exemplified by Faster R-CNN [
15
], CG-RPN [
16
], and FCCN [
17
], achieve
impressive accuracy through sophisticated proposal generation mechanisms but suffer
from substantial computational overhead that impedes real-time processing capabilities [
7
].
These algorithms, while effective in controlled environments, face significant challenges in
meeting the speed requirements of online sorting applications. Conversely, single-stage
detectors like YOLO [18] offer enhanced processing efficiency but frequently compromise
on detection accuracy [
19
]. Various optimization attempts, including cascaded architectures
combining YOLOV3 with support vector machines [
20
] and implementations of deformable
convolutions [
21
], have been proposed to address these limitations. However, these ap-
proaches have not fully resolved the fundamental speed-accuracy trade-off, particularly in
dense scene detection scenarios.
The subsequent evolution of YOLOv3 variants, incorporating more sophisticated
feature extraction networks and optimized training methodologies, has yielded significant
improvements in coal and gangue perception accuracy [
22
25
]. These advanced models
demonstrate enhanced detection capabilities but are characterized by large parameter
spaces and substantial computational requirements, presenting significant challenges for
deployment on edge devices with limited resources. This limitation has catalyzed research
interest in lightweight object detection models [
26
,
27
]. Current model lightweighting ap-
proaches can be broadly categorized into two types [
28
]. The first is network architecture
design, which includes manual design [
29
32
] and AutoML design [
33
]. The second is
model compression, achieved through techniques such as network pruning [
34
], low-rank
decomposition [
35
,
36
], low-bit quantization [
37
], and knowledge distillation [
38
]. Contem-
porary approaches to model compaction have primarily focused on network substitution
strategies, such as replacing heavyweight networks with lighter alternatives. Common
approaches include substituting DarkNet53 with ResNet18 [
39
] or MobileNetV3 [
20
] in
YOLOv3 implementations, or replacing VGG16 with GhostNet [
40
] or MobileNetV1 [
41
] in
SSD (Single Shot MultiBox Detector) architectures [
42
]. SSD is a single-stage object detection
algorithm optimized based on the VGG16 architecture. It leverages feature maps at multi-
ple levels for multi-scale detection, employing convolutional operations and predefined
anchor boxes. While these modifications effectively reduce model parameters and compu-
tational demands, they often result in compromised detection performance, particularly
in challenging dense-scene scenarios. To improve the performance of lightweight models,
attention mechanisms are usually added, but adding attention mechanisms does not help to
solve the label rewriting problem. Using a larger input image or feature map resolution for
detection can alleviate the label rewriting problem, but blindly increasing the resolution of
the input image and feature map will increase the computational complexity and seriously
weaken the inference speed of the lightweight model [
40
,
43
]. In compact convolutional
neural networks designed for recognizing coal and gangue, many high-performing models
have emerged [
44
48
]. However, these models often overlook the dense distribution of
coal and gangue, as well as the relatively small proportion of pixels occupied by coal and
Sensors 2024,24, 7318 3 of 17
gangue in the images. Therefore, when compacting convolutional neural networks, how
to continuously subtract so that they can maintain high performance and speed in dense
scenarios while avoiding label rewriting is an important research question.
Recent advances in multi-scale feature learning and multi-view analysis have pro-
vided valuable insights for dense object detection. Wang et al. proposed a progressive
learning strategy with multi-scale attention network, demonstrating the importance of scale-
adaptive feature extraction [
49
]. Similarly, Wang et al. introduced a bi-consistency guided
approach for incomplete multi-view clustering, highlighting the significance of consistent
feature representation [
50
]. Furthermore, Wang et al. developed a graph-collaborated
auto-encoder framework for multi-view clustering, offering novel perspectives on feature
fusion [
51
]. Building upon these works, our CGDet advances the field by introducing
ODDM and RROSM methods specifically designed for dense coal–gangue detection, while
maintaining computational efficiency through optimized feature fusion strategies.
The current state of research faces three critical challenges that previous studies have
failed to adequately address: (1) Performance Degradation in Dense Scenes: existing
lightweight models struggle to maintain detection accuracy when confronted with densely
distributed objects, a common scenario in coal–gangue sorting applications. (2) Computa-
tional Overhead: the integration of attention mechanisms and other performance-enhancing
features often introduces significant computational burden, contradicting the primary goal
of achieving a lightweight model. (3) Label Rewriting Issues: the problem of label rewriting
becomes particularly acute in high-density scenes, where multiple objects compete for
detection resources within limited spatial regions.
To address these fundamental challenges, this paper introduces CGDet, a novel
lightweight convolutional neural network specifically designed for dense coal–gangue
detection. Our approach introduces two innovative methodologies: (1) Object Distribution
Density Measurement (ODDM): A systematic approach for analyzing and optimizing object
detection in dense distributions. This methodology enables precise calibration of input
image and feature map resolutions while maintaining high performance and computational
efficiency. (2) Relative Resolution Object Scale Measurement (RROSM): A novel technique
for characterizing object scale variations and optimizing feature fusion structures. This
approach facilitates the development of efficient multi-scale detection strategies while
minimizing computational requirements.
The primary objective of this research was to develop a high-performance, computa-
tionally efficient object detection system capable of accurate coal–gangue discrimination in
dense underground mining environments. Specifically, we aimed to: (1) design a lightweight
model architecture that maintains high detection accuracy in dense scenes while minimiz-
ing computational requirements. (2) Develop novel methodologies for optimizing input
resolution and feature fusion based on object distribution characteristics. (3) Demonstrate
the effectiveness of our approach through comprehensive experimental validation.
The primary innovations and contributions of this research are synthesized into three
interconnected aspects: (1) The Object Distribution Density Measurement (ODDM) method-
ology is proposed, enabling optimal resolution selection, circumventing label rewriting
issues, and providing a systematic analytical framework for object distribution patterns,
with both theoretical foundations and practical guidance established for parameter op-
timization in dense detection scenarios. (2) The Relative Resolution Object Scale Mea-
surement (RROSM) technique is introduced, facilitating the optimization of feature fusion
design through precise quantification of object scale variations, with model complexity
significantly reduced and detection accuracy maintained. (3) Based on these innovations,
the CGDet architecture is constructed, integrating the advantages of ODDM and RROSM
within a unified framework.
Optimal performance is achieved in dense coal–gangue detection tasks, while compu-
tational efficiency is preserved for edge device deployment, offering a viable solution for
practical engineering applications. Together, these innovations form a cohesive technical
system, establishing a new paradigm for lightweight object detection in dense environments.
Sensors 2024,24, 7318 4 of 17
2. Materials and Methods
2.1. YOLOX Object Detector
YOLOX [
52
] is one of the state-of-the-art single-stage object detectors known for its
fast detection speed and high accuracy. It has been widely utilized in object detection tasks
and it is advantageous for real-time and high-precision perception of coal and gangue in
dense scenes. The YOLOX detector still follows the YOLO detection paradigm, which
involves gridding the image, and if an object’s center is within a grid cell, that grid cell
is responsible for detecting the object. The structure of the YOLOX-s model is shown in
Figure 1. As shown in Figure 1, the YOLOX-s model encompasses three main parts: the
backbone is used to extract features from the image, the neck is used to fuse feature maps
at different scales, and the heads use the feature maps generated by the neck for detection.
The backbone of the YOLOX-s model primarily consists of the Focus, CBS, CSP1_X, and
SPP (Spatial Pyramid Pooling) modules. Through the Focus layer, the image is sliced,
resulting in a reduction of its resolution. The CBS module amalgamates 2D convolution,
batch normalization, and SiLU activation functions, encompassing an effective combination.
The bottleneck layer incorporates CBS modules and assumes the responsibility of extracting
features. The CSP1_X module serves as a feature extraction unit, where X denotes the
count of bottleneck layers. For instance, CSP1_3 comprises three bottleneck layers, while
CSP2_1 contains one CBS module. The PAN [
53
] (Path Aggregation Network) structure is
an extension of the FPN [
54
] (Feature Pyramid Network) and is integrated into the neck
of the YOLOX-s model. Feature fusion is achieved through CSP2_X, with X implying
the number of CBS modules utilized. The heads of the YOLOX-s model encompass three
decoupled heads, namely head 1, head 2, and head 3. These heads are responsible for
generating the detection outputs.
Sensors 2024, 24, x FOR PEER REVIEW 4 of 19
sign through precise quantification of object scale variations, with model complexity sig-
nificantly reduced and detection accuracy maintained. (3) Based on these innovations, the
CGDet architecture is constructed, integrating the advantages of ODDM and RROSM
within a unified framework.
Optimal performance is achieved in dense coal–gangue detection tasks, while com-
putational efficiency is preserved for edge device deployment, offering a viable solution
for practical engineering applications. Together, these innovations form a cohesive tech-
nical system, establishing a new paradigm for lightweight object detection in dense envi-
ronments.
2. Materials and Methods
2.1. YOLOX Object Detector
YOLOX [52] is one of the state-of-the-art single-stage object detectors known for its
fast detection speed and high accuracy. It has been widely utilized in object detection tasks
and it is advantageous for real-time and high-precision perception of coal and gangue in
dense scenes. The YOLOX detector still follows the YOLO detection paradigm, which in-
volves gridding the image, and if an object’s center is within a grid cell, that grid cell is
responsible for detecting the object. The structure of the YOLOX-s model is shown in Fig-
ure 1. As shown in Figure 1, the YOLOX-s model encompasses three main parts: the back-
bone is used to extract features from the image, the neck is used to fuse feature maps at
different scales, and the heads use the feature maps generated by the neck for detection.
The backbone of the YOLOX-s model primarily consists of the Focus, CBS, CSP1_X, and
SPP (Spatial Pyramid Pooling) modules. Through the Focus layer, the image is sliced, re-
sulting in a reduction of its resolution. The CBS module amalgamates 2D convolution,
batch normalization, and SiLU activation functions, encompassing an effective combina-
tion. The bottleneck layer incorporates CBS modules and assumes the responsibility of
extracting features. The CSP1_X module serves as a feature extraction unit, where X de-
notes the count of bottleneck layers. For instance, CSP1_3 comprises three bottleneck lay-
ers, while CSP2_1 contains one CBS module. The PAN [53] (Path Aggregation Network)
structure is an extension of the FPN [54] (Feature Pyramid Network) and is integrated into
the neck of the YOLOX-s model. Feature fusion is achieved through CSP2_X, with X im-
plying the number of CBS modules utilized. The heads of the YOLOX-s model encompass
three decoupled heads, namely head1, head2, and head3. These heads are responsible for
generating the detection outputs.
Figure 1. Structure of the YOLOX-s model.
Figure 1. Structure of the YOLOX-s model.
2.2. Definition of the Label Rewriting Problem
When a large number of objects are densely distributed in an image, the label rewriting
problem occurs if the centers of two actual ground truth bounding boxes are located within
the same image grid cell. As shown in Figure 2, the yellow color represents labels that
have the label rewriting problem, while the blue color represents normal labels. The label
rewriting problem can lead to a decrease in the detection performance of the CGDet model.
Consider a set B = {b1,
. . .
, bt} consisting of the center coordinates of the ground truth
Sensors 2024,24, 7318 5 of 17
bounding boxes in the image, where t represents the total number of ground truth bounding
boxes within the image. The label will undergo rewriting when the following condition
is fulfilled:
(bi,bjC):bx
i%wbx
j%w=0, by
i%hby
j%h=0 (1)
where b
i
represents the center coordinate of the i-th ground truth bounding box, b
j
repre-
sents the center coordinate of the j-th actual ground truth bounding box.
bx
i
and
by
i
denote
the xand ycoordinates of the center of the i-th ground truth bounding box.
bx
i
and
by
i
represent the xand ycoordinates of the center of the j-th ground truth bounding box. wis
the number of columns in the grid, and his the number of rows in the grid.
Sensors 2024, 24, x FOR PEER REVIEW 6 of 19
2.2. Definition of the Label Rewriting Problem
When a large number of objects are densely distributed in an image, the label rewrit-
ing problem occurs if the centers of two actual ground truth bounding boxes are located
within the same image grid cell. As shown in Figure 2, the yellow color represents labels
that have the label rewriting problem, while the blue color represents normal labels. The
label rewriting problem can lead to a decrease in the detection performance of the CGDet
model. Consider a set B = {b1, …, bt} consisting of the center coordinates of the ground
truth bounding boxes in the image, where t represents the total number of ground truth
bounding boxes within the image. The label will undergo rewriting when the following
condition is fulfilled:
0%%,0%%:),( == hbhbwbwbCbb
y
j
y
i
x
j
x
iji
(1)
(1)
where b
i
represents the center coordinate of the i-th ground truth bounding box, b
j
repre-
sents the center coordinate of the j-th actual ground truth bounding box.
x
i
b
and
y
i
b
de-
note the x and y coordinates of the center of the i-th ground truth bounding box.
x
i
b
and
y
i
b
represent the x and y coordinates of the center of the j-th ground truth bounding box.
w is the number of columns in the grid, and h is the number of rows in the grid.
Figure 2. Illustration of CGDet model meshing and label rewriting.
Label rewriting introduces several negative impacts: (1) detection performance deg-
radation, as some objects may be missed; (2) reduced localization accuracy due to compe-
tition among multiple objects within the same grid cell; (3) lower classification accuracy
caused by feature interference between adjacent objects; and (4) unstable training perfor-
mance, as label reassignment affects loss function computation. Therefore, it is essential
to mitigate these impacts.
2.3. Measure the Distribution Density of Objects in Images
To mitigate the adverse effects of label rewriting on CGDet’s performance, the prob-
lem of label rewriting is transformed into a problem of measuring object distribution den-
sity. By selecting images with low object distribution density and low resolution for de-
tection, label rewriting can be prevented. Using the ODDM method [55] to calculate the
object distribution density in images allows the detector to determine high-performance,
low-computation input image resolution, and feature map resolution, thereby improving
detector performance while reducing computational load. The calculation formula for
ODDM is shown in Equation (2).
=
=
n
i
g
i
d
i
g
i
img
imgimg
n
1
1
β
(2)
()
,:%%0,%%0
xx yy
ij i j i j
bb C b w b w b h b h∃∈ = =
Figure 2. Illustration of CGDet model meshing and label rewriting.
Label rewriting introduces several negative impacts: (1) detection performance degra-
dation, as some objects may be missed; (2) reduced localization accuracy due to competition
among multiple objects within the same grid cell; (3) lower classification accuracy caused
by feature interference between adjacent objects; and (4) unstable training performance, as
label reassignment affects loss function computation. Therefore, it is essential to mitigate
these impacts.
2.3. Measure the Distribution Density of Objects in Images
To mitigate the adverse effects of label rewriting on CGDet’s performance, the problem
of label rewriting is transformed into a problem of measuring object distribution density.
By selecting images with low object distribution density and low resolution for detec-
tion, label rewriting can be prevented. Using the ODDM method [
55
] to calculate the
object distribution density in images allows the detector to determine high-performance,
low-computation input image resolution, and feature map resolution, thereby improving
detector performance while reducing computational load. The calculation formula for
ODDM is shown in Equation (2).
β=1
n
n
i=1
imgg
iimgd
i
imgg
i
(2)
where ndenotes the number of images,
imgg
i
denotes the number of objects in the i-th
image, On the feature map of the i-th image,
imgd
i
represents the amount of grids which
contain objects,
β
represents the density level. A larger
β
value leads to poorer detection
performance of the model.
2.4. Measure the Scale of an Object in an Image
To optimize the neck structure based on the object scale within the image, the scale
of the objects has to be measured. The COCO dataset [
56
] employs an absolute image
resolution to measure object scale, which, unfortunately, leads to inaccurate measurements
when dealing with images of varying resolutions. Therefore, RROSM [
55
] was introduced
Sensors 2024,24, 7318 6 of 17
in this study to accurately capture object scales, which aligns with the scale classification
criteria utilized in the COCO dataset. RROSM is calculated as follows.
S=[s1,s2,·· · ,sn]
X=h1
x1,1
x2,·· · ,1
xni
G=gij n×m
(3)
α=X·S·G(4)
For nimages, the multiplication of the width and height of the input resolution of
each image is computed to obtain the vector S. At the original resolution of each image,
the reciprocal of the multiplication of width and height produces the vector X. The matrix
Gis composed of the areas of the actual bounding boxes for objects in all images, where
i{1, 2, . . ., n}
and j
{1, 2,
. . .
,m}. The maximum value of object quantity in the images is
represented by m. Each element g
ij
symbolizes the multiplication of the width and height
of the actual bounding box of a specific object. If 0 < a
ij
<= 32
2
, it is classified as a small
object. If 32
2
<a
ij
<= 96
2
, it is classified as a medium object, and if a
ij
> 96
2
, it is classified as
a large object. When the majority of objects are small, using only the P3 feature map for
detection can yield good results.
2.5. The Structure of the CGDet Model
Through theoretical analysis, although YOLOX-s exhibits relatively small parameters
and computational costs, its structure still presents optimization potential. Based on
feature representation theory in deep learning, model performance demonstrates significant
correlation with feature map resolution and object distribution density. Consequently, we
propose the density theory-based ODDM method to calculate
β
values, quantitatively
evaluating the impact of object distribution on feature extraction. When
β
values decrease
significantly, indicating optimal object distribution density, the corresponding resolution
is selected for training and detection, effectively mitigating label rewriting issues and
enhancing feature learning quality.
Guided by multi-scale feature representation theory, we employ the RROSM method
to analyze object scale distribution characteristics. Experimental evidence indicates that in
the absence of large-scale objects, high-level feature maps (such as P5) contribute minimally
to feature representation, justifying the removal of corresponding detection heads. When
datasets predominantly contain medium and small-scale objects, according to feature pyra-
mid theory, utilizing only the high-resolution P3 feature map suffices for comprehensive
feature representation capability. Based on these theoretical analyses, we propose the
CGDet model through objective-oriented reconstruction of YOLOX-s.
CGDet maintains the original backbone network structure due to its demonstrated
efficacy in feature extraction. Quantitative analysis through RROSM reveals relatively
concentrated object scale distribution in our application scenario, negating the necessity
for complex feature fusion mechanisms. Therefore, based on information entropy theory,
we eliminate the computationally redundant PAN structure, adopting a more concise FPN
for feature fusion. ODDM density analysis demonstrates optimal feature representation
achievable on the P3 feature map, justifying the retention of only head3 for object detection—
a design that ensures detection performance while significantly reducing computational
complexity. The backbone network’s CSP1_1, CSP1_3, and CSP2_1 layers constitute a
multi-level feature extraction structure, with outputs {C2, C3, C4, C5} and downsampling
strides {4, 8, 16, 32} forming progressive feature abstraction levels.
As illustrated in Figure 3, regarding feature fusion, the neck network employs an en-
hanced FPN structure. Following feature pyramid theory, high-level features (C5) processed
through the CBS layer generate semantically rich P5 feature maps. Through upsampling
and feature concatenation operations, P5 merges with C4 through CSP2_1 processing to
generate P4, achieving effective integration of high and low-level features. Similarly, P4
fuses with C3 to generate P3, establishing a progressive multi-scale feature fusion mecha-
Sensors 2024,24, 7318 7 of 17
nism. The final feature maps {P3, P4, P5} maintain spatial consistency with {C3, C4, C5}.
To further optimize computational efficiency, based on depthwise separable convolution
theory, we replace the second CBS in the CSP2_X bottleneck layer with depthwise separable
convolution, constructing the FPND structure [
32
]. This enhancement significantly reduces
parameter count and computational complexity while maintaining feature representation
capability. The notation X = 1 or X = 3 indicates different levels of feature extraction depth,
enabling flexible feature optimization at various levels.
Sensors 2024, 24, x FOR PEER REVIEW 8 of 19
CGDet maintains the original backbone network structure due to its demonstrated
efficacy in feature extraction. Quantitative analysis through RROSM reveals relatively
concentrated object scale distribution in our application scenario, negating the necessity
for complex feature fusion mechanisms. Therefore, based on information entropy theory,
we eliminate the computationally redundant PAN structure, adopting a more concise
FPN for feature fusion. ODDM density analysis demonstrates optimal feature representa-
tion achievable on the P3 feature map, justifying the retention of only head3 for object
detection—a design that ensures detection performance while significantly reducing com-
putational complexity. The backbone networks CSP1_1, CSP1_3, and CSP2_1 layers con-
stitute a multi-level feature extraction structure, with outputs {C2, C3, C4, C5} and
downsampling strides {4, 8, 16, 32} forming progressive feature abstraction levels.
As illustrated in Figure 3, regarding feature fusion, the neck network employs an
enhanced FPN structure. Following feature pyramid theory, high-level features (C5) pro-
cessed through the CBS layer generate semantically rich P5 feature maps. Through up-
sampling and feature concatenation operations, P5 merges with C4 through CSP2_1 pro-
cessing to generate P4, achieving effective integration of high and low-level features. Sim-
ilarly, P4 fuses with C3 to generate P3, establishing a progressive multi-scale feature fu-
sion mechanism. The final feature maps {P3, P4, P5} maintain spatial consistency with {C3,
C4, C5}. To further optimize computational efficiency, based on depthwise separable con-
volution theory, we replace the second CBS in the CSP2_X bottleneck layer with depth-
wise separable convolution, constructing the FPND structure [32]. This enhancement sig-
nificantly reduces parameter count and computational complexity while maintaining fea-
ture representation capability. The notation X = 1 or X = 3 indicates different levels of fea-
ture extraction depth, enabling flexible feature optimization at various levels.
Figure 3. Structure of the CGDet model.
This architectural design, underpinned by solid theoretical foundations, demon-
strates the following advantages: (1) optimal resolution selection based on density distri-
bution theory; (2) scale-adaptive feature extraction guided by multi-scale representation
theory; (3) efficient feature fusion mechanism supported by information entropy theory;
and (4) computational optimization through advanced convolution theories. The integra-
tion of these theoretical foundations with practical architectural innovations results in a
model that achieves both computational efficiency and detection accuracy in dense object
detection scenarios.
3. Experiment
3.1. Experimental Environment Settings and Dataset
The experimental hardware resources include an AMD Ryzen 5 3600 CPU and
NVIDIA RTX 2060 graphics card. The experiments were conducted on a system running
Ubuntu 22.04 LTS with PyTorch 1.9.1, CUDA 10.2, and Python 3.8. In the experiments, the
input resolution of the images was set to 512 × 704. The batch size was set to eight, and the
Figure 3. Structure of the CGDet model.
This architectural design, underpinned by solid theoretical foundations, demonstrates
the following advantages: (1) optimal resolution selection based on density distribution theory;
(2) scale-adaptive feature extraction guided by multi-scale representation theory; (3) efficient
feature fusion mechanism supported by information entropy theory; and (4) computational
optimization through advanced convolution theories. The integration of these theoretical
foundations with practical architectural innovations results in a model that achieves both
computational efficiency and detection accuracy in dense object detection scenarios.
3. Experiment
Experimental Environment Settings and Dataset
The experimental hardware resources include an AMD Ryzen 5 3600 CPU and NVIDIA
RTX 2060 graphics card. The experiments were conducted on a system running Ubuntu
22.04 LTS with PyTorch 1.9.1, CUDA 10.2, and Python 3.8. In the experiments, the input
resolution of the images was set to 512
×
704. The batch size was set to eight, and the initial
learning rate was set to 0.003125. The YOLOXWarmCos method was used to update the
learning rate during training. The default training duration was set to 300 epochs.
In the experiment, anthracite and claystone gangue were utilized as the experiment
materials. The dataset employed in the experiment is displayed in Figure 4. Image ac-
quisition was carried out using KinectV2 under various lighting conditions, resulting in
significant variations in brightness between different images. The contrast between the
coal and gangue in the images is relatively low, accompanied by minor differences in
surface textures. Moreover, there is a substantial disparity in the distribution of coal and
gangue within the images. Whether the objects in an image are densely distributed is
determined by calculating the distribution density
β
using Equation (2) in Section 2.3. A
dense distribution is defined as
β
> 1.5
×
10
4
, and a sparse distribution as
β
< 1.5
×
10
4
.
It was observed that 55% of the images in the dataset exhibited dense distributions of coal
and gangue, while 45% exhibited sparse distributions. Via random sampling, 400 images
from 608 images were taken as the training set, 100 images were taken as the validation set
and finally, the remaining 108 images were taken as the test set. All these images were of
1470 ×1080 pixels resolution.
Sensors 2024,24, 7318 8 of 17
Sensors 2024, 24, x FOR PEER REVIEW 9 of 19
initial learning rate was set to 0.003125. The YOLOXWarmCos method was used to update
the learning rate during training. The default training duration was set to 300 epochs.
In the experiment, anthracite and claystone gangue were utilized as the experiment
materials. The dataset employed in the experiment is displayed in Figure 4. Image acqui-
sition was carried out using KinectV2 under various lighting conditions, resulting in sig-
nificant variations in brightness between different images. The contrast between the coal
and gangue in the images is relatively low, accompanied by minor differences in surface
textures. Moreover, there is a substantial disparity in the distribution of coal and gangue
within the images. Whether the objects in an image are densely distributed is determined
by calculating the distribution density β using Equation (2) in Section 2.3. A dense distri-
bution is dened as β > 1.5 × 10⁻⁴, and a sparse distribution as β < 1.5 × 10⁻⁴. It was observed
that 55% of the images in the dataset exhibited dense distributions of coal and gangue,
while 45% exhibited sparse distributions. Via random sampling, 400 images from 608 im-
ages were taken as the training set, 100 images were taken as the validation set and finally,
the remaining 108 images were taken as the test set. All these images were of 1470 × 1080
pixels resolution.
Figure 4. Images of coal and gangue in the dataset.
The evaluation of the proposed model encompasses several metrics, including the
parameter count, GFLOPs (Giga Floating-point Operations), AP (Average Precision), and
AR (Average Recall). AP and AR are computed using the COCO API [52]. Specifically,
AP50 and AR50 denote the AP and AR values, respectively, corresponding to an Intersec-
tion over Union (IOU) threshold of 0.5. Higher values of AP and AR signify superior
model performance.
4. Results, Discussion, and Analysis
4.1. Ablation Experiments with Different Components
The AP50 and AR50 in this chapter are the results obtained by the model on the test
set. The performance comparison of the model on the test set is shown in Table 1.
Table 1. Ablation experiments with different components.
Model
Improvement Method Performance
FPN FPND Head AP50
(%)
AR50
(%)
mAP50
(%) Parameters (M) GFLOPs Inference
Time (ms)
YOLOX-s 3 93.8 99.5 69.6 8.94 23.55 19.87
A 3 96.5 99.0 98.0 6.72 21.04 16.11
B 3 96.7 99.3 97.9 5.00 12.66 15.57
CGDet 1 96.7 99.2 98.3 4.76 12.26 13.61
Figure 4. Images of coal and gangue in the dataset.
The evaluation of the proposed model encompasses several metrics, including the
parameter count, GFLOPs (Giga Floating-point Operations), AP (Average Precision), and
AR (Average Recall). AP and AR are computed using the COCO API [
52
]. Specifically,
AP50 and AR50 denote the AP and AR values, respectively, corresponding to an Inter-
section over Union (IOU) threshold of 0.5. Higher values of AP and AR signify superior
model performance.
4. Results, Discussion, and Analysis
4.1. Ablation Experiments with Different Components
The AP50 and AR50 in this chapter are the results obtained by the model on the test
set. The performance comparison of the model on the test set is shown in Table 1.
Table 1. Ablation experiments with different components.
Model
Improvement Method Performance
FPN FPND Head AP50 (%) AR50 (%) mAP50
(%)
Parameters
(M) GFLOPs Inference Time
(ms)
YOLOX-s 3 93.8 99.5 69.6 8.94 23.55 19.87
A3 96.5 99.0 98.0 6.72 21.04 16.11
B3 96.7 99.3 97.9 5.00 12.66 15.57
CGDet 1 96.7 99.2 98.3 4.76 12.26 13.61
(In Table 1, the symbol indicates that the corresponding module is used or integrated within the model).
As shown in Table 1, YOLOX-s had the lowest AP50 in training, with more parame-
ters, computational workload, and inference time. Model A was derived by substituting
the original PAN (Path Aggregation Network) structure in YOLOX-s with FPN (Feature
Pyramid Network), while Model B was developed through the integration of depthwise
separable convolution into the network architecture. Quantitative analysis demonstrates
that Model A achieved significant performance improvements over the baseline YOLOX-s
architecture: a 28.4% enhancement in mAP
50
(mean Average Precision at IoU threshold
0.5), while concurrently reducing parameter count by 24.83%, computational complexity by
10.66%, and inference latency by 3.76 ms. Model B achieved a 44.07% reduction in parame-
ters and computation, along with a 4.3 ms faster inference time. Compared to the YOLOX-s
baseline, the proposed CGDet demonstrated substantial improvements across multiple
performance metrics by achieving a 28.7% increase in mAP
50
, while significantly reduc-
ing model parameters and computational complexity by 46.76% and 47.94%, respectively.
Furthermore, the model exhibited a 2.9% enhancement in AP50 while decreasing inference
latency to 13.61 ms. Using a single detection head on the P3 feature map further reduced
parameter, computation, and inference time while maintaining the model’s high-precision
detection capability.
Sensors 2024,24, 7318 9 of 17
4.2. Using ODDM to Measure the Distribution Density of Objects in Images
To investigate the distribution density of coal and gangue in different resolution
feature maps, while maintaining the aspect ratio of the images, the object distribution
density in different resolution feature maps was calculated based on Equation (2). The
results are shown in Figure 5. The feature maps, denoted as P3, P4, and P5, were obtained
by downsampling the input image by 8, 16, and 32 times, respectively. In Figure 5, the
height of the image is represented by the vertical axis, which varies from 32
×
224 to
896 ×1088
in resolution. Comparing the feature maps at the same input resolution, the
object distribution density was highest in P5 due to its lower resolution, while P3 had the
lowest distribution density because of its higher resolution. The object distribution density
in P4 fell between that of P3 and P5 since its resolution lay between the two.
Sensors 2024, 24, x FOR PEER REVIEW 10 of 19
(In Table 1, the "" symbol indicates that the corresponding module is used or inte-
grated within the model.)
As shown in Table 1, YOLOX-s had the lowest AP50 in training, with more parame-
ters, computational workload, and inference time. Model A was derived by substituting
the original PAN (Path Aggregation Network) structure in YOLOX-s with FPN (Feature
Pyramid Network), while Model B was developed through the integration of depthwise
separable convolution into the network architecture. Quantitative analysis demonstrates
that Model A achieved significant performance improvements over the baseline YOLOX-
s architecture: a 28.4% enhancement in mAP
50
(mean Average Precision at IoU threshold
0.5), while concurrently reducing parameter count by 24.83%, computational complexity
by 10.66%, and inference latency by 3.76 ms. Model B achieved a 44.07% reduction in pa-
rameters and computation, along with a 4.3 ms faster inference time. Compared to the
YOLOX-s baseline, the proposed CGDet demonstrated substantial improvements across
multiple performance metrics by achieving a 28.7% increase in mAP
50
, while significantly
reducing model parameters and computational complexity by 46.76% and 47.94%, respec-
tively. Furthermore, the model exhibited a 2.9% enhancement in AP50 while decreasing
inference latency to 13.61 ms. Using a single detection head on the P3 feature map further
reduced parameter, computation, and inference time while maintaining the model’s high-
precision detection capability.
4.2. Using ODDM to Measure the Distribution Density of Objects in Images
To investigate the distribution density of coal and gangue in different resolution fea-
ture maps, while maintaining the aspect ratio of the images, the object distribution density
in different resolution feature maps was calculated based on Equation (2). The results are
shown in Figure 5. The feature maps, denoted as P3, P4, and P5, were obtained by
downsampling the input image by 8, 16, and 32 times, respectively. In Figure 5, the height
of the image is represented by the vertical axis, which varies from 32 × 224 to 896 × 1088
in resolution. Comparing the feature maps at the same input resolution, the object distri-
bution density was highest in P5 due to its lower resolution, while P3 had the lowest dis-
tribution density because of its higher resolution. The object distribution density in P4 fell
between that of P3 and P5 since its resolution lay between the two.
Figure 5. Distribution density of objects in different input resolution images in different resolution
feature maps.
Figure 5. Distribution density of objects in different input resolution images in different resolution
feature maps.
As shown in Figure 5, the input image’s resolution rose, and the density of object
distribution steadily decreased in the P3, P4, and P5 feature maps. For the P3 feature map,
the density gradually decreased beyond an input resolution of 256
×
448. Once the input
resolution surpassed 416
×
608, the distribution density of objects decreased at a stable
rate. In the zoomed-in section of Figure 5, the object distribution density in P3 had an
input resolution of 384
×
576 is 1.64
×
10
4
, which was an order of magnitude higher
than the distribution density of 7.81
×
10
5
for an image resolution of 416
×
608. For
input resolutions of 416
×
608, 448
×
640, 480
×
672, and 512
×
704, the object distribution
density in P3 remained constant. Within the range of 416
×
608 to 512
×
704, the object
distribution density was lower in the P3 feature map than in the P4 feature map. The
distribution density in the P4 feature map decreased slowly for input image resolutions
higher than 640
×
832. P5 consistently had a decreasing density as the input resolution
increased from 32
×
224 to 896
×
1088. To counter the impact of densely distributed objects,
the CGDet model selects the P3 feature map for detection based on the density results in
Figure 5. CGDet uses an image resolution of 512 ×704 for training and detection because
the object distribution density is low at this resolution.
4.3. Using RROSM to Measure the Scale of Objects in Images
To enhance the accuracy of measuring the scale of coal and gangue in images, RROSM
was employed specifically for the training set. The outcomes obtained from this measure-
ment approach are illustrated in Figure 6, which presents the results derived from the
utilization of RROSM. The vertical axis represents the quantities of small, medium, and
large objects in the dataset, while the horizontal axis represents the input image resolutions.
The image resolution ranges from 32
×
224 to 896
×
1088, and the aspect ratio of the
images was maintained during the measurement. Figure 6depicts the relationship between
Sensors 2024,24, 7318 10 of 17
image resolution and object sizes in the dataset. As resolution increases, small objects
decrease while medium objects increase. Small objects transform into medium objects as
their resolution on the image grows. Before 416
×
608 resolution, small objects dominate
(over 50%), but afterward, medium objects become more prominent. When resolution
exceeds 800
×
992, medium objects start transforming into large objects, resulting in a
decrease in the number of medium objects and an increase in large objects.
Sensors 2024, 24, x FOR PEER REVIEW 12 of 19
Figure 6. The Scale of objects in the training set.
The distribution density of objects in the P3 feature map decreases as image resolu-
tion goes from 256 × 448 to 416 × 608, but small objects still dominate the dataset. From the
416 × 608 to 736 × 928 resolution range, the lowest density of object distribution is in the
P3 feature map. At this resolution, the model’s performance is less affected by object dis-
tribution density, reducing the impact of small objects on perception. Resolutions above
736 × 928 have almost no small objects and neglectable object distribution density. While
higher resolutions improve model performance, training and inference at such resolutions
are costly. CGDet uses a 512 × 704 feature map for detection. At this resolution, coal and
gangue, which are relatively measured in terms of object scale, are mainly composed of
medium and small objects. As deep features are helpful for large object recognition, they
are not very helpful for small and medium objects [53]. Therefore, CGDet chooses to use
FPN for feature fusion, discarding the path enhancement part in PAN.
To provide additional evidence of the benefits associated with the relative object scale
measurement method, Table 2 provides pertinent data regarding the model’s perfor-
mance and allows for a comparison of the performance between the PAN, FPN, and FPND
models.
Table 2. Performance comparison of different neck structures.
Neck AP50
(%)
AR50
(%) mAP
50
(%) mAR
50
(%) Parame-
ters (M) GFLOPs Inference Time
(ms)
PAN 96.2 98.9 98.0 99.6 8.94 23.55 19.87
FPN 96.5 99.0 98.0 99.6 6.72 21.04 16.11
FPND 96.7 99.3 97.9 99.6 5.00 12.66 15.57
When the input image resolution is 512 × 704, removing the path enhancement mod-
ule in PAN, which propagates shallow features to deeper layers, results in no change in
the model’s mAP
50
. This suggests that the path enhancement component in PAN is redun-
dant. Eliminating this redundant module improves AP50 and AR50 by 25%, reduces the
number of parameters by 25%, decreases computational cost by 8%, and shortens infer-
ence time by 19%. FPND employs depthwise separable convolutions within the FPN,
achieving reductions in parameter count, computational cost, and inference time at the
expense of a slight 0.1% decrease in mAP
50
. Compared to PAN, FPND reduces the number
of parameters by 44%, decreases computational cost by 46%, and improves inference
speed by 22%. These results further demonstrate that accurately measuring object scale in
Figure 6. The Scale of objects in the training set.
As shown in Figure 6, the y-axis represents the distribution density of objects. Figure 6
depicts the correlation between the input image resolution and the density of object distri-
bution in the P3 feature map. As resolution increases, the distribution density decreases,
along with the number of small objects. When the resolution is below 256
×
448, the density
is higher, and the dataset is mostly composed of small objects. Model performance is
limited by both object distribution density and the presence of small objects at resolutions
below 256 ×448.
The distribution density of objects in the P3 feature map decreases as image resolution
goes from 256
×
448 to 416
×
608, but small objects still dominate the dataset. From the
416
×
608 to 736
×
928 resolution range, the lowest density of object distribution is in
the P3 feature map. At this resolution, the model’s performance is less affected by object
distribution density, reducing the impact of small objects on perception. Resolutions above
736
×
928 have almost no small objects and neglectable object distribution density. While
higher resolutions improve model performance, training and inference at such resolutions
are costly. CGDet uses a 512
×
704 feature map for detection. At this resolution, coal and
gangue, which are relatively measured in terms of object scale, are mainly composed of
medium and small objects. As deep features are helpful for large object recognition, they
are not very helpful for small and medium objects [
53
]. Therefore, CGDet chooses to use
FPN for feature fusion, discarding the path enhancement part in PAN.
To provide additional evidence of the benefits associated with the relative object scale
measurement method, Table 2provides pertinent data regarding the model’s performance
and allows for a comparison of the performance between the PAN, FPN, and FPND models.
Table 2. Performance comparison of different neck structures.
Neck AP50 (%) AR50 (%) mAP50 (%) mAR50 (%) Parameters
(M) GFLOPs Inference
Time (ms)
PAN 96.2 98.9 98.0 99.6 8.94 23.55 19.87
FPN 96.5 99.0 98.0 99.6 6.72 21.04 16.11
FPND 96.7 99.3 97.9 99.6 5.00 12.66 15.57
Sensors 2024,24, 7318 11 of 17
When the input image resolution is 512
×
704, removing the path enhancement module
in PAN, which propagates shallow features to deeper layers, results in no change in the
model’s mAP
50
. This suggests that the path enhancement component in PAN is redundant.
Eliminating this redundant module improves AP50 and AR50 by 25%, reduces the number
of parameters by 25%, decreases computational cost by 8%, and shortens inference time
by 19%. FPND employs depthwise separable convolutions within the FPN, achieving
reductions in parameter count, computational cost, and inference time at the expense
of a slight 0.1% decrease in mAP
50
. Compared to PAN, FPND reduces the number of
parameters by 44%, decreases computational cost by 46%, and improves inference speed by
22%. These results further demonstrate that accurately measuring object scale in an image
based on relative resolution provides valuable guidance for model architecture design.
4.4. Elimination of Redundant Detection Heads via ODDM
To illustrate the negative impact of object distribution density on model perception
performance, detection experiments were conducted using feature maps with varying
object distribution densities, as shown in the experimental results in Table 3. The P5 feature
map, which had the highest object distribution density, impeded the model’s perceptual
capability, resulting in AP50, AR50, mAP
50
, and mAR
50
values all below 90%. In contrast,
the P4 feature map, with higher resolution and lower object distribution density, enhanced
the model’s perception, increasing AP50 and AR50 by 10.7% and 8.2%, respectively, com-
pared to P5. Furthermore, the model’s mAP
50
and mAR
50
improved by 10.7% and 20%,
respectively, when using the P4 feature map compared to P5. Increasing the resolution
further, the P3 feature map, which had the lowest object distribution density, provided an
additional boost to the model’s AP50 and AR50 by 2.9% and 3.3%, respectively, compared
to P4. The model’s mAP
50
and mAR
50
also increased by 3.5% and 2.8%, respectively,
relative to using the P4 feature map. However, using the P3 feature map for detection
required additional convolutional layers to fuse the feature maps, leading to increased
model parameters and computational complexity. As the resolution of the P5, P4, and
P3 feature maps gradually increased, the object distribution density within the feature
maps progressively decreased, and the model’s AP50, AR50, mAP
50
, and mAR
50
steadily
improved. Nevertheless, there was a diminishing marginal effect between the improvement
in model perception performance and the increase in feature map resolution.
Table 3. The impact of feature maps with different levels of density on model performance.
Feature Map
AP50 (%)
AR50
(%)
mAP50
(%)
mAR50
(%)
Parameters
(M) GFLOPs
P5 83.1 87.7 85.1 77.0 4.35 9.80
P4 93.8 95.9 95.8 97.0 4.68 10.76
P3 (CGDet) 96.7 99.2 98.3 99.8 4.76 12.26
While increasing the input resolution of images or enhancing the resolution of feature
maps used for detection can reduce the density of object distribution, the impact of increas-
ing input image resolution and feature map resolution on improving model perceptual
performance gradually diminishes. Since CGDet uses the P3 feature map for detection,
which has a resolution eight times lower than that of the input image, we conducted
experiments to investigate the relationship between input image resolution and model
performance. During model training and testing, experiments were conducted with images
of different resolutions ranging from 32
×
224 to 896
×
1088, and the results are shown in
Figure 7. In Figure 7, as the image resolution increased, the model’s mAP
50
, mAR
50
, and
computational load gradually increased. When the image resolution was below 224
×
416,
increasing the image resolution significantly improved the model’s mAP
50
and mAR
50
.
When the image resolution ranged from 256
×
448 to 512
×
704, the contribution of increas-
ing image resolution to improving the model’s mAP
50
and mAR
50
gradually decreased.
Once the image resolution exceeds 512
×
704, the model’s perceptual performance stabi-
Sensors 2024,24, 7318 12 of 17
lized; further increasing the image resolution hardly improved the model’s performance.
Figure 7reveals a noticeable diminishing return on model performance with increasing
image resolution. While higher resolution input images are advantageous in reducing
object density and enhancing the resolution of small objects, excessively increasing the
input image resolution is counterproductive, leading to a significant increase in redun-
dant computational load. Estimating the density of object distribution in images through
methods like object density estimation allows for the determination of high-performance,
low-computation image resolutions, thereby reducing the computational burden while
maintaining high model performance.
Sensors 2024, 24, x FOR PEER REVIEW 14 of 19
Figure 7. mAP
50
, mAR
50
, and GFLOPs were obtained for images with different input resolutions.
4.5. Visualization and Analysis of Results
The AP50 and AR50 of CGDet quantitatively represent the performance of the detec-
tor. However, they are difficult to observe. Therefore, CGDet was used to detect images
in the test set, and the results are visualized in Figure 8.
Figure 8. Vis
ua
lization of CGDet’s detection results on the test set.(a) Predicted Bounding Boxes
for Gangue (Blue) and Coal (Yellow); (b) Redundant Predictions with the Same Class Label (Coal);
(c) Redundant Predictions with Different Class Labels (Coal and Gangue).
In Figure 8a, the blue predicted bounding boxes represent gangue, while the yellow
predicted bounding boxes represent coal. Most of the predicted boxes cover the corre-
sponding objects in the image, but there are also a few instances where the object is re-
dundantly predicted by two bounding boxes. There are two cases of redundant predic-
tions. In one case, the two boxes with redundant predictions have the same class. As
shown in Figure 8b, a piece of coal in the image is simultaneously predicted by two bound-
ing boxes with the class label ‘coal’. In the other case, the two boxes with redundant pre-
dictions have different classes. As shown in Figure 8c, the same piece of gangue in the
image is predicted as both ‘coal’ and ‘gangue’ by two different bounding boxes. Due to
the small area occupied by coal and gangue in the images, this results in a lower amount
of discernible surface information for the coal and gangue in images. This makes it more
difficult to distinguish them and leads to redundant predictions.
4.6. Comparison of the Performance of Different Detectors for Detecting Coal and Gangue
To demonstrate the advantage of CGDet in perceiving coal and gangue in low-reso-
lution dense scenes, comparative experiments were conducted using MMDetection3. The
Faster R-CNN, YOLOF, and AutoAssign detectors were utilized for the experiments, all
of which employed ResNet50 as the backbone and FPN structure. The batch size in the
Figure 7. mAP50 , mAR50, and GFLOPs were obtained for images with different input resolutions.
4.5. Visualization and Analysis of Results
The AP50 and AR50 of CGDet quantitatively represent the performance of the detector.
However, they are difficult to observe. Therefore, CGDet was used to detect images in the
test set, and the results are visualized in Figure 8.
Sensors 2024, 24, x FOR PEER REVIEW 14 of 19
Figure 7. mAP
50
, mAR
50
, and GFLOPs were obtained for images with different input resolutions.
4.5. Visualization and Analysis of Results
The AP50 and AR50 of CGDet quantitatively represent the performance of the detec-
tor. However, they are difficult to observe. Therefore, CGDet was used to detect images
in the test set, and the results are visualized in Figure 8.
Figure 8. Vis
ua
lization of CGDet’s detection results on the test set.(a) Predicted Bounding Boxes
for Gangue (Blue) and Coal (Yellow); (b) Redundant Predictions with the Same Class Label (Coal);
(c) Redundant Predictions with Different Class Labels (Coal and Gangue).
In Figure 8a, the blue predicted bounding boxes represent gangue, while the yellow
predicted bounding boxes represent coal. Most of the predicted boxes cover the corre-
sponding objects in the image, but there are also a few instances where the object is re-
dundantly predicted by two bounding boxes. There are two cases of redundant predic-
tions. In one case, the two boxes with redundant predictions have the same class. As
shown in Figure 8b, a piece of coal in the image is simultaneously predicted by two bound-
ing boxes with the class label ‘coal’. In the other case, the two boxes with redundant pre-
dictions have different classes. As shown in Figure 8c, the same piece of gangue in the
image is predicted as both ‘coal’ and ‘gangue’ by two different bounding boxes. Due to
the small area occupied by coal and gangue in the images, this results in a lower amount
of discernible surface information for the coal and gangue in images. This makes it more
difficult to distinguish them and leads to redundant predictions.
4.6. Comparison of the Performance of Different Detectors for Detecting Coal and Gangue
To demonstrate the advantage of CGDet in perceiving coal and gangue in low-reso-
lution dense scenes, comparative experiments were conducted using MMDetection3. The
Faster R-CNN, YOLOF, and AutoAssign detectors were utilized for the experiments, all
of which employed ResNet50 as the backbone and FPN structure. The batch size in the
Figure 8. Visualization of CGDet’s detection results on the test set. (a) Predicted Bounding Boxes
for Gangue (Blue) and Coal (Yellow); (b) Redundant Predictions with the Same Class Label (Coal);
(c) Redundant Predictions with Different Class Labels (Coal and Gangue).
In Figure 8a, the blue predicted bounding boxes represent gangue, while the yellow
predicted bounding boxes represent coal. Most of the predicted boxes cover the correspond-
ing objects in the image, but there are also a few instances where the object is redundantly
predicted by two bounding boxes. There are two cases of redundant predictions. In one
case, the two boxes with redundant predictions have the same class. As shown in Figure 8b,
a piece of coal in the image is simultaneously predicted by two bounding boxes with the
class label ‘coal’. In the other case, the two boxes with redundant predictions have different
classes. As shown in Figure 8c, the same piece of gangue in the image is predicted as both
Sensors 2024,24, 7318 13 of 17
‘coal’ and ‘gangue’ by two different bounding boxes. Due to the small area occupied by coal
and gangue in the images, this results in a lower amount of discernible surface information
for the coal and gangue in images. This makes it more difficult to distinguish them and
leads to redundant predictions.
4.6. Comparison of the Performance of Different Detectors for Detecting Coal and Gangue
To demonstrate the advantage of CGDet in perceiving coal and gangue in low-
resolution dense scenes, comparative experiments were conducted using MMDetection3.
The Faster R-CNN, YOLOF, and AutoAssign detectors were utilized for the experiments,
all of which employed ResNet50 as the backbone and FPN structure. The batch size in
the experiment was eight and the input resolution of the images was set to 512
×
704,
using the same dataset and evaluation metrics as CGDet. The results of the comparison
experiment are shown in Table 4. Faster R-CNN is an excellent two-stage object detector,
and its performance dominance in detecting coal and gangue is not obvious. Although
YOLOF also performs detection using a single feature map, its AP50, and AR50 were 0.3%
and 2.1% lower than CGDet, respectively. Despite CGDet utilizing only a single feature map
for detection, it outperformed YOLOF. AutoAssign employs a dynamic label assignment
strategy, but in this experiment, its AP50 and AR50 were 5.7% and 2.4% lower than CGDet,
respectively. While YOLOV8n has fewer parameters and computational requirements than
CGDet, its performance significantly lagged behind CGDet. YOLOV8s, despite having
substantially more parameters and computational demands than CGDet, did not exhibit
superior performance either. In this experiment, CGDet demonstrated a clear advantage,
achieving AP50 and AR50 values of 96.7% and 99.2%, respectively, in comparison to Faster
R-CNN, YOLOF, AutoAssign, and YOLOV8 detectors. Furthermore, CGDet achieved
these results with an order of magnitude fewer parameters and significantly reduced
computational requirements.
Table 4. Performance comparison of different detectors.
Model AP50 (%) AR50 (%) Parameters (M) GFLOPs
Faster R-CNN 96.4 97.1 41.35 81.66
YOLOF 96.2 98.4 42.36 34.49
AutoAssign 91.0 96.8 36.25 69.54
YOLOV8n 95.6 99.7 3.2 8.7
YOLOV8s 95.6 99.4 11.2 28.6
CGDet 96.7 99.2 4.76 12.26
4.7. Comparison of Different Coal and Gangue Perception Methods
Many excellent convolutional neural network models have been developed for the
classification and localization of coal and gangue in images, but their performances vary.
To provide a rough comparison of the performance of these outstanding models, this study
selected convolutional neural network models that perceive coal and gangue using color
images for comparison. Since these models use different datasets and the source code is not
publicly available, the comparative results in Table 5can only reflect the overall progress in
the field.
As shown in Table 5, the proposed CGDet achieved the highest AP50, indicating that
CGDet is highly competitive in the perception of coal and gangue. At the same time, CGDet
had minimal inference time, indicating its ability to quickly perceive coal and gangue in
images. Furthermore, CGDet has fewer parameters and computations, demonstrating
its lightweight nature. By comparing with different models listed in the table, it can be
seen that many models either suffer from lightweight but poor performance, or good
performance but insufficient compacting and longer inference times. CGDet strikes a
balance between model compaction and performance, exhibiting outstanding performance
and efficiency in perceiving densely distributed coal and gangue in images.
Sensors 2024,24, 7318 14 of 17
Table 5. Comparison of different coal and gangue perception methods.
Reference AP50 (%) Parameters (M) GFLOPs Inference Time (ms)
Q. Liu [23] 96.45 - - 30.67
D. Yang [25] 91.90 6.64 14.30 -
P. Yan [24] 96.00 - - 19.00
G. Xue [39] 96.27 - - 21.97
J. Liu [40] 78.50 - - 28.41
Y. Liu [43] 80.24 5.97 6.83 11.12
B. Zhang [41] 91.33 - - 40.00
Z. Lv [20] 88.54 - - 30.20
CGDet 96.70 4.76 12.26 11.96
5. Conclusions
This paper presents CGDet, a compact convolutional neural network model specifi-
cally designed for the perception of coal and gangue in dense scenes. Through extensive
experimental validation, the following key scientific and practical findings were established:
Model Performance and Efficiency: CGDet operates with only 4.76 million parameters
and 12.26 GFLOPs of computational load, achieving an AP50 of 96.7% and an AR50 of
99.2%. This demonstrates that incorporating object distribution density and scale consider-
ations allows for significant model lightweighting without sacrificing performance, thereby
informing the design of efficient deep learning models.
Input Image and Feature Map Selection: The Object Distribution Density Measurement
(ODDM) method determined an optimal input image resolution of 512
×
704, utilizing P3
as the feature map for detection. These configurations yielded excellent performance in
dense scenarios, underscoring the importance of tailored input and feature map resolutions
for object detection to mitigate issues associated with label rewriting.
Structural Optimization and Cost Reduction: By employing the Relative Resolution
Object Scale Measurement (RROSM) method to assess object scale and optimizing the
model’s neck structure, CGDet achieved a 46.76% reduction in parameters and a 47.94%
decrease in computational costs, while slightly enhancing both AP50 and AR50. This
indicates that the RROSM method is effective at evaluating object scale, playing a crucial
role in structural design and the elimination of redundant parameters.
Practical Recommendations: For the specific task of detecting densely distributed coal
and gangue, it is advisable for designers and mechanical engineers to develop customized
object detection models, as these may outperform general-purpose detectors. Despite
CGDet’s superior performance, it remains susceptible to duplicate detections. Future work
should focus on addressing this issue, potentially through the integration of fine-grained
classification methods to enhance detection accuracy.
Author Contributions: Conceptualization, H.L. and X.C.; methodology, H.L. and Y.L.; software, X.C.,
Y.L. and J.L.; validation, H.L. and X.C.; investigation, X.C. and K.X.; resources, K.X.; data curation,
J.L., Y.L. and X.C.; writing—original draft preparation, H.L. and X.C.; writing—review and editing,
K.X. and J.L.; funding acquisition, K.X. All authors have read and agreed to the published version of
the manuscript.
Funding: This work was sponsored by the Fundamental Research Funds for the Central Universities
(No. FRF-BD-23-02), and the Beijing Science and Technology Planning Project (No. Z221100005822012).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare no conflicts of interest.
Sensors 2024,24, 7318 15 of 17
References
1.
Wang, X.-P.; Zhang, Z.-M.; Guo, Z.-H.; Su, C.; Sun, L.-H. Energy Structure Transformation in the Context of Carbon Neutralization:
Evolutionary Game Analysis Based on Inclusive Development of Coal and Clean Energy. J. Clean. Prod. 2023,398, 136626.
[CrossRef]
2.
Zhang, J.X.; Zhang, Q.; Spearing AJ, S.; Miao, X.X.; Guo, S.; Sun, Q. Green Coal Mining Technique Integrating Mining-Dressing-Gas
Draining-Backfilling-Mining. Int. J. Min. Sci. Technol. 2017,27, 17–27. [CrossRef]
3.
Wei, Y.; Zhang, W.X.; Lin, B.Q.; Si, G.Y.; Zhang, J.G.; Wang, J.L. Integration of Protective Mining and Underground Backfilling for
Coal and Gas Outburst Control: A Case Study. Process Saf. Environ. Prot. 2022,157, 273–283.
4.
Sotoudeh, F.; Nehring, M.; Kizil, M.; Knights, P. Integrated Underground Mining and Pre-Concentration Systems; a Critical
Review of Technical Concepts and Developments. Int. J. Mining Reclam. Environ. 2021,35, 153–182. [CrossRef]
5.
Liu, H.; Xu, K. Recognition of Gangues from Color Images Using Convolutional Neural Networks with Attention Mechanism.
Measurement 2023,206, 112273. [CrossRef]
6.
Luo, X.; He, K.; Zhang, Y.; He, P.; Zhang, Y. A Review of Intelligent Ore Sorting Technology and Equipment Development. Int. J.
Miner. Met. Mater. 2022,29, 1647–1655. [CrossRef]
7.
Yang, J.; Peng, J.; Li, Y.; Xie, Q.; Wu, Q.; Wang, J. Gangue Localization and Volume Measurement Based on Adaptive Deep Feature
Fusion and Surface Curvature Filter. IEEE Trans. Instrum. Meas. 2021,70, 1–13. [CrossRef]
8.
Wang, J.; Zhao, M.; Xia, C. An improved classification diagnosis approach for cervical images based on deep neural networks.
Pattern Anal. Appl. 2024,27, 79. [CrossRef]
9.
Campos, S.; Zamora, J.; Allende, H. Block-Wise Imputation EM Algorithm in Multi-Source Scenario: ADNI Case. Pattern Anal.
Appl. 2024,27, 44. [CrossRef]
10.
Akbaba, E.E.; Gurkan, F.; Gunsel, B. Boosting Person ReID Feature Extraction via Dynamic Convolution. Pattern Anal. Appl. 2024,
27, 80. [CrossRef]
11.
Zou, L.; Yu, X.; Li, M.; Lei, M.; Yu, H. Nondestructive Identification of Coal and Gangue via Near-infrared Spectroscopy Based on
Improved Broad Learning. IEEE Trans. Instrum. Meas. 2020,69, 8043–8052. [CrossRef]
12.
Li, C.; Wang, J. Remote Sensing Image Location Based on Improved Yolov7 Target Detection. Pattern Anal. Appl. 2024,27, 50.
[CrossRef]
13.
Bao, W.; Zhang, H.; Ding, Y.; Shen, F.; Li, L. EdgeNet: A Low-Power Image Recognition Model Based on Small Sample Information.
Pattern Anal. Appl. 2024,27, 82. [CrossRef]
14.
Kim, S.; Jang, I.-S.; Ko, B.C. Domain-Free Fire Detection Using the Spatial–Temporal Attention Transform of the Yolo Backbone.
Pattern Anal. Appl. 2024,27, 45. [CrossRef]
15.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017,39, 1137–1149. [CrossRef]
16.
Li, D.; Zhang, Z.; Xu, Z.; Xu, L.; Meng, G.; Li, Z.; Chen, S. An Image-Based Hierarchical Deep Learning Framework for Coal and
Gangue Detection. IEEE Access 2019,7, 184686–184699. [CrossRef]
17.
Lei, S.; Xiao, X.; Zhang, M.; Dai, J. Visual classification method based on CNN for coal-gangue sorting robots. In Proceedings
of the 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE), Dalian, China, 19–20
September 2020; pp. 543–547.
18.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
19.
Li, D.; Ren, H.; Wang, G.; Wang, S.; Wang, W.; Du, M. Coal Gangue Detection and Recognition Method Based on Multiscale
Fusion Lightweight Network SMS-YOLOv3. Energy Sci. Eng. 2023,11, 1783–1797. [CrossRef]
20.
Lv, Z.; Wang, W.; Xu, Z.; Zhang, K.; Lv, H. Cascade Network for Detection of Coal and Gangue in the Production Context. Powder
Technol. 2021,377, 361–371. [CrossRef]
21.
Li, D.; Wang, G.; Zhang, Y.; Wang, S. Coal Gangue Detection and Recognition Algorithm Based on Deformable Convolution
Yolov3. IET Image Process. 2022,16, 134–144. [CrossRef]
22.
Yan, P.; Sun, Q.; Yin, N.; Hua, L.; Shang, S.; Zhang, C. Detection of Coal and Gangue Based on Improved Yolov5.1 Which
Embedded Scse Module. Measurement 2022,188, 110530. [CrossRef]
23.
Liu, Q.; Li, J.G.; Li, Y.S.; Gao, M.W. Recognition Methods for Coal and Coal Gangue Based on Deep Learning. IEEE Access 2021,9,
77599–77610. [CrossRef]
24.
Yan, P.; Kan, X.; Zhang, H.; Zhang, X.; Chen, F.; Li, X. Target Recognition of Coal and Gangue Based on Improved Yolov5s and
Spectral Technology. Sensors 2023,23, 4911. [CrossRef] [PubMed]
25. Yang, D.; Miao, C.; Li, X.; Liu, Y.; Wang, Y.; Zheng, Y. Improved Yolov7 Network Model for Gangue Selection Robot for Gangue
and Foreign Matter Detection in Coal. Sensors 2023,23, 5140. [CrossRef] [PubMed]
26.
Xu, S.; Zhou, Y.; Huang, Y.; Han, T. Yolov4-Tiny-Based Coal Gangue Image Recognition and FPGA Implementation. Micromachines
2022,13, 1983. [CrossRef] [PubMed]
Sensors 2024,24, 7318 16 of 17
27.
Zhou, Y.; Chen, S.; Wang, Y.; Huan, W. Review of Research on Lightweight Convolutional Neural Networks. In Proceedings of
the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC 2020), Chongqing, China, 12–14
June 2020; pp. 1713–1720.
28.
Chen, F.; Li, S.; Han, J.; Ren, F.; Yang, Z. Review of Lightweight Deep Convolutional Neural Networks. Arch. Comput. Methods
Eng. 2024,31, 1915–1937. [CrossRef]
29.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with
Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12
June 2015.
30.
Christian, S.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
31.
Christian, S.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-V4, Inception-Resnet and the Impact of Residual Connections on
Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017.
32.
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 30th IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807.
33.
Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized Evolution for Image Classifier Architecture Search. In Proceedings of the
AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019.
34.
Vysogorets, A.; Kempe, J. Connectivity Matters: Neural Network Pruning through the Lens of Effective Sparsity. J. Mach. Learn.
Res. 2021,24, 1–23. [CrossRef]
35.
Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up Convolutional Neural Networks with Low Rank Expansions. arXiv 2014,
arXiv:1405.3866.
36.
Zhang, X.; Zou, J.; He, K.; Sun, J. Accelerating Very Deep Convolutional Networks for Classification and Detection. IEEE Trans.
Pattern Anal. Mach. Intell. 2015,38, 1943–1955. [CrossRef]
37.
Guo, Y.; Yao, A.; Zhao, H.; Chen, Y. Network Sketching: Exploiting Binary Structure in Deep Cnns. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017.
38. Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531.
39.
Xue, G.; Li, S.; Hou, P.; Gao, S.; Tan, R. Research on Lightweight Yolo Coal Gangue Detection Algorithm Based on Resnet18
Backbone Feature Network. Internet Things 2023,22, 100762. [CrossRef]
40.
Liu, J.; Qiao, H.; Yang, L.; Guo, J. Improved Lightweight Yolov4 Foreign Object Detection Method for Conveyor Belts Combined
with Cbam. Appl. Sci. 2023,13, 8465. [CrossRef]
41.
Zhang, B.; Zhang, H.-B. Coal Gangue Detection Method Based on Improved SSD Algorithm. In Proceedings of the 2021
International Conference on Intelligent Transportation, Big Data Smart City, Xi’an, China, 27–28 March 2021; pp. 634–637.
42.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings
of the European Conference on Computer Vision 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham,
Switzerland, 2016; pp. 21–37.
43. Liu, Y.; Wang, X.; Zhang, Z.; Deng, F. LOSN: Lightweight Ore Sorting Networks for Edge Device Environment. Eng. Appl. Artif.
Intell 2023,123, 106191. [CrossRef]
44.
Cao, Z.; Li, Z.; Fang, L.; Li, J. Lightweight Coal and Gangue Detection Algorithm Based on Improved Yolov7-Tiny. Int. J. Coal
Prep. Util. 2024,44, 1773–1792. [CrossRef]
45.
Yan, P.; Zhang, H.; Kan, X.; Chen, F.; Wang, C.; Liu, Z. Lightweight Detection Method of Coal Gangue Based on Multispectral and
Improved Yolov5s. Int. J. Coal Prep. Util. 2024,44, 399–414. [CrossRef]
46.
Wang, S.; Zhu, J.; Li, Z.; Sun, X.; Wang, G. Gdps-Yolo: An Improved Yolov8s for Coal Gangue Detection. Int. J. Coal Prep. Util.
2024. [CrossRef]
47.
Xin, F.; Jia, Q.; Yang, Y.; Pan, H.; Wang, Z. A High Accuracy Detection Method for Coal and Gangue with S3DD-Yolov8. Int. J.
Coal Prep. Util. 2024, 1–19. [CrossRef]
48.
Yan, P.; Wang, W.; Li, G.; Zhao, Y.; Wang, J.; Wen, Z. Detection of Coal Gangue Based on Spectral Technology and Enhanced
Lightweight Yolov7-tiny. Int. J. Coal Prep. Util. 2024,44, 1843–1863. [CrossRef]
49.
Wang, Y.; Peng, J.; Wang, H.; Wang, M. Progressive Learning with Multi-Scale Attention Network for Cross-Domain Vehicle
re-Identification. Sci. China Inf. Sci. 2022,65, 160103. [CrossRef]
50.
Wang, H.; Yao, M.; Chen, Y.; Xu, Y.; Liu, H.; Jia, W.; Fu, X.; Wang, Y. Manifold-Based Incomplete Multi-View Clustering via
Bi-Consistency Guidance. IEEE Trans. Multimed. 2024,26, 10001–10014. [CrossRef]
51.
Wang, H.; Yao, M.; Jiang, G.; Mi, Z.; Fu, X. Graph-Collaborated Auto-Encoder Hashing for Multiview Binary Clustering. IEEE
Trans. Neural Netw. Learn. Syst. 2023,35, 10121–10133. [CrossRef]
52. Zheng, G.; Songtao, L.; Feng, W.; Zeming, L.; Jian, S. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430.
53.
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
54.
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017; pp. 936–944.
Sensors 2024,24, 7318 17 of 17
55.
Liu, H.; Wang, D.; Xu, K.; Zhou, P.; Zhou, D. Lightweight Convolutional Neural Network for Counting Densely Piled Steel Bars.
Autom. Constr. 2023,146, 104692. [CrossRef]
56.
Lin, T.Y.; Michael, M.; Serge, B.; James, H.; Pietro, P.; Deva, R.; Piotr, D.; Zitnick, C.L. Lawrence Zitnick. Microsoft Coco: Common
Objects in Context. In Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Extraction of discriminative features is crucial in person re-identification (ReID) which aims to match a query image of a person to her/his images, captured by different cameras. The conventional deep feature extraction methods on ReID employ CNNs with static convolutional kernels, where the kernel parameters are optimized during the training and remain constant in the inference. This approach limits the network's ability to model complex contents and decreases performance, particularly when dealing with occlusions or pose changes. In this work, to improve the performance without a significant increase in parameter size, we present a novel approach by utilizing a channel fusion-based dynamic convolution backbone network, which enables the kernels to change adaptively based on the input image, within two existing ReID network architectures. We replace the backbone network of two ReID methods to investigate the effect of dynamic convolution on both simple and complex networks. The first one called Baseline, is a simpler network with fewer layers, while the second, CaceNet represents a more complex architecture with higher performance. Evaluation results demonstrate that both of the designed dynamic networks improve identification accuracy compared to the static counterparts. A significant increase in accuracy is reported under occlusion tested on Occluded-DukeMTMC. Moreover, our approach achieves a performance comparable to the state-of-the-art on Market1501, DukeMTMC-reID, and CUHK03 with a limited computational load. These findings validate the effectiveness of the dynamic convolution in enhancing the person ReID networks and push the boundaries of performance in this domain.
Article
Full-text available
Existing deep convolutional neural networks that rely on large datasets typically require images with high resolution and deep neural network models trained and called upon to improve accuracy of image recognition and classification. It is needed to use lightweight model to adapt to such low-power devices. However, lightweight small models are limited in their ability to classify and recognize small-sized images with low-resolution and are constrained by the number of parameters in the model and unable to perform deep-level feature extraction, since the low-resolution indicates small sample information. In the intelligent interaction in digital media, capturing, storing, transmitting, and computing high-resolution, high-precision images incur high power consumption and operating costs. When deploying an image recognition system on the client-side of IoT devices, it is difficult to meet the hardware requirements of high storage space and fast computation speed. It is also challenging to directly use high-resolution image data for model fine-tuning and training, and the size and parameter updates of the model are also limited by the storage and operating capacity of the hardware facilities. We proposed a low-power image recognition framework consists data pre-processing part and lightweight modeling architecture part. The data pre-processing method for image data based on an Auto-Encoder that filters R, G, B color channel data using a resolution filter to realize data compression, that is Downscaling large input data to a smaller size, thus to address the limitations of low-power deep learning model deployment and training. Based on the resolution filter, a channel normalization method is proposed to perform batch normalization on each channel dimension to encode the original image data at the same size and improve the mean squared error discrimination of the image data. And the lightweight model uses a depth-separable convolutional neural network and two kinds of blocks: one with batch normalization and the other without, EdgeNet. The architecture makes it possible to deploy more suitable for IoT device. The proposed framework achieves only a small precision loss within permission, but improves the forward inference speed of the model, and reduce the memory storage to 8.7 MB.
Article
Full-text available
In order to enhance the speed and performance of cervical diagnosis, we propose an improved Residual Network (ResNet) by combining pyramid convolution with depth-wise separable convolution to obtain the high-quality cervical classification. Since most of cervical images from patients are not in the center of colposcopy images, we devise the segmentation and extraction algorithm of the center movement of the region of interest (ROI), which will further enhance the classification performance. Extensive experiments indicate that our model can not only achieve the lightweight network model, but also fulfil the classification prediction, such as for three-classification of cervical lesions, the classification accuracy is as high as 91.29%%\%, the precision is 89.70%%\%, the sensitivity is 88.75%%\%, the specificity is 94.98%%\%, the rate of missed diagnosis is 11.25%%\% and the rate of misdiagnosis is 5.02%%\%. Finally, after dividing the colposcopy images into four categories, it is shown that our results are still better than those obtained from many previous works as far as the cervical image classification is concerned. The current work can not only assist doctors to quickly diagnose cervical diseases, but also the classification performance can meet some clinical requirements in practice.
Article
Full-text available
Target detection, as a core issue in the field of computer vision, is widely applied in many key areas such as face recognition, license plate recognition, security protection, and driverless driving. Although its detection speed and accuracy continue to break records, there are still many challenges and difficulties in target detection of remote sensing images, which require further in-depth research and exploration. Remote sensing images can be regarded as a "three-dimensional data cube", with more complex background information, dense and small object targets, and more severe weather interference factors. These factors lead to large positioning errors and low detection accuracy in the target detection process of remote sensing images. An improved YOLOv7 object detection model is proposed to address the problem of high false negative rate for dense and small objects in remote sensing images. Firstly, the GAM attention mechanism is introduced, and a global scheduling mechanism is proposed to improve the performance of deep neural networks by reducing information reduction and expanding global interaction representations, thus enhancing the network's sensitivity to targets. Secondly, the loss function CIoU in the original Yolov7 network model is replaced by SIoU, aiming to optimize the loss function, reduce losses, and improve the generalization of the network. Finally, the model is tested on the public available RSOD remote sensing dataset, and its generalization is verified on the Okahublot FloW-Img sub-dataset. The results showed that the accuracy (MAP@0.5) of detecting objects improved by 1.7 percentage points and 1.5 percentage points respectively for the improved Yolov7 network model compared to the original model, effectively improves the accuracy of detecting small targets in remote sensing images and solves the problem of leakage detection of small targets in remote sensing images.
Article
Full-text available
Conventional fire detection approaches typically relied on distinct models to address the varying characteristics of fires and smoke, particularly under different day and night conditions. Additionally, the distinction between wildfires and urban fires, which is influenced by camera shooting distances, often requires the use of separate detection algorithms. To address this, we introduce a novel domain-free (day, night, urban, and forest) fire detection algorithm within YOLOv5, incorporating linear attention for spatial attention extraction and gated temporary pooling (GTP) for temporal attention extraction. This study utilizes YOLOv5 as a base framework, and deviates from the conventional approach of modifying only pre-processing and downstream tasks. Within the dynamic attention block of GTP, our method extracts fire-related features by effectively discerning fires from background, while considering the spatiotemporal characteristics of fire flames and smoke. Despite the compact model size, our proposed approach significantly outperforms state-of-the-art methods for still image and continuous video datasets, demonstrating the effectiveness of our approach. This performance holds true for various fire types, locations, and under day and night conditions. This study marks a significant advancement in domain-free fire detection, offering a unified solution capable of addressing the diverse challenges presented by different fire scenarios and lighting conditions.
Article
Full-text available
Alzheimer’s disease is the most common form of dementia and the early detection is essential to prevent its proliferation. Real data available has been of paramount importance in order to achieve progress in the automatic detection despite presenting two major challenges: Multi-source observations containing Magnetic resonance (MRI), Positron emission tomography (PET) and Cerebrospinal fluid data (CSF); and also missing values within all these sources. Most machine learning techniques perform this predictive task by using a single data modality. Nevertheless, the integration of all these sources of evidence could possibly bring a higher performance at different stages of disease progression. The Expectation Maximization (EM) algorithm has been successfully employed to handle missing values, but it is not designed for typical Machine Learning scenarios where an imputation model is created over training data and subsequently applied on a testing set. In this work, we propose EMreg-KNN, a novel supervised and multi-source imputation algorithm. Based on the EM algorithm, EMreg-KNN builds a regression ensemble model for the imputation of future data thus allowing the further utilization of any vector-based Machine Learning method to automatically assess the Alzheimer’s disease diagnosis. Using the ADNI database, the proposed method achieves significant improvements on F1, AUC and Accuracy measures over classical imputation methods for this database using four classification algorithms. Considering these classifiers in four different classification scenarios, our algorithm is experimentally superior in terms of the F measure, in nearly 82% of the cases under evaluation.
Article
Incomplete multi-view clustering primarily focuses on dividing unlabeled data into corresponding categories with missing instances, and has received intensive attention due to its superiority in real applications. Considering the influence of incomplete data, the existing methods mostly attempt to recover data by adding extra terms. However, for the unsupervised methods, a simple recovery strategy will cause errors and outlying value accumulations, which will affect the performance of the methods. Broadly, the previous methods have not taken the effectiveness of recovered instances into consideration, or cannot flexibly balance the discrepancies between recovered data and original data. To address these problems, we propose a novel method termed Manifold-based Incomplete Multi-view clustering via Bi-consistency guidance (MIMB), which flexibly recovers incomplete data among various views, and attempts to achieve biconsistency guidance via reverse regularization. In particular, MIMB adds reconstruction terms to representation learning by recovering missing instances, which dynamically examines the latent consensus representation. Moreover, to preserve the consistency information among multiple views, MIMB implements a biconsistency guidance strategy with reverse regularization of the consensus representation and proposes a manifold embedding measure for exploring the hidden structure of the recovered data. Notably, MIMB aims to balance the importance of different views, and introduces an adaptive weight term for each view. Finally, an optimization algorithm with an alternating iteration optimization strategy is designed for final clustering. Extensive experimental results on 6 benchmark datasets are provided to confirm that MIMB can significantly obtain superior results as compared with several state-of-the-art baselines.