Access to this full-text is provided by Springer Nature.
Content available from Scientific Reports
This content is subject to copyright. Terms and conditions apply.
WCAY object detection of fractures
for X-ray images of multiple sites
Peng Chen1, Songyan Liu1, Wenbin Lu1,2, Fangpeng Lu1,2 & Boyang Ding1
The WCAY (weighted channel attention YOLO) model, which is meticulously crafted to identify
fracture features across diverse X-ray image sites, is presented herein. This model integrates novel
core operators and an innovative attention mechanism to enhance its ecacy. Initially, leveraging
the benets of dynamic snake convolution (DSConv), which is adept at capturing elongated tubular
structural features, we introduce the DSC-C2f module to augment the model’s fracture detection
performance by replacing a portion of C2f. Subsequently, we integrate the newly proposed weighted
channel attention (WCA) mechanism into the architecture to bolster feature fusion and improve
fracture detection across various sites. Comparative experiments were conducted, to evaluate the
performances of several attention mechanisms. These enhancement strategies were validated
through experimentation on public X-ray image datasets (FracAtlas and GRAZPEDWRI-DX). Multiple
experimental comparisons substantiated the model’s ecacy, demonstrating its superior accuracy and
real-time detection capabilities. According to the experimental ndings, on the FracAtlas dataset, our
WCAY model exhibits a notable 8.8% improvement in mean average precision (mAP) over the original
model. On the GRAZPEDWRI-DX dataset, the mAP reaches 64.4%, with a detection accuracy of 93.9%
for the “fracture” category alone. The proposed model represents a substantial improvement over the
original algorithm compared to other state-of-the-art object detection models. The code is publicly
available at https://github.com/cccp421/Fracture-Detection-WCAY .
Keywords Fracture detection, Deep learning, Attention mechanism, YOLO
Bone trauma, arising from incidents such as jostling, falls, and car accidents, is a prevalent occurrence in
modern life. It encompasses a range of injuries, including fractures, cracks, tears, and compression injuries.
Symptoms typically manifest as pain, swelling, and restricted movement, potentially leading to complications
such as nonunion and infection1. Timely diagnosis and appropriate treatment are crucial in managing bone
trauma, given the unpredictable nature of injury occurrence and variations in medical expertise among treating
physicians2. e advent of articial intelligence oers promising solutions to the clinical complexities associated
with orthopedic trauma3.
Deep learning, a pivotal subset of articial intelligence, has garnered signicant attention for its applications
in fracture detection and as a supplementary tool for clinician diagnostics4. Fracture detection primarily
utilizes X-ray and computed tomography (CT) images, with X-ray image research being particularly prevalent5.
Consequently, fracture detection within deep learning frameworks can be conceptualized as an object detection
task6.
Object detection algorithms serve the purpose of identifying both the location and class of targets within an
image7. ese algorithms predominantly rely on convolutional neural networks and are categorized into two
main types: two-stage models and single-stage models8. Two-stage models typically involve the generation of
candidate regions from the input image, followed by classication and regression9. Examples include R-CNN10,
Fast R-CNN11, and Faster R-CNN12, which are known for their higher detection accuracy. In contrast, single-
stage models simplify the problem by treating object detection as a regression task and performing global
regression-based classication13. Models such as the You Only Look Once (YOLO)14 series and RetinaNet15
directly extract class and location information without the need for candidate region generation.
Furthermore, the ongoing pursuit of improving model detection performance in neural network models for
object detection remains a prominent research focus. Enhancement strategies primarily revolve around data
augmentation and network architecture modications16. Of particular interest in network structure enhancement
is the integration of attention mechanisms, representing a current area of active exploration and research16.e
attention mechanism is a unique structure embedded in machine learning models that automatically captures
the contribution of input data to output data17. e basic principle of attentional mechanisms in computer vision
is to nd the correlation between the raw data and then emphasize some key features18, such as the squeeze-
1Heilongjiang University, Harbin 150080, China. 2These authors contributed equally: Wenbin Lu and Fangpeng Lu.
email: liusongyan@hlju.edu.cn
OPEN
Scientic Reports | (2024) 14:26702 1
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports
Content courtesy of Springer Nature, terms of use apply. Rights reserved
and-excitation (SE) attention method19, convolutional block attention module (CBAM)20, global attentional
mechanism (GAM)21, and coordinate attention (CA)22.
erefore, we use the FracAtlas X-ray dataset23, which is a collection of X-ray scan image data from
multiple parts of the hand, shoulder, leg, and foot. A generalized X-ray fracture detection model is designed.
We introduce the dynamic snake convolution C2f (DSC-C2f) operator, which is designed to eciently extract
slender fracture features. In addition, we introduce a novel WCA attention mechanism to improve the detection
accuracy. Leveraging insights from the YOLO family of single-stage detection algorithms, we develop the WCAY
fracture detection model. To improve the overall ecacy of the proposed model, we incorporated the YOLO
algorithm for dierent model sizes, including the Nano, Small and Medium models. To validate the feasibility
of the model, we trained the model on the GRAZPEDWRI-DX public dataset24. e contributions of this paper
are summarized as follows:
• Leveraging dynamic snake convolution (DSConv)25, we introduce a learning residual module, DSC-C2f, ca-
pable of capturing tubular structures.
• We propose a weighted channel attention mechanism (WCA).
• We propose a new object detection network called weighted channel attention YOLO (WCAY) that incor-
porates some of the above attention mechanisms as well as the WCA and DSC-C2f proposed in this paper.
• e feasibility of DSC-C2f, WCA and WCAY was veried with several datasets.
Related work
Fracture detection, a critical aspect of medical imaging, has seen widespread application. Guan et al.26.
utilized the R-CNN model on the MURA dataset27, achieving an average accuracy of 62.04%. Yahalomi et
al.28. demonstrated the eectiveness of a Faster R-CNN model in localizing distal radius fractures, surpassing
radiologists’ performance and oering promise in rare disease identication. Wang et al.29. introduced
ParallelNet, an R-CNN network with a TripleNet backbone, for thigh fracture detection in a dataset comprising
3842 X-ray images. Similarly, Krogue et al.30. employed a RetinaNet model utilizing DenseNet169 for automatic
detection, localization, and classication of hip fractures.
While these two-stage algorithms boast high accuracy, their speed remains a concern. Achieving a balance
between accuracy and speed is imperative. Single-stage object detection algorithms, exemplied by the YOLO
family, have emerged as signicant contributors in this realm. Li et al.31. applied the YOLOv3 model to vertebral
fracture detection, demonstrating its eectiveness. Yuan et al.32. innovatively integrated external attention and
3D feature fusion into YOLOv5 to detect skull fractures in CT images. Warin et al.33. leveraged YOLOv5 to
detect mammofacial fractures in a substantial dataset, classifying fracture conditions into frontal, midfacial, and
jaw fractures and no fractures. Mushtaq et al.34. demonstrated the prociency of the YOLOv5 model in lumbar
vertebrae localization, achieving an impressive average accuracy of 0.975. Furthermore, in pediatric wrist
fracture detection, Dibo et al.35. enhanced YOLOv7 with the CBAM attention mechanism, achieving improved
performance on the GRAZPEDWRI-DX dataset. Moreover, Rui et al.36. utilized the YOLOv8 model for wrist
fracture detection, presenting an application tailored for this purpose.
However, due to the diculties in establishing a high-quality fracture image dataset and the subjective nature
of doctors’ image annotations, a completely uniform standard does not exist, and deep learning-based fracture
diagnosis studies are usually conducted for specic fracture types37. erefore, it is particularly important to
develop a deep learning model for fracture detection that is applicable to various types of images and dierent
fracture sites.
Proposed method
YOLOv8 Architecture
Redmon et al.14. introduced the YOLO architecture in 2015 for real-time detection, aiming to address target
detection as a regression challenge. is approach involves directly mapping coordinates and class probabilities
from image pixels to bounding boxes using a single neural network model. YOLOv838, the latest iteration
proposed by Glenn Jocher, represents a signicant improvement over YOLOv539. Notably, YOLOv8 replaces the
C3 module with the more ecient C2f module, which features a CSP bottleneck with two convolutions instead
of three, along with adjustments to the number of channels. Moreover, the head section undergoes modications
to employ the decoupled head technique, separating classication and detection tasks.
Weighted-channel-attention-YOLO fracture detection network
To address issues such as inaccurate fracture detection, excessive model parameters, large model sizes, and
limited detection sites in traditional networks, this study introduces a novel X-ray fracture detection model
named WCAY (shown in Fig.1). Leveraging YOLOv8s as the baseline network, we incorporated the DSC-C2f
core operator into the network backbone to enhance the model’s sensitivity to elongated and curved tubular
structures typical of fractures. Additionally, we integrate a self-developed attention module (WCA) into the neck
network to enable the model to prioritize abnormal regions while suppressing non anomalous areas, thereby
enhancing overall performance.
DSC-C2f Module
e YOLOv8s network architecture incorporates numerous C2f modules, which are primarily tasked with
learning residual features. erefore, the network’s performance is heavily reliant on the eectiveness of these C2f
module features. Given the signicant variations in fracture morphology, location, and size—particularly with
crack-like fractures, which exhibit diverse shapes and sizes—the original C2f module may struggle to adequately
extract such small, localized features. To address this limitation and further bolster the network’s ability to learn
Scientic Reports | (2024) 14:26702 2
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
fracture features, this paper introduces the DSConv from the dynamic snake convolution network (DSCNet).
Subsequently, a new module, termed the DSC-C2f module, is meticulously designed.
In 2023, Yaolei Qi et al.25. developed the DSCNet network, which is specically tailored for tubular structure
segmentation tasks. Within the DSCNet network, DSConv emerged as a convolutional module, oering a novel
approach to traditional convolution. As illustrated in Fig.2, DSConv demonstrates distinctive operational
characteristics. To eectively extract local features of tubular structures and enable the convolutional kernel
to focus on intricate geometric features, DSConv introduces deformation osets. By sequentially examining
each target for processing, DSConv ensures consistent attention. Additionally, the incorporation of signicant
deformation osets prevents the spreading of sensory elds too extensively, resulting in an output feature map
resembling a “snake” shape.
Figure 3 illustrates the structure of DSC-C2F. e DySnakeConv module is formed by linking two initial
DSConv layers with a convolution module (ConvM) layer. Initially, the ConvM layer increases the number of
channels in the expansion layer. Subsequently, the DySnakeConv module is applied to the feature map, followed
by the utilization of a second ConvM layer to reduce the number of channels in the output feature map to align
with the input channels. Finally, the feature obtained in the preceding stage is merged with the residual edge
for feature fusion, thus constituting the dynamic snake convolution bottleneck (DSC-Botneck) module. e
newly designed DSC-C2f module is a DSC-Botneck module that replaces all the bottleneck components of
the original C2f module in the network model. is DSC-C2f module brings together the multiscale feature
extraction capabilities of the original C2f module with DSConv’s ability to pay adaptive attention to slender and
curvilinear features.
Weighted Channel Attention mechanism
e attention mechanism plays a crucial role in capturing the aspect of focus in the whole image to further enhance
the model’s focus on the image features of the abnormal bone region and improve the model generalizability.
However, it is important to note that utilizing the attention mechanism also has the disadvantage of increased
Fig. 1. Model structure of WCAY.
Scientic Reports | (2024) 14:26702 3
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
computational eort, leading to increased computational cost. We design a new channel attention mechanism,
weighted channel attention (WCA), inspired by the CA (coordinate attention)22 module, as shown in Fig.4.
is WCA module can be viewed as a computational unit designed to improve the representation of features
learned by the network. It can take as input any intermediate feature tensor
X∈RC×H×W
, where
C
denotes
the number of input channels and
H
and
W
denote the spatial dimensions of the input features. To clearly
describe the proposed WCA, we rst revisit the embedding of location information into the channel attention
CA, as shown in (a) in Fig.5.
e CA decomposes the original input tensor
X
into two parallel one-dimensional feature encoding
vectors for modeling cross-channel dependencies with spatial location information. e following two formulas
represent two one-dimensional vectors each from a one-dimensional global average pooling along the horizontal
dimension so that it can be viewed as a collection of positional information along the vertical dimension. e
one-dimensional global average pooling that encodes global information along the horizontal dimension of
C
with height
H
can be expressed as Eq.(1). Similarly, the output of the pooling in
C
with width
W
can be
expressed as Eq.(2).
ZH
c(H)=
1
W
0
≤
i
≤
Wxc(H, i
)
(1)
ZW
c(W)=
1
H
0
≤
j
≤
Hxc(j, W
)
(2)
Here,
xc
denotes the input feature in channel
c
. rough such an encoding process, CA captures the long-
distance dependencies in the horizontal dimension direction and preserves the exact position information in
the vertical dimension direction. e model uses input feature encoding to synthesize global information to
help capture spatial global features. It then generates two parallel 1D vectors for feature coding and permutes the
shape of one of the vectors before merging the two. Immediately aer, these parallel encoded vectors are shared
with the downscaled 1 × 1 convolution. Coordinate attention (CA) then decomposes the 1 × 1 convolution
output into dual parallel 1D feature encoding vectors. Each path contains a 1 × 1 convolution and a nonlinear
sigmoid function. Finally, the attentional weights of the two paths are applied to the original feature map to
produce the nal output. is approach preserves accurate spatial details while eciently exploiting long-range
dependencies through interchannel and spatial information coding.
Fig. 3. Structure of DSC-C2f.
Fig. 2. Schematic of how DSConv works. Dynamic snake convolution (DSConv) learns deformations based on
input feature maps and adaptively focuses on elongated and tortuous local features based on an understanding
of the morphology of tubular structures25.
Scientic Reports | (2024) 14:26702 4
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Although CA embeds precise positional information into channels, utilizing this spatial capture of long-
distance interactions improves the model’s concentration of fracture features40. However, the excess of long-
range temporal information causes the model to miss crucial feature details during multiscale fusion, leading
to overtting. As a result, the fracture feature localization becomes diuse and unconstrained, with the model
capturing a wide range of focal points beyond the pre-labeled bounding box in the image. To solve this problem
of concentration diusion, we designed the WCA module, whose overall structure is shown in (b) in Fig.5.
Fig. 5. Comparisons with dierent attention modules: (a) CA module; (b) WCA module.
Fig. 4. Principle of the WCA. Here, “X avg pool” represents 1D horizontal global pooling, and “Y avg pool”
indicates 1D vertical global pooling22.
Scientic Reports | (2024) 14:26702 5
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Specically, given the aggregated feature maps produced by Eq.(1) and Eq.(2). We rst concatenate them
and send them to a 3 × 3 convolutional transform function
F3×3
to obtain the following formula:
f=F
3×3
Z
H
,Z
W
(3)
where
[•,•]
denotes the join operation along the spatial dimension and
f∈RC×1×(W+H)
is the intermediate
feature map encoding spatial information in the horizontal and vertical directions. We then split
f
into two
separate tensors
fH∈RC×H×1
and
fW∈RC×1×W
along the spatial dimension. en to obtain the feature
weights for each of the two tensors in the vertical and horizontal dimensions, we feed
f
into a 1 × 1 convolutional
transform to obtain the following
w=σ(F1×1(f))
(4)
where
σ
is a sigmoid function, and similarly, we split
w
along the spatial dimensions into two separate feature
weights
wH∈RC×H×1
and
wW∈RC×1×W
. We then aggregate the dimension tensors and weights via
simple multiplication to obtain Eq.(5) and Eq.(6)
aH=fH×wH
(5)
bW=fW×wW
(6)
Finally, by multiplying the output of the two parallel routes with the original input feature map, the output Y of
our WCA module can be written as
y
c
(i, j)=x
c
(i, j)
×
σ
a
H
c
(i)
×
σ
b
W
c
(j)
(7)
In contrast to channel attention, which solely recalibrates the signicance of various channels, our WCA block
not only incorporates spatial information encoding but also amplies constraints, prioritizing spatial details. As
elucidated earlier, weighted attention is concurrently applied along both the horizontal and vertical directions
to the input tensor. Each element within these attention maps signies the presence of the object of interest in
the corresponding row and column. is encoding mechanism enables our WCA to precisely pinpoint the exact
position of an object, thereby facilitating improved recognition by the overall model.
Experiments and discussion
Dataset and image preprocessing
e FracAtlas and GRAZPEDWRI-DX datasets were used in this study. e FracAtlas dataset is composed of
4083 bone fracture images of X-rays from all major parts of the human body collected from three major hospitals
in Bangladesh, as shown in Fig.6. is dataset was manually annotated with the help of two radiologists and
an orthopedic surgeon and contains 717 images with 922 fracture instances23. e GRAZPEDWRI-DX dataset,
shown in Fig.7, was collected by a number of pediatric radiologists at the Department of Pediatric Surgery at
the University Hospital Graz. A total of 10,643 wrist site studies involving 20,327 image samples involving 6,091
unique pediatric patients were performed24. e dataset was annotated by a group of pediatric radiologists.
ere are nine dierent types of annotation objects, and each image can be associated with multiple objects35–36.
In addition, the restricted image diversity observed in low-feature X-ray images poses a challenge, as models
trained solely on such data may exhibit suboptimal performance when applied to other X-ray images. To
enhance the robustness of these models, we employ data augmentation techniques aimed at improving image
quality. Specically, we implement online data augmentation on the training dataset, leveraging methods such
as mosaic and mixup. Additionally, we ne-tune image brightness and contrast to further enhance model quality
utilizing Albumentations41, an open-source Python library renowned for its image enhancement capabilities.
Ethics approval
is research does not involve human participants and/or animals. All methods complied with the guidelines
and relevant regulations.
Fig. 6. Fracatlas dataset, showing scans containing various parts of the arm, leg, waist and shoulder.
Each fracture instance has its own mask and bounding box, and the scans also have a global label for the
classication task, which is set to “fractured”.
Scientic Reports | (2024) 14:26702 6
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Experimental environment
is experiment was conducted on an Ubuntu 18.04 system equipped with an Intel(R) Xeon(R) Platinum
8255C CPU and an NVIDIA GeForce RTX 3090 GPU; utilizing torch version 1.11. During training, the input
image resolution was set to 640 × 640 pixels. e model was trained for 300 epochs with a patience of 50, a batch
size of 32, and a learning rate of 0.01 utilizing the “SGD” optimizer. Each dataset was randomly divided into
three subsets—training, validation, and test sets—comprising approximately 70%, 20%, and 10% of the original
dataset, respectively.
Evaluation indicators
e key evaluation metrics of object detection algorithms include detection accuracy, model complexity, and
detection speed. We introduce the key metrics of precision, recall and mAP to evaluate the model detection
accuracy. e precision and recall are calculated via Eq.(8) and Eq.(9)
Pprecision =
TP(TP +FP)
(8)
Rrecall =
TP(TP +FN)
(9)
In the evaluation of target detection algorithms, true positives (TP) represent correctly detected positive
samples, false positives (FP) represent negative samples incorrectly identied as positive, and false negatives
(FN) represent positive samples erroneously identied as negative. A precision-recall (P-R) curve is generated
for each category during the performance assessment, depicting the accuracy against the recall42. e area under
this curve, which spans between the curve and the horizontal axis, denotes the average precision (AP) of the
category. e mAP value of the model is computed as the average of the AP values across all categories. Typically,
mAP is assessed using two metrics: mAP50, which considers predictions with at least 50% overlap with true
frames as correct, and mAP50:95, which evaluates IOU thresholds ranging from 0.5 to 0.95.
e complexity of an object detection algorithm is gauged by various factors, such as model size, parameter
count, and computational demands. Elevated values in these aspects correlate with increased model complexity.
is study assesses model complexity through evaluation metrics encompassing computational load and model
size. e computational load, which is indicative of time complexity, is quantied in oating-point operations
(FLOPs), where one GFLOPs equals one billion oating-point operations per second. Higher computational
demands signify greater computational resource requirements.
Ablation study
To demonstrate the eectiveness of WCAY, we chose YOLOv8s as the baseline network (Baseline) and added the
DSC-C2f module to the backbone network as well as the reneck network with our WCA attention mechanism.
We performed ablation studies mainly on the FracAtlas dataset, testing dierent combinations of several
improved modules.
Table1. Ablation experiment.
Comparative experiments of DSC-C2f modules
To demonstrate the eectiveness of DSC-C2f in the detection task and the eect of DSC-C2f at dierent positions
in the network on the detection performance, we uniformly conducted a series of positional substitution
comparison experiments on DSC-C2f on the FracAtlas dataset.
As seen in Fig.1 Model structure of WCAY, a C2f layer is set up in the P2, P3, P4 and P5 layers of the original
network backbone to extract features from the input image, and we replace the C2f of each layer with the DSC-
C2f module in turn. As shown in Table2, the accuracy of the model is improved to dierent degrees aer
replacing the C2f layer in the original model with DSC-C2f, which reects the excellent ability of the DSC-C2f
module to extract fracture tubular features. In addition, dierent arrangements of the same number of modules
produce dierent results. When we replace the C2f module in the P5 layer with DSC-C2f, the model detection
accuracy improves the most, by 5.7%, from 47.9% in the baseline model mAP50 to 53.6%, compared to when
we replace it in positions P2, P3, and P4. Although the number of parameters is improved, the corresponding
improvement in accuracy values is the most eective.
Fig. 7. e GRAZPEDWRI-DX dataset, which shows the wrist fracture conditions in children from this
dataset, is shown in the gure. Because there are fewer images in the metal category, we included the metal
category in the foreign body category to guarantee the convergence of the dataset. e dataset categories are
classied as “fracture”, “text”, “periosteal reaction”, “pronatorsign”, “pronatorsign”, “sotissue”, “foreignbody”, "
boneanomaly”, and “bonelesion”.
Scientic Reports | (2024) 14:26702 7
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Figure8 illustrates the impact of DSC-C2f on model accuracy across various locations. Over time, the
precision and recall curves consistently surpass the baseline curve. Notably, the most eective strategy, yielding
the maximum mAP, involves replacing DSC-C2f at layer P5. is approach ensures that the model maintains its
initial precision and recall levels while enhancing accuracy, thereby inuencing the mAP positively.
Figure9 shows a comparison plot of the eective receptive eld visualization for each C2f module in the
network backbone, where we introduce the eective receptive eld (erf) visualization method43–44. As shown in
the gure, we compare the eective erf sizes of the original C2f modules in each layer of the network backbone
with our DSC-C2f modules, and for the replaced DSC-C2f modules, the erf size is smaller than that of the
baseline network. Generally, the smaller the receptive eld is, the more local and detailed the features tend to
be. Consequently, our DSC-C2f module excels in capturing local features of the input image, enhancing the
network’s ability to discern local patterns and structures.
Fig. 8. Comparison of the precision and recall when the DSC-C2f module is at dierent positions in the
network structure.
Model Parameters/M GFLOPs Precision Recall mAP50(%)
Baseline 11.13 28.4 79.8 41.4 47.9
+DSC-C2f*P2 11.15 29.1 70.0 46.6 50.7
+DSC-C2f*P3 11.26 29.6 72.4 45.2 49.8
+DSC-C2f*P4 11.65 29.5 65.0 44.8 49.0
+DSC-C2f*P5 12.14 28.9 62.9 48.3 53.6
Tab le 2. Comparative experiments of DSC-C2f modules at dierent locations in the network structure.
Methods DSC-C2f WCA Parameters/M GFLOPs mAP50 (%) mAP50:95(%)
Baseline 11.13 28.4 47.9 17.8
a √ 12.14 28.9 53.6 23.0
b √ 12.44 28.5 53.3 23.5
c √ √ 13.45 29.0 56.7 23.1
Tab le 1. shows the experimental results on the FracAtlas dataset. Aer the C2f module in the P5 layer in the
baseline network backbone is replaced with the DSC-C2f module, the mAP50 improves from 47.9–50%, and
the mAP50:95 improves by 4%. us, the proposed DSC-C2f method is eective for fracture feature extraction
from images. e WCA mechanism improved the mAP50 by 5.4%, and the mAP50:95 also increased from
17.8–23.5% aer it was added to the model alone. is demonstrates that the proposed WCA can capture a
wider range of global information, allowing the network to focus more on features of the skeletal disease region
in the image. When both DSC-C2f and WCA modules are added to the model, the WCAY model achieves
an improvement of 8.8% in mAP50 and an increase from 17.8 to 23% in mAP50:95. e experimental results
demonstrated that the improved module WCAY model containing these recommendations made signicant
progress on all the evaluation metrics compared to the original YOLOv8s model. is validates the ecacy of
the improved module.
Scientic Reports | (2024) 14:26702 8
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Comparative experiments of WCA modules
In this section, we conduct comparative experiments on dierent attention mechanisms embedded in network
models to further validate the eectiveness of the proposed WCA module.
We chose YOLOv8s as a benchmark model to compare the performance of the WCA module with X/Y
weights added for model performance improvement. e experimental results are shown in Table3. Since our
WCA module was designed inspired by the CA module, it can be seen that in the fact that no weights are added,
the performance of both performs almost the same. And the performance of the model improves signicantly
when weights in the horizontal (X) and vertical (Y) directions are added separately20, as well as when both
are added. e results of the visualization are shown in Fig.10, it can be seen that with the addition of the X
weights alone, the model is more sensitive to the horizontal direction, and the activation value of the heat map
Fig. 10. Comparison of heat map results for WCA with the addition of dierent directional weights. e
heatmaps were created by Grad-CAM45. It is clear that WCA, with the addition of horizontal (X) and vertical
(Y) direction weights, the model pays more attention to fracture features.
Method Parameters/M GFLOPs Precision Recall mAP50(%)
Baseline 11.13 28.4 79.8 41.4 47.9
+CA 11.15 28.4 72.4 46.6 50.6
+WCA* 12.76 28.5 74.1 46.1 50.6
+X weight 12.44 28.5 77.4 45.4 52.8
+Y weight 12.44 28.5 69.0 47.1 51.6
+WCA 12.44 28.5 72.3 47.7 53.3
Tab le 3. Comparison results of WCA modules with dierent directional weights added. Here, the WCA* table
does not incorporate the WCA modules for horizontal (X) weights and vertical (Y) weights.
Fig. 9. Comparison of eective receptive elds (erf). Visual comparison of the eective receptive eld of the
DSC-C2f module and the C2f module.
Scientic Reports | (2024) 14:26702 9
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
is signicantly higher, indicating that the region receives more attention in the x-direction. Similarly, with the
addition of Y weights, the model has higher activation values for the vertical region heat map.
Meanwhile, we select dierent attention mechanisms to compare with WCA, and further verify the
eectiveness by adding SE19, CBAM20, GAM21, and CA22. e experimental results are presented in Table4. It
can be seen that the parameter proliferation of the model aer integrating GAM and CBAM fails to satisfactorily
improve the detection accuracy. On the other hand, SE and CA achieve signicant accuracy improvement with
minimal parameter increment. However, their ecacy in capturing fracture features seems to be somewhat
limited, as shown in the heat map in Fig.11. SE occasionally fails to capture certain features, while CA, due to its
intrinsic properties, sometimes exceeds the specied concentration range. On the contrary, despite the increase
in parameters and the negligible increase in computational cost, the accuracy of WCA is signicantly improved
by 5.4% compared to the baseline network without the attention mechanism.
In addition, we also conducted comp arative experiments for dierent attention mechanisms in the benchmark
network aer adding the DSC-C2f module, as shown in Table 5. e experiments show that aer feature
extraction with the DSC-C2f module, except for the model with the addition of the WCA module, which has
a 3.1% improvement in mAP, the mAP values of the models with the addition of the other modules decreased.
In terms of the precision and recall metrics, except for the model with the addition of the GAM module, the
metrics of all the other models improved. In contrast, the WCA attentional mechanism, which outperforms
other attentional mechanisms in all metrics, is more likely to perform well in fracture detection tasks.
Comparative experiments of the WCAY algorithm
To demonstrate the eectiveness of the proposed WCAY algorithm for fracture detection in X-ray images,
we conducted a series of comparative experiments. We have selected several state-of-the-art object detection
Fig. 11. Results of our heatmap visualization of dierent attentional mechanisms on the FracAtlas and
GRAZPEDWRI-DX datasets. It is clear that our WCA can localize objects of interest more accurately than
other attention methods.
Method Parameters/M GFLOPs Precision Recall mAP50(%)
Baseline 11.13 28.4 79.8 41.4 47.9
+GAM 17.68 33.7 56.7 44.3 43.5
+CBAM 11.39 28.4 70.4 42.3 47.2
+SE 11.16 28.4 76.0 42.0 50.2
+CA 11.15 28.4 72.4 46.6 50.6
+WCA 12.44 28.5 72.3 47.7 53.3
Tab le 4. Comparison of Dierent Attention Methods on the FracAtlas Dataset.
Scientic Reports | (2024) 14:26702 10
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
methods for our experiments, including the YOLO series, the DETR46 series, and other single-stage detection
models47,48, of which we have set up dierent sizes such as nano, small, and medium for the YOLO series.
It is worth noting that in order to test the training parameters of the DETR series model during the testing
process dierently from the YOLO series, we use the ocial default parameters and pre-training weights le,
with the batch size set to 8, and the input size of (974, 800).
As seen from the results in Table6, our algorithm has a positive eect on the detection performance, with
the mAP at the highest value under each model size. For the nano size, the mAP of WCAY-n reaches 47.2%,
which is 4.9% higher than the 42.3% of YOLOv8-n, which is the highest mAP among the other models. Our
model also achieves the best results for model comparisons with parameter counts of 30M or more, and has
the highest mAP value in comparison to the DETR series, which has more than 33% higher parameter counts
and computational eort. Our model also achieves the best performance among the models with the same
single-stage detection. Surprisingly, among the small models, our WCA achieves the highest mAP value of all
models, 56.7%, which is 5.9%, 5.7%, and 7.2% higher than the models YOLOv8, RT-DETR49, and FreeAnchor48,
respectively, which have the highest mAP values among the other algorithm series.
It is essential to highlight that transitioning from small to medium in size leads to a decrease in the mAP.
is decline can be attributed to the larger model size necessitating higher-resolution input images and larger
datasets. However, given the standardized input image size of 640, medium-sized models and larger models are
susceptible to overtting on our dataset. is is crucial for the DETR family of models, which is why pre-training
weights need to be added during the training process. Consequently, it becomes evident that the small model
size is the most suitable for our detection task.
To validate the versatility of our model, we conducted comparative experiments across multiple categories
using the GRAZPEDWRI-DX dataset. e results, depicted in Figs.12 and 13, reveal WCAY’s superior mAP
Method Parameters/M GFLOPs Precision Recall mAP50(%)
YOLO Series
YOLOv5-n 2.50 7.1 66.8 33.3 38.9
YOLOv5-s 9.11 23.8 70.6 40.8 48.4
YOLOv5-m 25.05 64.0 64.0 48.3 50.5
YOLOv6-n 4.23 11.8 68.7 35.6 39.1
YOLOv6-s 16.30 44.0 61.9 36.8 40.5
YOLOv6-m 51.98 161.1 72.4 37.4 39.7
YOLOv8-n 3.01 8.1 60.7 41.4 42.3
YOLOv8-s 11.13 28.4 79.8 41.4 47.9
YOLOv8-m 25.84 78.7 67.5 45.4 50.8
DETR Series
DETR-R5046 41.56 74.5 42.2 43.0 42.2
DAB-DETR-R5050 43.70 79.7 44.2 46.4 44.2
Conditional-DETR-R5051 43.45 78.4 50.1 36.3 50.1
RT-DETR-R504952 41.94 125.6 51.1 48.3 51.1
Other single-stage models
FreeAnchor-R5048 36.33 159.0 49.5 34.5 49.5
TOOD-R5047 32.02 153.0 39.6 33.7 39.6
Ours models
WCAY-n 3.60 8.2 70.9 43.1 47.2
WCAY-s 13.45 29.0 76.2 51.7 56.7
WCAY-m 30.06 80.0 65.9 49.4 51.9
Tab le 6. Comparison of WCAY with dierent detection algorithms on the FracAtlas dataset. e detection
methods compared include the YOLO Series, the DETR Series, and other single-stage detection algorithms.
Method Parameters/M GFLOPs Precision Recall mAP50(%)
Baseline + DSC-C2f 12.14 28.9 73.3 43.7 53.6
+CBAM 12.40 28.9 67.4 47.6 52.3
+GAM 18.70 34.2 76.4 39.7 48.0
+SE 12.17 28.9 63.7 49.4 51.5
+CA 12.16 28.9 72.6 48.3 52.6
+WCA 13.45 29.0 76.2 51.7 56.7
Tab le 5. Comparison of dierent attention methods on the FracAtlas dataset with the addition of DSC-C2f.
Scientic Reports | (2024) 14:26702 11
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
across various real-time detection algorithms. However, our algorithms exhibit slightly lower accuracy in
detecting the “bone anomaly” and “so tissue” categories. Nonetheless, for categories such as “fracture,” “text,”
“foreignbody,” “periostealreaction,” and “pronatorsign,” our algorithms demonstrated the highest mAP. Notably,
the “bonelesion” category consistently maintains a high AP value across dierent models, particularly in nano
and small models, providing remarkable detection results.
In conclusion, our algorithm consistently outperforms other models in terms of detection accuracy across
both datasets, despite the increased number of parameters and computational load required to maintain this
accuracy. Our experiments demonstrate the robust performance of WCAY compared to that of other object
detection networks, demonstrating its strong generalizability and eectiveness in tackling the task of X-ray
image fracture detection.
Fig. 13. Comparison of the mAP results of dierent real-time detection algorithms on the GRAZPEDWRI-
DX dataset.
Fig. 12. Comparison of the detection results of dierent real-time detection algorithms, in dierent categories,
on the GRAZPEDWRI-DX dataset.
Scientic Reports | (2024) 14:26702 12
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Qualitative results
To clearly demonstrate the ecacy of the WCAY model, in addition to performing inference detection on two
X-ray fracture detection datasets, FracAtlas and GRAZPEDWRI-DX, we also perform inference detection on
datasets with similarities to the X-ray images, NEU-DET52 and SSDD53 public datasets. e WCAY model allows
for better detection of objects in images from dierent domains and angles to detect objects in images, including
objects with random orientations and dierent scales. e detection results are visualized in Fig.14.
As seen from the gure, on the FracAtlas dataset, our model can clearly detect and localize the fracture region
in the X-ray image, and the detection results show a high condence level; on the GRAZPEDWRI-DX dataset,
our model can also detect the features of skeletal disorders in addition to the fracture features; and on the NEU-
DET and SSDD datasets, our model also perfectly detects the corresponding targets. set, our model can also
detect the corresponding targets perfectly, and the accurate localization and identication of target detection
in the displayed images prove the eectiveness of the WCAY algorithm in various types of challenging image
detection.
Conclusion
In this paper, we propose a new algorithm, WCAY, for fracture detection in dierent parts of X-ray images. To
improve the accuracy of the model in detecting fracture features, we introduce the DSConv module to improve
the C2f module and propose a new core operator, DSC-C2f. We also introduce an attention mechanism to
improve the model’s new energy. In addition, we design a new channel attention mechanism (WCA), which is
more eective at capturing long-range dependent information. e experimental results of the proposed WCAY
model on the X-ray fracture detection dataset show that it has advantages over some mainstream real-time
object detection methods. It performs well in terms of evaluation metrics, reaching the SOTA (State of the
art) level, e.g., precision, recall, and mAP. Specically, the WCAY model improves the mAP in the FracAtlas
dataset by 8.8% compared to the baseline model for small model sizes, while the mAP in the GRAZPEDWRI-
DX dataset for all categories improves by 1.1%, and for the fracture category therein the mAP reaches 93.9%,
proving its X-ray image capability in the task of excellent fracture detection in X-ray imaging.
Fig. 14. e gure shows some qualitative results of the WCAY algorithm proposed in this paper on four
datasets.
Scientic Reports | (2024) 14:26702 13
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Data availability
e datasets analyzed during the current study are available at Figshare under h t t p s : / / g s h a r e . c o m / a r t i c l e s / d a
t a s e t / e _ d a t a s e t / 2 2 3 6 3 0 1 2 ( F r a c A t l a s ) and h t t p s : / / g s h a r e . c o m / a r t i c l e s / d a t a s e t / G R A Z P E D W R I - D X / 1 4 8 2 5 1 9
3 ( G R A Z P E D W R I - D X ) . Both datasets are licensed under the Creative Commons Attribution 4.0 International
(CC BY 4.0) (https://creativecommons.org/licenses/by/4.0/). e implementation code and trained models for
this study can be found on GitHub at https://github.com/cccp421/Fracture-Detection-WCAY. I n c l u d i n g the da-
tasets used in this experiment, the provenance can be found at this URL.
Received: 17 April 2024; Accepted: 25 October 2024
References
1. Forriol, F. & Mazzola, A. Bone fractures: Generalities. Textbook Musculoskeletal D i s o r d e r s . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 1 - 2 0 9 8
7 - 1 _ 2 8 (2023).
2. Venneri, F. et al. Safe surgery saves lives. Textbook of Patient Safety and Clinical Risk M a n a g e m e n t . h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / 9 7 8 - 3 - 0
3 0 - 5 9 4 0 3 - 9 _ 1 4 (2021).
3. Lisacek-Kiosoglous, A. B. et al. Articial intelligence in orthopedic surgery: exploring its applications, limitations, and future
direction. Bone Joint Res. 12, 447–454. https://doi. org/10.1302/ 2046-3758.12 7.BJR-2023- 0111.R1 (2023).
4. Xu, F. et al. Deep learning-based articial intelligence model for classication of vertebral compression fractures: A multicenter
diagnostic study. Front. Endocrinol.https://doi.org/10.3389/fendo.2023.1025749 (2023).
5. Ju, R. Y. & Cai, W. Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. Sci R e p . h t t p s : / / d o i . o r g / 1 0 .
1 0 3 8 / s 4 1 5 9 8 - 0 2 3 - 4 7 4 6 0 - 7 (2023).
6. ian, Y. L. et al. Convolutional neural networks for automated fracture detection and localization on wrist radiographs. Radiology:
Articial Intelligence. https://doi.org/10.1148/ryai.2019180001 (2019).
7. Zhao, Z. Q., Zheng, P., Xu, S. T. & Wu, X. D. Object detection with deep learning: A review. IEEE Transactions on Neural Networks
and Learning Systems. 30, 3212–3232. https://doi.org/10.1109/TNNLS.2018.2876865 (2019).
8. L. Jiao. et al. A survey of deep learning-based object detection. IEEE Access. 7, 128837–128868. h t t p s : / / d o i . o r g / 1 0 . 1 1 0 9 / A C C E S S . 2
0 1 9 . 2 9 3 9 2 0 1 (2019).
9. Arkin, E., Yadikar, N., Muhtar, Y., Ubul, K. A survey of object detection based on CNN and transformer. in IEEE International
Conference on Pattern Recognition and Machine Learning (PRML) 99–108. https://doi.org/10.1109/PRML52754.2021.9520732
(2021).
10. Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Preprint at https://arxiv.org/abs/1311.2524 (2014).
11. Girshick, R. Fast r-cnn. in IEEE International Conference on Computer Vision (ICCV) 1440–1448. Preprint at h t t p s : / / a r x i v . o r g / a b s
/ 1 5 0 4 . 0 8 0 8 3 (2015).
12. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Toward real-time object detection with region proposal networks. Adv. Neural Inf.
Process. Syst. Preprint at https://arxiv.org/abs/1506.01497 (2015).
13. Hou, L., Lu, K. & Xue, J. Rened one-stage oriented object detection method for remote sensing images. IEEE Transactions on
Image Processing. 31, 1545–1558. https://doi.org/10.1609/aaai.v33i01.33018577 (2022).
14. Redmon et al. You Only Look Once: Unied, Real-Time Object Detection. Preprint at https://arxiv.org/abs/1506.02640 (2016).
15. Tsung, Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal Loss for Dense Object Detection. Preprint at h t t p s : / / a r x i v . o r g / a b s / 1 7 0 8
. 0 2 0 0 2 (2018).
16. Niu, Z., Zhong, G. & Yu, H. A review on the attention mechanism of deep learning. Neurocomputing. 452, 48–62. h t t p s : / / d o i . o r g /
1 0 . 1 0 1 6 / j . n e u c o m . 2 0 2 1 . 0 3 . 0 9 1 (2021).
17. Galassi, A., Lippi, M. & Torroni, P. Attention in natural language processing. IEEE Transactions on Neural Networks and Learning
Systems. 32, 4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893 (2021).
18. Wan, D. H. et al. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 123. h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . e n g a p p a
i . 2 0 2 3 . 1 0 6 4 4 2 (2023).
19. Jie, H, Li, S, Gang, S. Squeeze-and-excitation networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
7132–7141, Preprint at https://arxiv.org/abs/1709.01507v4 (2019).
20. Woo, S. et al. Cbam: Convolutional block attention module. in European Conference on Computer Vision (ECCV) 3–19, Preprint at
http://arxiv.org/abs/1807.06521 (2018).
21. Liu, Y., Shao, Z., Homann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. Preprint
at https://arxiv.org/abs/2112.05561v1 (2021).
22. Hou Q, Zhou D, Feng J. C oordinate attention for ecient mobile network design. in IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 13713–13722, Preprint at https://arxiv.org/abs/2103.02907v1 (2021).
23. Abedeen et al. FracAtlas: A dataset for fracture classication, localization and segmentation of musculoskeletal radiographs. Sci.
Data. 10, 521. https://doi.org/10.1038/s41597-023-02432-4 (2023).
24. Nagy, E. et al. A pediatric wrist trauma X-ray dataset (GRAZPEDWRI-DX) for machine learning. Sci Data.Bold">9, 222. h t t p s : / / d
o i . o r g / 1 0 . 1 0 3 8 / s 4 1 5 9 7 - 0 2 2 - 0 1 3 2 8 - z (2022).
25. Qi, Y., He, Y., Qi, X., Zhang, Y., Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular
structure segmentation. in IEEE/CVF International Conference on Computer Vision (ICCV) 6047–6056. h t t p s : / / d o i . o r g / 1 0 . 1 1 0 9 / I C
C V 5 1 0 7 0 . 2 0 2 3 . 0 0 5 5 8 (2023).
26. Guan, B., Zhang, G., Yao, J., Wang, X., Wang, M. Arm fracture detection in X-rays based on improved deep convolutional neural
network. Comput. Electr. Eng. 81. https://doi.org/10.1016/j.compeleceng.2019.106530 (2020).
27. Rajpurkar, P. et al. Mura dataset: Toward radiologist-level abnormality detection in musculoskeletal radiographs. Preprint at h t t p s : / /
a r x i v . o r g / a b s / 1 7 1 2 . 0 6 9 5 7 v 4 (2017).
28. Yahalomi, E., Chernofsky, M. & Werman, M. Detection of distal radius fractures trained by a small set of X-ray images and faster
R-CNN. Intell. Syst. Comput. 997. https://doi.org/10.1007/978-3-030-22871-2_69 (2019).
29. Wang, M. et al. ParallelNet: Multiple backbone network for detection tasks on thigh bone fracture. Multimedia Systems. 27, 1091–
1100. https://doi.org/10.1007/s00530-021-00783-9 (2021).
30. Krogue, J. D. et al. Automatic hip fracture identication and functional subclassication with deep learning. Radiol. Artif. Intell. 2.
https://doi.org/10.1148/ryai.2020190023 (2020).
31. Li, Y.-C. et al. Can a deep-learning model for the automated detection of vertebral fractures approach the performance level of
human subspecialists? Clinical Orthoped and Related Research. 479, 1598–1612. https://doi.org/10.1097/CORR.0000000000001685
(2021).
32. Yuan, G., Liu, G., Wu, X., Jiang, R. An improved YOLOv5 for skull f racture detec tion. E xploration of novel intelligent optimization
algorithms. Communications in Computer and Information Science 1590. https://doi.org/10.1007/978-981-19-4109-2_17 (2022).
Scientic Reports | (2024) 14:26702 14
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
33. Warin, K. et al. Maxillofacial fracture detection and classication in computed tomography images using convolutional neural
network-based models. Sci. Rep. 13, 3434. https://doi.org/10.1038/s41598-023-30640-w (2023).
34. Fatima, J. et al. Ver tebrae loc alization and spine segmentation on radiographic images for feature‐b as ed curvature classication for
scoliosis. Concurrency and Computation: Practice and Experience. 34. https://doi.org/10.1002/cpe.7300 (2022).
35. Dibo, R. et al. DeepLOC: Deep learning-based bone pathology localization and classication in wrist X-ray images. Analysis of
Images, Social Networks and Texts. 14486. https://doi.org/10.1007/978-3-031-54534-4_14 (2024).
36. Ju, R. Y. & Cai, W. Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. Sci. Rep. 13, 20077. h t t p s : /
/ d o i . o r g / 1 0 . 1 0 3 8 / s 4 1 5 9 8 - 0 2 3 - 4 7 4 6 0 - 7 (2023).
37. Tanzi, L., Vezzetti, E., Moreno, R. & Moos, S. X-ray bone fracture classication using deep learning: A baseline for designing a
reliable approach. Applied Sciences. 10, 1507. https://doi.org/10.3390/app10041507 (2020).
38. Jocher, G. et al. Ultralytics YOLO. GitHub https://github.com/ultralytics/ultralytics (2023).
39. Jocher, G. et al. YOLOv5 by Ultralytics. GitHub. https://doi.org/10.5281/zenodo.3908559 (2020).
40. Ouyang, D. et al. Ecient multi-scale attention module with cross-spatial learning. IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096516 (2023).
41. Buslaev, A. et al. Albumentations: Fast and exible image augmentations. Information. 11, 125. h t t p s : / / d o i . o r g / 1 0 . 3 3 9 0 / i n f o 1 1 0 2 0
1 2 5 (2020).
42. Boyd, K., Eng, K. H., Page, C. D. Area under the precision-recall curve: Point estimates and condence intervals. Machine learning
and knowledge discovery in databases. Lecture Notes in Computer Science. 8190. https://doi.org/10.1007/978-3-642-40994-3_29
(2013).
43. Luo, W. et al. Understanding the eective receptive eld in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 29,
https: //proceedin gs.neur ip s.cc/ pape r/20 16 /hash/c8067ad1937f728f51288b3e b986afaa -Abstract.html (2016).
44. Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. Preprint at https://arxiv.org/abs/2311.17132 (2023).
45. Selvaraju et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. in Proceedings of the IEEE
International Conference on Computer Vision 618–626, Preprint at https://arxiv.org/abs/1610.02391v4 (2017).
46. Carion, N., Massa, F., Synnaeve, G. et al. End-to-end object detection with transformers. Computer Vision—ECCV 2020 (ECCV
2020). vol 12346. https://doi.org/10.1007/978-3-030-58452-8_13 (2020).
47. Feng, C., Zhong, Y., Gao, Y. et al. Tood: Task-aligned one-stage object detection. in International Conference on Computer Vision
(ICCV). IEEE Computer Society 3490–3499. https://doi.org/10.1109/ICCV48922.2021.00349 (2021).
48. Zhang, X., Wan, F., Liu, C. et al. Freeanchor: Learning to match anchors for visual obj ect detection. Advances in Neural Information
Processing Systems. 32. https://doi.org/10.48550/arXiv.1909.02466 (2019).
49. Zhao, Y., Lv, W., Xu, S. et al. Detrs beat yolos on real-time object detection. in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR2024) 16965–16974. https://doi.org/10.48550/arXiv.2304.08069 (2024).
50. Liu, S., Li, F., Zhang, H. et al. Dab-detr: Dynamic anchor boxes are better queries for detr. Preprint at. h t t p s : / / d o i . o r g / 1 0 . 4 8 5 5 0 / a r X i
v . 2 2 0 1 . 1 2 3 2 9 (2022).
51. Meng, D., Chen, X., Fan, Z. et al. Conditional detr for fast training convergence. in Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV2021) 3651–3660. https://doi.org/10.48550/arXiv.2108.06152 (2021).
52. Zhao, W. D. et al. A new steel defect detection algorithm based on deep learning. Computational Intelligence and Neuroscience
1–13. https://doi.org/10.1155/2021/5592878 (2021).
53. Wang, Y. Y. et al. A SAR dataset of ship detection for deep learning under complex backgrounds. Remote Sensing. 11, 765. h t t p s : / /
d o i . o r g / 1 0 . 3 3 9 0 / r s 1 1 0 7 0 7 6 5 (2019).
54. Li, C. Y. et al. YOLOv6 by Meituan. GitHub https://github.com/meituan/YOLOv6 (2022).
Author contributions
P.C. is mainly responsible for writing the manuscript and conducting experiments throughout the entire re-
search. S.L. is responsible for the overall direction and supervision of the paper. W.L. and F.L. are responsible for
the overall layout of the paper and embellishment. B.D. is responsible for project management and coordination
to ensure that the project schedule meets expectations. All authors reviewed the manuscript.
Declarations
Competing interests
e authors declare no competing interests.
Additional information
Supplementary Information e online version contains supplementary material available at h t t p s : / / d o i . o r g / 1
0 . 1 0 3 8 / s 4 1 5 9 8 - 0 2 4 - 7 7 8 7 8 - 6 .
Correspondence and requests for materials should be addressed to S.L.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modied the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. e images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit h t t p : / / c r e a t i v e c o m m o
n s . o r g / l i c e n s e s / b y - n c - n d / 4 . 0 / .
© e Author(s) 2024
Scientic Reports | (2024) 14:26702 15
| https://doi.org/10.1038/s41598-024-77878-6
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com