Available via license: CC BY 4.0
Content may be subject to copyright.
VOLUME XX, 2017 1
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
An improved Yolov10n for detection of
bronchoalveolar lavage cells
PEIHE JIANG1, SHAOQI LI1, YANFEN LU2, XIAOGANG SONG2
1School of Physics and Electronic Information, Yantai University, Yantai 264005, China
2Department of Pulmonary and Critical Care Medicine, Yantai Yuhuangding Hospital, Yantai, 264000, China
Corresponding author: YANFEN LU (luyanfen71@163.com), XIAOGANG SONG (yhdsxg@163.com).
ABSTRACT Bronchoalveolar lavage fluid (BALF) is a liquid sample that reflects the biological status of
lung tissues, containing a wealth of components such as cells and proteins. These components provide a non-
invasive method to obtain pathological information about the lungs, serving as a powerful complement to
traditional lung biopsies. However, the similarity in morphology and function of cells in BALF, combined
with the diversity of sample processing and analysis methods, can lead to confusion in recognizing and
distinguishing these cellular features. This study presents an improved Yolov10 method for the detection and
classification of BALF cells, specifically targeting macrophages, lymphocytes, neutrophils, and eosinophils.
The backbone network incorporates the PLWA module in place of the PSA module to enhance the acquisition
of useful information, and the C2f-DC module replaces the C2f module to improve image feature extraction
capabilities. Furthermore, the head network employs the Cross-Attention Fusion module (CAP) to enhance
the retrieval of image information. Experimental results demonstrate that the model achieves a mean Average
Precision (mAP) of 86.5% and a recall rate of 79.1%, confirming the model's effectiveness.
INDEX TERMS Bronchoalveolar lavage fluid, classification, detection, machine learning.
I. INTRODUCTION
Bronchoalveolar lavage fluid (BALF) is a sample of alveolar
surface lining fluid collected by instilling sterile saline
solution into the lung segments or subsegments below the
bronchi through a fiberoptic bronchoscope and then
recovering the fluid [1], [2]. BALF cell classification and
counting is a form of BALF cell morphology analysis, used
to detect the types and proportions of macrophages,
lymphocytes, neutrophils, eosinophils, and other cells in the
alveolar surface lining fluid. The reference values for BALF
cell classification differ between smokers and non-smokers.
Reference values for healthy non-smoking adults are as
follows: total nucleated cell count (90-260)×10⁶/L, alveolar
macrophages 85%-96%, lymphocytes 6%-15%, neutrophils
≤3%, eosinophils < 1%, and squamous/ciliated columnar
epithelial cells ≤5% [3], [4], [5]. These indicators provide
important clues for the diagnosis and treatment of lung
diseases and can serve as objective markers for monitoring
therapeutic efficacy and assessing prognosis. Clinically,
increased lymphocyte counts and proportions in BALF are
commonly seen in sarcoidosis, hypersensitivity pneumonitis,
non-specific interstitial pneumonia, and viral infections.
Increased eosinophil counts and proportions are often seen
in allergic asthma, eosinophilic lung diseases, and fungal
infections. Increased neutrophil counts and proportions
indicate active alveolitis, commonly seen in purulent
infections, acute lung injury, and connective tissue diseases
[6], [7]. However, traditional BALF cell counting relies on
labor-intensive and time-consuming cytological techniques.
It requires trained professionals to manually count and
classify cells using a biological oil immersion lens with
1000x magnification, and different physicians may have
varying judgment criteria, leading to strong subjectivity.
Since the 1990s, the rapid development of computer vision
and machine learning technologies has brought new
opportunities to medical image analysis. In particular, the
emergence of deep learning techniques, especially the
widespread application of convolutional neural networks
(CNNs), has revolutionized object detection in medical
images. Early medical image analysis mainly relied on
traditional image processing techniques, such as edge
detection and feature extraction, which often required expert
knowledge and extensive manual adjustments. However, with
the advancement of deep learning technologies—especially
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
2 VOLUME XX, 2017
following the introduction of deep belief networks (DBN) by
Hinton et al. in 2006 [8] and the success of AlexNet in the field
of image recognition in 2012 [9] —research in medical image
analysis has gradually shifted towards deep learning methods.
In medical image analysis, deep learning techniques have
been applied to a variety of scenarios. For example, in X-ray
image analysis, deep learning has been used to detect signs of
diseases such as tuberculosis and pneumonia. S. Hwang et al.
[10] proposed the first tuberculosis screening system based on
deep CNN and used transfer learning techniques. Additionally,
researchers have utilized generative adversarial networks
(GANs) to enhance COVID-19 detection in X-ray images [11].
In CT image analysis, deep learning has been applied to the
detection and classification of lung nodules, such as the deep
CNN-based lung nodule classification method proposed by Li
[12], and the detection of COVID-19 from CT images using
deep learning by Fan et al. [13] In microscopic image analysis,
convolutional networks have been applied to the recognition
of bacterial morphotypes for diagnosing bacterial vaginosis,
nasal cytology interpretation, microscopic stool image
recognition, and the automatic detection of malaria parasites
[14].
This paper addresses the problem of detecting and
classifying cells in bronchoalveolar lavage fluid (BALF) by
proposing an improved method based on the Yolov10n
network. The core idea of the Yolo algorithm is to perform
regression analysis on the entire image using a single
convolutional neural network, thereby obtaining the positions
of bounding boxes and the probability of the class they belong
to. The Yolov10 network model offers high detection
accuracy and fast inference speed. Building on this, this paper
further improves the Yolov10 network (see FIGURE 1) to
enhance its performance in detecting BALF cells. The main
contributions of this paper are as follows:
(1) A dual-channel Bottleneck is adopted to enhance the
acquisition of target information, improving the accuracy of
cell detection.
(2) The Cascaded Group Attention (CGA) module is used
within the local window, which provides different splits of
complete features for various attention heads, enhancing the
diversity of attention.
(3) A Cross Attention Feature Fusion (CAF) module is
proposed to integrate image features from different levels,
further improving the model's accuracy.
(4) The Focal-CIoU bounding box regression loss function
is employed to speed up the regression process and increase
the precision of the cell recognition algorithm.
Conv Conv C2f-DC Conv C2f-DC SCDown C2f-DC SCDown C2f-DC SPPF PLWA
C2f
CAF
Upsample
C2fCAFUpsample
Conv
CAF
C2f
SCDown
CAF
C2fCIB
V10Detect V10Detect V10Detect
Conv
Conv
LightConv
Conv
c
+
c
0.5c
c
Conv
DWConv
DualBottleneck
LightConv
add=True
Conv
Split
DualBottlen
eck
DualBottlen
eck
Concat
Conv
c_in
c
0.5c
0.5c
0.5c
0.5c
0.5c
0.5c
0.5c
0.5c
0.5(n+2)c
c
C2f-DC
QKV
Self-
attention
Output
Split
QKV
Self-
attention
QKV
Self-
attention
Concat
0.5c
c
c
0.5c
0.5c
Token Token Token
++
Input 1 Input 2
GAP
LightConv
LightConv
Sigmoid
× ×
+ +
LightConv
LightConv
LightConv
Sigmoid××
+
Output
Concat
Concat
CAF
LightConv
LWA
LightConv
+
LightConv
+
Concat
LightConv
PLWA
Cascaded Group Attention
c c
2c
2c
2c
c
cc
2c
c
2c
c
w1-w
c
c
Split
c
0.5c
0.5c
0.5c
0.5c
0.5c
0.5c
c
0.5c
c
...
c/n
c/n
c/n
c
c
c
c
c/n
c/n
c/n
FIGURE 1. Improved Yolov10 network framework.
II. Methods and Materials
A. Dataset
In this experiment, The BALF dataset was constructed using
microscope-scanned images provided by doctors from Yantai
Yuhuangding Hospital. There were no exclusion criteria, such
as gender, age, or race. BALF was sampled in accordance with
the standard procedure recommended in the American
Thoracic Society clinical practice guideline [15]. The samples
are processed immediately after collection: mixing, filtration,
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME XX, 2017 3
centrifugation; Discard the supernatant, leave the bottom
sludge smear, and fix it with formaldehyde after drying;
Hematoxylin-eosin staining; Bake and dry and seal with resin.
The sample slides were scanned using a microscope slide
scanner (3DHISTECH Pannoramic SCAN Ⅱ) in brightfield
scanning mode. All materials used in this study were approved
by the Ethics Committee of the Yantai Yuhuangding Hospital.
LabelImg is an open-source tool developed by Tzutalin that
is widely used in object detection tasks to simplify the image
annotation process and supports PASCALVOC and YOLO
formats. The tool provides a simple and intuitive graphical
interface, and users can select and annotate regions of interest
through the mouse box, which is easy and convenient to
operate. In this experiment, the collected bronchoalveolar
lavage fluid cell target dataset was accurately labeled, and
each labeling file contained the class information of the cells
(i.e., the target category) and the position information of the
labeled bounding box (i.e., the position of the target in the
image), and the annotation file in YOLO format was used for
subsequent training. In order to ensure the accuracy of the
annotation, the annotation follows strict annotation
specifications. After each annotation is completed, a review is
performed to ensure the quality of the annotation. Table I
describes the details.
B. Image preprocessing
In this experiment, 1200 collected images (some of which
were obtained by diagonal reversal of the original image) were
divided into two groups in a 9:1 ratio. 1080 images were used
as the training set for model training and parameter
optimization, while the remaining 120 images formed an
independent testing set for model performance validation and
evaluation. Recognizing the necessity of training on large-
scale image data, various image enhancement techniques have
been implemented on the image data to enhance the model's
generalization ability and prevent overfitting.
Image enhancement methods include but are not limited to:
adjusting the brightness level of the image (including both
enhancing and reducing brightness), performing vertical
flipping operations, and rotating the image at different angles.
Specifically, the image enhancement methods used in this
experiment include adjusting the brightness level of the image,
that is, turning the brightness of the original image up and
down to simulate the image change under different lighting
conditions, performing vertical flipping operations to simulate
different spatial orientations, and rotating the image at
different angles to improve the adaptability of the model to
multiple angles. The application of these enhancement
strategies aims to simulate diverse scenarios in the real world,
thereby enhancing the adaptability and robustness of the
model.
This experiment strictly ensured the data independence
between the training set and the test set, that is, there is no
intersection between the two sets of data. This measure helps
to eliminate the risk of data leakage and ensure the fairness
and credibility of experimental results.
TABLE I
BALF CELL COUNT AND ANNOTATION
Cell Types
Number of Cells
Annotated
Number of
trainset cells
annotated
Number of
validation sets
cells annotated
Example
Eosinophil
1256
1100
156
Lymphocyte
1234
1106
128
Macrophage
5536
4924
612
Neutrophil
11994
10771
1223
Total
20020
17901
2119
III. Improvement of Yolov10 Network Structure
A. Dual channel Bottleneck
The experiment designed an innovative DualBottleneck
structure, with its core idea being the acquisition of different
feature information through two parallel paths, thereby
effectively handling complex information scenarios.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
4 VOLUME XX, 2017
In the proposed network architecture, the first path uses
traditional convolutional layers (conv) for feature extraction,
followed by another convolutional layer to further enhance
feature expression. The second path introduces a lightweight
convolution (Lightconv), designed to extract richer feature
information while maintaining computational efficiency.
After the Lightconv layer, a convolutional layer (conv) is also
connected to optimize feature representation.
By processing these two paths in parallel, the network can
capture feature information from the input data from different
perspectives. A feature fusion strategy is then employed to
integrate the outputs of both paths, further strengthening the
network's ability to acquire feature information. This fusion
strategy not only improves the network's performance in
handling complex information scenarios but also reduces the
computational load to some extent, enhancing the model's
practicality and efficiency.
B. Hierarchical local attention
In convolutional neural networks (CNNs), the focus during
image feature extraction is typically on local neighborhood
information, often neglecting global features. In contrast, the
self-attention mechanism (Vaswani et al., 2017) can capture
contextual information from images and learn richer semantic
features [16]. For single-head self-attention, the formula for
calculating the output for each pixel is as follows:
As can be seen from the formula, the self-attention
mechanism primarily focuses on the relationship between
each pixel and other pixels in an image, thereby enabling the
capture of global information. The core principle of the self-
attention mechanism lies in filtering out redundant
information while highlighting key information. However,
compared to CNNs, the self-attention mechanism lacks the
ability to leverage certain priors like scale invariance,
translation invariance, and feature locality inherent in CNNs,
which results in its underperformance on small datasets (Wang
et al., 2020) [17].
However, the multi-head attention mechanism has
significant computational redundancy, resulting in low
computational efficiency. Inspired by group convolution in
efficient CNNs, this paper adopts the Cascaded Group
Attention (CGA) mechanism [18]. CGA splits the input for
each head into different parts, thereby distributing the
attention calculation across different heads. Each group
independently computes self-attention, generating its own
Query, Key, and Value. Formally, the group attention within
each group can be represented as:
After calculating the intra-group attention for each group,
the output from each group is further used as new input to
compute inter-group attention. The attention output from each
group is considered a high-level feature representation, and
similar self-attention computations can be performed on these
features:
By using feature splitting instead of feeding in full features,
computational efficiency is improved, and the Q, K, V layers
are encouraged to learn projections on more informative
features. The cascaded design, as shown in Figure 1, optimizes
feature representation by progressively accumulating the
outputs from each attention head. This design not only
increases the network's depth and enhances model capacity but
does so without introducing extra parameters, leading to only
slight computational overhead.
Inspired by the local window design of the Swin
Transformer (Liu et al., 2021) [19], this paper applies the
Cascaded Group Attention within local windows. Specifically,
the input feature map is divided into multiple fixed-size
windows, and self-attention is computed independently within
each window. This approach significantly reduces the
computational complexity of global self-attention while
improving runtime efficiency.
As illustrated in FIGURE 2, the left image represents a
standard multi-head attention module, while the right image
shows how the local window operates. Each square represents
a pixel, and the local window method divides the feature map
into windows of size (shown as in the figure),
with self-attention calculated independently for each window.
FIGURE 2. Ordinary multi head attention and local window attention.
C. Cross-Attention Fusion Module
In this experiment, a Cross-Attention Fusion Module (CAF) is
used to improve the performance of the head network. The
main function of the CAF module is to achieve feature fusion
and enhancement through the following steps:
First, the CAF module concatenates the input features from
two branches. Next, the concatenated features are fed into a
Channel Attention Module, which consists of convolutional
layers, to generate the corresponding attention weights. These
weights are then applied to the original features through
element-wise multiplication, thereby weighting the original
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
4 VOLUME XX, 2017
features. Afterward, the weighted features are added to the
original features from another branch to enhance their
representation capabilities.
Further, the enhanced features are concatenated along the
channel dimension and fed into a Spatial Attention Module,
which generates the final fusion weights. As shown in Figure
1, the mathematical expression for this process is as follows:
Where
represents the output of the Cross-Attention
Fusion Module, is the output of channel attention,
and
are the input features from the two branches, and
denotes the Spatial Attention Module.
and
represent
the results of the fusion between the inputs and the outputs
after channel attention, while denotes the
convolutional layer.
By introducing the Cross-Attention Fusion Module, the
performance of the head network is effectively enhanced,
providing a more robust feature representation for subsequent
tasks.
D. Focaler-CIoU
Yolov10 uses (Complete Intersection over Union)
[20] to evaluate the difference between the predicted bounding
boxes and the real bounding boxes. The specific formula is as
follows:
Where and represent the centroids of the predicted
box and the ground truth box, respectively.
represents the
Euclidean distance between the two centroids, and ccc
represents the diagonal distance of the smallest enclosing box
that contains both the predicted box and the ground truth box.
and represent the widths of the predicted box and the
ground truth box, respectively. and represent the
heights of the predicted box and the ground truth box,
respectively. IoU is the ratio of the intersection and union
between the predicted box and the ground truth box.
In order to effectively alleviate the issue of imbalanced
BALF cell samples, the experiment utilizes the
loss [21], which optimizes the performance of bounding
box regression by focusing more on hard samples. The
principle of lies in improving the detector's
performance across different detection tasks by focusing on
different regression samples. The design of this loss function
takes into account the impact of easy and hard sample
distribution on the regression results. By adopting specific
strategies to focus on samples that are difficult to regress
correctly, the model pays more attention to learning these
samples during the training process, thereby improving the
overall localization accuracy of the model. The formula is as
follows:
In this experiment, the loss is applied to the
bounding box loss function, as shown in the following
formula. This enables the network to focus more on hard
samples and reduces the bounding box regression issues
caused by the imbalance of BALF samples.
IV. Experimental results
A. Experimental setup
The model is trained using Windows and the Pytorch deep
learning framework. The software environment includes
CUDA 12.1, CUDNN 8.2, and Python 3.10.12. The CPU used
for training the dataset is a 12th generation Intel(R) Core (TM)
i9-12900H 2.5GHz with 32GB RAM, and the GPU is a
GeForce RTX 3060 Laptop GPU with 6GB of VRAM.
B. Evaluating indicator
To evaluate the performance of the trained bronchoalveolar
lavage fluid (BALF) cell recognition model, objective
evaluation metrics such as Precision, Recall, and mean
Average Precision () were used.
Intersection over Union ( ) is a crucial metric for
measuring the localization accuracy of object detection
algorithms. is defined as the ratio of the overlapping area
between the predicted box and the ground truth box to the area
of their union. The larger the value, the higher the
localization accuracy of the object detection algorithm.
Specifically, can be calculated as follows:
Precision is defined as the proportion of true positive
samples among the samples predicted as positive by the
detector. Its calculation formula is as follows:
Where represents the number of true positive cases
correctly classified, and represents the number of false
positive cases incorrectly classified.
The calculation formula for is as follows:
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME XX, 2017 5
Where represents the number of false negative cases
incorrectly classified.
The mean Average Precision () is further calculated,
which is defined as the weighted average of the product of
Precision and Recall across all object categories C
for thresholds.
In this formula, represents the number of thresholds,
represents a specific threshold, and
represent the accuracy and recall at threshold , respectively.
These metrics allow for a comprehensive evaluation of the
model's recognition performance.
For mAP@0.5, it is defined as the mean of the Average
Precision (AP) for all categories when the threshold is set
to 0.5. On the other hand, mAP@0.5:0.95 specifically refers
to the average calculated at thresholds ranging
from 0.5 to 0.95, in increments of 0.05. This metric provides a
more comprehensive reflection of the model's performance
across different thresholds.
C. Comparison results with Yolov10n and visualization
of results
This article provides a detailed evaluation of the
performance differences between the proposed method and
Yolov10 through comparative experiments, and intuitively
presents the results using visualization techniques.
The left image of FIGURE 3 shows the detection results of
the original Yolov10 version for alveolar lavage fluid cells,
while the right image presents the detection results of the
optimized model introduced in this paper. It is evident that
under the improved network architecture, the detection
performance for lymphocytes has been enhanced, and the
issue of bounding box overlap has been alleviated.
FIGURE 3. Comparison of detection performance visualization.
The heatmap [22] shown in FIGURE 4 further reveals the
differences in the network's attention to the targets. The left
image reflects the original network's level of attention to the
targets (the redder the color, the higher the attention), while
the right image displays the improved network's level of
attention. Through comparative analysis, it is evident that the
improved network shows an increased focus on the target
areas, thereby validating the effectiveness of the optimization
strategy.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
6 VOLUME XX, 2017
FIGURE 4. Comparison of heatmaps. The darker the red area, the higher the level of attention. The left image reflects the level of attention of the
original network, while the right image reflects the level of attention of the improved network.
FIGURE 5 provides a detailed comparison of the
benchmark model (Yolov10) and the model presented in this
paper for individual categories on the test set. The study found
that for all four cell categories, the proposed model achieved
higher average precision (AP) values compared to the
benchmark model. Notably, the original version had the
lowest detection accuracy for lymphocytes, while the
optimized model demonstrated significant performance
improvements and outperformed Yolov10.
FIGURE 5. Comparison of average precision for different cell types.
E. Ablation experiment
Through a series of ablation experiments, the effectiveness
of the proposed improvements was thoroughly investigated.
The relevant results are shown in Table II, with the following
detailed analysis.
First, the experiments using the DualBottleneck structure
demonstrated a stable increase in both mAP and recall under
the condition of dual-channel feature extraction. Compared to
the traditional single-channel bottleneck, the DualBottleneck
network showed enhanced capability in extracting complex
information, which helps the model capture detailed
information more accurately.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME XX, 2017 7
Furthermore, the introduction of local windows and
cascaded attention mechanisms helped to some extent avoid
the extraction of redundant information by the network,
allowing it to focus more on the extraction of target features.
This improvement led to enhanced mAP and recall metrics,
validating its effectiveness.
Additionally, after integrating the CAF feature fusion
module, the network's ability to fuse feature information was
significantly strengthened, enabling the model to extract more
complex target features, as fully reflected in the experimental
results.
When all improvements were applied to the network, it was
observed that the model's ability to focus on target features
was significantly enhanced. Specifically, the mAP metric
increased by 3.2%, and the recall metric improved by 2.6%,
further confirming the effectiveness of the optimization
strategy.
TABLE II
COMPARISON OF DIFFERENT IMPROVEMENT POINTS
Model
mAP(@0.5)
mAP(@0.5:0.95)
Recall
Yolov10n
83.3%
69.9%
76.5%
Yolov10n+C2f-DW
83.7%
69.8%
78%
Yolov10n+PLWA
84.6%
69.6%
78.6%
Yolov10n+CAF
83.4%
69.6%
81.5%
Yolov10n+Focal-CIoU
84.5%
69.6%
79.1%
Yolov10n+All
86.5%
71.1%
79.1%
F. Comparison with other models
To verify the effectiveness and competitiveness of the
proposed model, comparative tests were conducted with
various classic object detection algorithms. The specific
experimental results are shown in Table III.
YOLOv3 [23], proposed by Redmon in 2018, employs the
Darknet-53 network structure for feature extraction, replacing
the previous Darknet-19 and introducing a feature pyramid
network structure to achieve multi-scale detection.
Additionally, YOLOv3 uses logistic regression as its
classification method instead of softmax, ensuring both the
practicality and accuracy of object detection.
YOLOv5, on the other hand, utilizes the improved
CSPDarknet53 based on Darknet-53 as its backbone network,
introducing a CSP structure to enhance the model's expressive
ability and feature reuse efficiency. At the same time,
YOLOv5 integrates a PANet structure, enhancing the model's
ability to detect small objects.
YOLOv8 incorporates a new architecture in its backbone
design to improve the network's ability to extract image
features and enhances the detection head by adopting an
anchor-free detection method, further improving accuracy and
flexibility.
YOLOv9 [24] enhances the model's flexibility in learning
object features by introducing Programmable Gradient
Information (PGI) technology and using a General Efficient
Layer Aggregation Network (GELAN) architecture based on
gradient path planning, which helps maintain high efficiency
while improving detection accuracy.
As shown in Table III, the proposed model improves the
mean Average Precision (mAP@0.5) by 4%, 3.2%, 2%, and
1.9% compared to YOLOv3, YOLOv5, YOLOv8, and
YOLOv9, respectively. These results fully demonstrate the
performance advantages of the model and further highlight its
application value in the field of object detection.
TABLE III
COMPARISON OF CROSS VALIDATION EXPERIMENTS
Model
mAP(@0.5)
mAP(@0.5:0.95)
Recall
Yolov3-tiny
82.5%
67.3%
81%
Yolov5n
83.2%
69.3%
78.8%
Yolov8n
84.5%
70.5%
80.1%
Yolov9t
84.6%
70.6%
77.9%
our model
86.5%
71.1%
79.1%
V. Discussion
This experiment aims to accurately quantify macrophages,
lymphocytes, neutrophils, and eosinophils in bronchoalveolar
lavage fluid (BALF) by improving the Yolov10 model,
significantly enhancing its accuracy and efficiency. Compared
to other network models, the proposed model demonstrates
exceptional performance in terms of accuracy.
In clinical practice, BALF cytomorphological examination
is crucial for diagnosing lung inflammation, tuberculosis,
tumors, and parasitic infections. Previous studies have
indicated that when conventional medical history, physical
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME XX, 2017 9
examinations, laboratory tests, pulmonary function
assessments, and imaging studies do not provide sufficient
diagnostic information, BALF cytology often provides critical
diagnostic clues. This is particularly important for the
differential diagnosis of interstitial lung diseases, where the
immunophenotyping of lymphocytes in BALF is key.
Currently, the standard detection method for lymphocyte
phenotyping is the peroxidase-antiperoxidase technique, but
this method is time-consuming and highly reliant on the
operator's experience. Additionally, with the increasing
volume of samples, the sorting and counting of BALF cells
still mainly depend on manual efforts by technicians, which is
not only cumbersome and time-consuming but also exhibits
poor consistency between different operators. Even
experienced professionals can see significant variations in cell
counting accuracy due to different slide preparation methods.
The detection method proposed in this paper has the
following significant advantages in clinical applications:
First, traditional BALF cytomorphological examinations
are labor-intensive and time-consuming, and the shortage of
professionals limits their widespread application. By
automating cell classification and counting through deep
learning methods, the workload on professionals is alleviated,
while detection efficiency is improved.
Second, the ability to quickly and accurately sort and count
cells helps pathologists escape the burden of extensive image
reading tasks, allowing them to focus more on patient
diagnosis and treatment decisions.
Finally, subjective differences in image interpretation
among pathologists can lead to fluctuations in diagnostic
accuracy. Deep learning methods can assist pathologists in
making accurate judgments, helping to reduce misdiagnosis
rates and improve the consistency and reliability of diagnoses.
VI. Conclusion
This paper constructs and optimizes a bronchoalveolar lavage
fluid (BALF) cell detection and classification model based on
an improved Yolov10, effectively addressing challenges in
algorithm application in this field. The main research findings
and contributions are as follows:
1) By introducing a dual-channel bottleneck structure, the
model's ability to extract information features is
significantly enhanced, leading to improved cell
detection accuracy. Although this increases
computational costs, the notable gain in accuracy
provides a more reliable foundation for subsequent cell
recognition.
2) The Cascaded Group Attention mechanism is employed
within local windows, which, while somewhat reducing
the capture of global information, improves
computational efficiency.
3) The introduction of a Cross-Attention Fusion Module
(CAF) in the bottleneck layer further optimizes the
feature fusion extraction process, enhancing the model's
capability to recognize cell features.
4) The use of Focal-CIoU bounding box regression loss
function accelerates bounding box regression speed and
improves the accuracy of cell recognition algorithms.
However, this method still has certain limitations in
practical applications. First, the dataset used in this study is
sourced from a single center's microscopic images of alveolar
lavage fluid. Future work should consider collecting more
high-quality image data from multiple centers for a more
comprehensive evaluation and validation of the model.
Second, in cell classification training and testing, the similarity
in features between lymphocytes and small macrophages, as
well as variations in slide preparation or observation methods,
can affect classification accuracy.
Future research directions include:
1) Exploring the possibility of reducing model parameters
and introducing new attention mechanisms or alternative
backbone networks to further enhance network
performance.
2) Investigating new feature extraction and fusion
strategies to address the issue of feature similarity and
reduce misclassification rates.
3) Expanding the sources of datasets and increasing sample
diversity to enhance the model's generalization ability.
REFERENCES
[1] Liu Z, Yan J, Tong L, et al. The role of exosomes from BALF in
lung disease[J]. Journal of cellular physiology, 2022, 237(1): 161-
168.
[2] Leroy B, Falmagne P, Wattiez R. Sample preparation of
bronchoalveolar lavage fluid[J]. 2D PAGE: Sample Preparation and
Fractionation, 2008: 67-75.
[3] Meyer K C, Raghu G, Baughman R P, et al. An official American
Thoracic Society clinical practice guideline: the clinical utility of
bronchoalveolar lavage cellular analysis in interstitial lung
disease[J]. American journal of respiratory and critical care
medicine, 2012, 185(9): 1004-1014.
[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai,
X., Unterthiner, T.,. & et al. (2020). An image is worth 16×16
words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929.
[5] Du Rand I A, Blaikley J, Booton R, et al. British Thoracic Society
guideline for diagnostic flexible bronchoscopy in adults: accredited
by NICE[J]. Thorax, 2013, 68(Suppl 1): i1-i44.
[6] Feng H, Yan L, Zhao Y, et al. Neutrophils in bronchoalveolar
lavage fluid indicating the severity and relapse of pulmonary
sarcoidosis[J]. Frontiers in Medicine, 2022, 8: 787681.
[7] Costabel U, Guzman J, Bonella F, Oshimo S. Bronchoalveolar
lavage in other interstitial lung diseases (October). Seminars in
Respiratory and Critical Care Medicine Vol. 28. © Thieme Medical
Publishers; 2007. p. 514–24. (October).
[8] Hinton G E. Deep belief networks[J]. Scholarpedia, 2009, 4(5):
5947.
[9] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with
deep convolutional neural networks[J]. Communications of the ACM,
2017, 60(6): 84-90.
[10] Hwang S, Kim H E, Jeong J, et al. A novel approach for tuberculosis
screening based on deep convolutional neural networks[C]//Medical
imaging 2016: computer-aided diagnosis. SPIE, 2016, 9785: 750-757.
[11] Loey M, Smarandache F, M. Khalifa N E. Within the lack of chest
COVID-19 X-ray dataset: a novel detection model based on GAN
and deep transfer learning[J]. Symmetry, 2020, 12(4): 651.
[12] Li W, Cao P, Zhao D, et al. Pulmonary nodule classification with
deep convolutional neural networks on computed tomography
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME XX, 2017 9
images[J]. Computational and mathematical methods in medicine,
2016, 2016(1): 6215085.
[13] Fan D P, Zhou T, Ji G P, et al. Inf-net: Automatic covid-19 lung
infection segmentation from ct images[J]. IEEE transactions on
medical imaging, 2020, 39(8): 2626-2637.
[14] Rajaraman S, Jaeger S, Antani S K. Performance evaluation of deep
neural ensembles toward malaria parasite detection in thin-blood
smear images[J]. PeerJ, 2019, 7: e6977.
[15] Meyer K C , Raghu G , Baughman R P ,et al.An official American
Thoracic Society clinical practice guideline: The clinical utility of
bronchoalveolar lavage cellular analysis in interstitial lung
disease[J].American journal of respiratory and critical care
medicine, 2012(9):185.
[16] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need.
Advances in neural information processing systems[J]. Advances in
neural information processing systems, 2017, 30(2017).
[17] Chen H, Wang Y, Guo T, et al. Pre-trained image processing
transformer[C]//Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2021: 12299-12310.
[18] Liu X, Peng H, Zheng N, et al. Efficientvit: Memory efficient vision
transformer with cascaded group attention[C]//Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2023: 14420-14430.
[19] Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision
transformer using shifted windows[C]//Proceedings of the
IEEE/CVF international conference on computer vision. 2021:
10012-10022.
[20] Zheng Z, Wang P, Liu W, et al. Distance-IoU loss: Faster and better
learning for bounding box regression[C]//Proceedings of the AAAI
conference on artificial intelligence. 2020, 34(07): 12993-13000.
[21] Zhang H, Zhang S. Focaler-IoU: More Focused Intersection over
Union Loss[J]. arXiv preprint arXiv:2401.10525, 2024.
[22] Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: Visual
explanations from deep networks via gradient-based
localization[C]//Proceedings of the IEEE international conference
on computer vision. 2017: 618-626.
[23] Redmon J. Yolov3: An incremental improvement[J]. arXiv preprint
arXiv:1804.02767, 2018.
[24] Wang C Y, Yeh I H, Liao H Y M. Yolov9: Learning what you want
to learn using programmable gradient information[J]. arXiv preprint
arXiv:2402.13616, 2024.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
10 VOLUME XX, 2017
PEIHE JIANG received the B.S. degree, M.S.
degree, and Ph.D. degree from Harbin Institute of
Technology, China, in 2011, 2013, and 2018
respectively. He is currently an Associate Professor
and a Master Tutor with the School of Physics and
Electronic Information, Yantai University, Yantai,
China. His research interests include image
processing, embedded software and hardware
design, signal processing technology, robotics and
autonomous navigation technology.
SHAOQI LI received the B.S. degree from the
Yantai University, Yantai, China, in 2023. He is
currently pursuing the M.S. degree in
communication engineering with Yantai University,
China. His research interests include image
processing and deep learning.
YANFEN LU received the B.S. degree from the
Qingdao University. She is currently with the
Department of Pulmonary and Critical Care
Medicine of Yantai Yuhuangding Hospital as an
associate chief physician.
XIAOGANG SONG received the M.S. degree
from China Medical University. He is currently
with the Department of Pulmonary and Critical
Care Medicine of Yantai Yuhuangding Hospital as
an attending physician. He is a member of
Respiratory Group of Allergology Branch of
Shandong Medical Association.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3532493
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/