Available via license: CC0
Content may be subject to copyright.
1
Acute Lymphoblastic Leukemia Diagnosis Employing YOLOv11,
YOLOv8, ResNet50, and Inception-ResNet-v2 Deep Learning Models
Alaa Awad§and Salah A. Aly†‡
§CS & IT Department, E-Japan University of Science and Tech., Alexandria, Egypt
†Faculty of Computing and Data Science, Badya University, Giza, Egypt
‡CS & Math Section, Faculty of Science, Fayoum University, Fayoum, Egypt
Abstract—Thousands of individuals succumb annually to
leukemia alone. As artificial intelligence-driven technologies con-
tinue to evolve and advance, the question of their applicability
and reliability remains unresolved. This study aims to utilize
image processing and deep learning methodologies to achieve
state-of-the-art results for the detection of Acute Lymphoblastic
Leukemia (ALL) using data that best represents real-world
scenarios. ALL is one of several types of blood cancer, and it is an
aggressive form of leukemia. In this investigation, we examine the
most recent advancements in ALL detection, as well as the latest
iteration of the YOLO series and its performance. We address
the question of whether white blood cells are malignant or
benign. Additionally, the proposed models can identify different
ALL stages, including early stages. Furthermore, these models
can detect hematogones despite their frequent misclassification
as ALL. By utilizing advanced deep learning models, namely,
YOLOv8, YOLOv11, ResNet50 and Inception-ResNet-v2, the
study achieves accuracy rates as high as 99.7%, demonstrating
the effectiveness of these algorithms across multiple datasets and
various real-world situations.
Index Terms—Lymphoblastic Leukemia, Yolov11 and
ResNet50 Deep Learning Models
I. INTRODUCTION
Cancer is one of the most lethal diseases known to the
human race, and it is embodied in several forms, like leukemia.
The incidence rate of leukemia across the globe is estimated at
487,294 with a mortality number of 305,405 annually, accord-
ing to the International Agency for Research on Cancer, which
is part of the World Health Organization (WHO) [1]. These
statistics highlight the urgency behind the need to establish a
better understanding of leukemia and provide effective health
services to people affected by this disease.
Leukemia, also known as blood cancer, is one of many types
of cancer that originates from the blood and bone marrow.
This issue arises from the abnormal and rapid production of
white blood cells. Leukemia can be categorized as chronic
or acute in relation to the disease’s progression speed and
into lymphocytic or myelogenous based on the type of cells it
metamorphoses to after growth that can be narrowed to white
lymphocyte blood cells or myelogenous, which in turn could
be of three types, namely, red blood cells, white blood cells,
or platelets [2].
Nowadays, we live in a technology-themed era in which
computer science is exploited to mimic human intelligence
for more accurate and faster decision-making. Therefore, many
researchers have tried to tackle the leukemia detection problem
through artificial intelligence using different deep learning
methodologies, such as MobileNetV2 [3], attention mecha-
nism [4] and YOLO [5]. Numerous datasets were used in
different studies, like the ALL image dataset [6] and the C-
NMC 2019 dataset [7].
Most of the research done relies on single-cell datasets
to train the AI models; however, in a practical environment,
the model should expect to be exposed to multi-cell images
and still perform accurately. The purpose of this paper is to
overcome this gap in research by exposing the proposed model
to multi-cell samples.
Fig. 1. Workflow Diagram
This study employs image processing techniques, such
as segmentation, to prepare the dataset. Additionally, it
utilizes transfer learning and fine-tuning methodologies on
YOLOv11 [8], YOLOv8 [9], and ResNet50 [10], achieving
results ranging from 97% to 99%, see our initial results [11].
The contributions of this study can be summarized as
follows:
1) Up to our knowledge, this is the first research conducted
to exploit YOLOv11 in blood cancer detection.
2) The integration of two datasets enhances the model’s
generalization across diverse samples.
3) The crucial question of whether white blood cells are
malignant or benign is addressed
4) The models demonstrate the capability to detect hemato-
gones cases, despite their frequent misclassification as
ALL.
5) A comparative analysis of our findings with previous
related studies is provided.
The structure of the paper is as follows: The dataset and
data collection is presented in section II. Section III goes
arXiv:2502.09804v1 [eess.IV] 13 Feb 2025
2
through the methodologies and several deep learning models
used. The performance metrics are discussed in section IV, and
the results of YOLOv11, YOLOv8 and ResNet50 are found
in sections V, VI and VII, respectively. Section IX discusses
the related work done in this field. Finally, our results are
compared with other works in section X, followed by the
conclusion in section XI.
II. DATASE TS DESCRIPTION AND DATA COLLECTIONS
Available datasets are divided into two types: single-cell and
multi-cell datasets. Single-cell datasets typically contain im-
ages with a single white blood cell per image, whereas multi-
cell datasets depict multiple cells within each sample. Since
multi-cell datasets better represent real-life scenarios when
working with blood cells, we chose to focus on them. The two
datasets selected for this study are the Acute Lymphoblastic
Leukemia (ALL) image dataset from Kaggle [6] and ALL-
IDB1 [12], both of which contain multiple white blood cells
per sample. We opted for these two datasets because most
of the others are single-celled or dedicated to classifying the
subtypes of malignant cancer solely.
On one hand, Table I summarizes the statistics of the ALL
image dataset, which contains 3,256 images in total, divided
into four categories: Benign, Early, Pre, and Pro. The benign
class includes Hematogones, a condition where lymphoid cells
accumulate in a pattern similar to ALL but are non-cancerous
and generally harmless. The dataset consists of 504 benign
images and 2,752 malignant cells, further categorized into 985
early-stage samples, 963 pre-stage samples, and 804 pro-phase
samples.
TABLE I
SAM PLE D IS TRI BUT IO N PER C LA SS FO R ALL IM AGE DATAS ET
Class Samples Per Class Percentage
Benign 504 15.5%
Early 985 30.2%
Pre 963 29.6%
Pro 804 24.7%
Total 3256 100%
On the other hand, Table II refers to the ALL-IDB1 dataset,
which contains microscopic images. It includes a total of
108 images, divided into 59 normal blood samples and 49
cancerous ones. This balance between normal and cancerous
samples is crucial for the model to effectively learn the
distinguishing features of ALL cells.
TABLE II
SAM PLE D IS TRI BUT IO N PER C LA SS FO R ALL-IDB1 DATASE T
Class Samples Per Class Percentage
Normal 59 54.6%
Cancer 49 45.4%
Total 108 100%
We aim to expose the models to diverse datasets and
various states of the blast cells for more practical and precise
classification. Thus, we merged the normal cells from the
ALL-IDB1 dataset with the benign cells from the ALL image
dataset under one category called Normal. In addition, we
combined the samples from the Early, Pre and Pro classes from
the ALL image dataset with the Cancer class samples from the
ALL-IDB1 dataset into one class called Cancer. Eventually, the
final dataset we will use to train the models will consist of only
two classes, Normal and Cancer. Table III below clarifies how
the integration between the two datasets was done to produce
the two final classes in order to ensure that the models would
be able to differentiate between healthy and malignant cells
even in early stages.
TABLE III
SAM PLE D IS TRI BUT IO N PER C LA SS FO R TH E FINA L MER GED D ATASET
Class Samples Per Class Total Percentage
Normal Normal(ALL-IDB1):59,
Benign(ALL Image Dataset):504
563 16.7%
Cancer Cancer(ALL-IDB1):49,
Early(ALL Image Dataset): 985,
Pre(ALL Image Dataset):963,
Pro(ALL Image Dataset):804
2801 83.3%
III. MOD EL S AN D METHODOLOGIES
The implementation is divided into several phases, as
illustrated in Fig. 1. The first phase involved data preparation,
where image segmentation techniques were applied to isolate
the relevant elements. Additionally, image rescaling was
done to ensure that each model receives the right shape
for the input layer while data augmentation was utilized to
assist with the data variety and enhancing the convergence.
Next, the pretrained YOLOv11, YOLOv8, ResNet50 and
Inception-ResNet-v2 models were loaded, and the last 10
layers were unfrozen for two of them as illustrated. In the
final phase, the models are trained to fine-tune the pretrained
weights for the specific task at hand.
Fig. 2. The implementation process
A. Dataset Preparation
The dataset was preprocessed using image processing tech-
niques to remove redundant elements and improve the models’
performance. First, unnecessary elements, such as varying
backgrounds and unrelated blood components like platelets,
were removed from the images. This step was essential to
enable the model to focus on the white blood cells, which
3
are the main area of concern, and reduce potential confusion.
To achieve this, image segmentation techniques were applied
using OpenCV. The images were converted to the HSV color
space (hue, saturation, value), and upper and lower thresholds
were set for the purple hue of the white blood cells to create
a binary mask. This mask was then applied to the original
images ??, allowing the segmentation of the white blood cells
III-A, as illustrated in Fig. 3 below.
(a) Original data sample
(b)Segmented data sample
Fig. 3. Data samples before and after image segmentation. (a) Before
segmentation and (b) After segmentation.
To introduce more robustness and reduce potential overfit-
ting, the data was augmented using random flipping, rotation,
and zoom for ResNet50 and InceptionResNetV2. Meanwhile,
the training configuration for the YOLO models incorporated
several augmentation techniques to enhance robustness and
generalization. First, the augmentation parameter was enabled
to allow training data augmentation. Mosaic augmentation,
which combines four images into one during training, was
applied with a probability of 1.0, meaning it was used 100%
of the time. We allowed images to be randomly rotated
by up to 45 degrees and added a 50% chance of flipping
them horizontally. Moreover, we applied a scaling factor of
0.5, meaning images could be resized or zoomed in/out by
up to 50%, either enlarging or shrinking them. Finally, the
dataset was divided into three sets—training, validation, and
testing—with respective ratios of 70%, 15%, and 15%.
B. Image Classification
The detection of blast cells can be done in numerous
ways; one of them is image classification. Countless deep
learning architectures have been introduced to support this
task. Convolutional Neural Networks (CNNs) are the most
common architectures used when working with image and
video datasets. Many models have been built based on CNNs
such as VGG [13], ResNet [10], AlexNet [14] and GoogleNet
(Inception) [15]. In this paper, the light will be shed on
two versions of YOLO, which are 8 and 11, in addition to
ResNet50. To train and optimize the models’ performance,
we used the following two techniques:
Transfer Learning: It is a method adopted in the deep
learning field. It aims to reduce the training time and cost for
sophisticated models and reuse the massive number of trained
weights. Additionally, it is used to overcome the shortage in
the available datasets needed to train the models. The core idea
behind transfer learning is to allow the model to transfer its
knowledge from one task to another. This mechanism works
by training a large deep learning model, then exploiting the
trained parameters in another task, so the model would not
have to learn things from scratch; consequently, the model
will not have to reinitialize random weights. Instead, it will
build on the prior knowledge to customize to fit with the new
task [16].
Fine-tuning: It is a transfer learning approach where the
some of the architecture’s layer are frozen and the others are
unfrozen for further training on a certain task’s data. Usually,
the layers the unfrozen layers are the last few layers in the
model, so the model can adapt to the new task while using
the general knowledge it was trained on at the beginning [16].
YOLOv8: One of the recent advancements in computer vi-
sion is YOLO [5] which is based on CNNs. YOLOv8 [9] is one
of the recent released versions in the series of YOLO object
detection algorithms. The architecture of this deep learning
model consists of two main components: the backbone net-
work and the detection head. The backbone network, based on
EfficientNet [17], extracts rich, multi-scale features from the
input image, while the detection head, built using NAS-FPN,
merges these features to generate high-quality predictions for
object detection. Several new features enhance YOLOv8’s
performance. These include the Focal Loss function, which
focuses on hard-to-classify examples, reducing the impact of
easier cases. Mixup is a new data augmentation technique that
blends images and their labels to improve generalization and
robustness. Additionally, a new evaluation metric, Average
Precision Across Scales (APAS), evaluates object detection
accuracy across different object sizes, providing a more com-
prehensive performance measure than standard metrics.
YOLOv11: YOLOv11 [8] is the latest release of YOLO
by Ultralytics to continue the legacy of the YOLO series.
YOLOv11 introduces several key advancements over previous
versions in the YOLO series. It features an enhanced backbone
and neck architecture for improved feature extraction, boosting
accuracy in object detection tasks. The model is optimized
for both speed and efficiency, maintaining high performance
with fewer parameters compared to earlier versions. YOLOv11
supports a wide range of computer vision tasks, including
object detection, image classification, segmentation, and pose
estimation, and is adaptable for deployment across diverse
platforms, including edge devices and cloud environments.
ResNet50: ResNet [10] is a deep convolutional neural
network architecture that has 50 layers which was inspired
by VGG nets [13], and its network design consists of the
plain network and Residual network. It was developed to
4
address the vanishing gradient problem common in very
deep networks by introducing residual blocks—layers that
skip connections to allow gradients to flow more effectively
during training. These residual connections enable the model
to focus on learning new features without overwriting already-
learned patterns. Moreover, this type of skip link has the
advantage of allowing regularization to bypass any layer that
impairs architecture performance. ResNet50’s architecture has
demonstrated remarkable effectiveness, especially in image
classification tasks, and has become a foundation for more
advanced models in computer vision.
Inception-ResNet: In the residual versions of the Inception
networks, Inception blocks are simplified and optimized to
reduce computational costs. Each block is followed by a filter-
expansion layer, a 1x1 convolution without activation, which
adjusts the filter bank’s depth to match the input dimensions
after each block. This approach compensates for any reduction
in dimensionality introduced by the Inception blocks. Two
main versions, Inception-ResNet-v1 and Inception-ResNet-v2,
were developed; the former aligns with Inception-v3’s compu-
tational demands, while the latter matches Inception-v4’s. A
key design choice in Inception-ResNet was the selective use
of batch normalization, applied only to traditional layers and
not to summation points, which helped manage GPU memory
usage and allowed for more Inception blocks. This decision
aimed to keep the models trainable on a single GPU, with the
goal of eventually revisiting this compromise as computing
efficiency improves [18].
Methodology: First, transfer learning was leveraged in this
implementation, where we imported a pretrained YOLOv8
model after installing the Ultralytics package, then trained it
on our customized segmented dataset. We experimented with
different optimizers and hyperparameters; however, the final
result is based on 50 epochs of training using SGD optimizer
and a learning rate of 0.001. The batch and image sizes were
set to 8 and 224, respectively.
Second, YOLOv11, which is the latest version of the series,
was trained on our custom data as well. In this step, we
experimented with both the nano and small versions of the
model to observe the performance on the different complexity
levels of YOLOv11’s architecture, eventually deciding which
one is more suitable for this problem. The best results were
chosen based on training the model with 50 epochs, SGD
optimizer, 0.001 learning rate, and 32 batch size.
Besides deploying different YOLO versions, ResNet50
and Inception-ResNet-v2 were also exploited. The seg-
mented dataset was loaded and prepared using im-
age dataset from directory from TensorFlow, with the batch
size set to 8 for better generalization. The images were
resized to (224, 224, 3) and (299, 299, 3) to match the input
layers of ResNet50 and Inception-ResNet-v2, respectively.
For both models, pre-trained versions with ImageNet weights
were imported from Keras. Although the top layers were
frozen, the last 10 layers of the architecture were unfrozen
for fine-tuning to better customize the models for the current
task. Additionally, the initial learning rate was set to 0.001,
and a ReduceLROnPlateau scheduler was added to lower
the learning rate as training progressed, helping to better
Fig. 4. YOLOv8, YOLOv11, ResNet50 and Inception-ResNet-v2 Models
Training and Evaluation
Require: ALL-IDB1 and ALL image dataset on CNN mod-
els.
Ensure: Accuracy, Loss and saved models.
1: Step 1: Model Initialization
2: - Load pre-trained YOLOv8, YOLOv11, ResNet50 and
Inception-ResNet-v2.
3: - Unfreeze the last 10 layers in both ResNet50 and
Inception-ResNet-v2.
4: - Initialize parameters for each model, including learning
rate, optimizer, loss and metric.
5: Step 2: Data Preprocessing and Analysis
6: - Perform image segmentation on ALL-IDB1 and ALL
image dataset.
7: - Rescale and apply data augmentation properly on the
datasets.
8: Step 3: Model Training and Evaluation
9: - Train YOLOv8 and YOLOv11 models on the prepro-
cessed data through transfer learning.
10: - Fine-tune ResNet50 and Inception-ResNet-v2 on the
preprocessed data.
11: - Validate performance and tune the parameters using
the validation dataset.
12: - Evaluate performance of the models on the test dataset.
13: Step 4: Model Evaluation
14: - Calculate the accuracy, Recall, Precision, F1-Score and
Specificity.
15: - Generate the training and validation accuracies and
losses graphs.
16: - Generate the confusion matrix.
control convergence. Many experiments were done with larger
batch sizes and different number of unfrozen layers but they
yielded in lower accuracies, more unstable performances or
overfitting. Therefore, the parameters mentioned led to the best
performance.
IV. PERFORMANCE MET RI CS
Numerous evaluation methods and metrics have been de-
veloped to assess different types of tasks in deep learning. In
this study, we have used several metrics to illustrate and com-
prehend the efficiency of the models used. These calculations
provided us with a clear explanation for the results obtained
from each model during the experimentation phase, which
assisted and guided us when tweaking the methodologies for
optimal results. For this purpose, accuracy, f1 score, precision,
recall, specificity, and confusion matrix were computed.
Accuracy is an overall indicator of how well the model
performs, considering the number of correctly identified sam-
ples out of all the given samples. It is represented by the
sum of true positives and true negatives divided by the total
number of examples, which includes True Positive (TP), True
Negative (TN), False Positive (FP), and False Negative (FN),
as expressed in Equation 1:
Accuracy =T P +T N
T P +T N +F P +F N (1)
5
Presented in Equation 2, the sample precision which is
identified by the ratio of correctly classified instances to the
total number of classified instances.
P recision =T P
T P +F P (2)
Recall, or Sensitivity, is calculated as the ratio of correctly
identified instances to the total number of instances, as de-
scribed in Equation 3.
Recall =T P
T P +F N (3)
Another significant metric that contributed to our results is
F1-score. It is obtained by calculating the harmonic mean of
precision and recall, as illustrated in Equation 4.
F1Score = 2 ∗
P recision ∗Recall
P recision +Recall (4)
In addition to the previous indicators, we calculated the
specificity using the formula in Equation 5.It refers to the
proportion of correctly identified negative instances among
all actual negative cases. It reflects the model’s ability to
accurately classify instances from the opposite disease classes.
Specif icity =T N
T N +F P (5)
Lastly, we generated the confusion matrix, which provides
a comprehensive summary of the model’s performance and
the essential components required to derive the previously
mentioned metrics. This matrix compares the predicted classes
of samples to their actual classes by displaying the counts for
True Negatives (TN), False Positives (FP), True Positives (TP),
and False Negatives (FN). It can be represented as follows 6:
TP FP
FN TN(6)
V. YOLOV11 PERFORMANCE RESULTS
In this section, we evaluate the performance of our various
trained models using the metrics mentioned in section IV.
Starting with YOLOv11, as mentioned before, we employed
two versions of the model to test different complexities and
determine the most appropriate for this problem. Both the
nano and small models were trained for 50 epochs, with
YOLOv11n achieving 98% validation accuracy and 97.3%
testing accuracy, while YOLOv11s attained a higher testing
accuracy of 98.2%.
The accuracy graph for the nano version of YOLOv11
shown in Fig. 5 demonstrates improvement in the accuracy’s
progress with some fluctuations at the beginning. These vari-
ations decrease gradually as the number of epochs increases
until the graph curve becomes more stable. It is important to
mention that this evaluation was based on SGD optimizer and
a learning rate of 0.001.
It can be seen that both the training and validation losses
were declining as the training advanced in Figures. 6 and 7.
The training loss was going down steadily while the validation
loss experienced more changes during its decline.
Analyzing the confusion matrix in Fig. 8 provides com-
prehensive and direct insights about the model’s performance
Fig. 5. YOLOv11n accuracy with SGD
Fig. 6. YOLOv11n train loss with SGD
Fig. 7. YOLOv11n validation loss with SGD
6
because it allows us to identify which classes the model
can detect better and when it usually makes poor decisions.
Eventually, it helps us decide what to fix and optimize to boost
the model’s behavior. The matrix for YOLOv11n clarifies how
the model obtained its high accuracy and validates the model’s
efficient overall performance. It seems that the trained model
performed significantly well when detecting cancer in all its
stages; however, it made some mistakes when identifying the
healthy white blood cells, as it mistook 0.10% for cancerous
cells.
Fig. 8. Normalized confusion matrix for YOLOv11n with SGD.
We also conducted experiments with different optimizers
and hyperparameters. Figures 9, 10, and 11 exhibit the model’s
performance when leveraging AdamW optimizer and the de-
fault 0.000714 YOLO learning rate during 100 epochs on
batches of size 16. AdamW optimizer shows more unsteady
performance than SGD in the accuracy and validation loss
graphs. Furthermore, it can be noticed that more epochs lead
to slower convergence, where the validation loss stabilizes
later after around 60 epochs, with more frequent oscillations
even in later epochs. Despite this experiment achieving a
validation accuracy of 98.2%, which is higher than that of
SGD optimizer, the model’s behavior here shows a greater
risk of overfitting. Thus, we decided to choose the YOLOv11n
with SGD as the best performance achieved, taking into
consideration not only the accuracy but also the stability and
efficiency.
YOLOv11s exhibited a similar but enhanced behavior com-
pared to YOLOv11n. The optimal performance on YOLOv11n
was reached with SGD optimizer, 32 batch size, and 0.001
learning rate, and the model was trained for 50 epochs. Fig. 12
clarifies how the accuracy value followed the same trend as
YOLOv11n, where it rose with some random variations at
the beginning indicating that the model was still learning the
new data patterns. The graph became more stable by the end
of the training process, achieving 98.6% validation accuracy.
In addition, the test accuracy reached 98.2%, surpassing that
of Yolov11n by 0.9. Figures 13 and 14 demonstrate the
Fig. 9. YOLOv11n accuracy with AdamW
Fig. 10. YOLOv11n train loss with AdamW
Fig. 11. YOLOv11n validation loss with AdamW
7
considerable decline in both the training and validation losses.
Fig. 12. YOLOv11s accuracy with SGD
Fig. 13. YOLOv11s train loss with SGD
There was a slight improvement in the confusion matrix as
well, which can be illustrated in Fig. 15, such that the rate
of images of healthy white blood cells misclassified as cancer
was less than that of YOLOv11n since it fell from 0.10 to
0.07.
On the other hand, Figures 12, 13, and 14 visualize the
model’s performance when trained using the AdamW opti-
mizer. Although it attained a higher validation accuracy of
98.8%, the training process showed considerable fluctuations,
reflecting instability and a possible risk of overfitting. Conse-
quently, we opted for model trained with SGD, as it demon-
strated superior generalization and more stable performance.
The experiments with YOLOv11 revealed several important
insights. Training with the SGD optimizer produced smoother
training and validation curves, whereas the AdamW optimizer
achieved marginally higher accuracy. Moreover, larger batch
sizes contributed to improved accuracy. However, increasing
the number of epochs beyond 50 caused a drop in performance,
with 100 or more epochs proving detrimental. As a result, 50
epochs were chosen as the optimal configuration.
Fig. 14. YOLOv11s validation loss with SGD
Fig. 15. Normalized confusion matrix for YOLOv11s with SGD.
Fig. 16. YOLOv11s accuracy with AdamW
8
Fig. 17. YOLOv11s train loss with AdamW
Fig. 18. YOLOv11s validation loss with AdamW
VI. YOLOV8SPERFORMANCE RES ULTS
The visual representation of YOLOv8 behavior is shown in
figures 19, 20, and 21, in which the accuracy on the validation
dataset was 96.6%, and it peaked at 98% when evaluated on
the testing dataset. Compared to YOLOv11s, YOLOv8’s small
version achieves a slightly lower accuracy while it outperforms
the nano version. The accuracy, training loss, and validation
loss graphs of YOLOv8 presented follow similar patterns
as those of YOLOv11. In comparing optimizers, YOLOv8
demonstrated greater stability when using SGD rather than
AdamW. For batch size, we experimented with 8, 16, 32, and
64, finding that smaller batch sizes yielded better performance.
Consequently, a batch size of 8 was chosen for the final model.
The confusion matrix in Fig. 22 illustrates how the number
of misclassified normal cells surpasses that of the small and
nano versions of YOLOv11; otherwise, the model is perform-
ing considerably well and efficiently.
VII. RES NET 50 PERFORMANCE RES ULTS
In this section, we evaluate ResNet50’s behavior and per-
formance metrics. Using fine-tuning, the model’s training
accuracy settled at a peak of 99.2%. Meanwhile, the validation
Fig. 19. YOLOv8s accuracy with SGD
Fig. 20. YOLOv8s train loss with SGD
Fig. 21. YOLOv8s validation loss with SGD
9
Fig. 22. Confusion matrix for YOLOv8s with SGD
accuracy grew to 99%, and the test dataset evaluation achieved
99%. The visualization of the training and validation accuracy
curves in Fig. 23 highlights the learning enhancement and
growth along the epochs. There were some variations at the
beginning of the validation curve that were reduced as the
training process progressed, and there was almost no gap
between the training and validation by the hundredth epoch,
which also applies to the training and validation losses in
Fig. 24.
Fig. 23. Training and Validation Accuracy
Fig. 24. Training and Validation Loss
On a different note, the confusion matrix of ResNet50 in
Fig. 25 illustrates that there is a percentage of healthy white
blood samples that was misidentified as cancerous. However,
it identifies all of the blast cells correctly.
Fig. 25. Confusion Matrix for ResNet50
VIII. INCEPTION-RES NET-V2 PERFORMANCE RE SU LTS
Finally, we analyze the performance of the fine-tuned
Inception-ResNet-v2 model, where the model’s behavior is
represented in figures 26, 27 and 28. The training and val-
idation accuracy graph illustrates the training progress across
epochs, which was initially unstable for the validation but
eventually converged smoothly, recording training, validation,
and testing accuracies of 99.7%, 98%, and 99.7%, respectively.
Meanwhile, the training and validation losses exhibited a
similar trend but in the opposite direction, declining over
time with some sharp variations for the validation at the
start of the process. It is worth noting that, compared to
ResNet50, Inception-ResNet-v2 demonstrated a more stable
training process.
Fig. 26. Training and Validation Accuracy
The confusion matrix of the model is visualized in Fig. 28,
which highlights the model’s strong capability in distinguish-
ing between cancerous and normal samples with very few
misclassifications.
10
Fig. 27. Training and Validation Loss
Fig. 28. Confusion Matrix for Inception-ResNet-v2
Table IV highlights the key findings of this study by
evaluating the algorithms on the test dataset using various
evaluation metrics. To sum up, it is notable that Inception-
ResNet-v2 comes first in both accuracy and specificity among
the five models with 99.7% and 100% accuracy and specificity,
respectively. YOLOv11s comes second in specificity with
93%, while YOLOv8s comes in last place. It is evident that
YOLOv8s attained lower scores than the rest of the tested
algorithms for the F1 and precision metrics too. However,
YOLOv11n falls at the bottom of the list when considering
the accuracy since it has only 97.3%.
TABLE IV
PERFORMANCE METRICS FOR DIFFERENT MODELS
Model Accuracy F1 Precision Recall Specificity
YOLOv11n 97.3 98.8 97.9 99.8 89.5
YOLOv11s 98.2 99.2 98.6 99.8 93
YOLOv8s 98 98 96.5 99.6 82.6
ResNet50 99 99.3 98.6 100 92.8
Inception-
ResNet-
v2
99.7 98.96 100 97.94 100
IX. RE LATE D WOR K
Extensive research has been conducted on the application
of AI for Leukemia detection. This section sheds light on the
previous work done on Leukemia detection using different
Deep Learning architectures, along with their strengths and
weaknesses.
Hosseini et al. [19] strove to detect B-cell acute lym-
phoblastic leukemia (B-ALL) cases, including the subtypes,
using a deep CNN. After leveraging K-means clustering and
segmentation for image preprocessing on a dataset consisting
of benign and malignant B-ALL cases, he compared the
efficiency of three lightweight CNN models (EfficientNetB0,
MobileNetV2, and NASNet Mobile) using the training and
testing data. Eventually, segmented and original images were
combined and fed as inputs through two channels to extract
maximum feature space, which enhanced the models’ accu-
racy. MobileNetV2 was selected for achieving 100% accuracy
and having the smallest size, making it suitable for implemen-
tation on mobile devices.
Talaat et al. [20] exploited the attention mechanism to
detect and classify leukemia cells. The A2M-LEUK algorithm
involved image preprocessing, feature extraction using CNN,
and an attention mechanism-based machine learning algorithm
applied to the extracted features. The C-NMC 2019 [7]
dataset was utilized, and classifiers such as SVM or a neural
network were used in the proposed algorithm. After evaluating
the precision, recall, accuracy, and specificity of the A2M-
LEUK algorithm against KNN, SVM, Random Forest, and
Na¨
ıve Bayes, the proposed model demonstrated superior per-
formance, achieving nearly 100% in all four metrics. However,
the paper did not specify which classification model was used
with A2M-LEUK to achieve that accuracy.
Yan [21] presented a study on the single-cell dataset C-
NMC 2019 [7] to classify normal and cancerous white blood
cells using three different models: YOLOv4, YOLOv8, and
a CNN model. Data augmentation was applied to the CNN
and YOLOv4 models. The CNN model, consisting of con-
volutional and max pooling layers, fully connected layers,
and ReLU activation functions, achieved 93% accuracy, while
YOLOv4 and YOLOv8 both achieved accuracies above 95%.
Devi et al. [22] utilized a combination of custom-designed
and pretrained CNN architectures to detect ALL in the ALL
image dataset [6] after applying augmentation. The custom-
designed CNN was used to extract hierarchical features, while
VGG-19 was used to extract high-level features. VGG-19
performed the classification task, and the proposed model
achieved 97.85% accuracy. On the other hand, [23] applied
image processing and the Fuzzy Rule-Based inference system
to tackle the same topic.
Rahmani et al. [24] opted for the C-NMC 2019 dataset,
where the data was preprocessed using methods such as
grayscaling and masking, followed by feature extraction
through transfer learning with models like VGG19, ResNet50,
ResNet101, ResNet152, EfficientNetB3, DenseNet-121, and
DenseNet-201. Feature selection was then applied using Ran-
dom Forest, Genetic Algorithms, and the Binary Ant Colony
Optimization metaheuristic algorithm. The classification was
11
conducted through a multilayer perceptron, achieving an ac-
curacy slightly above 90%.
Kumar et al. [25] contributed to the classification of different
types of blood cancer in white blood cells, such as ALL and
Multiple Myeloma. He applied preprocessing and augmenta-
tion methods to the data, followed by feature selection. The
study used the SelectKBest class to select K specific features.
The proposed model consisted of two blocks, each containing a
convolutional layer and a max pooling layer, followed by fully
connected layers and a classification layer. This architecture
achieved 97.2% accuracy.
Saikia et al. [26] introduced VCaps-Net, a fine-tuned
VGG16 model combined with a capsule network for ALL
detection. Two datasets were used: ALL-IDB1 [12] and a
private dataset. The proposed model integrates the powerful
structure of VGG16 with a capsule network, which represents
unit positions in images using vectors to maintain spatial rela-
tionships often lost due to max pooling. VCaps-Net achieved
an accuracy of 98.64%.
The ALL-IDB dataset was also used in a study by Al-
saykhan et al. [27] to detect ALL using a hybrid algorithm.
The approach combined support vector machine (SVM) and
particle swarm optimization algorithms to optimize the results
by selecting the best parameters to minimize errors. As a
result, an accuracy of 97% was achieved.
In [28], Abhishek et al. classified different types of
leukemia, including CLL, ALL, CML, and AML. He utilized
the transfer learning approach, freezing the initial layers of
pretrained CNNs as feature extractors (a process known as
fine-tuning). The feature extractors used were ResNet152V2,
MobileNet, DenseNet121, VGG16, InceptionV3, and Xcep-
tion, which were trained on ImageNet [29]. These extractors
were then combined with classifiers such as Support Vector
Machines, Random Forest, and new fully connected layers to
improve classification performance. The accuracies for various
combinations of classifiers ranged from 74% to 84%.
Vogado et al. [30] conducted a study using multiple datasets
of different natures, focusing on multi-cell and single-cell
images. CNNs were used for feature extraction from the orig-
inal images, and SVM was applied for classification without
prior image segmentation. The pre-trained models included
AlexNet [14], CaffeNet [31], and VGG-f [32]. The feature
vectors were then passed to the selected classifier for final
predictions.
X. COMPARISON ANALYSIS WITH OTH ER RE SU LTS
In this section, we compare our image classification methods
for Acute Lymphoblastic Leukemia (ALL) with existing ap-
proaches as shown in Table V. Our approach using YOLOv11s
achieved an accuracy of 98.2% on the ALL-IDB1 dataset,
outperforming Yan’s YOLOv8 model, which reached 96%
on the C-NMC 2019 dataset. Additionally, our YOLOv8s
model attained 98%, while the Inception-ResNet-v2 model
achieved 99.7% accuracy, outperforming other custom CNN-
based studies, such as Devi et al., which achieved 97.85%
on the ALL dataset. Our results also demonstrate a higher
accuracy compared to the VCaps-Net model from Saikia
et al., which obtained 98.64% on ALL-IDB1, as well as
the DenseNet-201 model by Rahmani et al., which achieved
90.55% accuracy on the C-NMC 2019 dataset.
TABLE V
COMPARISON OF DIFFER EN T APPRO ACH ES FO R DET ECT ING AC UT E
LYMPHOBLASTIC LEUK EMI A (ALL)
Study Methodology Accuracy Dataset
Yan [21] YOLOv4 YOLOv4: 98%
YOLOv8 YOLOv8: 96% C-NMC 2019
CNN CNN: 92%
Devi et al.
[22]
Custom + pre-
trained CNN
97.85% ALL dataset
Rahmani et
al. [24]
DenseNet-201+
RF-GA-BACO
90.55% C-NMC 2019
Saikia et al.
[26]
VCaps-Net 98.64% ALL-IDB1
Kumar et al.
[25]
Custom CNN 97.2% Custom
dataset
Abhishek et
al. [28]
SVM (VGG16
with LTCL fine
tuned along
with SVM)
84% Custom
dataset
Alsaykhan et
al. [27]
SVM + PSO 97% ALL-IDB1 +
ALL-IDB2
Our
study [11]
YOLOv11n YOLOv11n:97.3%
YOLOv11s YOLOv11s:
98.2%
ALL-IDB1
YOLOv8s YOLOv8s: 98% +
ResNet50 ResNet50: 99% ALL dataset
Inception-
ResNet-v2
Inception-
ResNet-v2:
99.7%
XI. CONCLUSION
In conclusion, the integration of AI in the medical field is
a massive step in the advancement of the health system and
services provided to patients. In this research, computer vision
techniques and several deep learning models were utilized
to detect the absence or presence of Acute Lymphoblastic
Leukemia in different stages. To achieve this goal, YOLOv11,
YOLOv8, ResNet50, and Inception-ResNet-v2 were fine-tuned
on multi-cell datasets to differentiate between healthy and can-
cerous white blood cells where this disease is usually found.
In order to help the models learn diverse features, we collected
images from ALL-IDB1 and ALL image datasets, then trained
the models on the final integrated datasets. Consequently,
the models exhibited high performances where the accuracies
peaked at 97.3%, 98.2%, 98%, 99%, and 99.7% for each of
YOLOv11n, YOLO11vs, YOLO8vs, ResNet50, and Inception-
ResNet-v2, respectively.
To the best of our knowledge, this is the first study to utilize
the most recent and 11th version of the YOLO series. The
comparison between YOLOv8 and YOLOv11 showed only
slight differences in performance and proved that a model’s
complexity can play a noticeable role in its behavior. However,
with data augmentation, other models such as ResNet50 and
Inception-ResNet-v2 can perform better, especially as the
network goes deeper.
12
For future work, we plan to collect more images from
other datasets to boost the robustness of the models when
encountering various features. This would assist the models
in facing computer vision challenges pertaining to different
settings and image preparations. We also aim to make our
models suitable for deployment on different devices with
consideration for the small hardware architectures.
REFERENCES
[1] W. H. Organization, “Cancer today,”
https://gco.iarc.who.int/today/en/dataviz, 2024, accessed: 2024-09-
25.
[2] A. S. of Hematology, “Leukemia,”
https://www.hematology.org/education/patients/blood-cancers/leukemia,
2023, accessed: 2024-09-25.
[3] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” 2019. [Online].
Available: https://arxiv.org/abs/1801.04381
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
[Online]. Available: https://arxiv.org/abs/1706.03762
[5] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016.
[6] M. Ghaderzadeh, M. Aria, A. Hosseini, F. Asadi, D. Bashash, and
H. Abolghasemi, “A fast and efficient cnn model for b-all diagnosis
and its subtypes classification using peripheral blood smear images,”
International Journal of Intelligent Systems, Nov 2021.
[7] S. Mourya, S. Kant, P. Kumar, A. Gupta, and R. Gupta, “All challenge
dataset of isbi 2019 (c-nmc 2019) (version 1),” The Cancer Imaging
Archive, 2019.
[8] Ultralytics, “Yolov11 - key features,” 2024, accessed: October 8,
2024. [Online]. Available: https://docs.ultralytics.com/models/yolo11/
#key-features
[9] R. Varghese and S. M., “Yolov8: A novel object detection algorithm
with enhanced performance and robustness,” in 2024 International
Conference on Advances in Data Engineering and Intelligent Computing
Systems (ADICS), Chennai, India, 2024, pp. 1–6.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2015. [Online]. Available: https://arxiv.org/abs/1512.03385
[11] A. Awad and S. A. Aly, “Early diagnosis of acute lymphoblastic
leukemia using yolov8 and yolov11 deep learning models,” in IEEE
JAC-ECC, International Japan-Africa Conference on Electronics com-
munications and Computations, Alex, Egypt, 16-18 December, 2024.
[12] A. Genovese, V. Piuri, K. N. Plataniotis, and F. Scotti, “DL4ALL:
Multi-Task Cross-Dataset Transfer Learning for Acute Lymphoblastic
Leukemia Detection,” IEEE Access, vol. 11, pp. 65 222–65 237, 2023.
[13] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” 2015. [Online]. Available: https:
//arxiv.org/abs/1409.1556
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems, F. Pereira, C. Burges, L. Bottou,
and K. Weinberger, Eds., vol. 25. Curran Associates, Inc.,
2012. [Online]. Available: https://proceedings.neurips.cc/paper files/
paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” 2014. [Online]. Available: https://arxiv.org/abs/1409.4842
[16] M. Iman, H. R. Arabnia, and K. Rasheed, “A review of deep
transfer learning and recent advancements,” Technologies, vol. 11,
no. 2, p. 40, Mar. 2023. [Online]. Available: http://dx.doi.org/10.3390/
technologies11020040
[17] N. Mehla, Ishita, R. Talukdar, and D. K. Sharma, “Object detection
in autonomous maritime vehicles: Comparison between yolo v8 and
efficientdet,” in International Conference on Data Science and Network
Engineering. Singapore: Springer Nature Singapore, 2023, pp. 125–
141.
[18] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning,”
2016. [Online]. Available: https://arxiv.org/abs/1602.07261
[19] A. Hosseini et al., “A mobile application based on efficient lightweight
cnn model for classification of b-all cancer from non-cancerous cells: A
design and implementation study,” Informatics in Medicine Unlocked,
vol. 39, pp. 101 244–101 244, Jan 2023.
[20] F. M. Talaat and S. A. Gamel, “A2m-leuk: attention-augmented algo-
rithm for blood cancer detection in children,” Neural Computing and
Applications, vol. 35, no. 24, pp. 18 059–18 071, Jun 2023.
[21] E. Yan, “Detection of acute myeloid leukemia using deep learning
models based systems,” in IFMBE Proceedings, Jan 2024, pp. 421–431.
[22] J. R. Devi, P. S. Kadiyala, S. Lavu, N. Kasturi, and L. Kosuri, “Enhanc-
ing acute lymphoblastic leukemia classification with a rapid and effective
cnn model,” in 2024 Third International Conference on Distributed
Computing and Electrical Circuits and Electronics (ICDCECE), Ballari,
India, 2024, pp. 1–6.
[23] M. A. Khosrosereshki and M. B. Menhaj, “A fuzzy based classifier
for diagnosis of acute lymphoblastic leukemia using blood smear image
processing,” in 2017 5th Iranian Joint Congress on Fuzzy and Intelligent
Systems (CFIS), Qazvin, Iran, 2017, pp. 13–18.
[24] A. M. Rahmani et al., “A diagnostic model for acute lymphoblastic
leukemia using metaheuristics and deep learning methods,” arXiv.org,
2024, accessed: Sep. 25, 2024. [Online]. Available: https://arxiv.org/
abs/2406.18568
[25] D. Kumar, N. Jain, A. Khurana, S. Mittal, S. C. Satapathy, R. Senkerik,
and J. D. Hemanth, “Automatic detection of white blood cancer from
bone marrow microscopic images using convolutional neural networks,”
IEEE Access, vol. 8, pp. 142 521–142 531, 2020.
[26] R. Saikia, A. Sarma, K. M. Singh, and S. S. Devi, “Vcaps-net: Fine-
tuned vgg16 with capsule network for acute lymphoblastic leukemia
detection on a diverse dataset,” in 2024 6th International Conference on
Energy, Power and Environment (ICEPE), 2024, pp. 1–6.
[27] L. K. Alsaykhan and M. S. Maashi, “A hybrid detection model for acute
lymphocytic leukemia using support vector machine and particle swarm
optimization (svm-pso),” Scientific Reports, vol. 14, p. 23483, 2024.
[28] A. Abhishek, R. K. Jha, R. Sinha, and K. Jha, “Automated detection
and classification of leukemia on a subject-independent test dataset
using deep transfer learning supported by grad-cam visualization,”
Biomedical Signal Processing and Control, vol. 83, p. 104722, 2023.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S1746809423001556
[29] O. Russakovsky, J. Deng, H. Su et al., “Imagenet large scale visual
recognition challenge,” International Journal of Computer Vision, vol.
115, no. 3, pp. 211–252, 2015.
[30] L. H. Vogado, R. M. Veras, F. H. Araujo, R. R. Silva, and K. R.
Aires, “Leukemia diagnosis in blood slides using transfer learning
in cnns and svm for classification,” Engineering Applications of
Artificial Intelligence, vol. 72, pp. 415–422, 2018. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0952197618301039
[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22nd ACM International
Conference on Multimedia, ser. MM ’14. New York, NY, USA:
Association for Computing Machinery, 2014, p. 675–678. [Online].
Available: https://doi.org/10.1145/2647868.2654889
[32] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of
the devil in the details: Delving deep into convolutional nets,” 2014.
[Online]. Available: https://arxiv.org/abs/1405.3531