PreprintPDF Available

Rice Leaf Disease Detection: A Comparative Study Between CNN, Transformer and Non-neural Network Architectures

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In nations such as Bangladesh, agriculture plays a vital role in providing livelihoods for a significant portion of the population. Identifying and classifying plant diseases early is critical to prevent their spread and minimize their impact on crop yield and quality. Various computer vision techniques can be used for such detection and classification. While CNNs have been dominant on such image classification tasks, vision transformers has become equally good in recent time also. In this paper we study the various computer vision techniques for Bangladeshi rice leaf disease detection. We use the Dhan-Shomadhan -- a Bangladeshi rice leaf disease dataset, to experiment with various CNN and ViT models. We also compared the performance of such deep neural network architecture with traditional machine learning architecture like Support Vector Machine(SVM). We leveraged transfer learning for better generalization with lower amount of training data. Among the models tested, ResNet50 exhibited the best performance over other CNN and transformer-based models making it the optimal choice for this task.
Content may be subject to copyright.
1
Rice Leaf Disease Detection: A Comparative Study
Between CNN, Transformer and Non-neural
Network Architectures
Samia Mehnaz[1] Md. Touhidul Islam[2]
[1]Viqarunnisa Noon College, Dhaka, Bangladesh
Email: samiamehnaz1003@gmail.com
[2]Department of Electrical and Electronic Engineering,
Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Email: 1806177@eee.buet.ac.bd
Abstract—In nations such as Bangladesh, agriculture plays a
vital role in providing livelihoods for a significant portion of
the population. Identifying and classifying plant diseases early
is critical to prevent their spread and minimize their impact on
crop yield and quality. Various computer vision techniques can be
used for such detection and classification. While CNNs have been
dominant on such image classification tasks, vision transformers
has become equally good in recent time also. In this paper we
study the various computer vision techniques for Bangladeshi
rice leaf disease detection. We use the Dhan-Shomadhan a
Bangladeshi rice leaf disease dataset, to experiment with various
CNN and ViT models. We also compared the performance of
such deep neural network architecture with traditional machine
learning architecture like Support Vector Machine(SVM). We
leveraged transfer learning for better generalization with lower
amount of training data. Among the models tested, ResNet50
exhibited the best performance over other CNN and transformer-
based models making it the optimal choice for this task.
Index Terms—CNN, Transformer Architecture, SVM, Classfi-
cation, Rice Leaf Disease
I. INTRODUCTION
Rice is the staple food for the majority of the population in
Bangladesh, contributing significantly to the country’s econ-
omy and food security. With over 75% of the agricultural
land dedicated to rice cultivation, ensuring healthy rice crops
is crucial for sustaining livelihoods and meeting national
food demands. However, rice crops are highly vulnerable to
various diseases such as bacterial blight, brown spots, and
sheath blight, which can lead to significant yield losses if
not addressed promptly. In Bangladesh, traditional methods of
rice disease identification often rely on manual labor, where
farmers visually inspect the fields for signs of disease. This
process is not only subjective and error-prone but also energy-
intensive, requiring considerable physical effort to cover large
fields. Farmers must repeatedly monitor crops throughout the
growing season, which demands significant time and labor
resources. These challenges are exacerbated in rural areas
where access to expert advice and modern tools is limited.
The reliance on manual labor can delay timely interventions,
leading to the rapid spread of diseases and substantial crop
losses.
The integration of modern technologies, such as image
processing based detection offers a promising solution to these
challenges. Automated disease detection systems can reduce
the need for labor-intensive monitoring, providing farmers
with precise, real-time diagnostics. This not only minimizes
physical effort but also ensures faster and more accurate
identification of diseases, allowing for timely and targeted
treatments.
Over the years, various paradigms have emerged for image
processing based classification. Early methods relied on tech-
niques like Support Vector Machine (SVM), which provided
a foundational approach to the task. With advancements in
machine learning, Convolutional Neural Networks (CNNs)
gained popularity for their superior ability to extract spatial
features, leading to significant improvements in classification
accuracy. More recently, transformer-based architectures have
begun to dominate the field due to their ability to model
long-range dependencies and achieve state-of-the-art results
in many domains. In this research, we explored the relevance
of all three paradigms—SVM, CNN, and transformer—in the
context of Bangladeshi rice disease classification. We have
experimented on the Dhan-Shomadhan dataset [1] providing
a comprehensive evaluation of the strengths and applicability
of CNN, transformer, and non-neural network architecture like
SVM in practical agricultural solutions.
II. LITERATURE REVIEW
The detection and classification of plant diseases using
machine learning (ML) techniques has gained significant at-
tention in recent years, particularly in agriculture where early
disease detection is critical for crop management. Traditional
machine learning models, such as Support Vector Machines
(SVM), were initially employed to identify rice leaf diseases
based on handcrafted features. In these early approaches, SVM
was often used in conjunction with various feature extraction
techniques such as color histograms and texture analysis [2].
For instance, Sethy et al. [2] developed a model using SVM
for detecting different rice leaf diseases, including paddy leaf
rot. While the model showed promising results, the reliance
arXiv:2501.06740v1 [cs.CV] 12 Jan 2025
2
on handcrafted features limited its ability to capture complex
patterns in the images.
With the increasing availability of large datasets, deep learn-
ing techniques have emerged as more powerful alternatives.
Convolutional Neural Networks (CNNs), which are designed
to automatically learn spatial hierarchies of features, have
become the standard for image classification tasks. CNN-based
models, such as ResNet and Inception, have demonstrated
exceptional performance in various plant disease classification
tasks [3]. In [3], the authors applied CNN models such
as VGG16, VGG19, ResNet, Xception for classifying plant
diseases. This shift from traditional machine learning to deep
learning marked a significant improvement in the ability to
detect rice leaf diseases with minimal human intervention.
The next evolution in the application of neural networks to
plant disease detection came with the advent of deeper and
more sophisticated CNN architectures. For example, Krish-
namoorthy et al. [4] employed a transfer learning approach
using InceptionResNetV2 to diagnose rice leaf diseases, in-
cluding bacterial blight, brown patches, and leaf blast. The
authors leveraged a dataset of 5200 images and demonstrated
the power of transfer learning for classifying rice leaf diseases
with greater precision. Their work further emphasized the
importance of pre-trained networks in dealing with small
datasets, where training models from scratch may not be
feasible.
In recent years, attention mechanisms have gained promi-
nence as they allow models to focus on important features
in the input data. One such architecture is MaxViT, a hybrid
model combining the strengths of both Convolutional Neu-
ral Networks and Transformer-based attention mechanisms.
MaxViT’s ability to capture both local and global features
through its multi-stage attention layers has shown promise in
tasks requiring fine-grained feature extraction and contextual
understanding [5]. This is particularly relevant in the context
of rice leaf disease detection, where the intricate patterns of
disease symptoms can benefit from global attention. MaxViT
has shown impressive performance in various image classifi-
cation tasks, making it a promising candidate for plant disease
classification.
III. METHODOLOGY
This research presents a comparative analysis of rice leaf
disease detection using CNNs, Transformer-based architec-
tures, and non-neural network models. The methodology con-
sists of data preprocessing, data augmentation, and detailed
model implementation.
A. Data Preprocessing
The dataset comprised images of rice leaves exhibiting var-
ious disease symptoms. Each image was resized to a uniform
dimension of 448×448 pixels to match the input requirements
of the models. Pixel intensity normalization was applied to
standardize the dataset, ensuring consistency in input image
values.
Fig. 1: An Image with the all the augmentation transformation
applied
B. Data Augmentation
To increase data variability, enhance model generalization,
and to avoid overfitting data augmentation techniques were
utilized:
Random Resize: Images were randomly resized to sim-
ulate variations in scale.
Random Crop: Random cropping introduced spatial
variability in the input data.
Random Perspective: Perspective transformations mim-
icked distortions typical in real-world imaging.
Random Gaussian Blur: Blurring effects were applied
to replicate noise conditions.
C. Model Implementation
Transfer learning was employed for all neural network
models due to the relatively small size of the dataset and
the challenges in acquiring large-scale labeled data in prac-
tical scenarios. All neural network models were pretrained
on the ImageNet dataset, allowing them to leverage learned
features and improve generalization. This approach signifi-
cantly reduces the computational cost and training time while
improving model performance. However, transfer learning was
not applied to the SVM models, as they were either used
directly or combined with features extracted from ResNet-
based models.
1) ResNet50: ResNet50 is a deep convolutional neural net-
work with 50 layers that employs residual learning to address
the vanishing gradient problem commonly encountered in deep
networks. By introducing skip connections, ResNet50 allows
gradients to flow more effectively during backpropagation,
making it possible to train very deep networks. Each residual
block consists of identity mappings combined with convolu-
tional operations, which enables the network to learn both low-
level and high-level features efficiently. ResNet50 is widely
known for its strong performance in image classification tasks
due to its balance between depth and computational efficiency.
3
Fig. 2: ResNet50 Model Architecture [6]
In this study, ResNet50 was used to extract hierarchical
features from rice leaf images, leveraging its ability to detect
subtle patterns indicative of disease symptoms. Figure 2 shows
the ResNet50 architecture.
2) ResNet152: ResNet152 is a deeper variant of ResNet,
consisting of 152 layers, and is designed to learn more
complex and fine-grained features from input data. Simi-
lar to ResNet50, it incorporates residual connections, but
the increased depth allows it to model intricate patterns in
large datasets. The additional layers in ResNet152 make it
particularly suitable for tasks that require capturing subtle
variations in the data, such as differentiating between rice leaf
diseases with similar visual symptoms. However, the increased
depth comes with a higher computational cost. Despite this,
ResNet152 has been shown to achieve superior accuracy in
many classification tasks, making it a strong candidate for
disease detection in this study.
3) Inception-V3: Inception-V3 utilizes inception modules,
which apply convolutional operations with multiple kernel
sizes in parallel. This allows the network to capture features
at various scales within a single layer. The architecture also
incorporates auxiliary classifiers, which serve as intermediate
supervision points during training, improving gradient flow
and enhancing convergence. Additionally, Inception-V3 em-
ploys techniques such as factorized convolutions and label
smoothing to improve training efficiency and model robust-
ness. This architecture is particularly effective for handling
complex datasets with diverse feature scales, making it suitable
for rice leaf disease detection. In this study, Inception-V3
was applied to identify disease-specific patterns in the input
images.
4) MaxViT: MaxViT is a Transformer-based architecture
that combines local and global attention mechanisms for
efficient feature extraction. It integrates convolutional layers
for localized context and uses two attention mechanisms:
Block Attention and Grid Attention.
Block Attention divides the feature map into non-
overlapping blocks and computes attention within each block,
enabling the model to capture fine-grained details with reduced
computational overhead. Grid Attention complements this by
rearranging features into a grid-based structure, allowing for
long-range dependency modeling across the image.
A key component of MaxViT is the MBConv block (Mobile
Inverted Bottleneck Convolution), which is used for efficient
feature extraction. The MBConv block incorporates depthwise
separable convolutions and squeeze-and-excitation (SE) layers
to reduce computational complexity while enhancing feature
representation. This block ensures that the model remains
computationally efficient even for high-resolution images.
Each MaxViT block includes convolutional embedding,
Layer Normalization, Multi-Head Self-Attention (MHSA)
modules for Block and Grid Attention, and Feed-Forward
Networks (FFNs) with skip connections for effective training.
Figure 4
MaxViT-L was employed in this study to evaluate the
effectiveness of Transformer-based architectures in rice leaf
disease detection, leveraging its ability to balance local detail
extraction with global contextual understanding.
5) Support Vector Machine (SVM): The SVM algorithm
is a classical machine learning approach that separates data
points of different classes using an optimal hyperplane. It is
particularly effective for tasks where the data is linearly or
near-linearly separable. In this study, SVM was applied as
a baseline model for disease detection. The algorithm was
trained on handcrafted features extracted from the dataset,
providing a benchmark to compare the performance of deep
learning models.
6) SVM with ResNet Feature Extractor: To leverage the
strengths of both deep learning and classical machine learning,
features were extracted from the penultimate layer of the
ResNet50 model and used as input to an SVM classifier.
This hybrid approach combines the robust feature extraction
capabilities of ResNet50 with the classification power of
SVM. By using deep, hierarchical features as input, the SVM
classifier was able to achieve better performance compared
to using raw image features. This method serves as a bridge
4
Fig. 3: Inception-V3 Model Architecture [7]
Fig. 4: MaxViT Model Architecture [5]
between traditional and modern machine learning techniques,
offering insights into their complementary strengths.
IV. EXP ER IM EN TAL RESULTS
We performed our experiments on the Dhan-Shomadhan
dataset [1]. The dataset contains images with white back-
ground and field background of 5 classes: Rice Blast, Rice
Tungro, Sheath Blight, Brown Spot, and Leaf Scald. All the
images are collected from the region of Dhaka division of
Bangladesh. Extensive experimentation was performed on a
NVIDIA P100 GPU. Multistep learning rate was used and
learning rate was decreased at optimal epochs. The dataset
was split in an 80:10:10 ratio into train, test, and validation
partitions.
The evaluation metrics used listed below. Here TP, FP, TN,
and FN respectively stands for True Positive, False Positive,
True Negative, and False Negative. As we are dealing with
multicalss classification, we worked with macro average of all
the metrics.
Macro Average Precision:
Precisionmacro =1
N
N
X
i=1
Precisioni
Where:
Precisioni=TPi
TPi+FPi
Macro Average Recall:
Recallmacro =1
N
N
X
i=1
Recalli
Where:
Recalli=TPi
TPi+FNi
Macro Average F1 Score:
F1macro =1
N
N
X
i=1
F1i
Where:
F1i= 2 ·Precisioni·Recalli
Precisioni+Recalli
5
TABLE I: Performance comparison of various models
Model Validation Acc. Recall Precision F1-Score
ResNet50 [8] 91.50% 90.1% 92.3% 91.2%
Inception-V3 [7] 87.50% 86.23% 88.11% 87.50%
ResNet152 [8] 90.00% 89.67% 91.2% 90.23%
MaxViT [5] 75.00% 73.2% 76.4% 74.9%
SVM [9] 62.45% 60.21% 55.10% 55.43%
SVM+ResNet 68.45% 61.1% 58.2% 59.89%
Note: All metrics have been macro averaged for the 5 classes.
Average Accuracy:
Avg Accuracy =1
N
N
X
i=1
TPi
TPi+FPi+FNi+TNi
Extensive experimentation was conducted, testing various
epochs, batch sizes, and learning rates to identify a robust
model for accurate predictions. Table I shows the experimental
results for the various models.
From the table, we can see that, ResNet50 achieved the
highest macro f1-score of 91.2%, followed by ResNet152 with
90.23%, and Inception-V3 with 87.50%. However, MaxViT
demonstrated relatively lower performance, achieving a macro
average f1-score of 74.9%. What is notable is, all 3 best
performing models are CNN-based, meaning that local context
is more important for such leaf disease classification task.
SVM underperformed severely with an f1-score of 55.43%.
Using ResNet for feature extraction of SVM and creating
a hybrid model between SVM and ResNet improved the
performance to an f1 score of 59.89%. However that is far
below the f1-score of the CNN networks.
These results indicate that CNN models are superior for crop
disease classification task. And traditional non-neural network
machine learning based methods are ineffective on such task.
ResNet50 is particularly effective for the rice crop disease
detection, offering a balance between accuracy and training
efficiency.
Figure 5 shows the loss curve for the ResNet50 training. The
validation loss converges to a stable value at around epoch 40
and an early stopping method has been adopted in training to
overcome overfitting. Figure 6 shows comparison of accuracy
between the neural network architectures.
V. CONCLUSION
Protecting crops in agriculture is a complex endeavor that
demands a deep comprehension of the crops, pests, diseases,
and weeds involved. In this reseach, we utilized deep learning
models, particularly CNN architectures, to identify various rice
diseases. The findings indicate that utilizing deeper models,
such as ResNet152, typically enhances performance; however,
ResNet50 provided a commendable balance between accuracy
and computational efficiency. Our research highlights the
capabilities of CNNs in automating the identification of rice
diseases, simplifying the process for users to detect illnesses
through visual input from cameras. By providing real-time
feedback, farmers can take early action to safeguard their
Fig. 5: Epoch vs train and validation loss graph for ResNet50
training
Fig. 6: Comparison of macro averaged accuracy between the
models
crops, thereby boosting productivity. Additional technologies
like drones fitted with cameras and sensors can be integrated
to improve the scalability and effectiveness of the disease
detection system. This will further enable farmers to carry
out preventive measures swiftly, ensuring healthier crops and
enhanced yields.
REFERENCES
[1] M. F. Hossain, Dhan-shomadhan: A dataset of rice leaf disease classifi-
cation for bangladeshi local rice, 2023. arXiv: 2309.07515 [cs.CV].
[Online]. Available: https://arxiv.org/abs/2309.07515.
6
[2] P. K. Sethy, N. K. Barpanda, A. K. Rath, and S. K. Behera, “Deep feature
based rice leaf disease identification using support vector machine,”
Computers and Electronics in Agriculture, vol. 175, p. 105 527, 2020.
[3] P. Tejaswini, P. Singh, M. Ramchandani, Y. K. Rathore, and R. R.
Janghel, “Rice leaf disease classification using cnn,” in IOP Conference
Series: Earth and Environmental Science, IOP Publishing, vol. 1032,
2022, p. 012 017.
[4] N. Krishnamoorthy, L. N. Prasad, C. P. Kumar, B. Subedi, H. B. Abraha,
and V. Sathishkumar, “Rice leaf diseases prediction using deep neural
networks with transfer learning,” Environmental Research, vol. 198, p.
111 275, 2021.
[5] Z. Tu, H. Talebi, H. Zhang, et al., “Maxvit: Multi-axis vision trans-
former, ECCV, 2022.
[6] N. Al-Humaidan and M. Prince, “A classification of arab ethnicity based
on face image using deep learning approach,” IEEE Access, vol. PP, pp.
1–1, Mar. 2021. DOI: 10.1109/ACCESS.2021.3069022.
[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
ing the inception architecture for computer vision,” in IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
2818–2826. DO I: 10.1109/CVPR.2016.308.
[8] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image
recognition, 2015. arXiv: 1512.03385 [cs.CV]. [Online]. Available:
https://arxiv.org/abs/1512.03385.
[9] N. Cristianini and E. Ricci, “Support vector machines,” in En-
cyclopedia of Algorithms, M.-Y. Kao, Ed. Boston, MA: Springer
US, 2008, pp. 928–932, IS BN: 978-0-387-30162-4. DO I: 10.1007/
978-0- 387-30162- 4415. [Online]. Available: https://doi.org/10.1007/
978-0- 387-30162- 4415.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Rice (Oryza sativa) is a principal cereal crop in the world. It is consumed by greater than half of the world’s population as a staple food for energy source. The yield production quantity and quality of the rice grain is affecting by abiotic and biotic factors such as precipitation, soil fertility, temperature, pests, bacteria, virus, etc. For disease management, farmers spending lot of time and resources and they detect the diseases through their penniless naked eye approach which leads to unhealthy farming. The advancement of technical support in agriculture greatly assists for automatic identification of infectious organisms in the rice plants leaves. The convolutional neural network algorithm (CNN) is one of the algorithms in deep learning has been triumphantly invoked for solving computer vision problems like image classification, object segmentation, image analysis, etc. In our work, InceptionResNetV2 is a type of CNN model utilized with transfer learning approach for recognizing diseases in rice leaf images. The parameters of the proposed model is optimized for the classification task and obtained a good accuracy of 95.67%.
Article
Full-text available
Human face and facial features gain a lot of attention from researchers and are considered as one of the most popular topics recently. Features and information extracted from a person are known as soft biometric, they have been used to improve the recognition performance and enhance the search engine for face images, which can be further applied in various fields such as law enforcement, surveillance videos, advertisement, and social media profiling. By observing relevant studies in the field, we noted a lack of mention of the Arab world and an absence of Arab dataset as well. Therefore, our aim in this paper is to create an Arab dataset with proper labeling of Arab sub-ethnic groups, then classify these labels using deep learning approaches. Arab image dataset that was created consists of three labels: Gulf Cooperation Council countries (GCC), the Levant, and Egyptian. Two types of learning were used to solve the problem. The first type is supervised deep learning (classification); a Convolutional Neural Network (CNN) pre-trained model has been used as CNN models achieved state of art results in computer vision classification problems. The second type is unsupervised deep learning (deep clustering). The aim of using unsupervised learning is to explore the ability of such models in classifying ethnicities. To our knowledge, this is the first time deep clustering is used for ethnicity classification problems. For this, three methods were chosen. The best result of training a pre-trained CNN on the full Arab dataset then evaluating on a different dataset was 56.97%, and 52.12% when Arab dataset labels were balanced. The methods of deep clustering were applied on different datasets, showed an ACC from 32% to 59%, and NMI and ARI result from zero to 0.2714 and 0.2543 respectively.
Chapter
Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to “see” globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.KeywordsTransformerImage classificationMulti-axis attention
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Dhan-shomadhan: A dataset of rice leaf disease classification for bangladeshi local rice
  • M F Hossain
M. F. Hossain, Dhan-shomadhan: A dataset of rice leaf disease classification for bangladeshi local rice, 2023. arXiv: 2309.07515 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2309.07515.
Rice leaf disease classification using cnn
  • P Tejaswini
  • P Singh
  • M Ramchandani
  • Y K Rathore
P. Tejaswini, P. Singh, M. Ramchandani, Y. K. Rathore, and R. R. Janghel, "Rice leaf disease classification using cnn," in IOP Conference Series: Earth and Environmental Science, IOP Publishing, vol. 1032, 2022, p. 012 017.