ArticlePDF Available

Automatic detection of oil palm fruits from UAV images using an improved YOLO model

Authors:

Abstract and Figures

Manual harvesting of loose fruits in the oil palm plantation is both time consuming and physically laborious. Automatic harvesting system is an alternative solution for precision agriculture which requires accurate visual information of the targets. Current state-of-the-art one-stage object detection method provides excellent detection accuracy; however, it is computationally intensive and impractical for embedded system. This paper proposed an improved YOLO model to detect oil palm loose fruits from unmanned aerial vehicle images. In order to improve the robustness of the detection system, the images are augmented by brightness, rotation, and blurring to simulate the actual natural environment. The proposed improved YOLO model adopted several improvements; densely connected neural network for better feature reuse, swish activation function, multi-layer detection to enhance detection on small targets and prior box optimization to obtain accurate bounding box information. The experimental results show that the proposed model achieves outstanding average precision of 99.76% with detection time of 34.06 ms. In addition, the proposed model is also light in weight size and requires less training time which is significant in reducing the hardware costs. The results exhibit the superiority of the proposed improved YOLO model over several existing state-of-the-art detection models.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
1 3
The Visual Computer
https://doi.org/10.1007/s00371-021-02116-3
ORIGINAL ARTICLE
Automatic detection ofoil palm fruits fromUAV images using
animproved YOLO model
MohamadHaniJunos1· AnisSalwaMohdKhairuddin1· SubbiahThannirmalai2· MahidzalDahari1
Accepted: 22 March 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
Manual harvesting of loose fruits in the oil palm plantation is both time consuming and physically laborious. Automatic
harvesting system is an alternative solution for precision agriculture which requires accurate visual information of the targets.
Current state-of-the-art one-stage object detection method provides excellent detection accuracy; however, it is computa-
tionally intensive and impractical for embedded system. This paper proposed an improved YOLO model to detect oil palm
loose fruits from unmanned aerial vehicle images. In order to improve the robustness of the detection system, the images are
augmented by brightness, rotation, and blurring to simulate the actual natural environment. The proposed improved YOLO
model adopted several improvements; densely connected neural network for better feature reuse, swish activation function,
multi-layer detection to enhance detection on small targets and prior box optimization to obtain accurate bounding box
information. The experimental results show that the proposed model achieves outstanding average precision of 99.76% with
detection time of 34.06ms. In addition, the proposed model is also light in weight size and requires less training time which
is significant in reducing the hardware costs. The results exhibit the superiority of the proposed improved YOLO model over
several existing state-of-the-art detection models.
Keywords Deep learning· Machine vision· Object detection· Precision agriculture· Improved YOLO
1 Introduction
At present, Malaysia significantly contributes to world’s
palm oil production (39%) and export (44%) [1]. Hence,
Malaysia plays a vital role in fulfilling the growing global
need for palm oils. During fresh fruit bunch harvesting
process, significant number of loose fruits will detach
and scatter on the ground. In general, loose fruits contain
high amount of oil where its collection is rather important
in which intend to maximise production and contribute to
higher national Oil Extraction Rate. The opportunity costs
of this loose fruits collection process are (1) not all of the
loose fruits are collected, (2) loose fruits are bruised while
collecting which lowers its quality, and (3) longer time taken
to send to mill to be processed, therefore lowering the oil
quality further. This is mainly influenced by the adoption of
manual technique that used equipment such as sickle and
chisel [2]. This technique is time consuming, practically
inefficient, and extremely physically demanding. Addition-
ally, human labour productivity is relatively stagnant over
time leading to high direct cost of human workers. Thus,
with the advancement in artificial intelligence, this work
can be done more effectively by using machine vision-based
automatic harvesting system [3]. It is imperative to develop
an accurate computer vision system in order to identify
accurate information and location of the fruits as this is the
initial step towards developing automatic harvesting system
before developing manipulation and grasping system [4].
Over the years, the performance of fruit detection systems
has been improved significantly; however, they are still far
from practical application. The basic challenges in devel-
oping such fruit detection system are the unrestrained and
complex environment in the orchards.
Traditional machine learning classifiers which adopted
hand-crafted features approach for crop detection offer
* Anis Salwa Mohd Khairuddin
anissalwa@um.edu.my
1 Department ofElectrical Engineering, Faculty
ofEngineering, Universiti Malaya, 50603KualaLumpur,
Malaysia
2 Advanced Technologies andRobotics, Sime Darby
Technology Centre Sdn Bhd, 43400Serdang, Selangor,
Malaysia
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M.H.Junos et al.
1 3
considerably good detection accuracy [58]. However,
these approaches are not robust towards highly chal-
lenging and complex conditions and require tedious pro-
cess for feature extraction. Over the years, the advance
development in deep learning technology, particularly
of convolutional neural networks (CNNs), has remark-
ably improved the state of the art in object detection [9].
Deep learning enables automatic extraction of multi-scale
image features through self-learning using large amount of
data [10]. In general, there are two types of object detec-
tion methods based on deep learning which include can-
didate region-based model and regression-based model.
Two-stage detection model generates region of interest in
the first stage. Subsequently, features are extracted from
each candidate box for bounding box and classification
regression task [11]. A significant milestone in two-stage
object detection approaches was achieved by the devel-
opment of Region-based CNN (R-CNN) method [12].
Afterwards, more methods were introduced in an attempt
to achieve better detection accuracy and faster detection
time, for instance Fast R-CNN [13], Faster R-CNN [14],
Mask R-CNN [15], and others. In single stage detection
model, object detection is addressed in a single regres-
sion problem where region detection and classification
take place simultaneously in the network. Some of the
methods are Single Shot Detector (SSD) [16]. You Only
Look Once (YOLO) [17], YOLOv2 [18], YOLOv3 [19],
RetinaNet [20], EfficientDet [21], and others. Generally,
two stage models are known for good detection accu-
racy while one-stage models achieved faster detection
speed. Currently, YOLOv3 is the state of the art for single
stage object detection. In addition, lighter version of the
model, YOLOv3 tiny, was developed for mobile device
applications.
The visual information of oil palm loose fruits is impor-
tant in developing an automatic harvesting system for
precision agriculture. In order to achieve this task, a deep
learning-based technique is employed to accurately and
effectively detect the oil palm loose fruits based on image
analysis. On top of that, the weight of the model is the key
factor for embedded system and mobile device applications.
Hence, this paper proposes an improved YOLO network
for loose fruit detection in oil palm plantation taking into
consideration the significant advantages of YOLOv3 tiny
model. The main objective is to develop a robust and accu-
rate detection system that requires low computational cost
and applicable to constraint environment. The main con-
tributions of this work are as follows: First, a novel dataset
of oil palm loose fruit was developed which was acquired
from unmanned aerial vehicle (UAV) and mobile camera.
Then, the network structure of YOLOv3 tiny was improved
in terms of integration of densely connected convolutional
network, swish activation function, addition of prediction
scale layer, and optimization of prior box. The improved
YOLO model achieved better detection performance, faster
computation time as well as lighter model size.
The outline of this paper is arranged as follows: Sect.2
presents the related works applied on fruit detection. Sec-
tion3 describes the development of loose fruit dataset and
explains the proposed methodology in detail. In Sect.4, the
experimental results with detail discussion on comparison
analysis are presented. Finally, Sect.5 drawn the conclusion
and future work.
2 Related works
Convolutional neural networks (CNNs) are extensively used
in real-life object detection application such as crack detec-
tion [22, 23], medical abnormality detection [24, 25], face
detection [26, 27], license plate detection [28, 29], traffic
light detection [30], and others. More recently, CNNs has
been extensively implemented in agriculture research par-
ticularly in crop classification and detection. Substantial
research has been proposed to develop accurate and robust
crop detection system. Chen etal. [31] proposed blob detec-
tors based on fully connected CNNs for extraction of can-
didate regions, segmentation of object areas, and adopted
CNNs counting algorithm to calculate the number of fruits.
In another study, fully connected CNNs were employed
for automatic detection of weed [32]. This research con-
siders images with the condition under which the leaves
were occluded. Besides, CNNs and support vector machine
(SVM) method were proposed for automatic extraction of
apple blossoms features in a complex background [33]. It is
proven that CNNs-based detection model has achieved out-
standing performance in object detection; however, there are
still improvement that can be made especially for detection
under complex and occluded condition. Therefore, several
CNNs-based algorithms were developed based on region-
based and regression-based model.
Among R-CNN model, Faster R-CNN is commonly used
for crop detection and classification. Sa etal. [34] proposed
Faster R-CNN network for accurate capsicum and rock-
melon detection and impressive result was achieved. In
another study, Faster R-CNN was integrated with a novel
multi-sensor framework and a multiple viewpoint approach
for detection of mangos [35]. Excellent result was achieved
for yield estimation. Furthermore, various researches have
incorporated multiple type of image information with Faster-
RCNN. For instance, Chen et at. [36] utilized RGB aerial
orthoimages and Faster R-CNN with ResNet-50 network to
develop strawberry detection system for yield estimation,
while Gené-Mola etal. [37] used the advantages from radi-
ometric capabilities of multi-modal images obtained from
RGB-D sensors for apple detection. Two-stage detection
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Automatic detection ofoil palm fruits fromUAV images using animproved YOLO model
1 3
method is known for excellent localization and classification
accuracy; nonetheless, the detection speed is slow which is
unsuitable for real-time fruit/crop detection.
Considering the advantage of one-stage detector that pro-
duces high detection speed, various researches have incor-
porated YOLO-based model for crop detection. In a study
conducted by Koirala etal. [38], a novel YOLO model was
developed by integrating features of YOLOv3 and YOLOv2
tiny. This model has achieved high accuracy and speed for
mango detection. Liu and wang [39] proposed and improved
YOLO method for tomato disease and pest detection utiliz-
ing multi-scale feature detection based on image pyramid
in order to increase the number of feature maps. The devel-
oped model is robust towards different target size and image
resolution. Furthermore, DenseNet network was utilized to
enhance the performance on YOLOv3 model for apple [40],
apple lesion [41], and tomato [42] detection. The imple-
mentation of feature reuse resulted in significant increase in
detection performance. Despite its excellent performance in
real-time object detection, YOLOv3 generates heavyweight
model and requires longer time for training process due to
the network complexity. Hence, it is not applicable in small-
scale embedded devices. Conversely, YOLOv3 tiny is a
lightweight model that contains extremely less layers which
greatly improves the detection speed. However, the detection
accuracy is reasonably low compared to its original version.
3 Materials andmethods
The pipeline of the proposed methodology for loose fruit
detection is shown in Fig.1. Loose fruit detection model
is developed based on three main stages. Firstly, the loose
fruit images were collected from mobile camera and UAV
platform and went through the process of data augmenta-
tion to form loose fruit dataset. Objects in the images were
annotated with bounding boxes. Then, the YOLO network
was optimized and trained on the developed datasets. Evalu-
ation metrics were computed to validate the detection per-
formance. Finally, the best model was selected for loose fruit
detection on UAV images.
3.1 Dataset
3.1.1 Data acquisition
In this paper, loose fruit images were collected by using two
different cameras; 1/5 CMOS sensor, 5 MP resolution cam-
era attached on Tello UAV and 12MP mobile camera. The
dataset was acquired during harvesting season in Sime Darby
oil palm plantation located in Selangor, Malaysia. The oil
palm loose fruits images were captured at distances within
the range of 15–100cm from the ground and various angles
within the range of 30°–90° from the ground. The collection
periods included 9 a.m., 1 p.m., and 5 p.m. and during sunny
and cloudy weather conditions. In order to improve the rich-
ness of the training dataset, all the images were taken under
several conditions which include occlusion, single and distrib-
uted, densely overlap and various illumination. There are 700
images where 200 images were acquired from UAV, while 500
images were acquired from mobile camera. The dataset was
then expanded to 6300 by using data augmentation method
where 70% of the dataset was used as training set while 30%
of the dataset was used as validation set. In addition, another
100 images that comprise of only UAV images were used as
Dataset
YOLO network
development and
optimization
Model training Model testing
Data acquisition
Data augmentation
Data annotation
Model training
Loose fruit detection
Training/validation
dataset
Testing
dataset
Model evaluation
Fig. 1 Overall architecture of the proposed methodology
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M.H.Junos et al.
1 3
test dataset to evaluate the performance of the proposed detec-
tion system.
3.1.2 Data pre‑processing
Data augmentation is an approach to enrich the variation of the
training data by synthetically expanding the dataset. This pro-
cess improves the network capacity to generalize and mitigates
overfitting issue. In this paper, four augmentation techniques
were used which include brightness transformation, rotation,
blurring, and motion blur.
Brightness adjustment techniques were used to simulate the
actual condition of oil palm plantation under various lighting
intensities. Two numeric scalar values of + 25 and − 25 was
selected to enhance the brightness of images in the training set.
The robustness of the training dataset was further improved
by using rotational technique. The images were augmented
by rotating the original images by 90°, 180°, and 270° and
horizontal mirrored. Practically, some of the collected images
are likely to be blurry due to incorrect focus and camera move-
ment. Therefore, Gaussian blur and motion blur were applied
to simulate indistinct images. Gaussian filter with standard
deviation of 1.5 was applied, while numeric scalar of 10 that
define the length of pixel motion at 0° angle of motion was
used to simulate motion blur effect. The complete augmented
dataset is shown in Table1.
The original dimension of the training and validation set
images was 1280 × 720 pixels. Then, the images were rescaled
to 1000 × 750 pixels for annotation process. The images were
annotated by using open source LabelImg tool where bounding
boxes of the loose fruit were drawn and classified manually.
In this process, loose fruit samples that are rotten and in bad
condition were not labelled.
3.2 The proposed improved YOLO‑P network model
YOLOv3 tiny significantly improved the detection speed due
to the shallow backbone network. However, its detection accu-
racy is significantly reduced compared to other algorithms.
Therefore, this paper proposes a modified YOLO model based
on YOLOv3 tiny model with the aim to improve the detec-
tion and computation performance to be used in embedded
system. In the proposed model, the network structure and
depth of YOLOv3 tiny model are modified. The proposed
method is composed of several components which include,
feature extractor based on densely connected neural net-
work (DenseNet), Swish function as the activation function,
multi-scale target detection, and clustering for optimization
of anchor box size.
3.2.1 Densely connected neural network (DenseNet)
asfeature extractor
In a deeper convolutional neural network, the transmission of
information from input layer to output layer grows bigger lead-
ing to the loss of feature information. DenseNet connects every
layer to one another in a feed-forward architecture thus, pro-
vide maximum and strong gradient flow [43]. DenseNet oper-
ates through feature reuse that produces more diverse features
and richer patterns which helps to improve efficiency. Thus,
the extracted features can be used effectively, particularly for
the convolution layers at the later stage. Each layer l obtains the
feature maps from all subsequent layers as describes in Eq.1.
[
x
0
,x
1
,,x
l1]
specifies the concatenation of the feature
maps formed in layer
[0, ,l1]
.
hl
represents the function
that process the spliced feature maps for nonlinear transfor-
mation of
x0,x1,,xl1
layers which involve the process
of convolution layer (Conv), batch normalization (BN) and
swish activation function.
DenseNet network consists ofdense block, transition block,
and dense layer. In the dense block, Dn the width and height of
a volume within a block remain constant, however, the volume
of the network expanding at every layer due to the concatena-
tion of the feature maps. Figure2 illustrates the feature maps
dimension at the bottom of each dense block. The final size
of feature maps,
kl
in each dense block is calculated by using
Eq.2.
k is the growth rate that provides the size of information
added to the subsequent layer and lth refers to the number
of dense layer in that specific dense block. Growth rate of 32
was used in this network. This indicates the concatenation of
the new information from previous volume which describes
the principle of feature reused.
In addition, the layer between the dense block is referred
as transition block,
Tn
where 1 × 1 convolutional with 128 fil-
ters and 2 × 2 average pooling with a stride of 2 is performed.
This will consequently downsampling the number of feature
maps and the size of volume. Dense layer consist of Conv-
BN-Swish (1 × 1) with 128 filters and Conv-BN-Swish (3 × 3)
(1)
xl
=h
l[
x
0
,x
1
,,x
l1]
(2)
Table 1 Loose fruit dataset
expansion by using data
augmentation methods
Dataset Original data Brightness Rotation Blur Motion blur Tot al
UAV 200 400 800 200 200 1800
Mobile camera 500 1000 2000 500 500 4500
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Automatic detection ofoil palm fruits fromUAV images using animproved YOLO model
1 3
with 32 filters where the volume of feature maps in each layer
is increased by the growth rate. In this study, DenseNet archi-
tecture was adopted as the network backbone replacing the
original feature extractor from YOLOv3 tiny. The detail net-
work architecture of the improved YOLO network is depicted
in Fig.2. The DenseNet structure is made up of 4 dense blocks
which compose of 6, 12, 44, and 16 dense layers.
3.2.2 Multi‑scale target detection
In this study, multi-scale target detection based on Feature
Pyramid Network (FPN) was applied to enhance the detec-
tion performance of the proposed model particularly for
detection of small targets. Two additional prediction scales
were added to the network to extract the location information
of small target with fine-grained features. The generated fea-
ture maps from DenseNet network were passed to the FPN.
At the early stage of neural network, the convolution
layers contain weak object information that only consists
of high-level features from the input image. Hence, feature
maps with the same feature scale from different layers of
early and later stage convolution network are connected.
This will remarkably improve object detection at differ-
ent sizes. Moreover, an up-sampling operation is utilized
to effectively merge the different feature maps dimension
from layers at multiple stages. The feature in each scale is
up sampled by doubles. This will allow the extraction of the
deep features and characteristics of the hidden layer by the
full connection layer. Subsequently, 3 × 3 and 1 × 1 convo-
lutional layer are employed on the combined feature maps
to integrate the features from the earlier stage layer. Then,
the final feature map is obtained by using a BN layer. These
network structures were implemented in each of the predic-
tion layers.
Four detection scales in various map size are proposed
for improved YOLO model which include, 13 × 13 size for
larger scale, 26 × 26 size for medium scale, 52 × 52 size for
small scale, and 104 × 104 size for smallest scale. The detail
network structure of the improved YOLO model is illustrated
in Fig.3. The improved YOLO model is compared with the
original YOLOv3 tiny model. Overall, the improved YOLO
model contains 267 number of layers, while YOLOv3 tiny
has only 23 number of layers.
3.2.3 Swish activation function
Swish activation function was designed based on the uti-
lization of sigmoid function for gating in long-short-term
memory and highway networks [44]. The function adopted
self-gating mechanism where the same value is used to gate
itself. Unlike normal gating that needs multiple scalar inputs,
self-gating only requires a single scalar input. This allows
Swish activation to replace single scalar-input activation
functions, for example the Rectified Linear Unit (ReLU)
function, without changing the number of parameters or the
hidden size.
The Swish activation function integrates the input func-
tion and sigmoid activation function. It employs the auto-
matic search method based on reinforcement learning to
calculate the function. Swish function is defined as:
Conv
Input
416 x 416 x 3
DenseNet structure
104x104x96 104x104x128 104x104x160 104x104x192 104x104x224 104 x104x256
104x104x64
Output
Scale 3
Scale 2
Scale 1
104x104x64
64+32*6
104x104x256
128+32*12
52x52x512
256+32*44
26x26x1664
512+32*16
13x13x1024
Scale 4
Conv
Conv
Conv
Conv (1x1)
BN
Swish
Conv (3x3)
BN
Swish
Fig. 2 Structure diagram of the improved YOLO network. In the dense layer, the size of feature map is added by growth rate of 32. In transition
layers, the number of feature maps and the size of the volume is downsampled by half
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M.H.Junos et al.
1 3
where sigmoid function is described as:
The general property of Swish function includes bounded
in the lower limits, unbounded in the upper limits, non-
monotonic, and smoothness. The smoothness characteristic is
crucial factor in achieving better optimization and generaliza-
tion during the training of deep learning architectures. The
non-monotonicity of Swish enhances the gradient flow, which
offers some robustness to various initializations and learning
rates. This feature differentiates Swish from other common
activation functions. The swish activation function was used
in the feature extractor and YOLO detection layer of the pro-
posed model.
(3)
f(x)=x
sigmoid(x)
(4)
sigmoid
(x)=
(
1
1+expx
)
3.2.4 Clustering ofanchor box dimension
The existing YOLO algorithm initialized the dimension
of anchor boxes based on MS-COCO [45] dataset which
may not be suitable for different dataset. As a result, it is
challenging to obtain accurate information of the bounding
box. Hence, it is critical to determine appropriate anchor
boxes that satisfy the loose fruit dataset. In order to obtain
appropriate dimension of candidate boxes, k-means clus-
tering algorithm was implemented on the training set. The
developed clusters demonstrate the samples distribution
in the dataset, which assist the network to learn easily
and perform better predictions. Average intersection over
union (IoU) was employed as the objective function to
evaluate the clustered box. The IoU ratio and average IoU
are defined in Eq.5 and 6.
Type Filters Size/stride Output
Convolutional647x7/2208x208x64
Max pooling2x2/2104x104x64
x6
Convolutional128 1x1/1
Convolutional323x3/1104x104x256
Convolutional128 1x1/1104x104x128
Max pooling2x2/252x52x128
x12Convolutional128 1x1/1
Convolutional323x3/152x52x512
Convolutional256 1x1/152x52x256
Max pooling2x2/226x26x256
x44Convolutional128 1x1/1
Convolutional323x3/126x26x1664
Convolutional512 1x1/126x26x512
Max pooling2x2/2 13x13x512
x16Convolutional128 1x1/1
Convolutional323x3/113x13x1024
Convolutional10001x1/113x13x1000
Convolutional256 1x1/113x13 x256
Convolutional512 3x3/113x13x512
Convolutional181x1/113x13x18
YOLO
Route13x13x256
Convolutional128 1x1/113x13x128
Up-sampling2x26x26x128
Route26x26x384
Convolutional256 3x3/126x26x256
Convolutional181x1/126x26x18
YOLO
Route26x26x384
Convolutional641x1/126x26x64
Up-sampling2x52x52x64
Route52x52x192
Convolutional128 3x3/152x52x128
Convolutional181x1/152x52x18
YOLO
Route52x52x192
Convolutional321x1/152x52x32
Up-sampling2x104x104x32
Route104x104x160
Convolutional643x3/1104x104x64
Convolutional181x1/1104x104x18
YOLO
Scale 1
Scale 2
Scale 3
Scale 4
DenseNet
feature extractor
Type Filters Size/stride Output
Convolutional163x3/1 416x416x16
Max pooling 2x2/2 208x208x16
Convolutional323x3/1 208x208x32
Max pooling 2x2/2 104x104x32
Convolutional643x3/1 104x104x64
Max pooling 2x2/2 52x52x64
Convolutional 128 3x3/1 52x52x128
Max pooling 2x2/2 26x26x128
Convolutional 256 3x3/1 26x26x256
Max pooling 2x2/2 13x13x256
Convolutional 512 3x3/1 13x13x512
Max pooling 2x2/1 13x13x512
Convolutional 1024 3x3/1 13x13x1024
Convolutional 256 1x1/1 13x13x256
Convolutional 512 3x3/1 13x13x512
Convolutional181x1/1 13x13x24
YOLO
Route 13x13x256
Convolutional 128 1x1/1 13x13x128
Up-sampling2x26x26x128
Route 26x26x384
Convolutional 256 3x3/1 26x26x256
Convolutional181x1/1 26x26x24
YOLO
Scale 1
Scale 2
YOLOv3-tiny
feature extractor
Network structure of YOLOv3 tiny Network structure of improved YOLO
Fig. 3 Network structure of YOLOv3 tiny and improved YOLO. Each of the detection network with the same dimension of feature maps is con-
nected to the convolutional layer in the earlier stage for feature extraction
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Automatic detection ofoil palm fruits fromUAV images using animproved YOLO model
1 3
IoU describes the overlap area between the ground truth,
Bgt
and the clustered bounding box,
Bc
. Centroid represents
the centre of the cluster, while box is the ground truth of
the target. k denotes the overall number of samples, while n
denotes the numbers of clusters.
nk
is referred as the numbers
of samples in the kth cluster centre.
The average IoU plotted at different value of cluster is
shown in Fig.4. The objective function was increased and
became more stable as the value of cluster increased. Greater
number of anchor boxes resulted in major overlap between
anchor boxes and bounding boxes; however, this will directly
increase the number of convolution layers in prediction lay-
ers. Besides, a bigger network size will be produced along
with higher computation time. For this reason, based on the
average IoU and number of prediction layers in the improved
model, twelve anchor boxes were selected where each pre-
diction layer consists of three sizes of boxes. At twelve
anchor boxes, the average IoU was 79.8%. The width and
height of the corresponding anchor boxes assigned for each
of the prediction scales are shown in Table2.
4 Experiments anddiscussion
In this section, multiple extensive experiments were car-
ried out to validate the reliability of the proposed improved
YOLO model for loose fruit detection at oil palm planta-
tion. Firstly, the experimental settings are introduced. Then,
(5)
IoU
(box, centroid)=
B
gt
Bc
B
gt
B
c
(6)
Avg IoU
=
k
i=1
n
k
j=1IoU(box, centroid
)
n
the evaluation metrics are explained. Finally, the com-
parative experimental results are discussed and analysed
comprehensively.
4.1 Experimental setup
In this study, all of the experiments were performed using
Intel Core i5-9300H @ 2.4G Hz processor with 8GB RAM
on a platform of Windows 10 64-bits operating system, and
NVIDIA GeForce GTX 1050 Ti, graphic card having 4GB
of VRAM. YOLO-based models were trained in Darknet
framework while Faster R-CNN and SSD in Tensorflow
environment. In order to enable a fair comparison between
the results of all the experimental configurations, the hyper-
parameters for all of the YOLO-based models were standard-
ized. The input images in the network were set to 416 × 416
pixels. The batch size was set to 64 with subdivision of 32.
Moreover, momentum of 0.9 was used to adjust network
parameters, while decay weight of 0.0005 was utilized to
prevent overfitting. The initial learning rate was set to 0.001,
and the models were trained up to 200 epochs. Considering
the number of images in the dataset and number of images
that will be process in one batch, about 100 iterations per
epochs were performed during training process. Steps pol-
icy was adopted where the learning rate will be updated to
0.0001 and 0.00001 at 160 and 180 epochs, respectively.
Fig. 4 Clustering for optimiza-
tion of anchor box dimension
0
20
40
60
80
100
036912
15
)%(UoIegarevA
Cluster
Table 2 Anchor boxes size at four prediction scales
Scale Dimension Anchor box size
1104 × 104 (7, 10), (12, 12), (9, 16)
252 × 52 (13, 20), (19, 21), (17, 30)
326 × 26 (27, 35), (30, 59), (47, 44)
413 × 13 (44, 66), (51, 91), (68, 112)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M.H.Junos et al.
1 3
4.2 Evaluation metric
Several indicators were applied to evaluate the effective-
ness of the proposed loose fruit detection model in terms of
detection and computational performance. Average precision
(AP) is defined as the area under the precision and recall
curve at various detection thresholds. The equation of AP is
described in Eq.7.
Pr and Rc are the precision and recall, respectively.
Precision measures the correct predictions when
theground truth boxes are matched by the predicted bound-
ing boxes. On the other hand, recallmeasures the probabil-
ity ofcorrect detection of ground truth objects. F1-score
describes the combined measure of recall and precision
considering their harmonic mean, where an F1-score of 1
represents the best value. The precision, recall, F1-score are
computed using Eqs.8, 9, 10.
True positive (TP) is the true classification result where the
loose fruits are correctly detected with IoU area over 0.5
threshold. False positive (FP) refers to the falsely detected
loose fruits, while missed detected loose fruits are denoted
as false negative (FN). IoU refers to the overlapped area
between the ground truth and predicted bounding box by
(7)
AP
=
1
0
Pr(Rc)
dRc
(8)
Pr
=
TP
TP +FP
(9)
Rc
=
TP
TP +FN
(10)
F
1=
2×Pr ×Rc
Pr +Rc
the proposed model as describes in Eq.5, while average
IoU is determined over the number of images. Moreover, the
time required to perform detection per image is measured
to evaluate the detection speed. These metrics were applied
to investigate the detection performance of the proposed
model.
On the other hand, the computational performance of
the proposed model is validated by conducting comparative
analysis on several metrics such as, floating point operation
(FLOPs), number of parameters, model size, and computa-
tion time. FLOPs are used to estimate the amount of calcu-
lation in the network layer of CNN model [46]. This metric
is closely related to the computation time. The larger the
FLOPs, the longer it takes to train a model. Additionally,
model size is referred to the storage space of the trained
model which associated with the number of parameters gen-
erated by the network.
4.3 Experimental results
In this section, the proposed improved YOLO model is com-
pared with several detection models which include one-stage
detection method, YOLOv3, YOLOv3 tiny, YOLOv2, SSD
with MobileNet and the state-of-the-art for two-stage detec-
tion method, Faster R-CNN with ResNet101. The perfor-
mance of the proposed model is validated in terms of detec-
tion and computation performance.
4.3.1 Comparison ofdetection performance
The loss curve in training stage of the proposed improved
YOLO method is compared with other YOLO-based method
as shown in Fig.5-left. Experimental results show that the
loss reduced rapidly and started to converge at 40 epochs
for YOLOv3, YOLOv2 and improved YOLO method with
approximately 0.2466, 0.3178, and 0.1603 final average loss,
respectively. On the other hand, YOLOv3 tiny converged
0
1
2
3
4
5
0500010000 15000 20000
ssoL
Epochs
YOLOv3 tiny
YOLOv2
YOLOv3
Improved YOLO
0.1
0.5
0.9
1.3
17000 18000 19000 20000
50 100150 200
200190
180
170
0
20
40
60
80
100
0500010000 1500020000
)%(noisicerpegarevA
Epochs
YOLOv3 tiny
YOLOv2
YOLOv3
Improved YOLO
50 100150 200
Fig. 5 Validation of training performance for YOLO-based model. Left: comparison of loss curve. Right: comparison of AP curve
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Automatic detection ofoil palm fruits fromUAV images using animproved YOLO model
1 3
slower with large fluctuations and achieved final average loss
of 1.1348. It can be noted that the proposed improved YOLO
model significantly reduced the final average loss by 0.9745
from the original YOLOv3 tiny.
The training model is further validated in terms of its AP.
The comparison of AP curve between YOLO-based model
is shown in Fig.5-right. Results show that the highest AP is
achieved by the proposed improved YOLO model, followed
by YOLOv3, YOLOv3 tiny, and YOLOv2 models. Despite
producing higher average loss, YOLOv3 tiny achieved
higher AP compared to YOLOv2 model. Both models suf-
fer from detection of small object leading to lower AP value.
Comparative detection performance of the models is high-
lighted in Table3. It is worth to note that the AP is enhanced
by the proposed improved YOLO model as compared to the
other models. The model obtained maximum AP of 99.76%
at 0.5 IoU showing an increment of 0.14% than the conven-
tional YOLOv3 model (99.62%). This shows the effective-
ness of using the swish activation function and feature reuse
where richer information is obtained from every layer. More-
over, the use of feature maps connected at different levels has
enhanced the ability of the model to detect loose fruits at dif-
ferent sizes. On the other hand, YOLOv3 tiny and YOLOv2
models achieved an AP of 88.70% and 79.04% which mainly
influenced by the problem in detecting small size of loose
fruit due to lower number of convolution and detection layer.
Faster R-CNN-ResNet101 achieved considerably good AP
with 94.46%, while SSD-MobileNet obtained the lowest AP
of 66.70% due to low accuracy in detection of small objects.
At IoU greater than 0.75, improved YOLO model achieved
the lowest percentage reduction in AP with 7.79% compare
to the other models; YOLOv3 (8.27%), YOLOv2 (32.57%),
YOLOv3 tiny (52.48%), SSD MobileNet (72.86%), and
Faster R-CNN-ResNet101 (27.69%). The proposed model
also obtained the highest AP of 74.72% at IoU = 0.5:0.95
which demonstrates the effectiveness of the proposed model
to perfectly align the bounding boxes with the ground truth
at higher IoU threshold.
In addition, the proposed model yielded the highest per-
centage of average IoU with 83.26%. This shows that the
developed bounding boxes overlapped with the ground truth
at a higher percentage. In terms of F1-score, the proposed
model achieved similar value to YOLOv3 model at 0.98.
This demonstrates that the proposed model achieved excel-
lent overall precision and recall performance. The preci-
sion-recall curves of the detection models are illustrated in
Fig.6. In terms of average detection time, overall, one-stage
detection methods (YOLO-based and SSD) are faster than
the two-stage detection method (Faster R-CNN). Despite
having higher number of convolutional layers, the proposed
model achieved faster detection speed than YOLOv3 model.
The darknet53 used in YOLOv3 is a complex network with
excessively huge parameters leading to slow detection speed.
Conversely, improved YOLO model is slower than YOLOv3
tiny, YOLOv2, and SSD-MobileNet owing to higher number
of concatenation layers in feature extractor. YOLOv3 tiny
and YOLOv2 method incorporated shallow structure hence
resulted in faster detection speed. Faster R-CNN adopted
selective search algorithm and performs object detection in
two discrete stage hence resulted in slow detection time. On
the contrary, SSD uses various feature maps for prediction
of the boundary boxes and the classes in onesinglestage.
Detection time of SSD-MobileNet is faster than the full
version of YOLOv2 and YOLOv3 as well as the proposed
model. However, the AP is the lowest among the rest of
the models. Consequently, both Faster R-CNN-ResNet101
and SSD-MobileNet models are not applicable for detection
of loose fruit. The results obtained for one stage detection
methods suggest that the increase in detection speed comes
at the expense of reducing the detection accuracy. Consider-
ing the improved detection time achieved by the proposed
model, it is able to fully meet the requirements of real-time
application.
4.3.2 Comparison ofcomputational performance
The comparative results of the computational performance
are demonstrated in Table4. In general, it can be observed
that the proposed improved YOLO model achieved bet-
ter computational performance than the state-of-the-art
Table 3 Comparison of detection performance for several detection models
The AP are obtained at three different IoU thresholds; IoU = 0.5, IoU = 0.75 and average IoU from 0.5 to 0.95 with 0.05 interval. The average
IoU, F1-score and detection time are evaluated at 0.5 IoU
Method AP0.5 (%) AP0.75 (%) AP0.5:0.95 (%) Average IoU (%) F1-score Detection time (s)
SSD-MobileNet 66.70 18.10 27.60 52.41 0.65 13.58
Faster R-CNN-ResNet101 94.46 68.30 62.00 75.38 0.93 107.10
YOLOv3 tiny 88.70 42.15 46.13 67.53 0.89 5.08
YOLOv2 79.04 53.30 51.12 67.28 0.81 18.01
YOLOv3 99.62 91.38 73.74 83.24 0.98 40.63
Improved YOLO 99.76 91.99 74.72 83.26 0.98 34.06
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M.H.Junos et al.
1 3
YOLOv3 models. The proposed model generated billion
FLOPs (BFLOPs) of 30.926 which is about 52.64% less
than YOLOv3 model. This demonstrates the effectiveness
of the proposed model to operate in computability constraint
environment which directly affects the computational time.
The proposed model was computed 1.68 times faster than
YOLOv3. Furthermore, the proposed model generated
model size of 58MB which is 4.1, 3.3, and 3.1 times smaller
compared to YOLOv3 (235MB), YOLOv2 (193MB), and
Faster R-CNN-ResNet101 (181MB). The used of DenseNet
structure has reduced the number of parameters in the
network resulting in smaller model size. It is imperative
to develop a smaller model in order to reduce the cost of
embedded system. The implementation of DenseNet archi-
tecture as network backbone and multiple detection layer has
significantly increased the number of layers in the network;
hence, the number of BFLOPS was considerably enlarged
by 5.7 times from the conventional YOLOv3 tiny model.
Nonetheless, the size of the trained model was increased
by only 1.8 times. YOLOv3 and YOLOv2 produced bigger
trained model due to the higher number of parameters in
the networks generated from the complex feature extractor
network namely, Darknet53 and Darknet19. On the other
hand, YOLOv3 tiny incorporated shallow network structure
resulting in faster computational time and small model size.
Despite required the shortest computational time, Faster
R-CNN-ResNet101 is too slow for real-time applications,
while SSD-MobileNet is less accurate. The remarkable
enhancement on the computation time and model size of
the proposed model allows it to be used in low configuration
hardware as well as less memory and CPU performance.
4.3.3 Detection onUAV test images
In order to further investigate the reliability of the proposed
improved YOLO model, a total of 100 UAV test images
which consist of 1117 ground truth loose fruits were used.
Figure7 presents the results of 3 randomly selected test
images that contain big, medium, and small loose fruit sizes.
The difference of fruit sizes was due to the distances between
UAV to ground during image acquisition process. The exper-
imental results demonstrate that the proposed model able to
satisfy object detection in different scale variation where
most of the loose fruits were successfully detected. In real
Fig. 6 Precision-recall curve of
detection methods. The AP is
determined at IoU = 0.5
0
0.2
0.4
0.6
0.8
1
00.2 0.40.6 0.
81
noisicerP
Recall
SSD-MobileNet
Faster R-CNN-ResNet101
YOLOv3 tiny
YOLOv2
YOLOv3
Improved YOLO
0.992
0.994
0.996
0.998
1
0.40.6 0.81
Table 4 Computational
performance of detection
models
Method BFLOPS Average runtime Parameters
(millions)
Model Size (MB)
SSD-MobileNet 2h 30min 5.49 22.0
Faster R-CNN ResNet 101 2h 45min 62.11 181.0
YOLOv3 tiny 5.451 8h 20min 8.25 33.0
YOLOv2 29.342 16h 20min 48.25 193.0
YOLOv3 65.304 36h 20min 58.75 235.0
Improved YOLO 30.926 21h 40min 14.5 58.0
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Automatic detection ofoil palm fruits fromUAV images using animproved YOLO model
1 3
condition, the loose fruits may be occluded under various
circumstances. This would directly affect the performance
of the detection model. Results show that the loose fruits
which are occluded by the grass, small plant or frond were
successfully detected by using the proposed model as shown
in Fig.8. Additionally, the proposed model also performed
well under different lighting conditions as depicted in Fig.9.
The experimental results exhibit the effectiveness of the pro-
posed model to detect loose fruit under different scenarios
with good classification accuracy.
4.3.4 Robustness test
Loose fruit detection in palm oil plantation by using UAV
platform is a challenging task due to several factors such
as complex environment and camera configurations. In
order to verify the robustness of the proposed method, a
series of image augmentation were conducted to create
synthetic test images. Three different augmented methods
are selected namely brightness transformation (Br), Gauss-
ian blur (Gb), and motion blur (Mb). This will replicate
the factors that may affect the visual quality of the images
captured at the palm oil plantation. Three levels of bright-
ness transformation were incorporated to simulate various
illumination conditions. Numeric scalar of + 25 and + 50
were used to represent brighter condition (Br1 and Br2),
while numeric scalar of − 25 was used to represent darker
condition (Br3). Besides, out of focus that occurs dur-
ing image acquisition process may cause the image to be
blurred. Therefore, three levels of Gaussian filter with
standard deviation of 1 (Gb1), 2 (Gb2), and 3 (Gb3) were
applied on the test images to simulate blur effect. In order
to simulate motion blur effect, zero-degree angle of motion
and numeric scalar of 5 (Mb1), 10 (Mb2), and 15 (Mb)
that specifies the length of pixel motion was applied.
Fig. 7 Detection results of loose fruit using the proposed improved YOLO with transfer learning method. Left: detection for large targets. Mid-
dle: detection for medium targets. Right: detection for small targets
Fig. 8 Detection results under occlusion condition
Fig. 9 Detection results under different lighting condition
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M.H.Junos et al.
1 3
The comparisons of the experimental results for the
synthetic test images are illustrated in Fig.10. It can be
noted that the F1-score for brightness transformation is
slightly reduced as compared to the original test images.
In the case of Gaussian and motion blur, both effects show
similar trend where the F1-score decreased as the blur
effect on these images increased. This is mainly contrib-
uted by the higher number of miss detected loose fruit
which clearly shown by the tremendous decreased of recall
value. On the other hand, the number of falsely detected
loose fruit is slightly reduced for both augmented methods,
thus slightly reducing the precision value. Under challeng-
ing environment due to lighting effect, blurring effect and
motion blur effect, the average of precision value is 0.967,
0.961, and 0.957, respectively. Despite the reduction in
precision and recall value on the augmented test images,
the overall average of F1 score is 0.928 which is consid-
ered as feasible for hardware implementation. This shows
the robustness of the proposed detection model to detect
challenging UAV-based images. This is an important
aspect for the development of an accurate and robust loose
fruit detection system for automated harvesting system.
4.3.5 Performance evaluation onpublic dataset
In this section, the proposed improved YOLO model was
trained and validated on a drone-based public dataset namely
VisDrone 2018-Det [47]. The dataset provides various types
of vehicles which acquired by drone platforms at various
backgrounds and heights. VisDrone 2018-Det dataset con-
tains 6471 images in training set and 548 images in vali-
dation set, respectively. The dataset is made up of various
resolution of images in the range between 960 × 540 and
2000 × 1500 pixels. The 10 categories of objects contain in
this dataset are pedestrian, people, bicycle, car, van, truck,
tricycle, awning-tricycle, bus, and motor.
In order to demonstrate the reliability of the proposed
model, it is compared with several benchmark models. The
input images for training process were set to 416 × 416 pix-
els. The experimental results for detection on VisDrone
2018-Det dataset are shown in Table5. It is notable that
the proposed model outperforms other detection mod-
els with mAP of 33.33% and F1-score of 0.51. The mAP
was improved by 3.66% and 69.52% from state-of-the-art
YOLOv3 and the original YOLOv3 tiny model. In addi-
tion, improved YOLO model obtained the highest precision
value, while the recall value is slightly lower than YOLOv3.
Mixed YOLOv3-lite utilized shallow backbone with addition
of residual block and parallel high-to-low subnetworks to
0.962
0.966 0.965
0.9700.9680.970
0.947
0.964
0.959
0.947
0.938 0.936
0.930
0.909
0.905
0.882
0.844
0.922
0.881
0.831
0.950 0.951
0.947
0.939
0.935
0.924
0.892
0.943
0.918
0.885
0.75
0.80
0.85
0.90
0.95
1.00
Original Br1 Br2 Br3 Gb1Gb2 Gb3Mb1 Mb2Mb3
1F ,llaceR ,noisicer
P-
erocs
Test Images
Precision Recall F1-score
Fig. 10 Detection results on synthetic test images
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Automatic detection ofoil palm fruits fromUAV images using animproved YOLO model
1 3
form shallow and deep structure [48], while SlimYOLOv3
proposed by [49] incorporated channel pruning on convo-
lutional layers to obtain slim detector. Both models were
trained at higher input size of 832 × 832 pixels; however, the
mAP achieved is lower than improved YOLO and YOLOv3
models.
5 Conclusion
In this paper, an improved YOLO model is proposed for
accurate detection of oil palm loose fruit under various natu-
ral conditions. Improvement was made by using DenseNet
backbone to enhance feature propagation between network
layers, swish activation function, four scale feature detection
based on feature pyramid network, and selection of initial
bounding box. The experimental results show that the pro-
posed improved YOLO model exhibits outstanding detec-
tion and computational performance. The proposed model
achieved AP of 99.76%, F1 score of 0.98 and average IoU of
83.26% while the average detection time obtained is faster
the state-of-the-art YOLOv3 models. In addition, the pro-
posed model is also tested on a public dataset; VisDrone
2018-Det dataset which demonstrated the effectiveness of
the proposed model over the previous detection models. In
general, the developed improved YOLO model is suitable
and applicable for oil palm loose fruit automatic detection
system. The proposed method could provide visual informa-
tion for the development of automatic loose fruit harvesting
system.
6 Future scope
Future work will concentrate on the optimization of the pro-
posed model in order to reduce the detection time while
maintaining the overall accuracy. Moreover, the visual infor-
mation on the loose fruit condition that includes damage,
rotten, and unripe loose fruit will be incorporated in order
to develop a robust loose fruit detection model. Besides, the
research will be further extended for development of count-
ing algorithm for yield estimation in real-time application.
Acknowledgements The research funding was provided by RU
Grant—Faculty Programme by Faculty of Engineering, University of
Malaya with Project No. GPF042A-2019 and Industry-driven Inno-
vation Grant (IDIG) with Project No.: PPSI-2020-CLUSTER-SD01.
Declarations
Conflict of interest The authors declare no conflict of interest.
References
1. MPOC: Malaysian Palm Oil Council, http:// www. mpoc. org. my,
accessed 15 September 2020
2. Idrees, A.: Malaysia Palm Oil Industry, http:// www. mpoc. org.
my/ Malay sian_ Palm_ Oil_ Indus try. aspx, accessed 15 September
2020
3. Zhao, Y., Gong, L., Huang, Y., Liu, C.: A review of key techniques
of vision-based control for harvesting robot. Comput. Electron.
Agric. 127, 311–323 (2016)
4. Mairon, R., Edan, Y.: Computer vision for fruit harvesting
robots—state of the art and challenges ahead. Int. J. Comput. Vis.
Robot. 3, 4–34 (2012)
5. Yamamoto, K., Guo, W., Yoshioka, Y., Ninomiya, S.: On plant
detection of intact tomato fruits using image analysis and machine
learning methods. Sensors 14(7), 12191–12206 (2014)
6. Maldonado, W., Barbosa, J.C.: Automatic green fruit counting in
orange trees using digital images. Comput. Electron. Agric. 127,
572–581 (2016)
7. Qureshi, W.S., Payne, A., Walsh, K.B., Linker, R., Cohen, O.,
Dailey, M.N.: Machine vision for counting fruit on mango tree
canopies. Precis. Agric. 18, 224–244 (2016)
8. Hamza, R., Chtourou, M.: Design of fuzzy inference system for
apple ripeness estimation using gradient method. IET Image Pro-
cess. 14, 561–569 (2020)
9. Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X.,
Pietikäinen, M.: Deep learning for generic object detection: A
survey. Int. J. Comput. Vis. 128, 261–318 (2020)
10. Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with
deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst.
30, 3212–3232 (2019)
11. Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., Qu, R.: A
survey of deep learning-based object detection. IEEE Access. 7,
128837–128868 (2019)
12. Girshick, R., Donahue, J., Darrell, T., Malik, J., Berkeley, U.C.:
Rich feature hierarchies for accurate object detection and semantic
segmentation. In: 2014 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 580–587 (2014)
Table 5 Detection results on
VisDrone 2018-Det dataset
The evaluation metrics are determined at IoU = 0.5
Model Input size Precision Recall F1-score mAP (%)
Mixed YOLOv3-lite [48] 832 0.39 0.38 0.38 28.50
SlimYOLOv3-spp3-50 [49] 416 0.39 0.24 0.30 15.70
832 0.46 0.36 0.39 25.80
YOLOv3 tiny [4] 416 0.33 0.23 0.27 10.16
YOLOv3 [4] 416 0.54 0.47 0.50 32.11
Improved YOLO 416 0.68 0.40 0.51 33.33
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M.H.Junos et al.
1 3
13. Girshick, R.: Fast R-CNN. In: IEEE International Conference on
Computer Vision Fast, pp. 1440–1448 (2015)
14. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN : Towards
real-time object detection with region proposal networks. IEEE
Trans. Pattern Anal. Mach. Intell. 36, 1–14 (2017)
15. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In:
Proceedings of the IEEE international conference on computer
vision, pp. 2980–2988 (2017)
16. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.,
Berg, A.C.: SSD : single shot multibox detector. In: European
Conference on Computer Vision, pp. 21–37 (2016)
17. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only
look once : unified, real-time object detection. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 779–788
(2016)
18. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: YOLO9000:
Better, faster, stronger. In: IEEE conference on Computer Vision
and Pattern Recognition, pp. 6517–6525 (2017)
19. Redmon, J., Farhadi, A.: YOLOv3 : An incremental improvement.
In: IEEE Conference on Computer Vision and Pattern Recognition
(2018)
20. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal Loss for
Dense Object Detection. In: IEEE transactions on pattern analysis
and machine intelligence. pp. 318–327 (2020)
21. Tan, M., Pang, R., Le, Q. V.: EfficientDet: Scalable and efficient
object detection. In: Proceedings of the IEEE computer soci-
ety conference on computer vision and pattern recognition, pp.
10778–10787 (2020)
22. Li, Y., Han, Z., Xu, H., Liu, L., Li, X., Zhang, K.: YOLOv3-lite:
a lightweight crack detection network for aircraft structure based
on depthwise separable convolutions. Appl. Sci. 9, 3781 (2019)
23. Park, S.E., Eem, S.H., Jeon, H.: Concrete crack detection and
quantification using deep learning and structured light. Constr.
Build. Mater. 252, 119096 (2020)
24. Xi, P., Guan, H., Shu, C., Borgeat, L., Goubran, R.: An integrated
approach for medical abnormality detection using deep patch con-
volutional neural networks. Vis. Comput. 36, 1869–1882 (2020)
25. Ozturk, T., Talo, M., Yildirim, E.A., Baloglu, U.B., Yildirim, O.,
Rajendra Acharya, U.: Automated detection of COVID-19 cases
using deep neural networks with X-ray images. Comput. Biol.
Med. 121, 103792 (2020)
26. Villamizar, M., Sanfeliu, A., Moreno-Noguer, F.: Online learning
and detection of faces with low human supervision. Vis. Comput.
35, 349–370 (2019)
27. Chen, W., Huang, H., Peng, S., Zhou, C., Zhang, C.: YOLO-face:
a real-time face detector. Vis. Comput. 37, 1–9 (2020)
28. Min, W., Li, X., Wang, Q., Zeng, Q., Liao, Y.: New approach to
vehicle license plate location based on new model YOLO-L and
plate pre-identification. IET Image Process. 13, 1041–1049 (2019)
29. Hendry, R.C.: Automatic license plate recognition via sliding-
window darknet-YOLO deep learning. Image Vis. Comput. 87,
47–56 (2019)
30. Lee, E., Kim, D.: Accurate traffic light detection using deep neural
network with focal regression loss. Image Vis. Comput. 87, 24–36
(2019)
31. Chen, S.W., Shivakumar, S.S., Dcunha, S., Das, J., Okon, E., Qu,
C., Taylor, C.J., Kumar, V.: Counting apples and oranges with
deep learning : A data driven approach. IEEE Robot. Autom. Lett.
2, 781–788 (2017)
32. Dyrmann, M., Jørgensen, R.N., Midtiby, H.S.: RoboWeedSupport
- Detection of weed locations in leaf occluded cereal crops using
a fully convolutional neural network. Adv. Anim. Precis. Agric.
8, 842–847 (2017)
33. Dias, P.A., Tabb, A., Medeiros, H.: Apple flower detection using
deep convolutional networks. Comput. Ind. 99, 17–28 (2018)
34. Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., McCool, C.:
Deepfruits: a fruit detection system using deep neural networks.
Sensors 16(8), 122 (2016)
35. Madeleine, S., Bargoti, S., Underwood, J.: Image based mango
fruit detection, localisation and yield estimation using multiple
view geometry. Sensors 16(11), 1915 (2016)
36. Chen, Y., Lee, W.S., Gan, H., Peres, N., Fraisse, C., Zhang,
Y., He, Y.: Strawberry yield prediction based on a deep neural
network using high-resolution aerial orthoimages. Remote Sens
11, 1–21 (2019)
37. Gené-Mola, J., Vilaplana, V., Rosell-Polo, J.R., Morros, J.R.,
Ruiz-Hidalgo, J., Gregorio, E.: Multi-modal deep learning for
Fuji apple detection using RGB-D cameras and their radiomet-
ric capabilities. Comput. Electron. Agric. 162, 689–698 (2019)
38. Koirala, A., Walsh, K.B., Wang, Z., McCarthy, C.: Deep learn-
ing for real-time fruit detection and orchard fruit load estima-
tion: benchmarking of MangoYOLO. Precis. Agric. 20, 1107–
1135 (2019)
39. Liu, J., Wang, X.: Tomato diseases and pests detection based on
improved Yolo V3 convolutional neural network. Front. Plant
Sci. 11, 1–12 (2020)
40. Tian, Y., Yang, G., Wang, Z., Wang, H., Li, E., Liang, Z.: Apple
detection during different growth stages in orchards using the
improved YOLO-V3 model. Comput. Electron. Agric. 157,
417–426 (2019)
41. Tian, Y., Yang, G., Wang, Z., Li, E., Liang, Z.: Detection of
apple lesions in orchards based on deep learning methods of
CycleGAN and YOLOV3-Dense. J. Sensors. 2019, 1–13 (2019)
42. Liu, G., Nouaze, J.C., Mbouembe, P.L.T., Kim, J.H.: YOLO-
tomato: a robust algorithm for tomato detection based on
YOLOv3. Sensors 20(7), 2145 (2020). https:// doi. org/ 10. 3390/
s2007 2145
43. Huang, G., Liu, Z., Maaten, L. van der, Weinberger, K.Q.:
Densely connected convolutional networks. In: 2017 IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR)
(2017)
44. Ramachandran, P., Zoph, B., Le, Q. V.: Swish: a self-gated acti-
vation function, In: Neural and Evolutionary Computing. pp.
1–12 (2017). arXiv: 1710. 05941
45. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: com-
mon objects in context. In: Computer Vision—ECCV 2014.
Lecture Notes in Computer Science, pp. 740–755 (2014)
46. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights
and connections for efficient neural networks. In: Advances
in Neural Information in Processing Systems. pp. 1135–1143
(2015). arXiv: 1506. 02626 v3
47. Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H.: Vision
meets drones: past, present and future. In: Computer Vision and
Pattern Recognition, pp. 1–20 (2020). arXiv: 2001. 06303
48. Zhao, H., Zhou, Y., Zhang, L., Peng, Y., Hu, X., Peng, H., Cai,
X.: Mixed YOLOv3-LITE: a lightweight real-time object detec-
tion method. Sensors 20(7), 1861 (2020)
49. Zhang, P., Zhong, Y., Li, X.: SlimYOLOv3: narrower, faster and
better for real-time UAV applications. In: 2019 International
Conference on Computer Vision Workshop, pp. 37–45 (2019)
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Automatic detection ofoil palm fruits fromUAV images using animproved YOLO model
1 3
Mohamad Haniff Junos was born
in Perak, Malaysia. He received
the B.S. degree in Aerospace
Engineering from Universiti
Sains Malaysia, in 2013 and
M.S. degree from Universiti
Teknologi Malaysia in 2016. He
is currently pursuing Ph.D.
degree in Electrical Engineering
with University of Malaya. His
research interests include the
application of deep learning in
agriculture image analysis and
the application of deep learning
in vision system of agriculture
robot.
Anis Salwa Mohd Khairud
din received the Bachelor of
Electrical Engineering from the
Universiti Tenaga Nasional,
Malaysia, in 2008, and Master of
Computer Engineering from
Royal Melbourne Institute of
Technology (RMIT), Australia in
2010. She graduated her Ph.D.
degree in Electrical Engineering
from Universiti Teknologi
Malaysia, in 2014. She is cur-
rently working as a senior lec-
turer at Department of Electrical
Engineering, Faculty of Engi-
neering, University of Malaya,
Malaysia. Her research interests are in the areas of Expert system
(machine learning, optimization), Agriculture robotics and automation
(classification, control system, intelligent system) and image process-
ing. Her detail CV can be obtained from https:// umexp ert. um. edu. my/
aniss alwa.
Subbiah Thannirmalai received
the M.E. degree in Electrical and
Electronic Engineering from
Imperial College London in
2013. He is a scientist from
Autonomous Technologies and
Robotics, Sime Darby Research.
Sime Darby Research is a sub-
sidiary of Sime Darby Planta-
tion. He has been researching in
palm oil industry for the past 6
years. His research efforts focus
on precision agriculture and oil
palm estate automation. Some of
his research projects include
autonomous loose fruit collector,
estate worker productivity monitoring, worker assistive exo-skeleton
devices and autonomous bunch weighing system for mini-tractor
grabber.
Mahidzal Dahari is currently
working as Associate Professor
in Department of Electrical
Engineering, Faculty of Engi-
neering, University of Malaya.
He is an established academician
and a practicing engineer. He has
to-date published more than 50
articles in journals and confer-
ence proceedings both locally
and internationally. His research
interest includes instrumenta-
tion, automation, robotics, and
control strategy especially
related to manufacturing tech-
nology, industrial robots and oil
and gas applications. Dr. Mahidzal is actively involved in many con-
sultancy projects, especially in the field of fluid flow, natural gas and
also hydrogen technology.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Magalhães et al. compared the detection performance of SSD and YOLO models for tomatoes in greenhouses and found that the SSD model based on the Inception v2 backbone network has better detection performance with an F1 value of 66.15% [53]. Because of the high detection accuracy, short detection time and modular framework design of these algorithms, the YOLO series and its improved algorithms are widely used in various fields such as fruit detection, fruit load estimation, pest detection and disease detection [59,[75][76][77][78][79][80]. The detection principle of the YOLO model is shown in Figure 7. ...
... Compared with the initial algorithm, the number of parameters was reduced by more than 10%, and the detection error of picking points was reduced by 40 pixels [33]. Because of the high detection accuracy, short detection time and modular framework design of these algorithms, the YOLO series and its improved algorithms are widely used in various fields such as fruit detection, fruit load estimation, pest detection and disease detection [59,[75][76][77][78][79][80]. The detection principle of the YOLO model is shown in Figure 7. ...
... Its mAP and recall were 87.9% and 94.3%, respectively [57]. Because of the high detection accuracy, short detection time and modular framework design of these algorithms, the YOLO series and its improved algorithms are widely used in various fields such as fruit detection, fruit load estimation, pest detection and disease detection [59,[75][76][77][78][79][80]. The detection principle of the YOLO model is shown in Figure 7. ...
Article
Full-text available
Due to the short time, high labor intensity and high workload of fruit and vegetable harvesting, robotic harvesting instead of manual operations is the future. The accuracy of object detection and location is directly related to the picking efficiency, quality and speed of fruit-harvesting robots. Because of its low recognition accuracy, slow recognition speed and poor localization accuracy, the traditional algorithm cannot meet the requirements of automatic-harvesting robots. The increasingly evolving and powerful deep learning technology can effectively solve the above problems and has been widely used in the last few years. This work systematically summarizes and analyzes about 120 related literatures on the object detection and three-dimensional positioning algorithms of harvesting robots over the last 10 years, and reviews several significant methods. The difficulties and challenges faced by current fruit detection and localization algorithms are proposed from the aspects of the lack of large-scale high-quality datasets, the high complexity of the agricultural environment, etc. In response to the above challenges, corresponding solutions and future development trends are constructively proposed. Future research and technological development should first solve these current challenges using weakly supervised learning, efficient and lightweight model construction, multisensor fusion and so on.
... The YOLO series of algorithms has increasingly become the leading choice for detecting small and densely packed targets, outperforming the SSD network due to advancements in the YOLO algorithm. M. Haniff Junos et al. [18] improved the feature extraction capabilities of the YOLOv3 network through the incorporation of a dense connected neural network as a feature extractor. DenseNet was employed to enhance the precision of small-target fruit detection. ...
... In order to address the aforementioned issues in the context of peach tree thinning young fruit detection, this section will first introduce the network of YOLOv8 and then proceed to describe the improved components in detail. SSD ✓ Lingwu long jujubes [15] YOLOv3 ✓ oil palm loose fruits [17] YOLOv7 ✓ ✓ apple [18] YOLOv7 ✓ ✓ ✓ bayberry [23] YOLOv5 ✓ ✓ ✓ ✓ potato leaf [26] YOLOLT ✓ satellite imagery [27] A. YOLOV8 MODEL ...
Article
Full-text available
Fruit thinning is a crucial technical aspect of fruit tree cultivation; it can enhance the fruiting rate and quality of fruit trees while also addressing seasonal, labor-intensive, and other characteristics. Currently, peach fruit thinning is predominantly conducted manually. However, robotic fruit thinning is an inevitable trend in the modern orchard industry. The initial step in robotic fruit thinning is the detection of young fruit. The detection of young peach fruit currently faces significant challenges due to their dense overlapping growth, frequent obscuration by leaf shade, and considerable variation in size and color at the early growth stage. This study proposes using an enhanced YOLO(You Only Look Once) model and image enhancement techniques to address these challenges and facilitate the detection of young peaches during the fruit thinning stage. First, the FasterNet network is designed to reconstruct the backbone network of YOLOv8, which makes the network more lightweight and helps to improve the target feature extraction ability in a cluttered environment. Then, a weighted BiFPN(bidirectional feature pyramid mechanism)is introduced in the feature fusion process for the presence of multi-scale targets. By deleting the less efficient feature transmission nodes, more efficient feature fusion is realized, which improves the fusion efficiency for different scale features. Finally, the Wise-IoU(Intersection over Union)v3+MPDIoU(Minimum Point Distance Intersection over Union) loss function is applied to replace the original loss function. This mitigates the harmful gradient caused by low-quality anchor frames and improves the model’s localization ability for densely occluded targets. The Young Peach datasets were collected under various natural conditions, and data enhancement was performed by deflation and cutting. The ablation and comparison experiments conducted validated the research hypothesis, and the average mAP(mean Average Precision) reached 98.4% at an IoU(Intersection over Union) threshold of 0.5. The F1 score, mAP at 0.5, and mAP at 0.95 all showed significant improvements, which were 11.8%, 14.9%, and 33% higher than the results of the directly collected annotated dataset and the standard YOLOv8 model, respectively. The method is an effective means of meeting the accuracy and robustness requirements of vision systems in the detection of young peaches.
... In [49] Junos et al. propose a system for the automatic detection of palm oil fruits from UAV images. They propose a variation of YOLO v3 tiny, introducing Densenet as feature extractor, a feature Pyramid Network as a multi-scale target detection network and a learnable Swish activation function. ...
... Fig. 12. Improved Yolo V3 Tiny. Reprocessing of the image in the original article [49]. The structure of dense blocks is non-trivial, and it is described in the original article. ...
Article
Full-text available
Machine learning is the state of the art for many recurring tasks in several heterogeneous domains. In the last decade, it has been also widely used in Precision Agriculture (PA) and Wild Flora Monitoring (WFM) to address a set of problems with a big impact on economy, society and academia, heralding a paradigm shift across the industry and academia. Many applications in those fields involve image processing and computer vision stages. Remote sensing devices are very popular choice for image acquisition in this context, and in particular, Unmanned Aerial Vehicles (UAVs) offer a good tradeoff between cost and area coverage. For these reasons, research literature is rich of works that face problems in Precision Agriculture and Wild Flora Monitoring domains with machine learning/computer vision methods applied to UAV imagery. In this work, we review this literature, with a special focus on algorithms, model sizing, dataset characteristics and innovative technical solutions presented in many domain-specific models, providing the reader with an overview of the research trend in recent years.
... Zhang et al. [32]retained more useful information by reducing the number of downsampling, introduced a context enhancement module to extract object context information, and designed variable size lightweight convolutions to improve model performance and reduce model size. Junos et al. [33] used a deeper convolutional neural network DenseNet for feature extraction, implemented smooth training of YOLOv3 tiny using Swish activation function, and used k-means clustering to obtain appropriate dimensions of candidate boxes, achieving better oil palm values detection. Li et al. [34]proposed a new feature fusion method, E-FPN, which achieves bidirectional improvement of semantics and details through more complex fusion. ...
Preprint
Full-text available
Object detection using Unmanned Aerial Vehicles (UAVs) has emerged as a crit- ical application across diverse domains. However, the wide-angle views of drones often result in images containing a high density of small objects, posing chal- lenges for object detection such as few learnable features, significant occlusion, and an imbalanced distribution of positive and negative samples. To address these issues, this paper introduces AGLC-YOLO, an enhanced version of the YOLOv7 architecture specifically designed for detecting small objects in UAV images. AGLC-YOLO integrates global and local context information through a Attention guide Global-Local Context Information Extraction (AGLC) module. This module employs parallel dilated convolutions to capture local context infor- mation and a transformer-based structure to extract global dependencies, which are then fused using an improved attention mechanism. The network also adds an additional small object detection head to enrich the small object informa- tion in the model. Additionally, AGLC-YOLO utilizes an auxiliary bounding box in conjunction with the Inner-Wise Intersection over Union (Inner-WIoU) loss function to accelerate the bounding box regression process and improve detec- tion accuracy. Experimental results on the VisDrone and ManipalUav datasets demonstrate that AGLC-YOLO achieves significant improvements over the base- line YOLOv7 model, with an increase of 3% in AP50 and 2.7% in AP95 on the VisDrone dataset, and 1.9% in AP50 and 2% in AP95 on the ManipalUav dataset. Source code is released in https://github.com/hanks124/aglc.
... By training on large-scale datasets, these models can gradually extract and learn features, enabling precise processing and understanding of various data types, including images, speech, and text. Deep learning models are now widely applied in agriculture, including maturity detection, weed detection in fields, seed classification, and nut detection [2][3][4]. Literature [5] proposes an improved YOLOv4 network with a local attention mechanism for detecting sesame weeds and explores an adaptive spatial feature fusion structure to enhance feature extraction. Literature [6] investigates an intelligent system for classifying pistachios, combining acoustic emission analysis, principal component analysis (PCA), and multilayer feedforward neural network (MFNN) classifiers. ...
Article
Full-text available
Walnuts possess significant nutritional and economic value. Fast and accurate sorting of shells and kernels will enhance the efficiency of automated production. Therefore, we propose a FastQAFPN-YOLOv8s object detection network to achieve rapid and precise detection of unsorted materials. The method uses lightweight Pconv (Partial Convolution) operators to build the FasterNextBlock structure, which serves as the backbone feature extractor for the Fasternet feature extraction network. The ECIoU loss function, combining EIoU (Efficient-IoU) and CIoU (Complete-IoU), speeds up the adjustment of the prediction frame and the network regression. In the Neck section of the network, the QAFPN feature fusion extraction network is proposed to replace the PAN-FPN (Path Aggregation Network—Feature Pyramid Network) in YOLOv8s with a Rep-PAN structure based on the QARepNext reparameterization framework for feature fusion extraction to strike a balance between network performance and inference speed. To validate the method, we built a three-axis mobile sorting device and created a dataset of 3000 images of walnuts after shell removal for experiments. The results show that the improved network contains 6071008 parameters, a training time of 2.49 h, a model size of 12.3 MB, an mAP (Mean Average Precision) of 94.5%, and a frame rate of 52.1 FPS. Compared with the original model, the number of parameters decreased by 45.5%, with training time reduced by 32.7%, the model size shrunk by 45.3%, and frame rate improved by 40.8%. However, some accuracy is sacrificed due to the lightweight design, resulting in a 1.2% decrease in mAP. The network reduces the model size by 59.7 MB and 23.9 MB compared to YOLOv7 and YOLOv6, respectively, and improves the frame rate by 15.67 fps and 22.55 fps, respectively. The average confidence and mAP show minimal changes compared to YOLOv7 and improved by 4.2% and 2.4% compared to YOLOv6, respectively. The FastQAFPN-YOLOv8s detection method effectively reduces model size while maintaining recognition accuracy.
... Currently, alongside the two-stage target detection algorithm, there exists a one-stage target detection algorithm exemplified by the YOLO series, which has gained extensive usage in orchards for fruit detection under natural conditions [15][16][17][18]. To address challenges posed by small fruits and severe shading of winter jujube in natural environments, Yu et al. [19] enhanced the SOD-YOLOv5n algorithm based on YOLOv5n for jujube detection, achieving a remarkable mean average precision of 92.20%. ...
Article
Full-text available
The harvesting of “Okubo” peach fruits is important in food processing and requires intelligent detection. In this study, a lightweight detection model YOLO-IA is proposed based on YOLOv8s combined with an inverted residual mobile block (iRMB) and asymptotic feature pyramid network (AFPN). Firstly, the C2f_iRMB module is designed to replace all C2f modules of YOLOv8s by using the iRMB module, which improves the model’s ability to extract features and detect accuracy. Secondly, the AFPN feature fusion method is adopted for the neck network to enhance the fusion ability of the model to the features of the backbone network, optimize the model parameters, and realize the model’s lightweight. Finally, the “Okubo” peach fruit detection system was developed, which can detect the fruit information in real-time. The results show that the YOLO-IA model has an average precision (AP) of 93.17% and 95.63% for unripe and ripe peaches, respectively, and the mean average precision (mAP) of 94.40%, with a model size of 10.9 MB and an inference time of 6.3 ms, which is capable of real-time detection. Compared with YOLOv8s, YOLO-IA improved the mAP and F1 scores of “Okubo” peach fruits by 1.38% and 1.53%, respectively, and compressed the model size by 49.07%. In summary, YOLO-IA is effective in detecting “Okubo” peach fruits in complex orchard environments and can provide a theoretical basis for the development of subsequent vision systems for picking robots.
... Image-based loose fruit identification using deep learning has been introduced, such as a two-stage object detection model (Faster R-CNN) by [2] and a single-stage detection model (YOLOv3) [3]. Two-stage detection models require region proposals, while single-stage models do not. ...
Article
Full-text available
The primary objective of this project is to create a portable automation module that should be suitable for various heavy machines in agriculture. To begin, the OREC Bull Mower machine was chosen as the initial pilot study. Five major machine components were identified through meticulous analysis, laying the groundwork for the automation process. Several functions, including gear shifting, clutch operation, and other critical tasks, were successfully automated. Extensive testing and measurements were performed to identify the best actuators for precise and efficient control. The actuators and controllers were thoughtfully integrated into the enclosure, resulting in a versatile automation module that can be programmed for remote or autonomous operation. Lab and on-site testing proved the module's efficacy in real-world scenarios, validating its successful implementation. While the project has produced promising results, future improvements are required, particularly in mechanical fitting, design for assembly, and the use of reliable electronic components.
Article
Full-text available
Remote sensing data are valuable for detecting, mapping, and managing invasive weed species. This article introduces an innovative algorithm for mapping Siam weed infestations using a bounding box approach from a deep learning object detection model. Supplemented with georeferencing of individual aerial images obtained from an uncrewed aerial vehicle (UAV), the study demonstrates the potential for large-scale mapping. High–resolution RGB images were used to develop a Siam weed detection model with YOLOv5, achieving 0.95 precision, 0.82 recall, and 0.88 F1–Score, indicating suitability for detecting flowering Siam weed. The validated model was applied to an independent single–flight dataset, enabling consecutive detections on high–resolution images. Georeferencing the detections was accomplished using a customized raycasting algorithm. To address fragmented and duplicated detections from overlapping images, the authors proposed buffering, dissolving, and filtering the georeferenced detection boxes. Therefore, revealing an extensive Siam weed infestation boundaries and pattern for a whole flight area. The developed algorithm successfully maps Siam weed infestations using an object detection bounding boxes approach on individual aerial images which is potentially advantageous for speed and scalability. It presents an opportunity to upscale weed detection and mapping with reduced overlapping percentage and higher payload aircraft, such as helicopters, allowing for larger–scale surveys and enhancing weed management teams' capacity to monitor extensive landscapes.
Article
Full-text available
Plant disease is one of the primary causes of crop yield reduction. With the development of computer vision and deep learning technology, autonomous detection of plant surface lesion images collected by optical sensors has become an important research direction for timely crop disease diagnosis. In this paper, an anthracnose lesion detection method based on deep learning is proposed. Firstly, for the problem of insufficient image data caused by the random occurrence of apple diseases, in addition to traditional image augmentation techniques, Cycle-Consistent Adversarial Network (CycleGAN) deep learning model is used in this paper to accomplish data augmentation. These methods effectively enrich the diversity of training data and provide a solid foundation for training the detection model. In this paper, on the basis of image data augmentation, densely connected neural network (DenseNet) is utilized to optimize feature layers of the YOLO-V3 model which have lower resolution. DenseNet greatly improves the utilization of features in the neural network and enhances the detection result of the YOLO-V3 model. It is verified in experiments that the improved model exceeds Faster R-CNN with VGG16 NET, the original YOLO-V3 model, and other three state-of-the-art networks in detection performance, and it can realize real-time detection. The proposed method can be well applied to the detection of anthracnose lesions on apple surfaces in orchards.
Article
Full-text available
Tomato is affected by various diseases and pests during its growth process. If the control is not timely, it will lead to yield reduction or even crop failure. How to control the diseases and pests effectively and help the vegetable farmers to improve the yield of tomato is very important, and the most important thing is to accurately identify the diseases and insect pests. Compared with the traditional pattern recognition method, the diseases and pests recognition method based on deep learning can directly input the original image. Instead of the tedious steps such as image preprocessing, feature extraction and feature classification in the traditional method, the end-to-end structure is adopted to simplify the recognition process and solve the problem that the feature extractor designed manually is difficult to obtain the feature expression closest to the natural attribute of the object. Based on the application of deep learning object detection, not only can save time and effort, but also can achieve real-time judgment, greatly reduce the huge loss caused by diseases and pests, which has important research value and significance. Based on the latest research results of detection theory based on deep learning object detection and the characteristics of tomato diseases and pests images, this study will build the dataset of tomato diseases and pests under the real natural environment, optimize the feature layer of Yolo V3 model by using image pyramid to achieve multi-scale feature detection, improve the detection accuracy and speed of Yolo V3 model, and detect the location and category of diseases and pests of tomato accurately and quickly. Through the above research, the key technology of tomato pest image recognition in natural environment is broken through, which provides reference for intelligent recognition and engineering application of plant diseases and pests detection.
Article
Full-text available
Automatic fruit detection is a very important benefit of harvesting robots. However, complicated environment conditions, such as illumination variation, branch, and leaf occlusion as well as tomato overlap, have made fruit detection very challenging. In this study, an improved tomato detection model called YOLO-Tomato is proposed for dealing with these problems, based on YOLOv3. A dense architecture is incorporated into YOLOv3 to facilitate the reuse of features and help to learn a more compact and accurate model. Moreover, the model replaces the traditional rectangular bounding box (R-Bbox) with a circular bounding box (C-Bbox) for tomato localization. The new bounding boxes can then match the tomatoes more precisely, and thus improve the Intersection-over-Union (IoU) calculation for the Non-Maximum Suppression (NMS). They also reduce prediction coordinates. An ablation study demonstrated the efficacy of these modifications. The YOLO-Tomato was compared to several state-of-the-art detection methods and it had the best detection performance.
Article
Full-text available
Embedded and mobile smart devices face problems related to limited computing power and excessive power consumption. To address these problems, we propose Mixed YOLOv3-LITE, a lightweight real-time object detection network that can be used with non-graphics processing unit (GPU) and mobile devices. Based on YOLO-LITE as the backbone network, Mixed YOLOv3-LITE supplements residual block (ResBlocks) and parallel high-to-low resolution subnetworks, fully utilizes shallow network characteristics while increasing network depth, and uses a “shallow and narrow” convolution layer to build a detector, thereby achieving an optimal balance between detection precision and speed when used with non-GPU based computers and portable terminal devices. The experimental results obtained in this study reveal that the size of the proposed Mixed YOLOv3-LITE network model is 20.5 MB, which is 91.70%, 38.07%, and 74.25% smaller than YOLOv3, tiny-YOLOv3, and SlimYOLOv3-spp3-50, respectively. The mean average precision (mAP) achieved using the PASCAL VOC 2007 dataset is 48.25%, which is 14.48% higher than that of YOLO-LITE. When the VisDrone 2018-Det dataset is used, the mAP achieved with the Mixed YOLOv3-LITE network model is 28.50%, which is 18.50% and 2.70% higher than tiny-YOLOv3 and SlimYOLOv3-spp3-50, respectively. The results prove that Mixed YOLOv3-LITE can achieve higher efficiency and better performance on mobile terminals and other devices.
Article
Full-text available
Face detection is one of the important tasks of object detection. Typically detection is the first stage of pattern recognition and identity authentication. In recent years, deep learning-based algorithms in object detection have grown rapidly. These algorithms can be generally divided into two categories, i.e., two-stage detector like Faster R-CNN and one-stage detector like YOLO. Although YOLO and its varieties are not so good as two-stage detectors in terms of accuracy, they outperform the counterparts by a large margin in speed. YOLO performs well when facing normal size objects, but is incapable of detecting small objects. The accuracy decreases notably when dealing with objects that have large-scale changing like faces. Aimed to solve the detection problem of varying face scales, we propose a face detector named YOLO-face based on YOLOv3 to improve the performance for face detection. The present approach includes using anchor boxes more appropriate for face detection and a more precise regression loss function. The improved detector significantly increased accuracy while remaining fast detection speed. Experiments on the WIDER FACE and the FDDB datasets show that our improved algorithm outperforms YOLO and its varieties.
Article
Full-text available
Computer-aided detection of abnormalities in medical images has clinical significance but remains a challenging research topic. Unlike object detection in natural images, the detection of medical abnormalities is unique because they often locate in tiny local regions within a high-resolution medical image. Traditional machine learning approaches build sliding window-based detectors with manual features and are therefore limited on both efficiency and accuracy. Recent advances in deep learning shed new light on this problem, but applying it to medical image analysis faces challenges including insufficient data. Training deep convolutional neural networks (CNNs) directly on high-resolution images requires image compression at the input layer, which leads to the loss of information that is essential for medical abnormality detection. Therefore, instead of training on full images, the proposed approach first fine-tunes the pre-trained deep CNNs on image patches centered at medical abnormalities and then integrates them with class activation mappings and region proposal networks for building abnormality detectors. The deep patch classifier has been tested on a mammogram data set and achieved an overall classification accuracy of 92.53%92.53\%, compared to 81.55%81.55\% by a traditional approach using manual features. The integrated detector has been tested on an ultrasound liver image data set for abnormality detection and achieved an average precision of 0.60, outperforming both the sliding window-based approach of 0.16 and a deep learning YOLO model of 0.51. These validations suggest that the integrated approach has great potential for assisting doctors in detecting abnormalities from medical images.
Article
The novel coronavirus 2019 (COVID-2019), which first appeared in Wuhan city of China in December 2019, spread rapidly around the world and became a pandemic. It has caused a devastating effect on both daily lives, public health, and the global economy. It is critical to detect the positive cases as early as possible so as to prevent the further spread of this epidemic and to quickly treat affected patients. The need for auxiliary diagnostic tools has increased as there are no accurate automated toolkits available. Recent findings obtained using radiology imaging techniques suggest that such images contain salient information about the COVID-19 virus. Application of advanced artificial intelligence (AI) techniques coupled with radiological imaging can be helpful for the accurate detection of this disease, and can also be assistive to overcome the problem of a lack of specialized physicians in remote villages. In this study, a new model for automatic COVID-19 detection using raw chest X-ray images is presented. The proposed model is developed to provide accurate diagnostics for binary classification (COVID vs. No-Findings) and multi-class classification (COVID vs. No-Findings vs. Pneumonia). Our model produced a classification accuracy of 98.08% for binary classes and 87.02% for multi-class cases. The DarkNet model was used in our study as a classifier for the you only look once (YOLO) real time object detection system. We implemented 17 convolutional layers and introduced different filtering on each layer. Our model (available at (https://github.com/muhammedtalo/COVID-19)) can be employed to assist radiologists in validating their initial screening, and can also be employed via cloud to immediately screen patients.
Article
Considering the deterioration of civil infrastructures, the evaluation of structural safety by detecting cracks is becoming increasingly essential. In this paper, the advanced technologies of deep learning and structured light composed of vision and two laser sensors have been applied to detect and quantify cracks on surfaces of concrete structures. The YOLO (You Only Look Once) algorithm has been used for real-time detection, and the sizes of the detected cracks have been calculated based on the positions of the projected laser beams on the structural surface. Since laser beams may not be projected in parallel due to installation or manufacturing errors, the laser alignment correction algorithm with a specially designed jig module and a distance sensor is applied to increase the accuracy of the size measurement. The performance of the algorithm has been verified through simulations and experimental tests, and the results show that the cracks on the structural surfaces can be detected and quantified with high accuracy in real-time.