Conference PaperPDF Available

# Detecting and Counting Sheep with a Convolutional Neural Network

Authors:

## Abstract and Figures

Counting livestock is generally done only during major events, such as drenching, shearing or loading, and thus farmers get stock numbers sporadically throughout the year.More accurate and timely stock information would enable farmers to manage their herds better. Additionally, prompt response to any stock in distress is extremely valuable, both in terms of animal welfare and the avoidance of financial loss. In this regard, the evolution of deep learning algorithms and Unmanned Aerial Vehicles (UAVs) is forging anew research area for remote monitoring and counting of different animal species under various climatic conditions.In this paper, we focus on detecting and counting sheep in a paddock from UAV video. Sheep are counted using a model based on Region-based Convolutional Neural Networks and the results are then compared with other techniques to evaluate their performance.
Content may be subject to copyright.
Detecting and Counting Sheep with a Convolutional Neural Network
Farah Sarwar1,2, Anthony Grifﬁn1,2, Priyadharsini Periasamy2, Kurt Portas3and Jim Law3
1High Performance Computing Research Lab, 2Electrical and Electronic Engineering Department
School of Engineering, Computer and Mathematical Sciences
Auckland University of Technology, New Zealand
farah.sarwar@aut.ac.nz, anthony.griffin@aut.ac.nz, peripriya86@gmail.com
3Palliser Ridge Limited, Pirinoa, New Zealand
kurt tp@hotmail.com, jimalaw@yahoo.com
Abstract
Counting livestock is generally done only during major
farmers get stock numbers sporadically throughout the year.
More accurate and timely stock information would enable
farmers to manage their herds better. Additionally, prompt
response to any stock in distress is extremely valuable, both
in terms of animal welfare and the avoidance of ﬁnancial
loss. In this regard, the evolution of deep learning algo-
rithms and Unmanned Aerial Vehicles (UAVs) is forging a
new research area for remote monitoring and counting of
different animal species under various climatic conditions.
In this paper, we focus on detecting and counting sheep in a
paddock from UAV video. Sheep are counted using a model
based on Region-based Convolutional Neural Networks and
the results are then compared with other techniques to eval-
uate their performance.
1. Introduction
Many industries want more data about the operation of
their business, and livestock farmers are no different. With
the advent of precision agriculture [9], many farmers have
more data than ever before, such as ground temperatures,
soil moisture, pasture covers, and individual stock iden-
tiﬁcation to monitor growth rates and welfare. However,
precise and timely information about the numbers and lo-
cation of their animals is hard to come by, particularly on
larger farms. Indeed, counting livestock still remains a man-
ual task that is labour-intensive and prone to minor—but
signiﬁcant—errors. It can also be disruptive to the animals,
as they must be run through a drafting race or a narrow
choke point. Currently, many farmers only count their ani-
mals once a month or so [14], and some farmers only count
their stock when they are being loaded on or off a truck, as
they are arriving or leaving the farm. Thus, many farmers
are interested in monitoring their livestock using a robust
automated system. Monitoring the distribution and popula-
tion of animal species with the passage of time is also a key
ingredient to successful nature conservation [21]. Animal
observation can be used for multiple tasks such as moni-
toring animal growth, animal distress, conducting surveys,
of applications has motivated researches to develop some
sophisticated—yet faster—ways to achieve this target, but
this is still an emerging ﬁeld as little work has been done
so far. Relevant to this research area, many challenges such
as diversity of background, species-speciﬁc characteristics,
spatial clustering of animals [18] arise. Researchers use
different statistical and biological methods [6, 10] to cope
with these challenges. Some have adopted different ap-
proaches and tried to solve similar tasks using techniques
like a template matching algorithm [22], AdaBoost classi-
ﬁer [1], power spectral based techniques [13], Deformable
Part-based Model (DPM), Support Vector Machine (SVM)
[21] and Convolutional Neural Networks (CNNs) [2] for au-
tomatic processing and showed some good results. How-
ever, all of them have tested their techniques on images
where the animals are few in number and occupy a major-
ity of the image with high resolution. We wish to deal with
images having hundreds of small animals per image.
In this work we focus on counting sheep from Unmanned
Aerial Vehicle (UAV) video. Although this is easier than
counting cattle or other animals—due to the relatively uni-
form colour of sheep—it is not a trivial task, particularly
when dealing with hundreds or even thousands of sheep.
The goal of this work is to provide farmers with extremely
accurate information about how many sheep are in each of
their paddocks, using a commonly-available UAV to count
the sheep in situ. This involves no disruption to the animals,
the UAV is so high that they rarely notice it.
Figure 1. Example test image, taken at 80 m from the ground con-
taining 139 sheep.
Training Testing
Data No. of Total No. of Total
Set Images Sheep Images Sheep
Sunny 2 267 4 582
Overcast 2 319 4 520
Mixed 4 586 8 1102
Table 1. Details of data sets used for both methods. All images are
(2048 ×1080 ×3) pixels in size.
Counting different objects in images was initially per-
formed using hand-crafted representations that need fea-
ture extraction for respective items [19] and then shifted
to deep learning algorithms to increase the speed and ac-
curacy. Currently, the researchers are focusing on solving
object detection using CNN [5, 11, 12, 17]. In the last
few years, many researchers have provided the modiﬁed
structures of CNN based models and delivered outstand-
ing results in classiﬁcation tasks such as Ciresan [3], which
demonstrated very good performance on the NORB and
CIFAR-10 datasets. Krizhevsky [7] beat the record of clas-
siﬁcation on ImageNet 2012. Since then extended versions
of CNNs are being proposed such as, Region-based CNN
(R-CNN) [5], Fast R-CNN [4], Faster R-CNN [16], You
Only Look Once (YOLO) [15] and Single Shot MultiBox
Detector (SSD) [8]. Amongst all these techniques, YOLO
is the fastest but makes more localization errors as com-
pared to Faster R-CNN and is not suitable for detection of
tiny objects [15]. Faster R-CNN provides best results so far
but is prone to false positives in the background.
We now discuss our methodologies in Section 2, and
Section 3 will discuss the experimental results and compare
two different methods to detect and count sheep in aerial
images. Conclusions are presented in Section 4.
2. Methodology
In this section we present two different methods for
sheep detection in a frame. Method A is based on a basic
architecture of R-CNN, while Method B uses hand crafted
techniques. Results of both these techniques are compared
Data No. of Total
set Images Sheep
Sunny 146 884
Overcast 169 811
Mixed 318 1706
Table 2. Details of augmented training data sets for method A. All
images are (250 ×250 ×3) pixels in size.
and discussed in later sections.
Our main task is to detect and count the sheep using im-
ages of a paddock, which were taken from a UAV recorded
video in Palliser Ridge Ltd, a sheep and beef farm in the
Pirinoa region of New Zealand.
There are commonly hundreds of sheep in a 2048×1080
RGB video frame, and each sheep is roughly 10 ×20 pix-
els. The video was recorded at a height of 80 m above the
ground. Both methods use the data sets shown in Table 1,
for training and testing, where there a total of 4 images used
for training and 8 for testing. There is no overlap between
the images used for training and testing. Figure 1 shows an
example of one of the training images.
2.1. Method A: R-CNN
Method A is mainly composed of 2 parts: designing
training data, and applying R-CNN for detection.
2.1.1 Designing Training Data
The training images were very similar to each other, so in
order to augment the training data for Method A we cropped
out sub-images of 250 ×250 pixels and rotated them to dif-
ferent angles. From the 4 full-size images, we designed al-
most 400 small images to complete the training data set.
Among these 400 images, 82 images contained no object
of interest and have only background. We then designed
three different training data sets using rest of the 318 im-
ages; sunny, overcast and a mix of overcast and sunny im-
ages. Table 2 shows the number of images and sheep per
image in the augmented training data sets for Method A.
Each training image is of 250 ×250 ×3size and all
the sheep are labelled manually in them. This was quite a
lengthy process as the number of sheep per training image
varies in the range of [0, 18]. We used RGB colour im-
ages to keep this algorithm applicable for different animal
species for our later research. There are a total of 1706 ob-
jects (sheep) or Region of Interests (ROIs) in our complete
training data set. Those objects of interest which were not
completely in an image or were at the boundaries were not
labelled. If we had kept all such objects then it would only
increase the false positive detection on the background and
thus degrade our results. Each ground truth bounding box
has four components: (x, y, w, h), where xand yrepresents
Figure 2. Annotated training images.
the starting coordinates of the respective ROI and wand h
gives its dimension in terms of width and height respec-
tively. Example of overcast, sunny and partly-sunny images
with labelled sheep are shown in Figure 2.
2.1.2 Applying R-CNN for Detection
The R-CNN based object detection algorithm consists of
three different modules [5]:
1. Region proposals
2. Convolutional Neural Network
3. Class-speciﬁc linear Support Vector Machines (SVMs)
Girshick et al. [5] used the selective search algorithm [20]
for deﬁning region proposals. The input images of training
data set are warped to 277×277×3pixel size internally be-
fore the computation of region proposals. Afterwards, fea-
tures are extracted from each region proposal using CNN.
The output of this layer is then fed to an SVM classiﬁer
as well as a simple linear bounding box regressor to get a
conﬁdence value of classiﬁcation for each bounding box.
The centroids of each sheep are given by the centres of the
bounding boxes.
The network used by Girshick [5] has 4 convolutional
layers and 2 fully connected (FC) layers. We have used dif-
ferent architectures to check the impact of different combi-
nations of convolutional and FC layers. The main algorithm
for processing the input data remains the same but is applied
to different network architectures. Figure 3 shows the start-
ing network structure. We then systematically eliminated
Figure 3. Main architecture of R-CNN network.
convolutional layers (from right to left) to analyze the im-
pact of smaller architecture on our results. And surprisingly
the results were not affected to a very large extent. A de-
crease in precision of 5% and increase in recall of 6% was
observed when going down to a network with 1 initial con-
volutional layer and 1 fully connected layer of 2 neurons
followed by soft max layer.
2.2. Method B: Expert System
Method B is a hand-crafted technique that takes advan-
tage of the fact that in all the images, the sheep are quite
uniformly white, and generally the brightest objects in an
image. We used blob analysis on a brightness thresholded
greyscale image. Our detection algorithm generates ˆ
ci, the
estimated centroids of the sheep in the i-th image, as fol-
lows:
1. Generate the greyscale image Gibased on the RGB
image from the video.
2. Generate a binary image Biwhose pixels are 1 if the
corresponding pixel in Giis above a minimum bright-
ness threshold βand 0 otherwise.
3. Find and label all the 8-connected blobs in Bi.
4. Remove the blobs with an area less than a minimum
area threshold α.
5. Record the centroids of the remaining blobs as ˆ
ci.
6. Calculate the mean of the blobs’ areas as ¯
A.
7. Record the blobs whose area is greater than γ¯
Aas two-
sheep blobs, by duplicating these centroids.
Note that βranges from 0 to 1, αis an integer number of
pixels, γis a number a little less than 2, and ˆ
ciis an Ni×2
matrix, whose j-th row is denoted by ˆ
c(j)
iand contains the
xand ypixel locations of the centroid of the j-th sheep.
Obviously, Niis the number of sheep detected in the i-th
image.
Although sheep tend to spread out when left alone in a
paddock, often there will be a group of two or more sheep
that are quite close together. Computing the centroids us-
ing only ﬁrst ﬁve steps of the algorithm can fail to identify
multiple sheep for such groups. So, the last two steps of the
algorithm were added to improve the counting accuracy.
In order to test the accuracy of this algorithm we man-
ually labelled the centroids of sheep in the training image.
Let these ground truth centroids be denoted by ci. We then
wish to compare ciand ˆ
cito get a measure of the error,
however ciand ˆ
cimay contain differing numbers of rows
and their rows are likely to be in different orders. Further-
more, the most similar rows of ciand ˆ
ciare likely to not
be exactly equal due to differences in human and machine
processing.
For each row of ciwe calculate the following:
d= min
k
ˆ
c(j)
ic(k)
i
2(1)
k0= arg min
k
ˆ
c(j)
ic(k)
i
2(2)
And we say that ˆ
c(j)
imatches c(k0)
iif d<δ, where δis
an error distance threshold in pixels, set at 10 pixels. We
must ensure that we don’t match c(k0)
ito another row of ˆ
ci,
so we remove the k0-th row of ci, before proceeding to the
next row of ˆ
ci. Our error metric is the number of unmatched
rows of ciplus the number of unmatched rows of ˆ
ci.
In order to test the sensitivity of this detection method
to the two parameters αand β, we varied them both over a
large range, and calculated the error described above. Fig-
ure 4 shows the error for the example frame in Figure 1,
which has bright, white sheep, and varying illumination. It
is clear that there are quite wide ranges of αand βthat per-
fectly detect the sheep.
3. Results
In order measure the performance of the two detection
methods, we followed the procedure in Section 2.2, to com-
pute the overlap between the true centroids and the esti-
mated centroids of the sheep. We calculated the precision
and recall, deﬁned as:
Precision =NTP
NTP +NFP
(3)
Recall =NTP
NTP +NFN
(4)
where NTP, NFP , NFN are the number of true positive, false
positive and false negative sheep detections, respectively.
Obviously, the ideal is a system that has a precision and
recall of 100%. However, in a counting context, recall is
more important as it measures how many of the sheep were
detected, whereas precision measures how conﬁdent you
can be that a detected sheep is actually a sheep.
3.1. Method A
Tuning most of the hyper parameters for successful im-
plementation of R-CNN deals with pragmatic evidence and
testing. We experimented with different settings to deter-
mine the combination of parameters to provide the best re-
sults. While testing with different training options we ob-
served that a mini batch size beyond a certain range did not
Figure 4. Sensitivity matrix for the frame in Figure 1 for Method B,
the colours represent the number of detection errors, ranging from
white being no errors, to black being 10 or more errors.
give good results for our data set. The algorithm took too
long to ﬁnd a minimum error for batch size more than 32
and needed too many epochs below the value of 10. Simi-
larly, different variations in results were observed when we
tried a negative overlap region between 0.1 and 0.5. Gir-
shick et al. [5] used the value of 0.3 in their experiments
as it best suited their data set, but it did not work for our
case. When we decreased the value to 0.2, it improved the
results with 7% recall, which was a huge difference. An-
other important parameter is the initial learning rate, which
also depends on the training data set and the type of object
under observation. Our network works well with this value
being around 105but above this value it gets stuck in a
local minimum and below this value it just oscillates and
takes too long to settle down somewhere. So, we used a
mini batch size of 15, an initial learning rate of 105and
negative overlap region of 0.2 while training on our data
sets.
Along with adjusting the options for training, we also
checked results using different architectures for the net-
work. These architectures are as follows:
1. 96 ﬁlters with different kernel sizes of 5×5,11 ×
11,21 ×21,25 ×25 and 29 ×29 in a network of 1
convolution layer followed by 1 FC and softmax layer.
2. R-CNN network with 2, 3, 4, 5 and 6 network layers
using a 6-layer network structure as shown in Figure 3.
3. A 6-layer R-CNN network using different kernel sizes
in each convolutional layer.
The graphs of precision and recall using previously dis-
cussed hyper parameters and above mentioned architectures
are shown in Figure 5. It can be seen that the use of larger
kernel size of around 21 ×21 and 25 ×25 give good results
when compared to other options. These ﬁlters are tested
only in the smallest network as different ﬁlter combinations
in larger networks did not give signiﬁcant variations, as can
be observed from the bottom right graph in the same ﬁgure.
Smaller ﬁlters usually capture minute details of objects and
Figure 5. Precision and Recall curves using different settings. All ﬁgures show three curves of precision and three curves of recall for three
test data sets. The top left ﬁgure was obtained using a 2-layer R-CNN network (1 conv + 1 FC + softmax) with the shown kernel sizes for
the convolutional layer. The bottom left ﬁgure shows the respective curves for 2-, 3-, 4-, 5- and 6-layer R-CNN networks. The bottom right
shows results with a 6-layer R-CNN network using different kernel sizes in each network.
our current object is an oval shaped white blob. Thus, hav-
ing only one oval shape white blob removes the necessity
for larger network structures.
Different training data sets also contribute towards the
accuracy of the results. It was observed that although the
training set with only sunny images has half the number of
images when compared to the full data set, it was able to
train the network in almost the same way. The results were
quite poor while using training data set of only overcast im-
ages. The details of test images were presented in Table 1,
which were used for testing all discussed networks. Results
presented in Figure 5 were computed using the training data
set only.
It can be seen that best recall is observed for cloudy test
images as there is less illumination variations in them. And
highest recall for both types of test data sets is observed
using network of 2 layers and 96 (21 ×21) ﬁlters in the
convolutional layer. One such resultant test image is shown
in Figure 6.
3.2. Method B
Once the thresholds α,βand γhad been tuned on the 4
training images, Method B was applied to the 8 test images.
The values chosen were
α= 35, β = 0.75, γ = 1.6(5)
The precision and recall for Method B are shown in Table 3.
This method performs perfectly on the training images, and
Data set Precision Recall
Training 100.0% 100.0%
Testing 95.6% 99.5%
Table 3. Precision and Recall for Method B
only shows only slight degradation on the test images.
3.3. Discussion
While it is clear that the Method B outperforms Method
A, the CNN method does show some promise. Indeed there
is perhaps a way to combine the two methods to achieve
even better detection performance. When Method B fails, it
is due to an increase in false negatives, particularly around
background objects such as trees. Method A tends to have
very low false negatives, so perhaps this information can be
used to help Method B improve its discrimination.
4. Conclusion
In this work we have taken an important step forward
in the previously unexplored area of computer vision tech-
niques to detect and count sheep from UAV video. We have
used two totally different approaches to handle the task,
which both work well, although it can be seen from results
that more work is required for deep learning algorithms to
perform effective object detection, especially if the objects
are of very small size as compared to the background. The
Figure 6. 189 detected sheep with 9 false positive results in an
overcast test image having true count of 199, for method A.
hand-crafted method proposed here shows great promise for
sheep detection and counting.
Nonetheless, there are many areas still open for further
work, such as speeding up the algorithms, detecting the
fence lines, improving the robustness of the algorithms to
parameter selection, and the use of tracking to cope with
small mistakes in object detection. CNN-based techniques
may be of use in background discrimination, as well as de-
tecting less uniform livestock such as cattle, which can be
a variety of colours, or even multi-coloured. We hope that
this work will inspire other researchers to tackle some of
these challenges.
References
[1] T. Burghardt and J. Calic. Real-time face detection and track-
ing of animals. In Neural Network Applications in Electrical
Engineering, 2006. NEUREL 2006. 8th Seminar on, pages
27–32. IEEE, 2006.
[2] P. Chamoso, W. Raveane, V. Parra, and A. Gonz´
alez. UAVs
applied to the counting and monitoring of animals. In Am-
bient Intelligence-Software and Applications, pages 71–80.
Springer, 2014.
[3] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column
deep neural networks for image classiﬁcation. In Computer
vision and pattern recognition (CVPR), 2012 IEEE confer-
ence on, pages 3642–3649. IEEE, 2012.
[4] R. Girshick. Fast R-CNN. arXiv preprint arXiv:1504.08083,
2015.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 580–587,
2014.
[6] J. Ingram and J. Anderson. Tropical soil biology and fertility.
A handbook of methods. 2nd ed. CAB Int., Wallingford, UK,
1993.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classiﬁcation with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
Y. Fu, and A. C. Berg. SSD: Single shot multibox detector.
In European conference on computer vision, pages 21–37.
Springer, 2016.
[9] A. McBratney, B. Whelan, T. Ancev, and J. Bouma. Fu-
ture directions of precision agriculture. Precision agricul-
ture, 6(1):7–23, 2005.
[10] J. McKinlay, C. Southwell, and R. Trebilco. Integrating
count effort by seasonally correcting animal population es-
timates (icescape): A method for estimating abundance and
its uncertainty from count data using Ad´
elie penguins as a
case study. CCAMLR Science, 17:213–227, 2010.
[11] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning
and transferring mid-level image representations using con-
volutional neural networks. In Computer Vision and Pat-
tern Recognition (CVPR), 2014 IEEE Conference on, pages
1717–1724. IEEE, 2014.
[12] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object local-
lutional neural networks. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
685–694, 2015.
[13] M. Parikh, M. Patel, and D. Bhatt. Animal detection us-
ing template matching algorithm. International Journal of
Research in Modern Engineering and Emerging Technology,
1(3):26–32, 2013.
[14] K. Portas. personal communication, December 2015.
[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
only look once: Uniﬁed, real-time object detection. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition, pages 779–788, 2016.
[16] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
works. In Advances in neural information processing sys-
tems, pages 91–99, 2015.
[17] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
and Y. LeCun. Overfeat: Integrated recognition, localization
and detection using convolutional networks. arXiv preprint
arXiv:1312.6229, 2013.
[18] G. Sileshi. The excess-zero problem in soil animal count
data and choice of appropriate models for statistical infer-
ence. Pedobiologia, 52(1):1–17, 2008.
[19] V. A. Sindagi and V. M. Patel. A survey of recent advances
in CNN-based single image crowd counting and density es-
timation. Pattern Recognition Letters, 107:3–16, 2017.
[20] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
Smeulders. Selective search for object recognition. Interna-
tional journal of computer vision, 104(2):154–171, 2013.
[21] J. C. van Gemert, C. R. Verschoor, P. Mettes, K. Epema, L. P.
Koh, and S. Wich. Nature conservation drones for automatic
localization and counting of animals. In Workshop at the
European Conference on Computer Vision, pages 255–270.
Springer, 2014.
[22] F. A. Wichmann, J. Drewes, P. Rosas, and K. R. Gegenfurt-
ner. Animal detection in natural scenes: critical features re-
visited. Journal of Vision, 10(4):1–27, 2010.
... Frequentemente, as indústrias possuem dados extras armazenados sobre as operações em suas empresas e buscam a otimização dos seus processos, e na agricultura isso não é diferente. Com o advento da agricultura de precisão, muitos agricultores possuem informações extras disponíveis que foram coletadas por variados sensores, tais como temperatura do solo, componentes do solo, cobertura, e identificadores e monitores de crescimento (Sarwar et al., 2018). Entretanto, informações precisas e atuais sobre a plantação, números e localização de cada planta são mais desafiadoras em grandes áreas, pois é um trabalho demorado, intensivo e caro (Karami et al., 2020). ...
... Também destaca-se que muitos autores utilizaram métodos de CNN que podem ser utilizados para detecção de objetos, já realizando o particionamento, extração de características e classificação. Sarwar et al. (2018) usaram imagens no espaço de cor RGB além de R-CNN com o intuito de detecção e contagem de ovelhas, tendo obtido resultados próximos a 100%. Wang et al. (2018) utilizaram Fast CNN, versão melhorada da R-CNN, para detecção de objetos em áreas urbanas, com um resultado de 86,52%. ...
... Kalantar et al. (2020) utilizaram, para anotar a caixa delimitadora de cada fruta manualmente, a ferramenta "labelImg (version 1.8.1)". Sarwar et al. (2018) também anotaram imagens de 250 × 250 pixels, tendo selecionado patches que contêm ou não objeto. ...
Article
Full-text available
... Moreover, its area under the curve (AUC) shows the effect of input resolution resize when the aim is the resolution optimization. In [55], they evaluate the performance of the CNN-based detection models using the precision-recall curve showing the disjoint of the actual centroids and the estimated centroids of the sheep. Similarly, [39], [56] assessed fifteen different CNN architectures using four global performance metrics, accuracy, precision, recall, and F1Score. ...
... The main phases of UAV-based object detection methods include three stages information region selection, feature extraction, and classification [43]. As an example, UAV-based video data fitted into DL (e.g., CNN) and image processing methods facilitated livestock monitoring of sheep [55], [77], and calves [78]. However, video sequencing is challenging for several reasons, each of which causes a specific issue. ...
... Another issue of dealing with hundreds of small animals per image still lacks information and reliable processing methods. In this regard, [55] explored two sheep detection and counting approaches based on R-CNNs and an expert system using blob analysis from UAV video. While the proposed expert system technique indicated great potential for the intended application, the CNN technique required more work to detect practical objects, especially when dealing with small objects in the background. ...
Article
Full-text available
With the ever-increasing importance of dairy and meat production, precision livestock farming (PLF) using advanced information technologies is emerging to improve farming production systems. The latest automation, connectivity, and artificial intelligence developments open new horizons to monitor livestock in the pasture, controlled environments, and open environments. Due to the significance of livestock detection and tracking, this systematic review extracts and summarizes the existing deep learning (DL) techniques in PLF using unmanned aerial vehicles (UAV). In the context of livestock recognition studies, UAVs are receiving growing attention due to their flexible data acquisition and operation in different conditions. This review examines the implemented DL architectures and scrutinizes the broadly exploited evaluation metrics, attributes, and databases. The classification of most UAV livestock monitoring systems using DL techniques is in three categories: detection, classification, and localization. Correspondingly, this paper discusses the future benefits and drawbacks of these DL-based PLF approaches using UAV imagery. Additionally, this paper describes alternative methods used to mitigate issues in PLF. The aim of this work is to provide insights into the most relevant studies on the development of UAV-based PLF systems focused on deep neural network-based techniques.
... In the deep learning based object detection method, deep convolutional neural networks (CNNs) are capable of learning features from training-image data automatically, thus boosting accuracy without relying on artificial features. Sarwar et al. proposed a model based on region-based CNN to detect and count sheep in a paddock from unmanned aerial vehicle (UAV) video [6]. Soares et al. proposed a method for counting cattle in aerial images obtained by UAVs, based on CNNs and a graph-based optimization to remove duplicated animals detected in overlapped images [7]. ...
Conference Paper
Full-text available
Chicken counting is an essential task in large scale farming management. Due to dense distribution, uneven illumination, and partial occlusion, accurate chicken counting remains challenging. In this paper, an automated chicken counting algorithm based on You Only Look Once (YOLO) v5x model is implemented. The intersection over union (IoU) threshold is set by analyzing the width and height of the ground truth (GT) boxes of the training images. Three objective-oriented data enhancements, i.e., Mosaic, horizontal flipping combined with lightness changing, and test time augmentation (TTA), are applied to diversify the training data. To validate the efficiency of our proposed method, extensive experiments are conducted on a well annotated dataset collected from a real farm with 1,100 images and 170,906 chickens in total. Our implementation achieves the average accuracy of 95.87% and inference speed of 23 ms per image, even if chickens are partially occluded in extremely uneven illumination perspectives.
... In contrast, object detection identifies objects of the same type in an image and pinpoints their locations [7]. The main purpose of object counting applications is to calculate the number of cars on a road or parking lot [8][9][10][11], the number of people in a crowded area [12][13][14], the number of goats and sheep in an livestock farm [15,16], and the number of apples in an cultivated area [17]. ...
Article
Full-text available
Simple Summary: This study employs Fully Convolutional Regression Networks (FCRN) and U-Shaped Convolutional Network for Image Segmentation (U-Net) architectures tailored to the da-taset containing dropping images of dairy cows collected from three different private dairy farms in Nigde. The main purpose of this study is to detect the number of undigested grains in dropping images in order to give some useful feedback to raiser. It is a novel study that uses two different regression neural networks on object counting in dropping images. To our knowledge, it is the first study that counts objects in dropping images and provides information of how effectively dairy cows digest their daily rations. Abstract: Deep learning algorithms can now be used to identify, locate, and count items in an image thanks to advancements in image processing technology. The successful application of image processing technology in different fields has attracted much attention in the field of agriculture in recent years. This research was done to ascertain the number of indigestible cereal grains in animal feces using an image processing method. In this study, a regression-based way of object counting was used to predict the number of cereal grains in the feces. For this purpose, we have developed two different neural network architectures based upon Fully Convolutional Regression Networks (FCRN) and U-Net. The images used in the study were obtained from three different dairy cows enterprises operating in Nigde Province. The dataset consists of the 277 distinct dropping images of dairy cows in the farm. According to findings of the study, both models yielded quite acceptable prediction accuracy with U-Net providing slightly better prediction with a MAE value of 16.69 in the best case, compared to 23.65 MAE value of FCRN with the same batch.
... Nowadays, one application of the UAVs is in agriculture, for farm and livestock management. UAVs are used in this field for monitoring, behavior recognition, counting, detection, tracking, and livestock identification [1][2][3][4][5][6][7]. These applications are sometimes achieved by harnessing other technologies such as Flying Ad hoc Networks (FANET) [8] and wireless sensor networks [9]. ...
Article
Full-text available
... Wang et al. [10] used the acoustic approach, with its capacity to properly distinguish sheep activity, which permits the use of more factors obtained from sheep behavior to estimate intake, their behavior classification, and identification. Moreover, [11] shows that Sheep may be counted using an R-CNN-based system on an unmanned aerial vehicle (UAV). Ma et al. [12] investigated sheep identification location and suggested a Faster-FCNN neural network model based on the Soft-NMS algorithm, which can monitor and identify Sheep in complicated raising situations. ...
Preprint
Full-text available
Sheep management and production enhancement are challenging for farmhouses due to the lack of dynamic sheep behaviors. Many researchers conducted machine learning-based studies to automate the sheep behavior monitoring process instead of manual assessment. However, there is a lack of utilization of sheep behaviors, which degrades the model performance after a few months. Moreover, behavior challenges, parameters, and analysis must be considered if one needs to conduct a machine learning-based study. In this paper, we present different challenges; what are the parameters of the sheep behaviors? Furthermore, how to analyze the sheep behaviors for automated machine learning systems to be helpful in the long term. Also, we review different studies for precision-based animal welfare and monitoring to have higher production and management capabilities.
Chapter
Most sheep farms are established in places far away from urban areas, which inconveniences centralized management, and inventory information is an essential task for centralized management. Still, the traditional way of inventory management mainly uses manual statistics is time-consuming and laborious. To address the above problems, inspired by the crowd density counting model SFANet, we propose a sheep counting algorithm under surveillance video based on VGG16 as the frontend feature extractor and the backend as a dual-path multiscale fusion network for generating density maps and attention maps. Attaching ASPP module and CAN module to the back-end density map path not only extracts the multi-scale features of sheep in the image, but also handles their multi-scale variations based on contextual information. By adding the channel attention module SE to the attention map path, the SE channel attention mechanism enhances the channel where the sheep is located according to the importance of each feature channel, thereby making the network more focused on the target. The results show that the network has a mean absolute error of 1.28 and a mean square error of 1.70 under the outdoor sheep farm dataset, while the indoor environment is more complex with a mean absolute error of 5.82 and a mean square error of 7.36. The experimental results show that is helpful for sheep counting under surveillance video.KeywordsConvolutional neural networksSheep countingSurveillance videos
Article
Full-text available
Estimating count and density maps from crowd images has a wide range of applications such as video surveillance, traffic monitoring, public safety and urban planning. In addition, techniques developed for crowd counting can be applied to related tasks in other fields of study such as cell microscopy, vehicle counting and environmental survey. The task of crowd counting and density map estimation is riddled with many challenges such as occlusions, non-uniform density, intra-scene and inter-scene variations in scale and perspective. Nevertheless, over the last few years, crowd count analysis has evolved from earlier methods that are often limited to small variations in crowd density and scales to the current state-of-the-art methods that have developed the ability to perform successfully on a wide range of scenarios. The success of crowd counting methods in the recent years can be largely attributed to deep learning and publications of challenging datasets. In this paper, we provide a comprehensive survey of recent Convolutional Neural Network (CNN) based approaches that have demonstrated significant improvements over earlier methods that rely largely on hand-crafted representations. First, we briefly review the pioneering methods that use hand-crafted representations and then we delve in detail into the deep learning-based approaches and recently published datasets. Furthermore, we discuss the merits and drawbacks of existing CNN-based approaches and identify promising avenues of research in this rapidly evolving field.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For $$300 \times 300$$ input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for $$512 \times 512$$ input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Conference Paper
Full-text available
This paper is concerned with nature conservation by automatically monitoring animal distribution and animal abundance. Typically , such conservation tasks are performed manually on foot or after an aerial recording from a manned aircraft. Such manual approaches are expensive, slow and labor intensive. In this paper, we investigate the combination of small unmanned aerial vehicles (UAVs or " drones ") with automatic object recognition techniques as a viable solution to manual animal surveying. Since no controlled data is available, we record our own animal conservation dataset with a quadcopter drone. We evaluate two nature conservation tasks: i) animal detection ii) animal counting using three state-of-the-art generic object recognition methods that are particularly well-suited for on-board detection. Results show that object detection techniques for human-scale photographs do not directly translate to a drone perspective, but that lightweight automatic object detection techniques are promising for nature conservation tasks.
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.