Conference PaperPDF Available

Detection of Big Animals on Images with Road Scenes using Deep Learning

Authors:

Abstract

The recognition of big animals on the images with road scenes has received little attention in modern research. There are very few specialized data sets for this task. Popular open data sets contain many images of big animals, but the most part of them is not correspond to road scenes that is necessary for on-board vision systems of unmanned vehicles. The paper describes the preparation of such a specialized data set based on Google Open Images and COCO datasets. The resulting data set contains about 20000 images of big animals of 10 classes: “Bear”, “Fox”, “Dog”, “Horse”, “Goat”, “Sheep”, “Cow”, “Zebra”, “Elephant”, “Giraffe”. Deep learning approaches to detect these objects are researched in the paper. Authors trained and tested modern neural network architectures YOLOv3, RetinaNet R-50-FPN, Faster R-CNN R-50-FPN, Cascade R-CNN R-50-FPN. To compare the approaches the mean average precision (mAP) was determined at IoU≥50%, also their speed was calculated for input tensor sizes 640x384x3. The highest quality metrics are demonstrated by architecture YOLOv3 as for ten classes (0.78 mAP) and one joint class (0.92 mAP) detection with speed more 35 fps on NVidia Tesla V-100 32GB video card. At the same hardware, the RetinaNet R-50-FPN architecture provided recognition speed of more than 44 fps and a 13% lower mAP. The software implementation was done using the Keras and PyTorch deep learning libraries and NVidia CUDA technology. The proposed data set and neural network approach to recognizing big animals on images have shown their effectiveness and can be used in the on-board vision systems of driverless cars or in driver assistant systems.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Detection of Big Animals on Images with Road
Scenes using Deep Learning
Dmitry Yudin
Laboratory of Intelligent Transport
Moscow Institute of Physics and
Technology (National Research
University)
Dolgoprudny, Russia
yudin.da@mipt.ru
Anton Sotnikov
Laboratory of Intelligent Transport
Moscow Institute of Physics and
Technology (National Research
University)
Dolgoprudny, Russia
sotnikov.ad@phystech.edu
Andrey Krishtopik
Laboratory of Cognitive Dynamic
Systems
Moscow Institute of Physics and
Technology (National Research
University)
Dolgoprudny, Russia
andrey.krishtopik@mail.ru
Abstract The recognition of big animals on the images with
road scenes has received little attention in modern research.
There are very few specialized data sets for this task. Popular
open data sets contain many images of big animals, but the most
part of them is not correspond to road scenes that is necessary
for on-board vision systems of unmanned vehicles. The paper
describes the preparation of such a specialized data set based on
Google Open Images and COCO datasets. The resulting data set
contains about 20000 images of big animals of 10 classes:
Bear, Fox, Dog, Horse, Goat, Sheep, Cow,
Zebra, Elephant, Giraffe. Deep learning approaches to
detect these objects are researched in the paper. Authors trained
and tested modern neural network architectures YOLOv3,
RetinaNet R-50-FPN, Faster R-CNN R-50-FPN, Cascade R-
CNN R-50-FPN. To compare the approaches the mean average
precision (mAP) was determined at IoU≥50%, also their speed
was calculated for input tensor sizes 640x384x3. The highest
quality metrics are demonstrated by architecture YOLOv3 as
for ten classes (0.78 mAP) and one joint class (0.92 mAP)
detection with speed more 35 fps on NVidia Tesla V-100 32GB
video card. At the same hardware, the RetinaNet R-50-FPN
architecture provided recognition speed of more than 44 fps and
a 13% lower mAP. The software implementation was done using
the Keras and PyTorch deep learning libraries and NVidia
CUDA technology. The proposed data set and neural network
approach to recognizing big animals on images have shown their
effectiveness and can be used in the on-board vision systems of
driverless cars or in driver assistant systems.
Keywords image recognition, detection, big animals, road
scene, data set, deep learning, neural network, software.
I. INTRODUCTION
Reliable detection of big animals on images is a serious
challenge for the computer vision systems of unmanned cars.
This is especially important because of the relatively high
number road accidents with wild animals[1].
At the early stage, approaches to solving this problem
were used detectors based on hand-crafted features: Haar-
features, HOG (Histogram of oriented gradients), LBP (Local
binary patterns) [2, 3]. However, such approaches were not
reliable enough.
Modern research in the field of big animals detection on
images is associated, mainly, with the usage of deep
convolutional neural networks. Moreover, the recognition of
animals is investigated as a solution to the problems of
classification [4], detection [5] and segmentation [6] of
objects. Some works are devoted to the detection of animals
on images obtained from unmanned aerial vehicles, for
example, paper [7].
The appearance of animals on the road is a relatively rare
event, at the same time, sufficiently large and varied data sets
are needed to train neural network systems for their detection.
Table I shows the most popular modern open data sets
containing images for the detection of big animals. There are
also closed data sets created on the basis of images and videos
from the Internet, for example, LADSet [3], but there is little
information about their contents.
The IWildCam [1], Animal Image [2], The Oxford-IIIT-
Pet [3], and STL-10 [4] datasets have disadvantage that they
contain a small number of labeled images in the training set
and a limited number of animal classes. The largest ImageNet
image database [5] currently contains many labeled images
of a huge number of types and subtypes of big animals, but
the vast majority of them do not apply to the road scene.
TABLE I. OPEN DATA SETS FOR BIG ANIMALS DETECTION PROBLEM
Data sets
Total amount of images in the
data set
IWildCam [1]
~200k
Animal Image [2]
3k
The Oxford-IIIT-Pet [3]
7.5k
STL-10 [4]
100k
ImageNet [5]
14kk
COCO [6]
330k
Google’s Open Images [7]
1.9kk
The COCO [6] and Google’s Open Images [7] data sets are
more promising for use in the research area, and they contain
not only bounding boxes, but also polygons of object
segments. In the present article, in section III, we consider the
formation on their basis of a data set for the detection of big
animals on the road scene.
In addition, special attention is paid to the use of modern
object detectors based on deep convolutional neural networks
and the results of experiments using the created data set are
analyzed.
II. PROBLEM DEFINITION
This article solves the problem of detecting and
classifying animals on the image with road scene. We need
to investigate methods based on deep neural networks for
detection big animals of 10 widespread classes: “Bear,
Fox, Dog, Horse, Goat, Sheep, “Cow”, Zebra,
Elephant, Giraffe. Also task includes the need to study
the detection of an one joint class. To train and test various
neural network architectures appropriate data set should be
generated. Then we need to determine the best architecture
for this task with AP (average precision) [15] quality metric
per class and overall mAP (mean average precision) [16].
Another important indicator is the inference time for one
image (without taking into account the loading time of the
image into memory and its preparation for supplying the
network input).
III. DATA SET PREPARATION
To obtain specific results, we created our own data set
based on COCO [13] and Google’s Open Images V5 [14].
The following classes of large animals were selected from
COCO data set: Dog, Horse, Sheep, Cow, Bear,
Elephant, Zebra, Giraffe. Although there are almost no
representatives of the last 3 classes in the area under
consideration, they were added to improve the quality of the
future detector by their recognition on the road scene. Open
Images V5 contains previous and additional two classes of
large animals: deer, fox” and goat. Annotations to images
are stored in COCO format, i.e. are contained in the .json file.
Let's consider in more detail which fields are included in it:
“Segmentation”: contains polygon’s coordinates;
Area: shows the area of object;
IsCrowd”: shows how many objects are present in the
image, ‘0’- one object, 1’- more than one;
bbox”: contains the coordinates of ground truth
bounding boxes;
Category_id”: shows the supercategory to which the
class belongs. In this case, all classes belong to the one
category “animal;
id”: unique number of each image.
Table II below provides summary statistics on the number
of images of each class of developed data set. Its fragment is
shown on Fig. 1.
IV. DEEP LEARNING APPROACH TO DETECTION
To solve this problem, we chose four architectures of neural
networks based on the successful experience of their
application for solving similar tasks [17, 18]:
1) YOLOv3 [19]: It is a one-stage neural network
architecture that allows to achieve high-speed image
processing with slightly lower quality. Feature extractor
consists of 3x3 and 1x1 convolutional layers and shortcut
connections. YOLOv3 [19] predicts boxes at 3 different
scales using a similar concept to feature pyramid networks.
For classification independent logistic classifier is used
instead of softmax. Bounding box predictor uses anchor
boxes.
TABLE II. NUMBER OF IMAGES BY CLASS
Classes
Training sample
Testing sample
Images
Images
Boxes
Dog
4385
177
218
Horse
2941
128
273
Sheep
1529
65
361
Cow
1968
87
380
Elephant
2143
89
255
Bear
960
49
71
Zebra
1916
85
268
Giraffe
2546
101
232
Fox
460
10
12
Goat
274
14
34
Total
19122
805
2104
Fig. 1. Fragment of proposed data set.
2) RetinaNet R-50-FPN [20]: This one-stage network was
developed to test a new loss function - the focal loss function,
which was created to improve the effectiveness of training.
Focal loss adds a factor (1 − pt)γ to the standard cross entropy
criterion. Setting γ > 0 reduces the relative loss for well-
classified examples (pt > 0.5), putting more focus on hard,
misclassified examples. The network is pretty simpe. It uses
FPN (Feature pyramid network) on top of the ResNet-50 [21]
architecture as feature extractor.
3) Faster R-CNN R-50-FPN [22]: This two-stage
architecture uses ResNet-50 with FPN to extract feature
maps. The difference between Faster R-CNN and Fast R-
CNN [23] is that region proposals are retrieved using the
Region Proposal Network (RPN) [22] instead of using
selective search which exceed network performance by about
10 times.
4) Cascade R-CNN R-50-FPN [24]: Cascade R-CNN is a
multi-stage object detection architecture (Fig. 2). A specialty
of this network is cascaded bounding box regression, as
shown in the figure. “I” is input image, ResNet-50 with FPN
is backbone, “pool” region-wise feature extraction, “H”
network head, “AB” animal bounding box, and “AC” animal
classification. “AB0” is proposals in all architectures.
Fig. 2. Architecture of Cascade R-CNN R-50-FPN neural network
YOLOv3 was trained using the neural-network library
Keras [25] (running on top of TensorFlow [26]). The rest of
the architectures are using the PyTorch library [27]. For
training on our data set, pre-trained models were used.
The YOLOv3 model was pre-trained on ImageNet. We
used only pre-trained backbone (DarkNet53) [19]). Since we
did not use the entire network, but only the backbone, the rest
of the network is initialized with random weights. Because of
this, during the first several epochs, the network trained with
a frozen backbone to train randomly initialized weights first.
Only after that the entire network is included in the training.
The remaining models were pre-trained on the COCO
2017 train [13]. Unlike YOLOv3, we used the whole
pretrained network. However, since there are 80 classes in the
COCO data set, before training, we removed the extra classes
from the models.
The training was carried out with input image tensor sizes
640x384x3 and batch of 8 images. The learning rate was
initially 0.01 and automatically decreased during the learning
process if needed.
V. EXPERIMENTAL RESULTS
The calculations had performed using the NVidia CUDA
technology on the graphics processor of the Tesla V100
graphics card with 32GB, central processor Intel Xeon Gold
6154 CPU, 16 Core with 3.00 GHz and 128 GB RAM.
Table III shows the results of the animal detection and
classification on test samples using YOLOv3, RetinaNet,
Faster R-CNN and Cascade R-CNN architectures.
TABLE III. QUALITY OF BIG ANIMAL DETECTION ON TESTING SAMPLE
(10 CLASSES)
Quality
metric
Architecture of deep neural network
Cascade
R-CNN
R-50-FPN
Faster
R-CNN
R-50-FPN
RetinaNet
R-50-FPN
YOLOv3
APdog
0.81
0.81
0.83
0.92
APhorse
0.75
0.76
0.77
0.88
APsheep
0.68
0.67
0.65
0.75
APcow
0.65
0.66
0.60
0.80
APelephant
0.82
0.83
0.84
0.88
APbear
0.81
0.87
0.89
0.95
APzebra
0.84
0.88
0.88
0.91
APgiraffe
0.87
0.86
0.87
0.91
APfox
0.21
0.18
0.19
0.18
APgoat
0.39
0.44
0.41
0.58
mAP
0.68
0.70
0.69
0.78
As we can see from the table above, the YOLOv3 network
has the best mAP score. As for the AP in each category,
YOLOv3 is slightly inferior to the Cascade R-CNN network
only in the fox class. In all other classes, YOLOv3 is
noticeably ahead of other architectures. The rest of the
architectures showed roughly the same results.
RetinaNet has the highest speed (Table V). The slowest
architecture is the Cascade R-CNN.
We had also trained models for detecting animals as one
joint class, that is, without classification. The quality of
detection is presented in the Table IV.
TABLE IV. QUALITY OF BIG ANIMAL DETECTION ON TESTING SAMPLE
(ONE JOINT CLASS)
Quality
metric
Architecture of deep neural network
Cascade
R-CNN
R-50-FPN
Faster
R-CNN
R-50-FPN
RetinaNet
R-50-FPN
YOLOv3
mAP
0.81
0.81
0.83
0.92
When detecting without classification, the mAP is higher.
YOLOv3 has the best result. The rest of the architecture is
about the same level. Table V shows Fps (frame per second)
performance metric for the architectures providing joint class
detection.
We can see that the speed has increased slightly in
comparison of 10 classes detection. RetinaNet has the highest
speed. The slowest architecture is Cascade R-CNN.
TABLE V. PERFORMANCE OF BIG ANIMAL DETECTION
Performance
metric
Neural network architecture
Cascade
R-CNN
R-50-FPN
Faster
R-CNN
R-50-FPN
RetinaNet
R-50-FPN
YOLOv3
Fps for one
joint class
detection
27.5
40.9
50.0
39.8
Fps for 10
classes
detection
26.8
39.6
44.6
35.4
VI. CONCLUSION
The paper demonstrates research of deep learning
approaches to detect 10 classes of big animals on the data set
with about 20000 images: Bear, Fox, Dog, Horse,
Goat, Sheep, “Cow”, Zebra, Elephant, Giraffe”.
Authors trained and tested several modern neural network
architectures: YOLOv3, RetinaNet R-50-FPN, Faster R-
CNN R-50-FPN, Cascade R-CNN R-50-FPN. To compare
the approaches the mAP metric was determined at IoU≥50%,
also their speed was calculated for input tensor sizes
640x384x3. The highest quality metrics are demonstrated by
architecture YOLOv3 as for ten classes (0.78 mAP) and one
joint class (0.92 mAP) detection with speed more 35 fps on
NVidia Tesla V-100 32GB video card. At the same hardware,
the RetinaNet R-50-FPN architecture provided recognition
speed of more than 44 fps and a 13% lower mAP. The
proposed data set and neural network approach to recognizing
big animals on images have shown their effectiveness and can
be used in the on-board vision systems of driverless cars or in
driver assistant systems.
For further study of this topic, it is necessary to increase
the volume of training and testing samples for all classes
especially for night and poorly lit road scenes. This can be
done, for example, by using image augmentation or by the
usual addition of new labeled images.
ACKNOWLEDGMENT
This study was carried out under the contract with the
Scientific-Design Bureau of Computing Systems (SDB CS)
and supported by the Government of the Russian Federation
(Agreement No 075-02-2019-967).
REFERENCES
[1] W. Saad, A. Alsayyari, Loose Animal-Vehicle Accidents Mitigation:
Vision and Challenges. 2019 International Conference on Innovative
Trends in Computer Engineering (ITCE), 2019.
[2] D. Zhou, Real-time animal detection system for intelligent vehicles,
2014.
[3] A. Mammeri, D. Zhou, A. Boukerche, Animal-Vehicle Collision
Mitigation System for Automated Vehicles,” IEEE Transactions on
Systems, Man, and Cybernetics: Systems, Vol. 46, Iss. 9, 2016, pp.
1287-1299.
[4] G. K. Verma, P. Gupta, “Wild Animal Detection Using Deep
Convolutional Neural Networks,” Second International Conference on
Computer Vision & Image Processing (CVIP-2017), vol. 704, 2017.
[5] Z. Zhang, Z. He, G. Cao, W. Cao. Animal Detection From Highly
Cluttered Natural Scenes Using Spatiotemporal Object Region
Proposals and Patch Verification, IEEE Transactions on Multimedia,
Vol. 18, Iss. 10 , 2016, pp. 2079-2092.
[6] K. Saleh, M. Hossny, S. Nahavandi. Kangaroo Vehicle Collision
Detection Using Deep Semantic Segmentation Convolutional Neural
Network, 2016 International Conference on Digital Image
Computing: Techniques and Applications (DICTA), 2016.
[7] B. Kellenberger, M. Volpi, D. Tula, “Fast animal detection in UAV
images using convolutional neural networks,” IGARSS 2017 - 2017
IEEE International Geoscience and Remote Sensing Symposium,
2017.
[8] S. Beery, D. Morris, and P. Perona, The iWildCam 2019 Challenge
Dataset,” arXiv:1907.07617, 2019.
[9] Animal Image Dataset (DOG, CAT and PANDA),
https://www.kaggle.com/ashishsaxena2209/animal-image-datasetdog-
cat-and-panda.
[10] O. M. Parkhi, A. Vedaldi, A. Zisserman, C. V. Jawahar. Cats and Dogs.
IEEE Conference on Computer Vision and Pattern Recognition, 2012
[11] A. Coates, H. Lee, and A. Y. Ng, An Analysis of Single Layer
Networks in Unsupervised Feature Learning, AISTATS, 2011.
[12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, ImageNet: A
large-scale hierarchical image database,” CVPR, pp. 248-255, 2009
[13] COCO. Common objects in context, http://cocodataset.org.
[14] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-
Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari, “The
Open Images Dataset V4: Unified image classification, object
detection, and visual relationship detection at scale,”
arXiv:1811.00982, 2018.
[15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
Zisserman, The PASCAL Visual Object Classes (VOC) Challenge,
International Journal of Computer Vision, 88(2), 303-338, 2010.
[16] S.M. Beitzel, E.C. Jensen, O. Frieder, MAP,” In: LIU L., ÖZSU M.T.
(eds) Encyclopedia of Database Systems. Springer, Boston, MA, 2009.
[17] D. A. Yudin, A. Skrynnik, A. Krishtopik, I. Belkin, A. I. Panov,
Object Detection with Deep Neural Networks for Reinforcement
Learning in the Task of Autonomous Vehicles Path Planning at the
Intersection, Optical Memory & Neural Networks (Information
Optics), Vol. 28 № 4, 2019.
[18] D. Yudin, A. Ivanov, M.Shchendrygin, Detection of a Human Head
on a Low-Quality Image and its Software Implementation,
International Archives of the Photogrammetry, Remote Sensing and
Spatial Information Sciences, vol. 42, 2/W12, 2019.
[19] J. Redmon, A. Farhadi, “YOLOv3: An Incremental Improvement,”
arXiv:1804.02767, 2018.
[20] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, “Focal Loss for
Dense Object Detection,” The IEEE International Conference on
Computer Vision (ICCV), 2017, pp. 2980-2988.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
Image Recognition,” arXiv:1512.03385, 2015.
[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-
Time Object Detection with Region Proposal Networks,” Neural
Information Processing Systems, 2015.
[23] R. Girshick, “Fast R-CNN,” The IEEE International Conference on
Computer Vision (ICCV), 2015, pp. 1440-1448.
[24] Z. Cai, N. Vasconcelos, “Cascade R-CNN: Delving Into High Quality
Object Detection,” The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2018, pp. 6154-6162.
[25] Keras library, https://github.com/keras-team/keras.
[26] Tensorflow library. https://github.com/tensorflow/tensorflow.
[27] PyTorch library. https://github.com/pytorch/pytorch.
... [1], [4], Detection of Big Animals on Images with Road Scenes using Deep Learning used various high-level deep learning algorithms to detect the object. [1] used data set based on Google Open Images and COCO datasets containing around 20,000 random images. [1] used network architectures such as VOLOv3, Fast R-CNN. ...
... [1] used data set based on Google Open Images and COCO datasets containing around 20,000 random images. [1] used network architectures such as VOLOv3, Fast R-CNN. The drawback of this model is high component power. ...
... [3], [5] created a support vector model to detect and classify the animals using IR and Camera, which is then connected to the main MUC, where the vector machine finds the key points of animals. The efficiency of this model is higher than [1], [2]. ...
Conference Paper
Over the years, Accidents due to animals crossing the road at unexpected moments have still been a significant cause of road death. Roads near the forest are dark and dense; hence drivers cannot spot the animals clear. Truck drivers face issues due to blindspot regions. This paper proposes a model that can efficiently detect the animals and alarm the driver. Using Machine learning - A deep learning algorithm, we are segregating the animals with the help of a vast open-source dataset. Using convolution neural networks, the model will predict the object for every image frame received from the Live Camera. If the machine marks an object as an animal, the system gives an alert of 3 seconds to make the driver conscious about the approaching animal. This model doesn't stop with few animals as the dataset is open-sourced the variety of animals detection keep increasing. The model gives 91% accuracy. DOI: https://doi.org/10.1109/ICCICA52458.2021.9697287
... The frequency of animal assaults varies depending on where you live. For instance, in the United States, an estimate of two million animals' assaults on humans are recorded each year [38] . In contrast, Tanzanian and American scientists' reports show that Animal assaults on humans increase from 1990 to 2005. ...
... Since the African Wildlife Dataset contains only four classes without the issue of imbalance in the dataset, we used the source codes provided by [38] . We evaluated the African Wildlife Dataset on them for a fair comparison as shown in Table 5 . ...
... We noted that the best Loss is at a different epoch which could be that the learning rate is high and when it's reaching the local minimum, it exceeds it, or at some point, the invalid probabilities are being passed to the loss function. Table 4 , the YOLOv3 outperformed our proposed model with a slight difference in some of the classes when implemented on the 10 classes dataset recorded in [38] . In comparison, our algorithm outperformed all other compared architectures on the African Wildlife Dataset with 4 classes. ...
Article
Full-text available
Detecting and classifying animal species is the first step in determining their long-term viability and the influence we may be having on them. Second, it aids people in recognizing predators and non-predatory animals, both of which pose a significant threat to humans and the environment. Third, it lowers the rate of traffic accidents in various regions since it has been a regular sighting on roadways, resulting in several collisions with automobiles. However, animal species' detection and Classification of animal species face many challenges such as the size and inconsistent behaviors various among the species. This paper proposes using a novel two-stage network with a modified multi-scale attention mechanism to create a more integrated recognition and classification system to attend to the challenges. At the regional proposal stage, a deeply characterized pyramid design with lateral connections was adopted, making the semantic characteristic of a small item more sensitive. Secondly, by reason of a densely connected convolutional network, the functional transmission is enhanced and multiplexed throughout the classification stage, resulting in a more precise Classification with fewer parameters. The Proposed model was evaluated using the AP and mAP evaluation metrics on the Animal wildlife and the challenging Animal-80 dataset. An mAP of +0.1% and an AP of 5% to 20% increase in each class was achieved by the attention-based proposed model compared to the non-attention-based model. Further comparison with other related works shows the proposed techniques' effectiveness for detecting and classifying animal species.
... The issue could accomplish background subtraction with thresholds and color matching [16]. Recently, deep learning has made it possible to analyze animal biometric datasets and perform complex imaging tasks, including segmentation and classification [17,18]. While these methods have made valuable contributions, they are not robust to inevitable environmental variations. ...
Article
Full-text available
Image recording is now ubiquitous in the fields of endangered-animal conservation and GIS. However, endangered animals are rarely seen, and, thus, only a few samples of images of them are available. In particular, the study of endangered-animal detection has a vital spatial component. We propose an adaptive, few-shot learning approach to endangered-animal detection through data augmentation by applying constraints on the mixture of foreground and background images based on species distributions. First, the pre-trained, salient network U2-Net segments the foregrounds and backgrounds of images of endangered animals. Then, the pre-trained image completion network CR-Fill is used to repair the incomplete environment. Furthermore, our approach identifies a foreground–background mixture of different images to produce multiple new image examples, using the relation network to permit a more realistic mixture of foreground and background images. It does not require further supervision, and it is easy to embed into existing networks, which learn to compensate for the uncertainties and nonstationarities of few-shot learning. Our experimental results are in excellent agreement with theoretical predictions by different evaluation metrics, and they unveil the future potential of video surveillance to address endangered-animal detection in studies of their behavior and conservation.
... In human research, AI-based approaches are used to classify pathogens into genetic subgroups (Prajapati et al., 2017;Sardogan et al., 2018), distinguish patient groups with different risk factors, and detect objects in images that can be used for diagnosis (Ubbens et al., 2018;Jiang et al., 2020). Animal data are also used to classify or diagnose diseases (Banzato et al., 2018a,b;Choi et al., 2018;Kim et al., 2019) and to study animal cognition (Hao et al., 2019;Yudin et al., 2019;Mohammed and Hussain, 2021). In plants, AI-based image analyses can be used to recognize specific tissues (i.e., flowers and fruits), detect diseases (Wozniak and Połap, 2018;Maeda-Gutierrez et al., 2020), and classify species, cultivars, and lineages (Lee et al., 2015;Grinblat et al., 2016;Hedjazi et al., 2017). ...
Article
Full-text available
Efficient and accurate methods of analysis are needed for the huge amount of biological data that have accumulated in various research fields, including genomics, phenomics, and genetics. Artificial intelligence (AI)-based analysis is one promising method to manipulate biological data. To this end, various algorithms have been developed and applied in fields such as disease diagnosis, species classification, and object prediction. In the field of phenomics, classification of accessions and variants is important for basic science and industrial applications. To construct AI-based classification models, three types of phenotypic image data were generated from 156 Brassica rapa core collections, and classification analyses were carried out using four different convolutional neural network architectures. The results of lateral view data showed higher accuracy compared with top view data. Furthermore, the relatively low accuracy of ResNet50 architecture suggested that definition and estimation of similarity index of phenotypic data were required before the selection of deep learning architectures.
... By solving it, we can build trajectories for dynamic obstacles surrounding robotic platforms or unmanned vehicles and improve their safety. In the general case, tracking is based on the results of object recognition in the frame (Yudin et al, 2019). All methods can be divided by online (Bewley et al., 2016, Bochinski et al., 2017, Payer et al., 2018 using current and previous frames, and offline (Voigtlaender et al., 2019, Luiten et al., 2018 using future frames features. ...
Article
Full-text available
The paper is devoted to the task of multiple objects tracking and segmentation on monocular video, which was obtained by the camera of unmanned ground vehicle. The authors investigate various architectures of deep neural networks for this task solution. Special attention is paid to deep models providing inference in real time. The authors proposed an approach based on combining the modern SOLOv2 instance segmentation model, a neural network model for embedding generation for each found object, and a modified Hungarian tracking algorithm. The Hungarian algorithm was modified taking into account the geometric constraints on the positions of the found objects on the sequence of images. The investigated solution is a development and improvement of the state-of-the-art PointTrack method. The effectiveness of the proposed approach is demonstrated quantitatively and qualitatively on the popular KITTI MOTS dataset collected using the cameras of a driverless car. The software implementation of the approach was carried out. The acceleration of the procedure for the formation of a two-dimensional point cloud in the found image segment was done using the NVidia CUDA technology. At the same time, the proposed instance segmentation module provides a mean processing time of one image of 68 ms, the embedding and tracking module of 24 ms using the NVidia Tesla V100 GPU. This indicates that the proposed solution is promising for on-board computer vision systems for both unmanned vehicles and various robotic platforms.
Article
Full-text available
Among a number of problems in the behavior planning of an unmanned vehicle the central one is movement in difficult areas. In particular, such areas are intersections at which direct interaction with other road agents takes place. In our work, we offer a new approach to train of the intelligent agent that simulates the behavior of an unmanned vehicle, based on the integration of reinforcement learning and computer vision. Using full visual information about the road intersection obtained from aerial photographs, it is studied automatic detection the relative positions of all road agents with various architectures of deep neural networks (YOLOv3, Faster R-CNN, RetinaNet, Cascade R-CNN, Mask R-CNN, Cascade Mask R-CNN). The possibilities of estimation of the vehicle orientation angle based on a convolutional neural network are also investigated. Obtained additional features are used in the modern effective reinforcement learning methods of Soft Actor Critic and Rainbow, which allows to accelerate the convergence of its learning process. To demonstrate the operation of the developed system, an intersection simulator was developed, at which a number of model experiments were carried out.
Article
Full-text available
The paper considers the task solution of detection on two-dimensional images not only face, but head of a human regardless of the turn to the observer. Such task is also complicated by the fact that the image receiving at the input of the recognition algorithm may be noisy or captured in low light conditions. The minimum size of a person’s head in an image to be detected for is 10 × 10 pixels. In the course of development, a dataset was prepared containing over 1000 labelled images of classrooms at BSTU n.a. V.G. Shukhov. The markup was carried out using a segmentation software tool specially developed by the authors. Three architectures of convolutional neural networks were trained for human head detection task: a fully convolutional neural network (FCN) with clustering, the Faster R-CNN architecture and the Mask R-CNN architecture. The third architecture works more than ten times slower than the first one, but it almost does not give false positives and has the precision and recall of head detection over 90% on both test and training samples. The Faster R-CNN architecture gives worse accuracy than Mask R-CNN, but it gives fewer false positives than FCN with clustering. Based on Mask R-CNN authors have developed software for human head detection on a lowquality image. It is two-level web-service with client and server modules. This software is used to detect and count people in the premises. The developed software works with IP cameras, which ensures its scalability for different practical computer vision applications.
Chapter
Full-text available
Wildlife monitoring and analysis are an active research field since last many decades. In this paper, we focus on wildlife monitoring and analysis through animal detection from natural scenes acquired by camera-trap networks. The image sequences obtained from camera-trap consist of highly cluttered images that hinder the detection of animal resulting in low-detection rates and high false discovery rates. To handle this problem, we have used a camera-trap database that has candidate animal proposals using multilevel graph cut in the spatio-temporal domain. These proposals are used to create a verification phase that identifies whether a given patch is animal or background. We have designed animal detection model using self-learned Deep Convolutional Neural Network (DCNN) features. This efficient feature set is then used for classification using state-of-the-art machine learning algorithms, namely support vector machine, k-nearest neighbor, and ensemble tree. Our intensive results show that our detection model using DCNN features provides accuracy of 91.4% on standard camera-trap dataset.
Article
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide \(15\times \) more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Article
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.