Conference PaperPDF Available

Dog Breed Identification Using Deep Learning


Abstract and Figures

The current paper presents a fine-grained image recognition problem, one of multi-class classification, namely determining the breed of a dog in a given image. The presented system employs innovative methods in deep learning, including convolutional neural networks. Two different networks are trained and evaluated on the Stanford Dogs dataset. The usage/evaluation of convolutional neural networks is presented through a software system. It contains a central server and a mobile client, which includes components and libraries for evaluating on a neural network in both online and offline environments.
Content may be subject to copyright.
Dog Breed Identification Using Deep Learning
an R´
Babes¸–Bolyai University
Cluj-Napoca, Romania
Csaba Sulyok
Babes¸–Bolyai University
Cluj-Napoca, Romania
Zsolt Vad´
Cluj-Napoca, Romania
Attila Z¨
Cluj-Napoca, Romania
Abstract—The current paper presents a fine-grained image
recognition problem, one of multi-class classification, namely
determining the breed of a dog in a given image. The presented
system employs innovative methods in deep learning, including
convolutional neural networks. Two different network are trained
and evaluated on the Stanford Dogs dataset.
The usage/evaluation of convolutional neural networks is
presented through a software system. It contains a central server
and a mobile client, which includes components and libraries
for evaluating on a neural network in both online and offline
Index Terms—convolutional neural networks, dog breed iden-
tification, fine-grained image recognition, image classification,
inception resnet v2, mobile trained model, stanford dog dataset
Nowadays, convolutional neural networks (CNN) [1] are
popular in different topics of deep learning: image recogni-
tion [1], detection [2], speech recognition [3], data genera-
tion [4], etc.
Several traditional image recognition methods are known:
Scale-Invariant Feature Transform (SIFT) [5], Histogram of
Oriented Gradients (HoG) [6], attribute classification with
classifiers: Support Vector Machine (SVM), Softmax and
Cross Entropy loss. However, CNNs have also gained sig-
nificant traction in this field in recent years, mostly due
to general reoccurring architectures being viable for solving
many different problems.
The current paper presents the methodology and results of
fine-tuning CNNs for two different architectures, using the
Stanford Dogs dataset [7]. This constitutes a classification
problem, but also one of fine-grained image recognition, where
there are few and minute differences separating two classes.
Convolutional neural networks are very similar to Artificial
Neural Networks [8], which have learnable weights and biases.
The difference is the filters, which process over the whole
image and are effective in image recognition and classification
problems. Deep CNNs are viable on large dataset [9] and
are even accurate in large-scale video classification [10].
Fine-tuning methods and learning results for the Inception-
Resnet V2 [11] and NASNet-A mobile [12] architectures are
presented. Furthermore, the usage of the trained convolutional
neural networks is visualized through a separate software
system employing modern technologies. This system is able to
determine the breed of a dog in an image provided by the user,
and also displays detailed information about each recognized
breed. It consists of two main components: a mobile client
and a centralized web server.
The remainder of the document is structured as follows:
Section II provides an overview of similar approaches in the
literature, while Section III presents the used and preprocessed
Stanford Dogs dataset. Section IV details the learning of
two different CNNs, with Section V encapsulating the re-
sults thereof. Providing a practical edge to these CNNs, the
accompanying software system is described in Section VI.
Conclusions are drawn and further development plans are
proposed in Section VII.
The current section presents previous attempts at addressing
the problems tackled by the current research. Abdel-Hamid et
al. [3] solve a similar problem, one of speech recognition,
using traditional methods, making use of the size and position
of each local part, and PCA, while Sermanet et al. [13] and
Simon et al. [14] mention convolutional neural networks with
different architectures.
Liu et al. [15] present alternative learning methods in 2016
using attention localization, while Howard et al. in 2017 [16]
present a learning of a CNN using the MobileNet architecture
and the Stanford Dogs dataset extended with noisy data.
Similar fine-grained image recognition problems are solved
by detection. For example, Zhang et al. [17] generalize R-
CNN to detect different parts of an image, while Duan et
al. [18] discover localized attributes. Angelova et al. [19] use
segmentation and object detection to tackle the same issue.
Chen et al. [20] use selective pooling vector for fine-grained
image recognition.
The current research is based on fine-tuning CNNs and the
results thereof are reproducible on the original Stanford Dogs
dataset using the presented methods.
The presented CNN learning methodology revolves around
the Stanford Dogs [7] dataset. It contains 120 different dog
breeds and is a subset of the more general ImageNet [21]. It
is separated into training and test dataset. Both sets contain
images of different sizes and every image is given a label
representing the embodied dog breed. The training dataset
contains 12.000 images with roughly 100 per breed; the test
data consists of 8.580 unevenly distributed images.
The first step of the preprocessing is to split the training data
into train folds and validation fold for experimental tuning of
the learning hyperparameters. Before splitting the data, the dog
images are resized to 256x256 pixels (for NASNet-A mobile)
and 299x299 pixels (Inception-Resnet V2 input).
For experimenting with the hyperparameters of the CNNs,
fine-tuning and 5-fold cross-validation is used, which in the
end produces 5 different training and validation subsets. Each
of these datasets contains 9.600 training images and 2.400 vali-
dation data. After getting the best fine-tuned hyperparameters,
the entire Stanford Dogs training dataset is used for training,
while the test data is exclusively used for evaluation.
After obtaining the necessary formatted and resized data, the
next step is fine-tuning the convolutional neural networks. This
section presents the applied technologies, CNN architectures,
methods, hyperparameters and using the trained models as
frozen graph.
The presented problem fits into the category of fine-grained
image recognition, since the differences linking any sample
image to a certain class are few and minuscule; the CNN must
consider small key features to dissolve ambiguity. For exam-
ple, the husky and the malamut breeds present small enough
differences among them to make differentiating difficult even
for trained eyes.
A. CNN Architectures
Transfer learning [22] provides a performance boost by not
requiring a full training from scratch. Instead, it uses pre-
trained models which are taught general reoccurring features.
These models are often trained on the ImageNet [21] dataset,
which has a competition every year and some pre-trained
models are published. The learning of these models represents
fine-tuning the given dataset with the learned weights and
biases. The current research contains two different public pre-
trained convolutional neural networks, which are fine-tuned:
NASNet-A mobile [12] and Inception-Resnet V2 [11].
The NASNet-A architecture is created based on the ap-
proach of the Neural Architecture Search (NAS) frame-
work [23], by the Google AutoML [24].
The Inception-Resnet V2 is a very deep architecture con-
taining over 300 layers, which is created by the Google
developers team.
B. Data augmentation
The most common method to reduce overfitting on training
data is to use different transformations before the feedforward
pass during the training; this is called data augmentation [9].
During the learning of the CNN models, another preprocess-
ing method, augmentation, is applied. The Inception-Resnet
V2 and the NASNet-A mobile architecture uses Inception
preprocessing, which is the following function:
f(x)=( x
255.00.5) 2.0
Fig. 1. The adaptive learning rate during the training sessions decay
exponentially by 10% every 3 epochs. The blue is the learning rate of the
Inception-Resnet V2 with an initial value of 0.1, while the red is the NASNet-
A learning rate with an initial value of 0.029.
where xthe image. Before the Inception preprocessing, the
image is randomly reflected and cropped by the TensorFlow’s
distorted bound box algorithm.
For the evaluation of the validation or the test dataset, the
applied augmentation steps include an 87.5% central crop and
the Inception preprocessing.
C. Learning and hyperparameters
The fine-tuning experiments of the convolutional neural net-
works are performed on a personal computer with a GeForce
GTX 1080 GPU, an Intel Core i5-6400 CPU and 64 GB RAM.
During the learning, a Softmax Cross-Entropy loss function
and Nesterov momentum [25] optimizers are used for the fine-
tuning of the NASNet-A mobile model and the Inception-
Resnet V2 model. During the training, the last fully-connected
layer (logits) is unfrozen and fine-tuned, while the other layers
are unchanged and frozen. This makes use of the benefits of
the pre-trained models.
The first parts of the CNN training involve hyperparame-
ter tuning using cross-validation. During the experiment the
hyperparameters are chosen empirically. After resulting in the
appropriate parameters, another phase of learning begins on
the entire Stanford Dogs training dataset, with the model
evaluated on the test dataset in this case.
Both convolutional neural network are trained with the
following common hyperparameters: a batch size of 64, an ex-
ponentially decreasing learning rate with different initial value,
where the rate decays 10% every 3 epochs (approximately
563 steps), a 0.0001 weight decay, and training for 30.000
steps. The NASNet-A mobile architecture is fine-tuned with a
learning rate with an initial value of 0.029 (see Figure 1). The
Inception-Resnet V2 is fine-tuned with a learning rate with an
initial value of 0.1 (see Figure 1). Training the mobile model
takes three times less than the Inception Resnet V2 model.
Further experimented hyperparameters include: fixed learn-
ing rates (0.01, 0.001), exponentially adaptive learning rates
(with initial values of 0.031, 0.035), default weight decay
(0.0004), different numbers of steps for training (15.000,
20.000, 20.500) and different optimizer (RMSprop).
Fig. 2. The accuracy of the trained NASNet-A and
the Inception Resnet V2 models on the train and
test dataset.
(a) NASNet-A mobile bottom 10 classes (b) Inception Resnet V2 bottom 10 classes
Fig. 3. Normalized confusion matrices depicting the bottom 10 ranking classes from the test dataset.
The y axis holds the actual labels, while the x axis shows the predicted labels in equivalent order.
D. Frozen graph
To use the taught CNN for evaluation through an API, it is
necessary in preamble to freeze the model. This step involves
freezing, saving and exporting the model of the neural network
to a single binary file including the structure and parameters.
The trained models are evaluated every 10 minutes on the
training and test dataset. In this section, we present the relevant
metrics for evaluation: accuracy (see Figure IV-C), precision,
recall (see Table I) and confusion matrices (see Figure 3). The
evaluation points are linearly distributed in time, not in number
of steps, hence the uneven step distribution.
The accuracy is monitored on the training and test datasets;
this metric represents the mean percentage of correctly classi-
fied classes on a dataset. The NASNet-A mobile architecture
achieves 85.06% accuracy on the train dataset and 80.72% on
the test dataset. The accuracy with a deeper CNN shows better
results: the Inception Resnet V2 network achieves 93.66%
accuracy on the train dataset and 90.69% on the test dataset
(see Figure IV-C). The superior performance of the latter is
not surprising, since it is a much deeper network.
Precision and recall are also measured during the evaluation
of the training and test dataset; results are presented in Table I.
The precision and the recall for the trained Inception Resnet
V2 model is better, the deeper model extracts more features
from an image and classifies better than the trained NASNet-A
mobile model.
Furthermore, confusion matrices are built in the last step of
training for both CNNs using the test data, representing the 10
best and 10 worst scoring classes (the bottom 10 are visible in
Metric Model Training dataset Test dataset
Precision NASNet-A 85.27% 80.03%
Precision Inception Resnet V2 93.86% 90.18%
Recall NASNet-A 85.06% 79.86%
Recall Inception Resnet V2 93.65% 89.99%
Figure 3). Examining the results, both CNNs have difficulties
differentiating between certain classes, e.g. the eskimo dog and
siberian husky, or the toy poodle and miniature poodle. Even
from the confusion matrices, the accuracies may be observed
as less for the NASNet-A mobile than for the Inception-Resnet
The top ranking classes from the confusion matrices are also
analyzed. The Inception-Resnet V2 trained model classifies
10 different dog breeds1with 100% accuracy. On the other
hand, the NASNet-A trained model classifies only 1 breed
with 100% accuracy, namely the sealyham terrier.
To gain further insight from the confusion matrices, a statis-
tical analysis is performed, gathering how many different dog
breeds are classified correctly between percentage intervals;
the resulting histogram is visible in Figure 4. The histogram
shows that in the case of the Inception Resnet V2 trained
1brittany spaniel, chow, afghan hound, african hunting dog, keeshond,
schipperke, bedlington terrier, sealyham terrier and curly
Fig. 4. Histogram of classified class percentages for both CNN architectures
on both datasets. A bar represents the count of classes within a range of
Optimizer Training accuracy Test accuracy
Nesterov momentum 85.06% 80.72%
RMSprop 84.94% 80.57%
Method Accuracy
Chen et al. [20] 52.00%
Simon et al. [14] 68.61%
Google LeNet [13] 75.00%
Krause et al. [26] 82.6%
Liu et al. [15] (ResNet-50) 88.9%
Ours, NASNet-A mobile 80.72%
Ours, Inception Resnet V2 90.69%
model, most classes from the train dataset are classified in the
(95%, 100%] interval, while from the test dataset the most
classes are classified in the (90%, 95%] range. The test dataset
is classified well with some inaccuracies remaining. The
classified classes from train and test dataset for the NASNet-
A model have a wider spread, with a majority of the classes
falling in the (50%, 80%] range. The large accuracy difference
between the two models is understandably in correlation with
the size and complexity of the architectures.
The mentioned accuracies using the NasNet-A architecture
are achieved using the Nesterov momentum optimizer. For a
comparison, an alternative optimizer, RMSprop, is also tested;
the results are similar (see Table II).
Comparison to related work on Stanford Dogs dataset is
given in the Table III.
After the evaluation of each trained convolutional neural
network, a Grad-CAM [27] heat map visualization is made
for the NASNet-A mobile trained model (see Figure 5). The
last convolutional block ”pays attention”mostly to the heads
of each dogs on an image.
Examining the heat map images, the NASNet-A model
mostly focuses on the head of the dogs. In the second image,
which shows a German Shepherd in a different position, the
CNN pays attention also to the body. If an image contains
more than one dog, the network is interested in all of the
recognized dogs in varied percentages, and evaluates the image
taking each appeared breed into consideration. For an accurate
classification it must consider the evaluate an image, which
contains different parts of the dog including the head. The
displayed images are evaluated correctly by the NASNet-A
mobile trained model.
Using different data augmentation: random image rotation
between 0 and 15 degrees, random zooming or 87.5% central
cropping does not help to improve the accuracy of the trained
The alternative hyperparameters presented in Section IV-C
(fixed vs. adaptive learning rates, weight decay and step
counts) prove to not improve the accuracy or other valuable
metrics after training the CNNs.
Fig. 5. Grad-CAM visualization with heat maps for the last layers of
the NASNet-A mobile trained model. The last convolutional block ”pays
attention” to the warm colored parts of the images (mostly to the heads of
each dogs), while the cold colors represents the less interested parts of the
image. These pictures are not part of Stanford Dogs dataset.
The usage of the trained convolutional neural networks
is presented through a software system, called Sniff!. Its
associated mobile application gives the opportunity for users
to take a photo or select an existing one from the gallery
in a mobile application, which not only classifies the image,
but also displays detailed information about each evaluated
breed. The displayed data serves educational and informative
The software system consists of two component: a central
server written in the Go programming language, and a mobile
client realized in React Native. The components communicate
via HTTP requests/responses.
The server contains a classifying module using the Ten-
sorFlow Go library; it loads the trained convolutional neural
network, preprocessing and evaluating images. The Inception
Resnet V2 and NASNet-A model are both runnable on the
server, since a desktop machine can make use of more
CPU/GPU resources for a faster evaluation. By default, the
Resnet model is used, since it reaches higher accuracies.
The results of the evaluation for an image are for every
classes, the most classes are evaluated with 0% percentage.
To avoid the wrong classification with low percentage there is
set a threshold on the server and also on the mobile client.
The mobile client can take an image with the camera of the
smartphone or import one from its gallery, and submit it for
classification. The process can happen online using the central
server for a faster classification, or offline using the phone
resources in case of a lacking network connection. Offline
evaluation is facilitated by React Native Tensorflow [28]
wrapper library. Figure VI shows the usage of the Sniff!
The app uses the NASNet-A mobile trained model, which
loads every evaluation into the memory of the device; this
depends on the resources of the phone. The memory is freed
after the evaluation.
The app displays detailed information about the detected
breeds, with data web scraped from A-Z Animals2and dog-
2Source: https://a-z-
(a) Camera (b) Crop (c) Result
Fig. 6. Main components in the Sniff! application.
Two different convolutional neural network architectures
have been presented: the NASNet-A mobile architecture and
the Inception Resnet V2 deep architecture.
The architectures have been tested on a niche image
classification problem: that of recognizing dog breeds. The
pre-trained networks are fine-tuned using the Stanford Dogs
Results are promising even for the smaller, mobile-friendly
CNN, reaching only 10% less accuracy than the deep Inception
Resnet V2 model.
We have also presented a usage of the fine-tuned convolu-
tional neural networks through a software system, called Sniff!:
a mobile application, which can determine the breed of a dog
from an image (even without an Internet connection).
The convolutional neural networks can be further developed
by: Generative Adversarial Nets (GAN) [4] to extend the
training dataset, using other loss function like center loss
[29], training other convolutional neural network architectures,
expanding the dataset with other popular dog breeds, using de-
tectors for locating multiple dogs on an image and optimizing
the server and mobile classification.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification
with Deep Convolutional Neural Networks,” in Advances in Neural
Information Processing Systems 25. Curran Associates, Inc., 2012,
pp. 1097–1105.
[2] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
real-time object detection with region proposal networks,” CoRR, vol.
abs/1506.01497, 2015.
[3] O. Abdel-Hamid, A. r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu,
“Convolutional neural networks for speech recognition,IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 22,
no. 10, pp. 1533–1545, Oct 2014.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems, 2014, pp. 2672–
[5] T. Lindeberg, “Scale Invariant Feature Transform,” Scholarpedia, vol. 7,
no. 5, p. 10491, 2012, revision #153939.
[6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.
[7] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset
for fine-grained image categorization,” in First Workshop on Fine-
Grained Visual Categorization, IEEE Conference on Computer Vision
and Pattern Recognition, Colorado Springs, CO, June 2011.
[8] R. J. Schalkoff, Artificial neural networks. McGraw-Hill New York,
1997, vol. 1.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks.
[10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[11] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
resnet and the impact of residual connections on learning,” CoRR, vol.
abs/1602.07261, 2016.
[12] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning trans-
ferable architectures for scalable image recognition,” CoRR, vol.
abs/1707.07012, 2017.
[13] P. Sermanet, A. Frome, and E. Real, “Attention for fine-grained catego-
rization,” CoRR, vol. abs/1412.7054, 2014.
[14] M. Simon and E. Rodner, “Neural activation constellations: Unsuper-
vised part model discovery with convolutional networks,” in Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp.
[15] X. Liu, T. Xia, J. Wang, and Y. Lin, “Fully convolutional attention
localization networks: Efficient attention localization for fine-grained
recognition,” CoRR, vol. abs/1603.06765, 2016.
[16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
[17] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns
for fine-grained category detection,” in Computer Vision – ECCV 2014,
D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 834–849.
[18] K. Duan, D. Parikh, D. Crandall, and K. Grauman, “Discovering local-
ized attributes for fine-grained recognition,” in 2012 IEEE Conference
on Computer Vision and Pattern Recognition, June 2012, pp. 3474–3481.
[19] A. Angelova and S. Zhu, “Efficient object detection and segmentation for
fine-grained recognition,” in Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 811–818.
[20] G. Chen, J. Yang, H. Jin, E. Shechtman, J. Brandt, and T. X. Han,
“Selective pooling vector for fine-grained recognition,” in 2015 IEEE
Winter Conference on Applications of Computer Vision, Jan 2015, pp.
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[22] Transfer learning and the art of using pre-trained models in deep
learning. [Online]. Available:
2017/06/transfer-learning-the-art- of-fine- tuning-a- pre-trained- model/
[23] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
learning,” CoRR, vol. abs/1611.01578, 2016.
[24] Google automl. [Online]. Available:
2017/05/using-machine- learning-to- explore.html
[25] S. Ruder, “An overview of gradient descent optimization algorithms,
CoRR, vol. abs/1609.04747, 2016.
[26] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin,
and L. Fei-Fei, “The unreasonable effectiveness of noisy data for fine-
grained recognition,” in Computer Vision – ECCV 2016, B. Leibe,
J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International
Publishing, 2016, pp. 301–320.
[27] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh,
and D. Batra, “Grad-cam: Why did you say that? visual explana-
tions from deep networks via gradient-based localization,” CoRR, vol.
abs/1610.02391, 2016.
[28] React native tensorflow. [Online]. Available:
[29] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature
learning approach for deep face recognition,” in European Conference
on Computer Vision. Springer, 2016, pp. 499–515.
... However, very few methods exist for breed identification of animals based on images despite its tremendous utility. Such an attempt has been carried out for dog breed classification [7], [8] using deep neural networks. In this work, the focus is on breed identifiers associated with pigs. ...
... In this paper, the focus was on qualifying the muzzle of a pig as a robust breed descriptor. Since the available methods in literature related to breed classification in animals based on visual biometrics uses Deep Neural Networks [7], [8], a direct technology transfer is not possible in our case due to the availability of very limited data. Apart from developing a segmentation procedure for picking out the internal details of the muzzle, visual descriptors for computing the feature vectors were identified. ...
Full-text available
Non-intrusive and automated detection of pig breeds, particularly from visual standpoint, is important from a food quality tracking perspective, both from the point of view of the vendors as well as the buyers. Colour as well as texture based visual descriptors from muzzle images have been identified, which, serve as breed-identifiers to separate four common pig-breeds: Duroc, Ghungroo, Hampshire and Yorkshire. While these handcrafted visual descriptors by themselves are fairly robust and discriminative, it is recognized that by controlling the decision space either by choosing the feature-type based on colour or texture or by combining features and also selecting the order in which particular breeds are siphoned, classification accuracies can be improved. In that light, is proposed a stable, relatively data-independent, breed-specific, hierarchical tree synthesis and feature selection procedure, based on a breed-pair cluster separation table, aided by some secondary statistics. The proposed approach has been compared with the state of the art Phylogenetic distance based Hierarchical Agglomerative Clustering algorithm (AGNES) and also with the standard decision tree classification algorithm. When completely different sets of pigs were used for training and testing (50-50 split), the proposed algorithm reported relatively high mean classification accuracies of 86.45% for Duroc, 93.02% for Ghungroo, 86.91% for Hampshire and 98.54% for Yorkshire.
... The behavior and identification of dogs have been studied in various domains that include: veterinarian science [3,4], human-computer interaction [5,6], and computer vision [7][8][9][10][11]. We find that many existing works focus on capturing the behavior of dogs with the inclusion of wearable sensors. ...
... Tasks that utilize computer vision exist for works similar to dog breed classification and analyzing sleeping behaviors, however; they are often developed for systems with large resource availability. For example, Raduly et al. [7] use the camera on a smart phone to capture images to classify dog breeds. The NASNet-A network architecture is used for on device inference, but achieves a low accuracy relative to a more complex network. ...
In this paper we outline the development methodology for an automatic dog treat dispenser which combines machine learning and embedded hardware to identify and reward dog behaviors in real-time. Using machine learning techniques for training an image classification model we identify three behaviors of our canine companions: "sit", "stand", and "lie down" with up to 92% test accuracy and 39 frames per second. We evaluate a variety of neural network architectures, interpretability methods, model quantization and optimization techniques to develop a model specifically for an NVIDIA Jetson Nano. We detect the aforementioned behaviors in real-time and reinforce positive actions by making inference on the Jetson Nano and transmitting a signal to a servo motor to release rewards from a treat delivery apparatus.
... This study road towards that system; any breed detection system involves two parts one is building a dataset to train the model; other is the development of model that can be properly trained to recognize the different classes of objects. Like Zalán Ráduly et al., use benchmark dataset of Stanford dog breeds for the development of a dog breed identification model [8]. This study basically targets the first part i.e. building a cow breed dataset. ...
... classification problem i.e. recognizing dog breeds a fine-grained image recognition problem[8]. Lingyun Li et al., use (bird breeds) dataset to check the classification accuracy of fine-tuning algorithms. ...
Full-text available
Fine-grained visual categorization (FGVC) dealt with objects belonging to one class with intra-class differences into subclasses. FGVC is challenging due to the fact that it is very difficult to collect enough training samples. This study presents a novel image dataset named Cowbreefor FGVC. Cowbree dataset contains 4000 images belongs to eight different cow breeds. Images are properly categorized under different breed names (labels) based on different texture and color features with the help of experts. While evidence shows that the existing dataset are of low quality, targeting few breeds with less number of images. To validate the dataset, three state of the art classifiers sequential minimal optimization (SMO), Multiclass classifier and J48 were used. Their results in term of accuracy are 68.81%, 55.81% and 57.45% respectively. Where results shows that SMO out performed with 68.81% accuracy, 68.4% precision and 68.8% recall.
The social structure of urban India has been changed and most pet lovers choose the dog over any other kind of pet. The population of adopted dogs is projected at 31.5 million approximately by 2023. With the increase in demand, the fraud cases of selling the right breed are rising day by day. With the demand for different dog breeds, recognizing the correct breed in time by their physical ability, instinct, interaction, and behavior, the body structure is necessary. Recent developments of artificial intelligence have already proven its superiority over the human capability for image classification tasks. The present work has built a Convolutional Neural Network (CNN)-based model to construct a highly accurate dog breed image classifier. In this paper, various state-of-the-art deep CNN models have been applied, and a modified-Xception model has been proposed for improving the overall accuracy. For evaluating the overall classification performance of our proposed methodology, the Kaggle Dog Breed Identification dataset has been used and throughout the experiment, our modified-Xception model has achieved 87.40%, the highest overall accuracy.
Dogs are one of the most common domestic animals. Due to a large number of dogs, there are several issues such as population control, decreased outbreak such as Rabies, vaccination control, and legal ownership. At present, there are over 180 dog breeds. Each dog breed has specific characteristics and health conditions. In order to provide appropriate treatments and training, it is essential to identify individuals and their breeds. Machine learning gives the strength on the way to train algorithms model that can handle the difficulties of info classification also prediction grounded on totally on arising information as of raw information. Convolutional Neural Networks ( CNNs ) gives single often used methods for image classification and detection. In this exertion , we define a CNN based approach for spotting dogs in perchance complex images and due to this fact reflect inconsideration on the identification of the one of kinds of dog breed. The experimental outcome analysis supported the standard metrics and thus the graphical representation confirms that the algorithm ( CNN ) gives good analysis accuracy for all the tested datasets.
In this paper, an attempt has been made to develop a model to decide with precision the breed identity of individual goat by using its image. For image-based multi-class classification tasks, CNNs have been found to be the best tool. But selecting the most efficient CNN model for a particular classification scenario is a very difficult job. To find an optimal CNN model for goat breed prediction, we have compared two of the most popular pre-trained deep-learning-based CNN models (VGG-16 & Inception-v3) based on their performance. Both the models have been fine-tuned using transfer learning on the goat breed database. This goat breed database has been created from goat images of six different breeds, which have been captured from different organized registered goat farms in India and almost two thousand digital images of individual goat have been captured without imposing stress to animals. It has been observed that Inception-v3 has outperformed VGG-16 with higher accuracy and lower training time. To measure the prediction performance of this fine-tuned Inception-v3 model, it has been applied to a test set of pure breed goat images and standardized classification performance evaluation metrics have been used to evaluate the prediction results. From the results, it is established that the proposed method used in this paper is able to accurately classify (recognize) goat breeds with high accuracy. Finally, comparison has been made with prediction accuracies of different technologies used for identification of domestic animals.
Full-text available
We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model utilizing this data. Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and simple, generic methods of recognition. This approach has benefits in both performance and scalability. We demonstrate its efficacy on four fine-grained datasets, greatly exceeding existing state of the art without the manual collection of even a single label, and furthermore show first results at scaling to more than 10,000 fine-grained categories. Quantitatively, we achieve top-1 accuracies of 92.3% on CUB-200-2011, 85.4% on Birdsnap, 93.4% on FGVC-Aircraft, and 80.8% on Stanford Dogs without using their annotated training sets. We compare our approach to an active learning approach for expanding fine-grained datasets.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Developing state-of-the-art image classification models often requires significant architecture engineering and tuning. In this paper, we attempt to reduce the amount of architecture engineering by using Neural Architecture Search to learn an architectural building block on a small dataset that can be transferred to a large dataset. This approach is similar to learning the structure of a recurrent cell within a recurrent network. In our experiments, we search for the best convolutional cell on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more of this cell. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5 on ImageNet, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. This cell can also be scaled down two orders of magnitude: a smaller network constructed from the best cell also achieves 74% top-1 accuracy, which is 3.1% better than the equivalently-sized, state-of-the-art models for mobile platforms.
Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.84, which is only 0.1 percent worse and 1.2x faster than the current state-of-the-art model. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art.