Conference PaperPDF Available

Dog Breed Identification Using Deep Learning

Authors:

Abstract and Figures

The current paper presents a fine-grained image recognition problem, one of multi-class classification, namely determining the breed of a dog in a given image. The presented system employs innovative methods in deep learning, including convolutional neural networks. Two different networks are trained and evaluated on the Stanford Dogs dataset. The usage/evaluation of convolutional neural networks is presented through a software system. It contains a central server and a mobile client, which includes components and libraries for evaluating on a neural network in both online and offline environments.
Content may be subject to copyright.
Dog Breed Identification Using Deep Learning
Zal´
an R´
aduly
Babes¸–Bolyai University
Cluj-Napoca, Romania
raduly.zalan@gmail.com
Csaba Sulyok
Babes¸–Bolyai University
Cluj-Napoca, Romania
csaba.sulyok@gmail.com
Zsolt Vad´
aszi
Codespring
Cluj-Napoca, Romania
vadaszi.zsolt@codespring.ro
Attila Z¨
olde
Codespring
Cluj-Napoca, Romania
zolde.attila@codespring.ro
Abstract—The current paper presents a fine-grained image
recognition problem, one of multi-class classification, namely
determining the breed of a dog in a given image. The presented
system employs innovative methods in deep learning, including
convolutional neural networks. Two different network are trained
and evaluated on the Stanford Dogs dataset.
The usage/evaluation of convolutional neural networks is
presented through a software system. It contains a central server
and a mobile client, which includes components and libraries
for evaluating on a neural network in both online and offline
environments.
Index Terms—convolutional neural networks, dog breed iden-
tification, fine-grained image recognition, image classification,
inception resnet v2, mobile trained model, stanford dog dataset
I. INTRODUCTION
Nowadays, convolutional neural networks (CNN) [1] are
popular in different topics of deep learning: image recogni-
tion [1], detection [2], speech recognition [3], data genera-
tion [4], etc.
Several traditional image recognition methods are known:
Scale-Invariant Feature Transform (SIFT) [5], Histogram of
Oriented Gradients (HoG) [6], attribute classification with
classifiers: Support Vector Machine (SVM), Softmax and
Cross Entropy loss. However, CNNs have also gained sig-
nificant traction in this field in recent years, mostly due
to general reoccurring architectures being viable for solving
many different problems.
The current paper presents the methodology and results of
fine-tuning CNNs for two different architectures, using the
Stanford Dogs dataset [7]. This constitutes a classification
problem, but also one of fine-grained image recognition, where
there are few and minute differences separating two classes.
Convolutional neural networks are very similar to Artificial
Neural Networks [8], which have learnable weights and biases.
The difference is the filters, which process over the whole
image and are effective in image recognition and classification
problems. Deep CNNs are viable on large dataset [9] and
are even accurate in large-scale video classification [10].
Fine-tuning methods and learning results for the Inception-
Resnet V2 [11] and NASNet-A mobile [12] architectures are
presented. Furthermore, the usage of the trained convolutional
neural networks is visualized through a separate software
system employing modern technologies. This system is able to
determine the breed of a dog in an image provided by the user,
and also displays detailed information about each recognized
breed. It consists of two main components: a mobile client
and a centralized web server.
The remainder of the document is structured as follows:
Section II provides an overview of similar approaches in the
literature, while Section III presents the used and preprocessed
Stanford Dogs dataset. Section IV details the learning of
two different CNNs, with Section V encapsulating the re-
sults thereof. Providing a practical edge to these CNNs, the
accompanying software system is described in Section VI.
Conclusions are drawn and further development plans are
proposed in Section VII.
II. RE LATE D WO RKS
The current section presents previous attempts at addressing
the problems tackled by the current research. Abdel-Hamid et
al. [3] solve a similar problem, one of speech recognition,
using traditional methods, making use of the size and position
of each local part, and PCA, while Sermanet et al. [13] and
Simon et al. [14] mention convolutional neural networks with
different architectures.
Liu et al. [15] present alternative learning methods in 2016
using attention localization, while Howard et al. in 2017 [16]
present a learning of a CNN using the MobileNet architecture
and the Stanford Dogs dataset extended with noisy data.
Similar fine-grained image recognition problems are solved
by detection. For example, Zhang et al. [17] generalize R-
CNN to detect different parts of an image, while Duan et
al. [18] discover localized attributes. Angelova et al. [19] use
segmentation and object detection to tackle the same issue.
Chen et al. [20] use selective pooling vector for fine-grained
image recognition.
The current research is based on fine-tuning CNNs and the
results thereof are reproducible on the original Stanford Dogs
dataset using the presented methods.
III. DATAS ET AND PREPROCESSING
The presented CNN learning methodology revolves around
the Stanford Dogs [7] dataset. It contains 120 different dog
breeds and is a subset of the more general ImageNet [21]. It
is separated into training and test dataset. Both sets contain
images of different sizes and every image is given a label
representing the embodied dog breed. The training dataset
contains 12.000 images with roughly 100 per breed; the test
data consists of 8.580 unevenly distributed images.
The first step of the preprocessing is to split the training data
into train folds and validation fold for experimental tuning of
the learning hyperparameters. Before splitting the data, the dog
images are resized to 256x256 pixels (for NASNet-A mobile)
and 299x299 pixels (Inception-Resnet V2 input).
For experimenting with the hyperparameters of the CNNs,
fine-tuning and 5-fold cross-validation is used, which in the
end produces 5 different training and validation subsets. Each
of these datasets contains 9.600 training images and 2.400 vali-
dation data. After getting the best fine-tuned hyperparameters,
the entire Stanford Dogs training dataset is used for training,
while the test data is exclusively used for evaluation.
IV. EXP ERIME NT S
After obtaining the necessary formatted and resized data, the
next step is fine-tuning the convolutional neural networks. This
section presents the applied technologies, CNN architectures,
methods, hyperparameters and using the trained models as
frozen graph.
The presented problem fits into the category of fine-grained
image recognition, since the differences linking any sample
image to a certain class are few and minuscule; the CNN must
consider small key features to dissolve ambiguity. For exam-
ple, the husky and the malamut breeds present small enough
differences among them to make differentiating difficult even
for trained eyes.
A. CNN Architectures
Transfer learning [22] provides a performance boost by not
requiring a full training from scratch. Instead, it uses pre-
trained models which are taught general reoccurring features.
These models are often trained on the ImageNet [21] dataset,
which has a competition every year and some pre-trained
models are published. The learning of these models represents
fine-tuning the given dataset with the learned weights and
biases. The current research contains two different public pre-
trained convolutional neural networks, which are fine-tuned:
NASNet-A mobile [12] and Inception-Resnet V2 [11].
The NASNet-A architecture is created based on the ap-
proach of the Neural Architecture Search (NAS) frame-
work [23], by the Google AutoML [24].
The Inception-Resnet V2 is a very deep architecture con-
taining over 300 layers, which is created by the Google
developers team.
B. Data augmentation
The most common method to reduce overfitting on training
data is to use different transformations before the feedforward
pass during the training; this is called data augmentation [9].
During the learning of the CNN models, another preprocess-
ing method, augmentation, is applied. The Inception-Resnet
V2 and the NASNet-A mobile architecture uses Inception
preprocessing, which is the following function:
f(x)=( x
255.00.5) 2.0
Fig. 1. The adaptive learning rate during the training sessions decay
exponentially by 10% every 3 epochs. The blue is the learning rate of the
Inception-Resnet V2 with an initial value of 0.1, while the red is the NASNet-
A learning rate with an initial value of 0.029.
where xthe image. Before the Inception preprocessing, the
image is randomly reflected and cropped by the TensorFlow’s
distorted bound box algorithm.
For the evaluation of the validation or the test dataset, the
applied augmentation steps include an 87.5% central crop and
the Inception preprocessing.
C. Learning and hyperparameters
The fine-tuning experiments of the convolutional neural net-
works are performed on a personal computer with a GeForce
GTX 1080 GPU, an Intel Core i5-6400 CPU and 64 GB RAM.
During the learning, a Softmax Cross-Entropy loss function
and Nesterov momentum [25] optimizers are used for the fine-
tuning of the NASNet-A mobile model and the Inception-
Resnet V2 model. During the training, the last fully-connected
layer (logits) is unfrozen and fine-tuned, while the other layers
are unchanged and frozen. This makes use of the benefits of
the pre-trained models.
The first parts of the CNN training involve hyperparame-
ter tuning using cross-validation. During the experiment the
hyperparameters are chosen empirically. After resulting in the
appropriate parameters, another phase of learning begins on
the entire Stanford Dogs training dataset, with the model
evaluated on the test dataset in this case.
Both convolutional neural network are trained with the
following common hyperparameters: a batch size of 64, an ex-
ponentially decreasing learning rate with different initial value,
where the rate decays 10% every 3 epochs (approximately
563 steps), a 0.0001 weight decay, and training for 30.000
steps. The NASNet-A mobile architecture is fine-tuned with a
learning rate with an initial value of 0.029 (see Figure 1). The
Inception-Resnet V2 is fine-tuned with a learning rate with an
initial value of 0.1 (see Figure 1). Training the mobile model
takes three times less than the Inception Resnet V2 model.
Further experimented hyperparameters include: fixed learn-
ing rates (0.01, 0.001), exponentially adaptive learning rates
(with initial values of 0.031, 0.035), default weight decay
(0.0004), different numbers of steps for training (15.000,
20.000, 20.500) and different optimizer (RMSprop).
Fig. 2. The accuracy of the trained NASNet-A and
the Inception Resnet V2 models on the train and
test dataset.
(a) NASNet-A mobile bottom 10 classes (b) Inception Resnet V2 bottom 10 classes
Fig. 3. Normalized confusion matrices depicting the bottom 10 ranking classes from the test dataset.
The y axis holds the actual labels, while the x axis shows the predicted labels in equivalent order.
D. Frozen graph
To use the taught CNN for evaluation through an API, it is
necessary in preamble to freeze the model. This step involves
freezing, saving and exporting the model of the neural network
to a single binary file including the structure and parameters.
V. RES ULTS
The trained models are evaluated every 10 minutes on the
training and test dataset. In this section, we present the relevant
metrics for evaluation: accuracy (see Figure IV-C), precision,
recall (see Table I) and confusion matrices (see Figure 3). The
evaluation points are linearly distributed in time, not in number
of steps, hence the uneven step distribution.
The accuracy is monitored on the training and test datasets;
this metric represents the mean percentage of correctly classi-
fied classes on a dataset. The NASNet-A mobile architecture
achieves 85.06% accuracy on the train dataset and 80.72% on
the test dataset. The accuracy with a deeper CNN shows better
results: the Inception Resnet V2 network achieves 93.66%
accuracy on the train dataset and 90.69% on the test dataset
(see Figure IV-C). The superior performance of the latter is
not surprising, since it is a much deeper network.
Precision and recall are also measured during the evaluation
of the training and test dataset; results are presented in Table I.
The precision and the recall for the trained Inception Resnet
V2 model is better, the deeper model extracts more features
from an image and classifies better than the trained NASNet-A
mobile model.
Furthermore, confusion matrices are built in the last step of
training for both CNNs using the test data, representing the 10
best and 10 worst scoring classes (the bottom 10 are visible in
TABLE I
PRECISION AND REC ALL
Metric Model Training dataset Test dataset
Precision NASNet-A 85.27% 80.03%
Precision Inception Resnet V2 93.86% 90.18%
Recall NASNet-A 85.06% 79.86%
Recall Inception Resnet V2 93.65% 89.99%
Figure 3). Examining the results, both CNNs have difficulties
differentiating between certain classes, e.g. the eskimo dog and
siberian husky, or the toy poodle and miniature poodle. Even
from the confusion matrices, the accuracies may be observed
as less for the NASNet-A mobile than for the Inception-Resnet
V2.
The top ranking classes from the confusion matrices are also
analyzed. The Inception-Resnet V2 trained model classifies
10 different dog breeds1with 100% accuracy. On the other
hand, the NASNet-A trained model classifies only 1 breed
with 100% accuracy, namely the sealyham terrier.
To gain further insight from the confusion matrices, a statis-
tical analysis is performed, gathering how many different dog
breeds are classified correctly between percentage intervals;
the resulting histogram is visible in Figure 4. The histogram
shows that in the case of the Inception Resnet V2 trained
1brittany spaniel, chow, afghan hound, african hunting dog, keeshond,
schipperke, bedlington terrier, sealyham terrier and curly
Fig. 4. Histogram of classified class percentages for both CNN architectures
on both datasets. A bar represents the count of classes within a range of
accuracies.
TABLE II
NAS NE T-A MOBILE ACCURACY WITH DIFFERENT OPTIMIZERS
Optimizer Training accuracy Test accuracy
Nesterov momentum 85.06% 80.72%
RMSprop 84.94% 80.57%
TABLE III
PERFORMANCE OF SPECIES CATEGORIZATION USING STANFO RD DO GS
Method Accuracy
Chen et al. [20] 52.00%
Simon et al. [14] 68.61%
Google LeNet [13] 75.00%
Krause et al. [26] 82.6%
Liu et al. [15] (ResNet-50) 88.9%
Ours, NASNet-A mobile 80.72%
Ours, Inception Resnet V2 90.69%
model, most classes from the train dataset are classified in the
(95%, 100%] interval, while from the test dataset the most
classes are classified in the (90%, 95%] range. The test dataset
is classified well with some inaccuracies remaining. The
classified classes from train and test dataset for the NASNet-
A model have a wider spread, with a majority of the classes
falling in the (50%, 80%] range. The large accuracy difference
between the two models is understandably in correlation with
the size and complexity of the architectures.
The mentioned accuracies using the NasNet-A architecture
are achieved using the Nesterov momentum optimizer. For a
comparison, an alternative optimizer, RMSprop, is also tested;
the results are similar (see Table II).
Comparison to related work on Stanford Dogs dataset is
given in the Table III.
After the evaluation of each trained convolutional neural
network, a Grad-CAM [27] heat map visualization is made
for the NASNet-A mobile trained model (see Figure 5). The
last convolutional block ”pays attention”mostly to the heads
of each dogs on an image.
Examining the heat map images, the NASNet-A model
mostly focuses on the head of the dogs. In the second image,
which shows a German Shepherd in a different position, the
CNN pays attention also to the body. If an image contains
more than one dog, the network is interested in all of the
recognized dogs in varied percentages, and evaluates the image
taking each appeared breed into consideration. For an accurate
classification it must consider the evaluate an image, which
contains different parts of the dog including the head. The
displayed images are evaluated correctly by the NASNet-A
mobile trained model.
Using different data augmentation: random image rotation
between 0 and 15 degrees, random zooming or 87.5% central
cropping does not help to improve the accuracy of the trained
models.
The alternative hyperparameters presented in Section IV-C
(fixed vs. adaptive learning rates, weight decay and step
counts) prove to not improve the accuracy or other valuable
metrics after training the CNNs.
Fig. 5. Grad-CAM visualization with heat maps for the last layers of
the NASNet-A mobile trained model. The last convolutional block ”pays
attention” to the warm colored parts of the images (mostly to the heads of
each dogs), while the cold colors represents the less interested parts of the
image. These pictures are not part of Stanford Dogs dataset.
VI. TH E SO FT WARE SYST EM
The usage of the trained convolutional neural networks
is presented through a software system, called Sniff!. Its
associated mobile application gives the opportunity for users
to take a photo or select an existing one from the gallery
in a mobile application, which not only classifies the image,
but also displays detailed information about each evaluated
breed. The displayed data serves educational and informative
purposes.
The software system consists of two component: a central
server written in the Go programming language, and a mobile
client realized in React Native. The components communicate
via HTTP requests/responses.
The server contains a classifying module using the Ten-
sorFlow Go library; it loads the trained convolutional neural
network, preprocessing and evaluating images. The Inception
Resnet V2 and NASNet-A model are both runnable on the
server, since a desktop machine can make use of more
CPU/GPU resources for a faster evaluation. By default, the
Resnet model is used, since it reaches higher accuracies.
The results of the evaluation for an image are for every
classes, the most classes are evaluated with 0% percentage.
To avoid the wrong classification with low percentage there is
set a threshold on the server and also on the mobile client.
The mobile client can take an image with the camera of the
smartphone or import one from its gallery, and submit it for
classification. The process can happen online using the central
server for a faster classification, or offline using the phone
resources in case of a lacking network connection. Offline
evaluation is facilitated by React Native Tensorflow [28]
wrapper library. Figure VI shows the usage of the Sniff!
application.
The app uses the NASNet-A mobile trained model, which
loads every evaluation into the memory of the device; this
depends on the resources of the phone. The memory is freed
after the evaluation.
The app displays detailed information about the detected
breeds, with data web scraped from A-Z Animals2and dog-
time.com3.
2Source: https://a-z- animals.com/
3Source: http://dogtime.com/
(a) Camera (b) Crop (c) Result
Fig. 6. Main components in the Sniff! application.
VII. CONCLUSIONS AND FUTURE WORK
Two different convolutional neural network architectures
have been presented: the NASNet-A mobile architecture and
the Inception Resnet V2 deep architecture.
The architectures have been tested on a niche image
classification problem: that of recognizing dog breeds. The
pre-trained networks are fine-tuned using the Stanford Dogs
dataset.
Results are promising even for the smaller, mobile-friendly
CNN, reaching only 10% less accuracy than the deep Inception
Resnet V2 model.
We have also presented a usage of the fine-tuned convolu-
tional neural networks through a software system, called Sniff!:
a mobile application, which can determine the breed of a dog
from an image (even without an Internet connection).
The convolutional neural networks can be further developed
by: Generative Adversarial Nets (GAN) [4] to extend the
training dataset, using other loss function like center loss
[29], training other convolutional neural network architectures,
expanding the dataset with other popular dog breeds, using de-
tectors for locating multiple dogs on an image and optimizing
the server and mobile classification.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification
with Deep Convolutional Neural Networks,” in Advances in Neural
Information Processing Systems 25. Curran Associates, Inc., 2012,
pp. 1097–1105.
[2] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
real-time object detection with region proposal networks,” CoRR, vol.
abs/1506.01497, 2015.
[3] O. Abdel-Hamid, A. r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu,
“Convolutional neural networks for speech recognition,IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 22,
no. 10, pp. 1533–1545, Oct 2014.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems, 2014, pp. 2672–
2680.
[5] T. Lindeberg, “Scale Invariant Feature Transform,” Scholarpedia, vol. 7,
no. 5, p. 10491, 2012, revision #153939.
[6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.
886–893.
[7] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset
for fine-grained image categorization,” in First Workshop on Fine-
Grained Visual Categorization, IEEE Conference on Computer Vision
and Pattern Recognition, Colorado Springs, CO, June 2011.
[8] R. J. Schalkoff, Artificial neural networks. McGraw-Hill New York,
1997, vol. 1.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks.
[10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[11] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
resnet and the impact of residual connections on learning,” CoRR, vol.
abs/1602.07261, 2016.
[12] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning trans-
ferable architectures for scalable image recognition,” CoRR, vol.
abs/1707.07012, 2017.
[13] P. Sermanet, A. Frome, and E. Real, “Attention for fine-grained catego-
rization,” CoRR, vol. abs/1412.7054, 2014.
[14] M. Simon and E. Rodner, “Neural activation constellations: Unsuper-
vised part model discovery with convolutional networks,” in Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp.
1143–1151.
[15] X. Liu, T. Xia, J. Wang, and Y. Lin, “Fully convolutional attention
localization networks: Efficient attention localization for fine-grained
recognition,” CoRR, vol. abs/1603.06765, 2016.
[16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
2017.
[17] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns
for fine-grained category detection,” in Computer Vision – ECCV 2014,
D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 834–849.
[18] K. Duan, D. Parikh, D. Crandall, and K. Grauman, “Discovering local-
ized attributes for fine-grained recognition,” in 2012 IEEE Conference
on Computer Vision and Pattern Recognition, June 2012, pp. 3474–3481.
[19] A. Angelova and S. Zhu, “Efficient object detection and segmentation for
fine-grained recognition,” in Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 811–818.
[20] G. Chen, J. Yang, H. Jin, E. Shechtman, J. Brandt, and T. X. Han,
“Selective pooling vector for fine-grained recognition,” in 2015 IEEE
Winter Conference on Applications of Computer Vision, Jan 2015, pp.
860–867.
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[22] Transfer learning and the art of using pre-trained models in deep
learning. [Online]. Available: https://www.analyticsvidhya.com/blog/
2017/06/transfer-learning-the-art- of-fine- tuning-a- pre-trained- model/
[23] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
learning,” CoRR, vol. abs/1611.01578, 2016.
[24] Google automl. [Online]. Available: https://research.googleblog.com/
2017/05/using-machine- learning-to- explore.html
[25] S. Ruder, “An overview of gradient descent optimization algorithms,
CoRR, vol. abs/1609.04747, 2016.
[26] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin,
and L. Fei-Fei, “The unreasonable effectiveness of noisy data for fine-
grained recognition,” in Computer Vision – ECCV 2016, B. Leibe,
J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International
Publishing, 2016, pp. 301–320.
[27] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh,
and D. Batra, “Grad-cam: Why did you say that? visual explana-
tions from deep networks via gradient-based localization,” CoRR, vol.
abs/1610.02391, 2016.
[28] React native tensorflow. [Online]. Available: https://github.com/reneweb/
react-native-tensorflow
[29] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature
learning approach for deep face recognition,” in European Conference
on Computer Vision. Springer, 2016, pp. 499–515.
... Current research in DBI reports accuracies over 90%. [44][45][46][47][48] Research today focuses on comparing effects of different CNN architecture (layouts of layers) on performance, with marginal gains (eg, up to 92.4% 44 ) achieved by designing complex, hybrid architectures. However, as described above, the data used for pretraining and fine-tuning is also an important part of a model's performance. ...
Article
Full-text available
Dog breed is fundamental health information, especially in the context of breed-linked diseases. The standardization of breed terminology across health records is necessary to leverage the big data revolution for veterinary research. Breed can also inform clinical decision making. However, client-reported breeds vary in their reliability depending on how breed was determined. Surprisingly, research in computer science reports that AI can assign breed to dogs with over 90% accuracy from a photograph. Here, we explore the extent to which current research in AI is relevant to breed assignment or validation in veterinary contexts. This review provides a primer on approaches used in dog breed identification and the datasets used to train models to identify breed. Closely examining these datasets reveals that AI research uses unreliable definitions of breed and therefore does not currently generate predictions relevant in veterinary contexts. We identify issues with the curation of the datasets used to develop these models, which are also likely to depress model performance as evaluated within the field of AI. Therefore, expert curation of datasets that can be used alongside existing algorithms is likely to improve research on this topic in both fields. Such advances will only be possible through collaboration between veterinary experts and computer scientists.
... Moving to black-box models has the price of losing explainability; however, such approaches seem more promising for this type of task. For instance, Ráduly et al. (65) reach very high accuracy in dog breed classification using deep learning techniques (although their dataset is much larger than the one used in this study). ...
Article
Full-text available
Facial landmarks, widely studied in human affective computing, are beginning to gain interest in the animal domain. Specifically, landmark-based geometric morphometric methods have been used to objectively assess facial expressions in cats, focusing on pain recognition and the impact of breed-specific morphology on facial signaling. These methods employed a 48-landmark scheme grounded in cat facial anatomy. Manually annotating these landmarks, however, is a labor-intensive process, deeming it impractical for generating sufficiently large amounts of data for machine learning purposes and for use in applied real-time contexts with cats. Our previous work introduced an AI pipeline for automated landmark detection, which showed good performance in standard machine learning metrics. Nonetheless, the effectiveness of fully automated, end-to-end landmark-based systems for practical cat facial analysis tasks remained underexplored. In this paper we develop AI pipelines for three benchmark tasks using two previously collected datasets of cat faces. The tasks include automated cat breed recognition, cephalic type recognition and pain recognition. Our fully automated end-to-end pipelines reached accuracy of 75% and 66% in cephalic type and pain recognition respectively, suggesting that landmark-based approaches hold promise for automated pain assessment and morphological explorations.
... The second group is based on convolutional neural networks (CNN) [8][9][10][11][12][13], which typically employ a single CNN model for the Stanford Dog Dataset, with most achieving an accuracy rate up to 80%. For instance, VGGNet increases network depth by using multiple 3 × 3 convolutional filters while reducing model parameters. ...
Article
Full-text available
When classifying breeds of dogs, the accuracy of classification significantly affects breed identification and dog research. Using images to classify dog breeds can improve classification efficiency; however, it is increasingly challenging due to the diversities and similarities among dog breeds. Traditional image classification methods primarily rely on extracting simple geometric features, while current convolutional neural networks (CNNs) are capable of learning high-level semantic features. However, the diversity of dog breeds continues to pose a challenge to classification accuracy. To address this, we developed a model that integrates multiple CNNs with a machine learning method, significantly improving the accuracy of dog images classification. We used the Stanford Dog Dataset, combined image features from four CNN models, filtered the features using principal component analysis (PCA) and gray wolf optimization algorithm (GWO), and then classified the features with support vector machine (SVM). The classification accuracy rate reached 95.24% for 120 breeds and 99.34% for 76 selected breeds, respectively, demonstrating a significant improvement over existing methods using the same Stanford Dog Dataset. It is expected that our proposed method will further serve as a fundamental framework for the accurate classification of a wider range of species.
... The result is a set of feature maps that capture different aspects of the input image. [3] ...
Article
This research focuses on the difficulties that dog enthusiasts face when trying to accurately and efficiently identify different canine breeds. Recognizing the challenges of obtaining breed information di- rectly from dog owners, a new solution is proposed: an automated system that can be accessed through a user-friendly website. This system utilizes image recognition technology, allowing users to easily upload a photo of a dog and accurately determine its breed. In addition to breed identifica- tion, the platform offers comprehensive information about temperament, care requirements, and other unique characteristics of each breed, pro- viding a holistic understanding. The objective is to empower potential dog owners to make well-informed decisions that align with theirlifestyle and preferences. Specifically tailored for the United States market and available in English, this resource aims to assist dog lovers nationwide in selecting and caring for their pets by providing detailed and easily - accessible information. Keywords: User-friendly, Temperament, Holistic
... Raduly et al. [ 10] set up a system for determining the breed of dogs. In his approach, two different networks were trained on the Standford Dog dataset (12000 images of 120 different breeds). ...
... Dog breed identification technology has practical applications in conservation, breeding programs, veterinary care, legal regulations, genetic research, disease diagnosis, and biodiversity preservation. However, dog breed classification faces challenges like inconsistent lighting, occlusions, and differing poses [1,2]. ...
Article
Full-text available
This study explores the performance of various deep learning models including ResNet152, VGG16, VGG19 and ResNet256 on the dog breed classification task. During training, observe the loss and accuracy trends. The loss gradually decreases, showing that the model is fitting the training data better. With the improvements of the capacity of the model, the accuracy trend shows a steady increase. These models converge after about 20 epochs and fluctuate little after that. The initial learning rate, adjustment factor and patience parameters play key roles in the convergence process. However, the achieved accuracy is below 90%, suggesting that further optimization or more complex architectures may be beneficial. Among all models, ResNet512 has the highest overall accuracy (83%), followed by ResNet256 (83%), VGG19 256 (79%) and VGG16 256 (78%). The ResNet model outperforms the VGG model in most cases, probably because its network structure reduces computational complexity while maintaining accuracy. Increasing the input size can improve the accuracy of the same network structure, such as ResNet 256 and ResNet 512, while modifying the network structure by adding more layers. Learning rate decay scheduling methods, such as ReduceLROnPlateau and CosineAnnealingLR, and optimizers such as SGD, Adam, and Adagrad, are explored as well.
... As a positive control for use of the photograph, we also analyzed each sample/photograph pair using a pretrained convolutional neural network, NASNet, that has shown particularly robust performance for dog breed identification. [30][31][32] We loaded the NASNet model NASNetLarge from the Python package PyTorch Image Models (timm, version 0.6.12) 30,32 and used it to it to classify each of the dog photos, adapting code from a tutorial. ...
Article
Full-text available
OBJECTIVE To compare pedigree documentation and genetic test results to evaluate whether user-provided photographs influence the breed ancestry predictions of direct-to-consumer (DTC) genetic tests for dogs. ANIMALS 12 registered purebred pet dogs representing 12 different breeds. METHODS Each dog owner submitted 6 buccal swabs, 1 to each of 6 DTC genetic testing companies. Experimenters registered each sample per manufacturer instructions. For half of the dogs, the registration included a photograph of the DNA donor. For the other half of the dogs, photographs were swapped between dogs. DNA analysis and breed ancestry prediction were conducted by each company. The effect of condition (ie, matching vs shuffled photograph) was evaluated for each company’s breed predictions. As a positive control, a convolutional neural network was also used to predict breed based solely on the photograph. RESULTS Results from 5 of the 6 tests always included the dog’s registered breed. One test and the convolutional neural network were unlikely to identify the registered breed and frequently returned results that were more similar to the photograph than the DNA. Additionally, differences in the predictions made across all tests underscored the challenge of identifying breed ancestry, even in purebred dogs. CLINICAL RELEVANCE Veterinarians are likely to encounter patients who have conducted DTC genetic testing and may be asked to explain the results of genetic tests they did not order. This systematic comparison of commercially available tests provides context for interpreting results from consumer-grade DTC genetic testing kits.
Article
Many deep learning approaches have been developed to solve artificial intelligence problems with deep learning architectures. Due to its powerful feature extraction and learning capabilities, it is frequently preferred in object recognition processes. Detection of dogs, which is one of the most preferred pets today, is important for different purposes. It is preferred in analyzes made on the basis of gender. In this article, deep learning methods and deep learning and segmentation methods are used together to detect the dog in a data set consisting of 3 different dangerous dog breeds. In the results obtained, it was seen that the accuracy rate increased to 88.33% with the tissue segmentation method used before NasNetLarge.
Article
Full-text available
We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model utilizing this data. Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and simple, generic methods of recognition. This approach has benefits in both performance and scalability. We demonstrate its efficacy on four fine-grained datasets, greatly exceeding existing state of the art without the manual collection of even a single label, and furthermore show first results at scaling to more than 10,000 fine-grained categories. Quantitatively, we achieve top-1 accuracies of 92.3% on CUB-200-2011, 85.4% on Birdsnap, 93.4% on FGVC-Aircraft, and 80.8% on Stanford Dogs without using their annotated training sets. We compare our approach to an active learning approach for expanding fine-grained datasets.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Developing state-of-the-art image classification models often requires significant architecture engineering and tuning. In this paper, we attempt to reduce the amount of architecture engineering by using Neural Architecture Search to learn an architectural building block on a small dataset that can be transferred to a large dataset. This approach is similar to learning the structure of a recurrent cell within a recurrent network. In our experiments, we search for the best convolutional cell on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more of this cell. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5 on ImageNet, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. This cell can also be scaled down two orders of magnitude: a smaller network constructed from the best cell also achieves 74% top-1 accuracy, which is 3.1% better than the equivalently-sized, state-of-the-art models for mobile platforms.
Article
Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.84, which is only 0.1 percent worse and 1.2x faster than the current state-of-the-art model. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art.