Conference PaperPDF Available

Deep Learning for Practical Image Recognition: Case Study on Kaggle Competitions

Authors:

Abstract and Figures

In past years, deep convolutional neural networks (DCNN) have achieved big successes in image classification and object detection, as demonstrated on ImageNet in academic field. However, There are some unique practical challenges remain for real-world image recognition applications, e.g., small size of the objects, imbalanced data distributions, limited labeled data samples, etc. In this work, we are making efforts to deal with these challenges through a computational framework by incorporating latest developments in deep learning. In terms of two-stage detection scheme, pseudo labeling, data augmentation, cross-validation and ensemble learning, the proposed framework aims to achieve better performances for practical image recognition applications as compared to using standard deep learning methods. The proposed framework has recently been deployed as the key kernel for several image recognition competitions organized by Kaggle. The performance is promising as our final private scores were ranked 4 out of 2293 teams for fish recognition on the challenge "The Nature Conservancy Fisheries Monitoring" and 3 out of 834 teams for cervix recognition on the challenge "Intel &MobileODT Cervical Cancer Screening", and several others. We believe that by sharing the solutions, we can further promote the applications of deep learning techniques.
Content may be subject to copyright.
Deep Learning for Practical Image Recognition: Case Study on
Kaggle Competitions
Xulei Yang*
Institute for Infocomm Research
yang_xulei@i2r.a-star.edu.sg
Zeng Zeng
Institute for Infocomm Research
zengz@i2r.a-star.edu.sg
Sin G. Teo
Institute for Infocomm Research
teosg@i2r.a-star.edu.sg
Li Wang
Institute for Infocomm Research
wang_li@i2r.a-star.edu.sg
Vijay Chandrasekhar
Institute for Infocomm Research
vijay@i2r.a-star.edu.sg
Steven Hoi
Singapore Management University
chhoi@smu.edu.sg
ABSTRACT
In past years, deep convolutional neural networks (DCNN) have
achieved big successes in image classication and object detection,
as demonstrated on ImageNet in academic eld. However, There
are some unique practical challenges remain for real-world image
recognition applications, e.g., small size of the objects, imbalanced
data distributions, limited labeled data samples, etc. In this work,
we are making eorts to deal with these challenges through a com-
putational framework by incorporating latest developments in deep
learning. In terms of two-stage detection scheme, pseudo labeling,
data augmentation, cross-validation and ensemble learning, the pro-
posed framework aims to achieve better performances for practical
image recognition applications as compared to using standard deep
learning methods. The proposed framework has recently been de-
ployed as the key kernel for several image recognition competitions
organized by Kaggle. The performance is promising as our nal
private scores were ranked 4 out of 2293 teams for sh recognition
on the challenge “The Nature Conservancy Fisheries Monitoring”
and 3 out of 834 teams for cervix recognition on the challenge “Intel
& MobileODT Cervical Cancer Screening”, and several others. We
believe that by sharing the solutions, we can further promote the
applications of deep learning techniques.
KEYWORDS
Image recognition, deep learning, objection detection, and image
classication
ACM Reference Format:
Xulei Yang*, Zeng Zeng, Sin G. Teo, Li Wang,Vijay Chandrasekhar, and Steven
Hoi. 2018. Deep Learning for Practical Image Recognition: Case Study on
Kaggle Competitions. In KDD ’18: The 24th ACM SIGKDD International Con-
ference on Knowledge Discovery & Data Mining, August 19–23, 2018, London,
United Kingdom. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
3219819.3219907
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
KDD ’18, August 19–23, 2018, London, United Kingdom
©2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5552-0/18/08. . . $15.00
https://doi.org/10.1145/3219819.3219907
1 INTRODUCTION
Deep Neural Networks (DNNs) have emerged as powerful machine
learning models that exhibit major dierences from traditional
approaches for image classication [
13
,
23
]. DNNs with deep ar-
chitectures have the capacity to learn complex models and allow
for learning powerful object representations without the need to
handle designed features. This has been empirically demonstrated
on the ImageNet classication task across thousands of classes.
Compared to image classication, object detection is a more
challenging task that requires more sophisticated methods to solve.
In this context, we focus not only on classifying images, but also on
estimating the classes and locations of objects within the images
precisely. Object detection is one of the hardest problems in com-
puter vision and data engineering. Owing to the improvements in
object representations and machine learning methods, many major
advances in object detection have been achieved, like Faster R-CNN
[
20
], which has achieved excellent object detection accuracy by
using DNNs to classify object proposals.
In real world applications, it may be required to classify a given
image based on the object(s) contained within that image. For ex-
ample, we want to classify an image into dierent categories based
on the sh types within the image, as shown in Fig.1. There are
some unique practical challenges for this kinds of image recogni-
tion problems: Firstly, the objects are very small as compared to
the background. Standard CNN based methods like ResNet [
7
] and
Faster R-CNN [
20
] may learn the feature of the boats (background)
but not the shes (objects). Therefore, it will fail when presented
with images containing new boats. Secondly, imbalanced data sets
exist widely in real world and they have been providing great chal-
lenges for classication tasks. As a result, the CNN models might be
biased towards majority classes with large training samples, such
that might have trouble classifying those classes with very few
training samples. Thirdly, in real-world applications, getting data
is expensive that involves time-consuming and labour-intensive
process, e.g., ground truth have to be labeled and conrmed by
multiple experts in the domain. How to achieve good performance
with very limited training dataset remains a big challenge in both
academic and industry.
In this paper, we address the aforementioned challenges by pre-
senting a computational framework based-on the latest develop-
ments in deep learning. Firstly, we propose two-stage detection
scheme to handle small object recognition. The framework com-
bines the advantages of both object detection and image classica-
tion methods. We rst use state-of-the-art object detection method
Figure 1: Example of image recognition based on the object
types within the image. The object in this example is the sh
indicated by red rectangle.
like Faster R-CNN [
20
] to identify the location of the object within
the full image. Based on this localization, a small Region of Interest
(ROI) is cropped from the image. Finally, deep convolutional net-
works, like ResNet [
7
], are employed to perform classication on the
cropped image. Secondly, Data augmentation and cross validation
are used to deal with imbalanced dataset problem. We oversample
the rare samples by horizontal ipping, slight shifting and rotation,
as well as adding color variance. We also use stratied K-fold to
split the training set into dierent folds for cross validation so that
the model is able to “see” enough of the rare classes for each fold.
In such a way, the negative eect of data imbalance is reduced to a
minimum. Lastly, we use the pseudo labeling approach to utilize the
unlabeled data in the testing dataset. We rst train a model based on
the training dataset. The model is then used to perform prediction
on the testing dataset. After that, the most condent samples from
the testing dataset are added into the training dataset. We again
train a new model with larger training dataset. The scheme can be
iteratively processed to generate more training samples.
The main contributions of this paper are three folds: First, we
propose a deep learning framework to handle several unique chal-
lenges for practical image recognition applications, e.g., small size
of objects, imbalanced data distributions, and limited labeled train-
ing samples. Second, the proposed framework have been deployed
for several image recognition competitions organized by Kaggle.
The performance is promising as our nal scores are ranked top
1% in the private leaderboard for all the competitions. Lastly, we
publicly share the source codes of the implementation of our case
studies for sh recognition on the Kaggle challenge “The Nature
Conservancy Fisheries Monitoring” [
3
], as well as cervix recogni-
tion on the Kaggle challenge “Intel & MobileODT Cervical Cancer
Screening” [
2
]. The interested readers may re-use the source codes
for their own image recognition applications. The rest of this paper
is organized as follows: Section 2 reviews related works of object
detection and image classication methods. Our proposed frame-
work for image recognition is presented in Section 3. Case studies
of our proposed framework on two Kaggle image recognition tasks
are illustrated in Section 4. Lastly, the conclusion is given in Section
5.
2 RELATED WORK
2.1 Image Classication Methods
In recent years, deep convolutional neural network architectures
(DCNN), specically, VGG [
24
], Inception [
26
], and ResNet [
7
] have
been widely used for image classication tasks and seen great
success in many computer vision challenges like ImageNet. More
recently published deep neural networks include DenseNet [
9
] and
Dual Path Networks [
1
]. The interested readers may refer to [
17
]
for a comprehensive review of DCNN for image classication.
VGG.
VGG network architecture is introduced by Simonyan and
Zisserman [
24
]. This network is characterized by its simplicity,
using only 3
×
3convolutional layers stacked on top of each other
in increasing depth. The reduction of volume size is handled by
max pooling. Two fully-connected layers, each with 4
,
096 nodes
are then followed by a softmax classier. Due to its depth and
number of fully-connected nodes, VGG is over 533 MB for VGG16
and 574 MB for VGG19. This makes deploying VGG a tiresome task.
However, VGG is still popularly used in many deep learning image
classication problems due to its simplicity.
Inception.
The “Inception” micro-architecture was rst introduced
by Szegedy et al. [
26
]. The goal of the inception module is to act
as a “multi-level feature extractor” by computing 1
×
1,3
×
3, and
5
×
5convolutions within the same module of the network. The
outputs of these lters are then stacked along the channel dimension
before being fed into the next layer in the network. The origin
of this architecture is titled as GoogLeNet, but after subsequent
improvements, it is simply called Inception VN where
N
refers to
the version number put out by Google, such as Inception V3 [
28
]
and Inception V4 [25].
ResNet.
First introduced by He et. al. [
7
], the ResNet architec-
ture has become a seminal work, demonstrating that extremely
deep networks can be trained using standard stochastic gradient
descent (SGD) and a reasonable initialization function, through
the use of residual modules. Further accuracy improvement can be
obtained by updating the residual module to use identity mappings,
as demonstrated in their follow-up publication [
8
]. Even though
ResNet is much deeper than VGG, the model size is actually sub-
stantially smaller due to the usage of global average pooling rather
than fully-connected layers: this reduces the model size to 102 MB
for ResNet50.
2.2 Object Detection Methods
Signicant progress has been achieved in recent years on object
detection due to the development of convolutional neural networks
(CNNs). Among them, Faster R-CNN [
20
], YOLO (You Only Look
Once) [
18
], and SSD (Single Shot Multibox Detector) [
15
] are the
most well-known. Other modern convolutional object detectors
include R-FCN (region-based fully convolutional networks) [
4
] and
multibox [
27
]. The interested readers may refer to [
10
] for the
speed/accuracy trade-os for various object detectors.
Faster R-CNN.
Faster R-CNN is an improved version of Regional
CNN (R-CNN) [
22
] and its descendant Fast R-CNN [
21
]. The basic
idea is to break down object detection into two separate stages. In
the rst stage, regions within the image that are likely to contain
2
ROIs are identied. In the second stage, a convolutional neural net-
work runs on each proposed region, and outputs the object category
score and the corresponding bounding box coordinates that may
contain objects. Faster R-CNN is the state-of-the-art model with
the best mAP scores on the VOC and COCO benchmark datasets.
However, with a framerate of 7 fps, Faster R-CNN is the slowest
amongst other state-of-the-art models such as YOLO and SSD. The
latest descendant in this family is called Mask R-CNN [12], which
extends such object detection techniques to provide pixel level
segmentation.
YOLO.
It rst partitions the raw input image into
N×N
squared
regions. Then it ts a convolutional neural network directly on
the input image and output
M
sets of object condence scores and
bounding box coordinates, with
M
depending on
N
. The entire
model is trained end-to-end. Since YOLO takes raw image as the
input and output condence scores and bounding box coordinates
directly with a single pass, it is considerably faster than Faster R-
CNN, which requires multiple stages for training and inference.
However, it is signicantly less accurate than Faster R-CNN, and is
particularly poor at localizing small objects. YOLO V2 [
19
] has made
several small but important changes inspired by Faster R-CNN to
improve the detection accuracy.
SSD.
It is another state-of-the-art object detection model that is
known to have good trade-o between speed and accuracy. Similar
to YOLO, it requires only a single step for training and inference,
and the entire model is trained in end-to-end style. The major
contribution of SSD as compared to other models is that it makes use
of feature maps of dierent scales to make predictions. In contrast,
Faster R-CNN and YOLO base their predictions only on a single set
of feature maps located at some selected intermediate level (e.g.,
“conv5”) down the convolutional layers hierarchy. Theoretically
speaking, using feature maps of varying scales can lead to better
detection quality of objects with dierent sizes. Similar to YOLO,
SSD is also suering from small object detection problem. Feature-
Fused SSD [
6
] is one of the latest research work to improve the
performance of SSD on small object localization.
3 THE PROPOSED FRAMEWORK
Many real world applications, e.g., the two Kaggle image recog-
nition competitions studied in Section 4, involve in the task of
classifying a given image based on the objects contained within
that image. The deep learning methods discussed in the last Sec-
tion, either object detection or image classication, can be directly
applied for this task. Their performance, however, may not meet
the desired requirements, due to some unique challenges related to
practical image recognition as discussed in Section 1, i.e., small size
of the objects, imbalanced data distributions, limited labeled data
samples. In this section, we propose a computational framework,
which consists of two-stage detection scheme, pseudo labeling, data
augmentation, cross-validation and ensemble learning, to tackle
the above challenging issues.
Overview.
The overall workow of the proposed framework is
illustrated in Fig. 2, which consists of two major successive tasks,
i.e., object localization and ROI classication, followed by a stacking
process to ensemble the results from various models. Fig. 3 shows
one of the examples that uses our proposed framework to perform
sh recognition from a shing boat image. The detailed diagram
of the framework is shown in Fig. 4. Fig. 4-a shows the diagram
for object localization task, where various object detection models
are trained on original input images to detect the objects within
the image. Each model outputs one bounding box with maximum
probability value. Simple average or major voting schemes are
then used to select and crop the overlapped region as the nal
ROI image. In the case when the object detection models cannot
nd any object within the image, the image is labeled using the
classication models directly trained on full images. Similarly, Fig.
4-b displays the diagram for ROI classication task, where various
image classication models are trained on cropped ROI images from
object localization. K-fold cross validation is applied here to train
each single model, and the probability vector of the training set and
testing set from each model is then feed into the stacking block for
level-2 ensemble. Coming to the stacking process in Fig. 4-c, we are
using hill-climbing or simple averaging methods to determine the
linear combination weights, then apply the weights to the testing
results to get the nal image recognition result.
Two-stage Detection Scheme for Small Objects.
Real-world im-
age recognition applications, like the case demonstrated in Fig. 3,
may involve small object recognition in large background. On one
hand, image classication based methods like ResNet [
7
] will take
the whole image as the input and extract dierent layers of features
to classify the image. However, this method will learn the feature of
the boats but not the shes. Therefore, it will fail when presented
with images containing new boats. On the other hand, object de-
tection methods like Faster R-CNN [
20
] will locate and classify the
shes within the image. However, this method may mis-classify
the sh types due to the small size of the objects in low resolution
images.
In our proposed computational framework, as shown in Fig. 2
and Fig. 4, we deploy a two-stage detection scheme to cater for
small object recognition. The scheme combines the advantages of
both object detection methods (Fig. 4-a) and image classication
methods (Fig. 4-b). The basic idea is to develop an automated ways
to detect and crop out the objects of interest from the images, then
apply image augmentation to the cropped regions, and lastly per-
form image classication on the augmented ROIs. In such a way, the
proposed framework might more focus on the object itself rather
than the background, such that perform much better than stan-
dard one-stage deep learning approach on small object recognition
within a given large image.
Data Augmentation & Stratied K-fold for Imbalanced Data.
Many real world image recognition datasets are imbalanced which
they pose a great challenge to machine learning especially in a
classication task. Both Kaggle image recognition competitions
studied in Section 4 provide the imbalanced datasets. For example,
in the nature conservancy sheries monitoring, the training dataset
provides 1719 ALB sh but only 67 LAG sh. Even this imbalanced
dataset can be used to train a model. However, it can be biased and
inaccurate with imbalanced class distribution, in other words, the
model may tend to mis-classify the rare classes with less samples
into the majority classes with large samples.
3
Figure 2: Overall diagram of the proposed image recognition framework
Figure 3: A sample demonstration using the proposed framework to perform sh classication
Figure 4: The three main steps of the proposed framework for image recognition based on Deep Learning Techniques
To solve this problem, we deploy data augmentation and cross
validation that can help to increase accuracy of the classier. Our
method basically consists of i) oversampling the rare samples by
horizontal ipping, slight shifting and rotation, and ii) oversam-
pling the rare samples by changing color channel variance. We
also split the training dataset into dierent folds during cross val-
idation so as each fold contains enough the classes that are not
represented equally in the dataset(e.g., Stratied K-fold). In such a
way, our method can reduce the negative eect of data imbalance
to a minimum.
Pseudo Labeling & Data Augmentation for Limited Training
Samples.
In many machine learning applications, getting data is
expensive that it involves time-consuming and labour-intensive
process, e.g., ground truth have to be manually labeled and then
they will be conrmed by multiple experts in the domain. Using
the limited data to train a good machine learning model is still an
open but challenging problem. In both Kaggle image recognition
competitions studied in Section 4, the number of training samples
is only one-forth to one-third of the number of testing samples, e.g.,
3792 versus 13153 in the nature conservancy sheries monitoring.
4
So a question comes very naturally: can we leverage un-labeled data
to further improve the performances of machine learning models?
We use the pseudo labeling approach to utilize the unlabeled data
in the testing dataset to improve the performance of deep neural
models. We rst train a model based on original training dataset,
then perform prediction on the testing dataset to select the most
condent samples and add them into original training dataset. We
again train a new model with enlarged training samples. The exper-
iment results demonstrate that the pseudo labeling approach can
slightly improve the overall performance of our proposed frame-
work.
Dealing with Overtting Issues.
The goal of a deep learning
model is to generalize the training data to any data from the problem
domain. This allows us to make predictions on unseen data by the
model in the future. In both Kaggle image recognition competitions
studied in Section 4, samples in training dataset and public testing
set are much smaller than the samples in private testing dataset. As
a result, the model might be overtted to training set and public
testing set. As a result, the model might perform quite bad on
private testing set though it achieve good results on public testing
set.
We rst deploy k-folder (
k=
10 in sh recognition and
k=
5
in cervix recognition) cross-validation approach to minimize the
overtting problem in both the competitions. The output of the
model is the average of the k-trained sub-models. This model gen-
eralization approach increases our competition score signicantly
in both the leaderboards.
We also perform data augmentation to relief overtting problem.
To build a good image classier using the limited training data,
image augmentation is a method that can help to boost the perfor-
mance of deep neural networks. We use the combination of multiple
processing, e.g., random rotation, shifts, shear, scale and ips, to
create articial images. We also implemented mean-variance nor-
malization, color space transformation and elastic transformation,
to enhance the image augmentation. Our experiment results showed
that data augmentation is one of the important methods to achieve
the good performance in many image classiers. Dierent domain
applications with the limited training data can adopt dierent data
augmentation strategies to build good image classiers.
4 CASE STUDIES
Our proposed image recognition framework has been trained and
evaluated using the image sets from various image recognition
applications. Specically, in this paper, we use two case studies on
Kaggle image recognition competitions: “The Nature Conservancy
Fisheries Monitoring” [
3
] and “Intel & MobileODT Cervical Cancer
Screening” [
2
] in this Section. Both challenges are two-stage com-
petitions. In the rst phase, participants train models and submit to
a small/temporary leaderboard. In the second phase, they use the
same models developed in the rst phase to predict on an unseen
test set. The spirit of having a second stage is to prevent hand label-
ing and leaderboard probing of the test data. In order to achieve this,
participants must upload their source codes or models with xed
parameters ahead of the second stage dataset is released. These
“codes” or “models” will be examined by Kaggle and the competition
host to determine the eligibility to win the competition.
4.1 Problem Description
The Nature Conservancy Fisheries Monitoring.
The main source of protein for nearly half of the world popula-
tion depends on seafood. The Nature Conservancy reports that 60%
of world’s tuna is caught in the Western and Central Pacic. The
shing activity contains many illegal, unreported, and unregulated
shing practices that aect the balance of marine ecosystem, global
seafood supplies, and local livelihood to some extent.
Many existing electronic sheries monitoring systems work well
and can be deployed easily. However, they generate monitoring
data which are expensive to process manually in term of time and
cost. Therefore, the Conservancy seeks to use cameras that can
dramatically scale the shing monitoring activities.
In this competition, the task is to develop an algorithm to auto-
matically detect and classify species of the catch from shing boats
that it may shorten the video review process time (Fig. 5).
Figure 5: Image samples for sh recognition competition.
Picture is taken from [3]
Two major advantages of faster review with reliable data are
easy to i) reallocate human capital to management, and ii) enforce
shing activities. They bring a positive impact on environment of
our planet.
The Conservancy provides three datasets, i.e., training dataset
contains 3
,
792 images, stage-1 testing dataset contains 1
,
000 im-
ages, and stage-2 testing dataset contains 12
,
153 images. The im-
ages are taken from xed cameras mounted on sh boats. The goal
is to develop a model that can detect and classify species of sh
into 8dierent classes such as i) Albacore tuna, ii) Bigeye tuna,
iii) Yellown tuna, iv) Mahi Mahi, v) Opah, vi) Sharks, vii) Other
(sh can not categorized into above the 6 classes, and viii) No Fish.
For illustration, Fig.6 shows the pictures of shes from the rst six
types. We assume that each image only belongs to the 8classes.
However, some given images show more than one class of sh, i.e.,
sh within one of above classes and some other small sh. The
small sh are used as bait that are not counted in this competition.
Intel & MobileODT Cervical Cancer Screening.
Cervical cancer can be prevented for women if its pre-cancerous
stage is identied, and eective and life-saving treatment is car-
ried out. However, one of the main challenges in cervical cancer
treatment is to nd and determine the appropriate method due
to varying physiological dierence in women. Women who suer
cervical cancer cannot receive appropriate treatment in rural areas.
Even worse, many of them receive wrong treatments that can result
in high cost and risk their lives.
5
Figure 6: Illustration of six sh types. Picture is taken from
[3]
Intel & MobileODT work together to improve the existing Qual-
ity Assurance workow that can help rural healthcare providers
to make better treatment decisions for cervical cancer. One im-
provement of the workow is to allow real-time determinations of
the cancer treatment based on woman cervix type (Fig. 7). In this
competition, the task is to develop an algorithm that can correctly
identify woman cervix type based on the given images. This algo-
rithm can help to reduce cancer mistreatment and allow healthcare
providers to refer some cases that need further advanced treatment.
Figure 7: Image samples for cervix recognition competition.
Picture taken from [11]
Intel & MobileODT provide three datasets in this competition,
i.e., training dataset contains 1
,
466 images, stage-1 testing dataset
contains 512 images, and stage-2 testing dataset contains 3
,
506
images. The images are taken under various illumination, optical
ltering, magnication, etc. The target is to classify a given image
into three cervix categories: type-1, type-2, and type-3. Most cervi-
cal cancers begin in the cells of the transformation zone. Dierent
transformation zone locations can be used to determine dierent
types of cervix cancer, as shown in Fig. 8. Normally, cervix types-2
and 3 may include hidden lesions that require dierent treatments.
Figure 8: Illustration of three cervix types. Picture is taken
from [2]
4.2 Our Solutions
The Nature Conservancy Fisheries Monitoring.
Basically, the
solution can be divided into three stages as follows: i) rst one is
to carry out classication by VGG16 and ResNet50 directly on full
images, ii) the second is to detect and crop sh by Faster R-CNN
and SSD, and iii) the last stage is to apply ResNet50 on detected sh
region of interests (ROIs). The nal output is the ensemble of the
results from both stage-1 and stage-3. We detail each stage in the
following.
Stage-1
Classication on original sh images. This stage can
be divided into 5 steps. We run step by step as follows.
Step 1.
Preprocessing - we use dierent data augmentation
in terms of Keras ImageDataGenerator, deal with the train-
ing sample imbalance by horizontally ipping the images
on the sh types with fewer samples, and increase the
sample number for limited sh types. The original im-
ages are resized to 270
×
480 (height/width ratio is set
to 9 : 16). We add articial borders to the image that its
ratio is close to 9 : 16. Distorting the original relative scale
should be avoided that it is not helped in sh detection
and classication.
Step 2.
Model training - we train 10 VGG16 and 10 ResNet50
models on the original training dataset and then test them
on the public testing dataset. After that, we ensemble
results from VGG16 and ResNet50 respectively. The best
loss result 0
.
9
x
from VGG16 is slightly worse than 0
.
8
x
of ResNet50 on the public testing dataset. So we choose
ResNet50 in the sequential steps.
6
Step 3.
Data ltering - we then proceed to clean the train-
ing dataset by manually checking and removing the mis-
labelled samples. As a result, the number of the original
training samples is reduced from 3
,
792 to 3
,
702. We fur-
ther remove 32 uncertain samples from 3
,
702 which are
grouped into “Other” class. At the end of Step 3, we gen-
erate a new training dataset that contains 3,670 samples.
Step 4.
Re-train - we train 10 ResNet50 models on the new
training dataset with 3
,
670 samples from the Step 3, then
tested on public testing dataset, and ensemble all results
from the models and obtain the best loss result 0
.
7
x
. We
denote the best result from this step as Rst1.
Step 5.
Move to next stage-2 - we run many trials in this
step. At the end of the trials, the best result is still around
0.7x. Therefore, we move to the next step, sh detection.
Stage-2
Fish detection by using Faster R-CNN and SSD. This
stage can be divided into 5 steps. We run step by step as
follows.
Step 1.
Generating label text les: we borrow the annotation
from the Kaggle forum discussion to generate the bound-
ing box requested by Faster R-CNN and SSD for object
detection.
Step 2.
First object detection models: we train various Faster
R-CNN and SSD models using the annotated sh dataset.
Our goal in Step 2 is to perform sh detection only. There-
fore, we congure one class (sh) only and then use the
pre-trained VGG16 ImageNet model as base networks to
train the models.
Step 3.
Cropping sh region of interests (ROIs): we use the
trained object detection models for sh detection on pub-
lic testing image set, and then crop sh ROIs using the
bounding box with the highest probability value. Out of
1000 given images, a total of sh ROIs is 789.
Step 4.
Second object detection models: we use another set
of annotations (we manually generate segmentation masks
for each of the sh on the original sh training dataset). In
a similar way, we also train another set of Faster R-CNN
and SSD models, and then crop another set of 804 sh
ROIs.
Step 5.
Final output: we run two sets of trained object de-
tection models on the same testing image set using the
following condition: if majority of the models detect sh
on the given image, then select the bounding box with the
highest probability value, and output the sh ROIs. At the
end of Step 5, the nal output of ROIs number is about 746.
Stage-3
Classication on cropped sh ROIs. This stage can be
divided into 3 steps. We run step by step as follows.
Step 1.
Training: we train 10 ResNet50 models on annotated
sh ROIs. In a similar way, we use same data augmentation
and training strategy as in the stage-1.
Step 2.
Testing: we test the trained models on the cropped
sh ROIs from the public testing dataset, then ensemble
and output the classication probability results, denoted
as Rst2.
Step 3.
Merging: We use the result from
Rst1
on a given
image without any sh detection. Otherwise we combine
the weighted results of both
Rst1
of the Stage-1 and
Rst2
of
the Stage-3. The best loss result 0
.
6
x
is obtained on public
testing dataset.
Intel & MobileODT Cervical Cancer Screening.
The overall
workow of our solution is depicted in Fig. 4. The solution can be
divided into two stages as follows: i) rst is to perform cervical
detection using YOLO and faster R-CNN models on full images, and
ii) second is to apply various image classiers on cropped cervical
ROIs. We detail each stage in the following.
Stage-1
Cervix detection on original image. This stage can be
divided into 5 steps. We run step by step as follows.
Step 1.
Generating label text les: the bounding box requested
by YOLO and faster R-CNN for object detection is gener-
ated. All the images from all the image sets are squared
size by bordering the shorter side of the images with black
pixels.
Step 2.
Model training: Various YOLO and faster R-CNN
models are trained using the annotated cervical datasets
with dierent number of anchors.
Step 3.
Crop cervix ROIs: The trained models perform cervix
detection on both training and public testing image sets,
and cervical ROIs are cropped using the bounding box
with the highest probability value from the above models.
Step 4.
Re-train and cropping: The cropped cervical ROIs
from training set are used as rened annotations to re-train
various models and crop new ROIs by repeating Steps 2
and 3 above.
Step 5.
Resize ROIs: In this step, all the cropped ROIs are re-
sized to 224
×
224 for the next stage. In the end of this step,
a total of 1466 and 512 ROIs are generated from training
and public testing set, respectively.
Stage-2
Cervix classication on cropped ROIs. This stage can
be divided into 3 steps. We run step by step as follows.
Step 1.
Pre-processing: Image preprocessing is performed
in this step, e.g., dierent data augmentation in terms
of Keras ImageDataGenerator, such as rotation, ipping,
and shifting. The data imbalance can be reduced using
horizontal ipping of the images.
Step 2.
Model training: Various CNNs (VGG16, VGG19, In-
ception V3, and ResNet50) are used to train models using
the cropped ROIs from the training dataset, and then test
on the cropped ROIs from public testing dataset.
Step 3.
The nal result is an ensemble of the probability
values from various CNNs, where the ensemble weights
are determined by using hill-climbing optimization.
7
4.3 Numerical Results
Evaluation Metric.
Both the competitions involves in multi-class
image recognition tasks. Each image has been labeled with one true
class. We need to submit a set of predicted probabilities on every
image. The performance of the proposed framework applying on
the competitions is evaluated using the denition of
log
loss for
multi-class image recognition problem as follows.
logloss =1
n
n
Õ
i=1
m
Õ
j=1
yi j log(pij ),(1)
where
n
is the number of images in the test dataset,
m
is the number
of image class labels,
yi j
is 1 if observation
i
belongs to Class
j
and
0otherwise, and
pi j
is the predicted probability that the observa-
tion
i
belongs to Class
j
. The smaller the
log
loss, the better the
performance of our framework can achieve.
The Nature Conservancy Fisheries Monitoring.
In this compe-
tition, we performe the sh localization by using Faster R-CNN and
SSD, and then determine the sh location using the results from
the localization models with top probabilities. Subsequently, we
performe sh classication using VGGNet16 and ResNet50 mod-
els. In the last step, we obtain a nal output by assembling the
classication probabilities from each of the deep CNN models.
The nal output is evaluated using the multi-class
loд
loss based
on Eq. 1. Table 1 shows that our proposed framework scores 0
.
64 and
1
.
19 on public leaderboard and private leaderboard, respectively,
which are much better than the values of 0
.
72 and 1
.
89 by image
classiers only. In this competition, we are ranked 4out of 2
,
293
teams.
Method Standard Image Classier Our Proposed Framework
Score 1.89/0.72 1.19/0.64
Table 1: Performance based on the Kaggle competition on
“The Nature Conservancy Fisheries Monitoring” using eval-
uation metric Eqn. 1 on private/public leaderboard.
Intel & MobileODT Cervical Cancer Screening.
In this compe-
tition, we rst perform the cervical region of interest (ROI) detec-
tion using Faster R-CNN and YOLO, then determined the cervix
location using the results from the localization models via a major
voting scheme. Subsequently, we classify cervix types using vari-
ous deep learning models such as VGGNet, Xception, Inception V3,
and ResNet50. Finally, we use hill-climbing to determine the linear
combination weights for the nal ensemble as our output.
The nal output is evaluated using the multi-class
loд
loss based
on Eq. 1. Table 2 compares the scores using image classiers only
and our proposed framework. Though the score 0
.
427 of image
classier on public leaderboard is better than 0
.
458 of our proposed
framework. But the former seems overtting to the stage-1 testing
set, the score on private leaderbaord increases to 0
.
873, which is
much worse than 0
.
808 of our proposed framework. The nal score
0.808 of our proposed framework is ranked 3out of 834 teams.
Method Standard Image Classier Our Proposed Framework
Score 0.873 / 0.427 0.808 / 0.458
Table 2: Performance based on the Kaggle competition on
“Intel & MobileODT Cervical Cancer Screening” using eval-
uation metric Eqn. 1 on private/public leaderboard.
5 CONCLUSIONS AND FUTURE WORKS
In this paper, we propose a computational deep learning framework
used for image recognition, specically on image classication of
object types within the image. The proposed framework involves
the latest developments in deep learning network architectures
for object detection and image classication. The eectiveness of
the proposed framework has been demonstrated by the promising
performance on various Kaggle image recognition competitions.
To continuously enhance the performance of the proposed frame-
work for image recognition applications, several directions are wor-
thy of further investigations. Firstly, we are working on how to
leverage un-labeled data to further improve performances. One
of the possible solutions is to perform semi-supervised learning
by Generative Adversarial Networks (GAN)[
5
]. Promising results
have been reported in several research works like [
16
]. Secondly,
from the experiments, we found out that most of the classication
failures are caused by the hard samples or classes. For example,
some classes in sh recognition competitions, say, ”ALB”, ”BET”
and ”YFT”, are extremely similar to each other. It is very hard for
the model even human to correctly classify them. We are looking
for some solutions like focal loss [
14
] to deal with this issue. Deep
learning grows very fast, recently, more elegant deep network ar-
chitectures have been presented for classication and regression
problems, such as Inception V4 [
25
], DenseNet [
9
], Dual Path Net-
works [
1
] and many more. We are planing to integrate these new
CNN architectures into our proposed framework to further increase
the performance.
ACKNOWLEDGEMENT
The authors would like to thank Dr. Russ Wolnger, Mr. Dmytro
Poplavskiy, Mr. Gilberto Titericz, and Mr. Joseph Chui for their
insightful discussions and suggestions on the Kaggle competitions
presented in this study.
REFERENCES
[1]
Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi
Feng. 2017. Dual path networks. In Advances in Neural Information Processing
Systems. 4470–4478.
[2]
Kaggle Competition. 2017. Intel & MobileODT Cervical Cancer Screening. https:
//www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening
[3]
Kaggle Competition. 2017. The Nature Conservancy Fisheries Monitoring. https:
//www.kaggle.com/c/the-nature-conservancy-sheries-monitoring
[4]
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via
region-based fully convolutional networks. In Advances in neural information
processing systems. 379–387.
[5]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
nets. In Advances in neural information processing systems. 2672–2680.
[6]
Cao Guimei, Xie Xuemei, Yang Wenzhe, Liao Quan, Shi Guangming, and Jin-
jian Wu. 2017. Feature-fused SSD: fast detection for small objects. In Ninth
International Conference on Graphic and Image Processing. 106151E.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
8
vision and pattern recognition. 770–778.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings
in deep residual networks. In European Conference on Computer Vision. Springer,
630–645.
[9]
Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2016.
Densely connected convolutional networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition. 2261–2269.
[10]
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,
Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al
.
2017. Speed/accuracy trade-os for modern convolutional object detectors. In
Proceedings of the IEEE conference on computer vision and pattern recognition.
3296–3297.
[11]
Health Information. 2015. Approaches of Cervical Cancer. http://magickeepers.
blogspot.sg/2015/12/wise-woman- approaches-of- cervical-cancer.html
[12]
He Kaiming, Gkioxari Georgia, Dollar Piotr, and Girshick Ross. 2017. Mask
R-CNN. In Proceedings of IEEE Conference on Computer Vision. 2961–2969.
[13]
Yann LeCun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. Nature
521, 7553 (2015), 436–444.
[14]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
Focal loss for dense object detection. In arXiv:1708.02002.
[15]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
In European conference on computer vision. Springer, 21–37.
[16]
Augustus Odena. 2016. Semi-supervised learning with generative adversarial
networks. In arXiv:1606.01583.
[17]
Waseem Rawat and Zenghui Wang. 2017. Deep convolutional neural networks
for image classication: A comprehensive review. In Neural computation. MIT
Press, 2352–2449.
[18]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
only look once: Unied, real-time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 779–788.
[19]
Joseph Redmon and Ali Farhadi. 2016. YOLO9000: better, faster, stronger. In
arXiv:1612.08242.
[20]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN:
Towards real-time object detection with region proposal networks. In Advances
in neural information processing systems. 91–99.
[21] Girshick Ross. 2015. Fast R-CNN. In arXiv:1504.08083.
[22]
Girshick Ross, Donahue Je, and Malik Trevor, Darrell andJitendra. 2016. Region-
based convolutional networks for accurate object Detection and segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence (2016), 142–158.
[23]
Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview.
Neural networks, 85–117.
[24]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for large-scale image recognition. In arXiv:1409.1556.
[25]
Christian Szegedy, Sergey Ioe, Vincent Vanhoucke, and Alexander A Alemi.
2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on
Learning.. In AAAI. 4278–4284.
[26]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.
Going deeper with convolutions. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 1–9.
[27]
Christian Szegedy, Scott Reed, Dumitru Erhan, Dragomir Anguelov, and Sergey
Ioe. 2014. Scalable, high-quality object detection. In arXiv:1412.1441.
[28]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioe, Jon Shlens, and Zbigniew
Wojna. 2016. Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
2818–2826.
9
... Unfortunately, in reality, this is not possible and researchers have to deal with these two problems in almost every task. The research community through conferences and competitions such as ImageNet [15] and Kaggle [16] challenges, and motivates researchers to create models and strategies to solve different tasks in computer vision. They also create and make available many databases [8] that hugely contribute to fostering research in the field. ...
... 16: Training loss curves for two iterations for the best initialization strategy (Init6) with convergence at ≈ 30 epochs. ...
Thesis
Multimodal learning involves the use of multiple senses (touch, visual, auditory, etc.) during the learning process to better understand a phenomenon. In the computational domain, we need systems to understand, interpret, and reason with multimodal data, and while there have been enormous advances in the field, many of the desired capabilities remain beyond our reach. The objective of such systems is to leverage different semantically related data types to output better predictions for a phenomenon of interest. For example, for users with sensory disabilities such as visual, to carry out daily tasks such as making a purchase or finding a place in a city, the visual information of their environment has to be transformed into a different modality with more semantic meaning. A system for this purpose could use auditory information provided by the user that specifies what information is required and that can be easily transformed into textual data, and visual information such as images obtained from its surroundings to help the user make a decision. Therefore, it would be a multimodal system leveraging information from three different modalities: auditory + text + images.When it comes to the computational side, working with multimodal data comes with several challenges. This thesis focuses on advancing multimodal learning research through various scientific contributions: we simplify the creation of deep learning models by proposing frameworks that find a common semantic space for visual and textual modalities using deep learning as the backbone tool; we propose competitive strategies to address the tasks of cross-modal retrieval, scene-text visual question answering, and attribute learning; we address various data-related issues like imbalance and learning when not enough data is annotated. These contributions aim to bridge the gap between humans (such as non-expert users) and artificial intelligence to tackle everyday tasks. Our first contribution aims to evaluate the effectiveness of a multimodal system that receives images and text and retrieves relevant multimodal information. This approach allows us to perform a complete study to evaluate the effectiveness of a cross-modal retrieval system with deep learning as the backbone tool. The cross-modal feature allows the formulation of the queries in the form of images or text and retrieves relevant multimodal data. With this approach, we can evaluate the ability of the model to produce effective multimodal representations and to handle any multimodal query with a single model. Subsequently, in our second contribution, we adapt the system to perform a recent task called scene-text visual question answering (ST-VQA). The aim is to teach traditional VQA models to read the text contained in natural images. This task requires us to perform a semantic analysis between the visual content and the textual information contained in associated questions to give the correct answer. We find this task very relevant in the multimodal context since it truly forces us to jointly develop mechanisms that reason about visual and textual content. Our latest contributions point to data-related issues. Data is one of the most important factors in aiming for good performance. Therefore, we determined that a relevant skill is to understand how to properly clean and analyze data and create strategies that can take advantage of it. We address very common and frequent issues such as noise, imbalance, and insufficient annotated data. To evaluate our strategies, we consider the problem of attribute learning. Attribute learning can complement category-level recognition and therefore improve the degree to which machines perceive visual objects. In the first study, we cover two key aspects: imbalance and insufficient labeled data. We propose adaptations to classical imbalanced learning strategies that cannot be directly applied when using multi-attribute deep learning models. In the second study, we propose a novel strategy to exploit class-attribute relationships to learn predictors of attributes in a semi-supervised learning way. Semi-supervised learning permits harnessing the large amounts of unlabelled data available in many use cases in combination with typically smaller sets of labeled data.
... A publicly accessible FathomNet Model Zoo (FMZ) 63 contains FathomNet-trained machine learning models contributed by community members to be freely used on other visual data collected in various marine regions with a number of platforms containing many concepts. As FathomNet grows to include additional concepts and imagery, we envision intellectual activities around the dataset similar to ImageNet and Kaggle-style competitions, where baseline datasets and annual challenges could be leveraged to develop state-of-the-art algorithms for future deployment 64,65 . ...
Article
Full-text available
The ocean is experiencing unprecedented rapid change, and visually monitoring marine biota at the spatiotemporal scales needed for responsible stewardship is a formidable task. As baselines are sought by the research community, the volume and rate of this required data collection rapidly outpaces our abilities to process and analyze them. Recent advances in machine learning enables fast, sophisticated analysis of visual data, but have had limited success in the ocean due to lack of data standardization, insufficient formatting, and demand for large, labeled datasets. To address this need, we built FathomNet, an open-source image database that standardizes and aggregates expertly curated labeled data. FathomNet has been seeded with existing iconic and non-iconic imagery of marine animals, underwater equipment, debris, and other concepts, and allows for future contributions from distributed data sources. We demonstrate how FathomNet data can be used to train and deploy models on other institutional video to reduce annotation effort, and enable automated tracking of underwater concepts when integrated with robotic vehicles. As FathomNet continues to grow and incorporate more labeled data from the community, we can accelerate the processing of visual data to achieve a healthy and sustainable global ocean.
... Kaggle competition is a very well-known platform for machine learning researchers where many research agencies share their data to solve different types of research problems. For example, many researchers used data from Kaggle competitions to analyze real-life problems and propose models to solve the problems, such as sentiment analysis, feature detection, diagnosis prediction [10,14,[29][30][31]. ...
... , d, a ̸ = b. However, the Pearson correlation coefficient is a continuous value between [−1, 1] and needs to be encoded for further usage in machine learning or deep learning [45]. ...
Article
Analysis of high dimensional biomedical data such as microarray gene expression data and mass spectrometry images, is crucial to provide better medical services including cancer subtyping, protein homology detection,etc. Clustering is a fundamental cognitive task which aims to group unlabeled data into multiple clusters based on their intrinsic similarities. The K-means algorithm is one of the most widely used clustering heuristics that aims at grouping the data objects into meaningful clusters such that the sum of squared Euclidean distances within each cluster is minimized. Its conceptual simplicity and computational efficiency make it easy to be used for wide applications of different data types. However, all features of data in K-means are considered equally in relevance, which distorts the performance when clustering high-dimensional data such as microarray gene expression data, mass spectrometry images, where there exist many redundant variables and correlated variables. In this paper, we propose a new correlation induced clustering, CoIn, to capture complex correlations among high dimensional data and guarantee the correlation consistency within each cluster. We evaluate the proposed method on a high dimensional mass spectrometry dataset of liver cancer tumor to explore the metabolic differences on tissues and discover the intra-tumor heterogeneity (ITH). By comparing the results of baselines and ours, it has been found that our method produces more explainable and understandable results for clinical analysis, which demonstrates the proposed clustering paradigm has the potential with application to knowledge discovery in high dimensional bioinformatics data.
... Reinforcement learning is another method used to solve the space exploration problem of mobile robots in unknown environments [10,11]. This is due to its characteristics that enable agents to have learning capabilities and recognition, e.g., through image processing. ...
Article
Full-text available
Space exploration is a hot topic in the application field of mobile robots. Proposed solutions have included the frontier exploration algorithm, heuristic algorithms, and deep reinforcement learning. However, these methods cannot solve space exploration in time in a dynamic environment. This paper models the space exploration problem of mobile robots based on the decision-making process of the cognitive architecture of Soar, and three space exploration heuristic algorithms (HAs) are further proposed based on the model to improve the exploration speed of the robot. Experiments are carried out based on the Easter environment, and the results show that HAs have improved the exploration speed of the Easter robot at least 2.04 times of the original algorithm in Easter, verifying the effectiveness of the proposed robot space exploration strategy and the corresponding HAs.
... >In recent years, deep learning-based methods have been widely applied to computer vision including pattern recognition [1], image segmentation [2,3], and object detection [4]. Xiao et al. [5] presented the pipeline image diagnosis algorithm which combines image segmentation with object detection. ...
Article
Full-text available
Estimating the fundamental matrix (F-matrix) is a basic problem in computer vision. The traditional algorithms are highly based on correspondences. By imprecise detecting and matching correspondences, the F-matrix is estimated incredibly. An end-to-end network (F-net) is provided in the present work without detecting and matching correspondences. To ensure estimation of an accurate F-matrix which is rank-2 with 7 degrees of freedom and scale invariance, we used the Improved convolutional block attention module (Improved-CBAM), and two self-define layers in this network. The experiments were conducted on the KITTI dataset. Two metrics, MMABS (Epipolar Constraint with Mean Absolute Value) and MMSQR (Epipolar Constraint with Mean Squared Value) were used to measure how well the epipolar constraint is satisfied by the estimated F-matrix. MMSQR and MMABS of the F-net are 0.21 and 0.11, respectively, and are 95.32 and 37.36 in the eight-point algorithm, respectively. For another end-to-end network, they are 3.48 and 2.77, respectively. F-net outperforms the other algorithms. The results demonstrated that the F-matrix can be successfully estimated by the F-net.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Article
Full-text available
In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with more than 3 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications.
Article
Full-text available
Convolutional neural networks (CNNs) have been applied to visual tasks since the late 1980s. However, despite a few scattered applications, they were dormant until the mid-2000s when developments in computing power and the advent of large amounts of labeled data, supplemented by improved algorithms, contributed to their advancement and brought them to the forefront of a neural network renaissance that has seen rapid progression since 2012. In this review, which focuses on the application of CNNs to image classification tasks, we cover their development, from their predecessors up to recent state-of-the-art deep learning systems. Along the way, we analyze (1) their early successes, (2) their role in the deep learning renaissance, (3) selected symbolic works that have contributed to their recent popularity, and (4) several improvement attempts by reviewing contributions and challenges of over 300 publications. We also introduce some of their current trends and remaining challenges.
Article
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which further makes training easy and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10/100, and a 200-layer ResNet on ImageNet.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.