ArticlePDF Available

Abstract and Figures

Vehicle type recognition (VTR) is a quite common requirement and one of key challenges in real surveillance scenarios such as intelligent traffic and unmanned driving. Usually coarse-grained and fine-grained VTR are applied in different applications, and the challenge from multiple viewpoints is critical for both cases. In this paper, we propose a Feedbackenhancement Multi-branch CNN (FM-CNN) to solve the challenge in these two cases. The proposed FM-CNN takes three derivatives of an image as input and leverages the advantages of hierarchical details, feedback enhancement, model average and stronger robustness to translation and mirroring. A single global cross-entropy loss is insufficient to train such a complex CNN and so we add extra branch losses to enhance feedbacks to each branch. Though reusing pre-trained parameters, we propose a novel parameter update method to adapt FM-CNN to task-specific local visual patterns and global information in new datasets. To test the effectiveness of FM-CNN, we create our own Multi-view VTR (MVVTR) dataset since there is no such datasets available. And for fine-grained VTR, we use CompCars dataset. Compared with state-of-the-art classification solutions without special preprocessing, the proposed FM-CNN demonstrates better performance in both coarse-grained and finegrained scenarios. For coarse-grained VTR, it achieves 94.9% Top-1 accuracy on MVVTR dataset. For fine-grained VTR, it achieves 91.0% Top-1 and 97.8% Top-5 accuracies on CompCars dataset.
Content may be subject to copyright.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 1
Multi-view Vehicle Type Recognition with
Feedback-enhancement Multi-branch CNNs
Zhibo Chen, Senior Member, IEEE, Chenlu Ying, Chaoyi Lin, Sen Liu, Weiping Li, Fellow, IEEE
Abstract—Vehicle type recognition (VTR) is a quite common
requirement and one of key challenges in real surveillance
scenarios such as intelligent traffic and unmanned driving.
Usually coarse-grained and fine-grained VTR are applied in
different applications, and the challenge from multiple viewpoints
is critical for both cases. In this paper, we propose a Feedback-
enhancement Multi-branch CNN (FM-CNN) to solve the chal-
lenge in these two cases. The proposed FM-CNN takes three
derivatives of an image as input and leverages the advantages
of hierarchical details, feedback enhancement, model average
and stronger robustness to translation and mirroring. A single
global cross-entropy loss is insufficient to train such a complex
CNN and so we add extra branch losses to enhance feedbacks
to each branch. Though reusing pre-trained parameters, we
propose a novel parameter update method to adapt FM-CNN
to task-specific local visual patterns and global information in
new datasets. To test the effectiveness of FM-CNN, we create
our own Multi-view VTR (MVVTR) dataset since there is no
such datasets available. And for fine-grained VTR, we use
CompCars dataset. Compared with state-of-the-art classification
solutions without special preprocessing, the proposed FM-CNN
demonstrates better performance in both coarse-grained and fine-
grained scenarios. For coarse-grained VTR, it achieves 94.9%
Top-1 accuracy on MVVTR dataset. For fine-grained VTR, it
achieves 91.0% Top-1 and 97.8% Top-5 accuracies on CompCars
dataset.
Index Terms—VTR, multi-view, feedback-enhancement, multi-
branch, CNN.
I. INT ROD UCTION
SURVEILLANCE-RELATED tasks are hot topics in com-
puter vision domains [1]–[4]. Among these tasks, vehicle
type recognition(VTR) is a key challenge in surveillance data
analysis and has many applications. For example, it can help
automatically calculate the road tolls without human involve-
ment which are depended on vehicle sizes. And in unmanned
vehicles, VTR can help to determine the safe distances to
keep with the neighboring vehicles. These two applications
focus more on vehicle sizes, which are coarse-grained. Be-
sides, when it comes to searching a vehicle in surveillance
videos, methods based on the number plate usually can’t work
Zhibo Chen, Chenlu Ying, Chaoyi Lin, Sen Liu and Weiping Li are
with University of Science and Technology of China, Hefei, Anhui,
230026, China, (e-mail: chenzhibo@ustc.edu.cn, ying1992@mail.ustc.edu.cn,
lcy1993@mail.ustc.edu.cn, elsen@iat.ustc.edu.cn, wpli@ustc.edu.cn)
This work was supported in part by the National Key Research and
Development Program of China under Grant No. 2016YFC0801001, the
National Program on Key Basic Research Projects (973 Program) under Grant
2015CB351803, NSFC under Grant 61571413, 61632001,61390514, and Intel
ICRI MNC.
Copyright c
2018 IEEE. Personal use of this material is permitted. How-
ever, permission to use this material for any other purposes must be obtained
from the IEEE by sending an email to pubs-permissions@ieee.org.
since the number plate may be either fake or invisible. In
such situations, VTR can filter out vehicles with unrelated
types and reduce the burden of human searching. Such an
application needs not only coarse-grained information but also
fine-grained information like vehicle models.
(a) Multiple viewpoints
(b) Invisible number plates
Fig. 1. Challenges in our MVVTR dataset. Images are taken from various
viewpoints. Number plates are not shown in most images. These two cases are
common in real scenarios but have not been considered by existing coarse-
grained VTR solutions.
Accordingly, researches on VTR can be classified into
two categories. The first category focuses on the coarse-
grained classification and differentiates vehicles mainly by
their sizes, e.g. large, medium or small [5]–[9]. Vehicles
in these researches can be classified into bus, truck, sedan,
minicar and so on. Though these researches have achieved
good performance, the challenge of multiple viewpoints does
not seem to have been well studied. On the contrary, vehicle
images taken from various viewpoints are very common in real
scenarios. Therefore, existing coarse-grained VTR solutions
are limited in real applications. In this paper, we try to
solve this multi-view coarse-grained VTR problem by our
proposed FM-CNN. We also create a MVVTR dataset for
this problem. The second category focuses on the fine-grained
classification. In this case vehicles are classified mainly by
their detailed information such as makers and/or models [10]–
[15]. CompCars dataset [14] is created especially for fine-
grained car classification. It contains 431 vehicle models and
30955 web-nature images of various viewpoints. The baseline
performance with GoogLeNet [16] in [14] are not good. The
result in [15] achieves the state-of-the-art performance with
AlexNet [17] on this dataset. In this paper, we also verified
the proposed FM-CNN on CompCars dataset for fine-grained
VTR. Compared with many existing methods, our FM-CNN
handles the multi-view challenge with better performance and
does not have the requirement for specific viewpoints and
specific camera parameters.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 2
Recently, convolutional neural networks (CNNs) have
achieved excellent performance in computer vision tasks like
classification [16]–[20] and detection [21]–[24]. In this paper,
we also utilize the powerful feature extraction & learning
ability of CNNs and design FM-CNN inspired by AlexNet
[17]. FM-CNN is deep and broad. Training such a large CNN
with only one global cross-entropy loss would cause bad
convergence. Therefore, we add local branch loss modules to
enhance feedbacks to each branch. In addition, training such
a large CNN from scratch with MVVTR dataset or CompCars
dataset would cause over-fitting. Therefore, FM-CNN reuses
the pre-trained parameters of AlexNet on the ILSVRC2012
dataset1. Considering the analogy between CNNs and human
visual systems and the fact that neurons in higher cortices
are sensitive to more semantic visual features, we propose
a novel parameter update method to both adapt high level
convolutional kernels to task-specific local patterns and make
fully-connected layers reorganize global information according
to new datasets. With feedback-enhancement branch losses
and the update method, FM-CNN achieves excellent results
on MVVTR dataset and outperforms state-of-the-art solutions
on CompCars dataset.
Our contributions are as follows:
1) We design FM-CNN which has three major character-
istics. The first characteristic is multiple branches. The
FM-CNN has three convolutional branches with three
scale images as input. Multi-scale inputs expose hierarchy
details of images. More branches also mean stronger
robustness to translation and mirroring with training
set augmentation. The second characteristic is feedback
enhancement. Training the multi-branch with only one
global cross-entropy loss may be intractable because of
latent interference from multi-scale feature maps. So we
add a local loss module after each branch to enhance the
feedback. Every branch is an independent CNN with its
own branch loss. Therefore, the whole FM-CNN takes the
advantage of model average with such feedback enhance-
ment scheme. The third characteristic is parameter update
method to renew the parameters in high convolutional lay-
ers and high fully-connected layers. This update method
is inspired by the similarity between hierarchical layer
architectures of CNNs and human visual systems, and the
fact that neurons in high visual cortices like V4 respond
to more semantic visual patterns. Updating the parameters
in higher convolutional layers changes the selectivity of
convolutional kernels to new task-specific visual patterns.
And updating the parameters in higher fully-connected
layers can reorganize global information and avoid over-
fitting.
2) We create MVVTR dataset for multi-view coarse-grained
VTR problems. This dataset contains 7570 web-nature
images and 7 major vehicle types: bus, minicar, MPV,
sedan, sports car, SUV and truck. Each type has nearly
1000 images. All the images display vehicles from
various viewpoints (Fig.1a) and many of them do not
show number plates (Fig.1b). The multi-view attribute in
1http://www.image-net.org/challenges/LSVRC/2012/
MVVTR dataset invalidates nearly all existing solutions
for coarse-grained VTR.
3) With the extra feedback-enhancement branch losses
and the novel parameter update method, our FM-CNN
achieves 94.9% Top-1 accuracy on MVVTR dataset for
coarse-grained multi-view VTR and 91.0% Top-1 accu-
racy and 97.8% Top-5 accuracies on CompCars dataset
for fine-grained multi-view VTR. The performance on
CompCars outperforms the sate-of-the-art method [15]
by 6.2% and 2.4%. The results show our FM-CNN is ef-
fective and efficient for general VTR problems regardless
of viewpoints and granularities.
The remainder of this paper is organized as follows: Section
II reviews works related to VTR and CNNs. Section III
introduces the architecture and advantages of our FM-CNN.
Section IV introduces our MVVTR dataset and experiment
settings as well as analysis of the experiment results on
MVVTR dataset and CompCars dataset. Section Vconcludes
this paper. MVVTR dataset and the model and weights of our
FM-CNN will be shared on our website: http://staff.ustc.edu.
cn/chenzhibo/resources.html.
Fig. 2. Flowchart of FM-CNN. Images are resized to 3 scales and then fed to
conv branches. The feature maps are concatenated in the conv sq layer and
the last fc layer gives model probabilities.
II. RE LATE D WORK
A. Vehicle type recognition
VTR problems have already been studied for nearly 20
years. All existing methods can be summarized into three
categories.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 3
The first category is appearance-based and suits for both
coarse-grained and fined-grained VTR. These methods firstly
localize number plates as reference and then extract hand-
craft features from vehicle fronts, especially from regions
around lamps and number plates [10]–[13], [25]. Appearance-
based methods with powerful visual features like histogram
of oriented gradients (HOG) [26] and scale-invariant feature
transform (SIFT) [27] have achieved good performance and
overcame many challenges such as location and luminance
variances. Distinct features also help to recognize vehicle types
in a fine-grained level. However, an obvious drawback of
these methods is that they can only handle front images of
vehicles, which is due to number plate localization and distinct
feature extraction. We call this drawback as the limitation
of viewpoints. Because of this limitation, appearance-based
methods can not have a popular usage in real scenarios since
multi-view vehicle images are common.
The second category is geometry-based and only suits for
coarse-grained VTR. These methods firstly separate vehicles
from backgrounds. Then by applying the transformation from
the coordinate in images to the coordinate in the real world,
geometry parameters, e.g. length, width and height, of ve-
hicles are estimated to determine vehicles rough types [5]–
[7]. Geometry-based methods can’t determine vehicles’ fine-
grained types since vehicles of different types may have
similar geometry parameters. Besides, geometry-based meth-
ods need camera parameters to apply the transformation of
coordinates, which are usually hard to obtain in real scenarios.
We call this drawback as the limitation of camera parameters.
Because of this limitation, geometry-based methods are all
camera-specific and need to adjust its parameters for every
camera. This limits its application in real scenarios.
The third category is learning-based with CNNs and suits
for both coarse-grained and fine-grained VTR. For coarse-
grained VTR problems, works with CNNs succeed in recog-
nizing vehicles of several types in images that contain more
than one vehicles [8], [9]. But the images under study are only
taken from one or two specific camera viewpoints without
considering the challenge of multiple viewpoints. And the
CNN framework used in these works are too simple to learn
powerful features from vehicles of multiple viewpoints. For
fine-grained VTR problems, [14] offers a large car dataset
named CompCars dataset and gives the baseline performance
with GoogLeNet [16]. The baseline performance is low with
76.7% Top-1 accuracy and 91.7% Top-5 accuracy. The method
in [15] achieves the state-of-the-art performance on CompCars
dataset with 84.8% Top-1 accuracy and 95.4% Top-5 accuracy.
But this method has several drawbacks: firstly it needs three
faces of vehicles to be shown in images (the limitation of
viewpoints); secondly it requires camera parameters to eval-
uate the bounding boxes (the limitation of cameras); thirdly
its preprocessing of bounding box detection is complicated,
compared with the mean reduction preprocessing in many
CNN works [16], [17], [19], [20].
B. Convolutional neural networks
CNNs are binary graphical models and decompose complex
non-linear mappings to combinations of many simple non-
linear functions. The first named CNN for classification is
LeNet [28]. LeNet are successful in recognizing handwritten
numbers and demonstrate the effectiveness and efficiency of
back-propagation training methods. In lone time after LeNet,
researches on CNNs do not have much progress due to
hardware limitation. Until 2012, Krizhevsky et al. propose a
CNN which gets a good performance on the 1000-category
ILSVRC2012 dataset [17]. This CNN is called AlexNet and
demonstrates efficiency of many techniques such as dropout,
ReLU activations and data augmentation. Since AlexNet, re-
search on CNNs for object classification proceeds quickly. For
example, there are VGG [19], which uses fixed 3×3convolu-
tional kernels, GoogLeNet [16], which combines multi-scale
convolutional kernels into Inception modules, and ResNet
[20], which contains skip connections to learn residues. CNNs
for classification usually have these layers: convolutional lay-
ers to learn hierarchy local features; fully-connected layers to
use global information; pooling layers to reduce feature map
dimensions; activation layers to produce non-linearity; dropout
layers to avoid over-fitting and classification layer to output
probabilities of categories.
In this paper, we design FM-CNN for both coarse-grained
and fine-grained VTR problems. Contrary to many existing
methods mentioned above, our method does not need infor-
mation on viewpoints of images and parameters of cameras.
And FM-CNN does not need complex preprocessing like scene
segmentation or vehicle part detection neither. Fig.2shows
how FM-CNN works.
III. FEE DBA CK -EN HA NC EM EN T MULT I-B RA NC H CNN
A. Network architecture
Our FM-CNN is based on AlexNet. As shown in Fig. 3, the
input size of AlexNet is 227 ×227. It has five convolutional
layers to learn local features and three fully-connected layers
to organize global information. Two local response normaliza-
tion (LRN) layers which are behind the conv1 and conv2 layers
mimic lateral inhibition. Three pooling layers are behind the
lrn1, lrn2 and conv5 layers to reduce feature map dimensions.
Two dropout layers with probability 0.5 are behind the fc6
and fc7 layers to avoid over-fitting. Activations are all ReLU
except the last one, which is pure identity. AlexNet is trained
on the ILSVRC2012 dataset, so the last fc8 layer has 1000
outputs.
Our FM-CNN extend the convolutional part to three
branches, as shown in Fig. 4. The first branch has the largest
input size 435×435 and is called the large branch. We append
l” after layer names in this branch. The large branch has
five convolutional layers, two LRN layers and three pooling
layers. The second branch has the same input size 227 ×227
as AlexNet and is called the medium branch. We append “ m”
after layer names in this branch. The medium branch has all
the layers in the large branch except the last pooling layer.
The third branch has the smallest input size 117 ×117 and
is called the small branch. We append “ s” after layer names
in this branch. The small branch only has two convolutional
layers, one LRN layer and one pooling layer. All activations in
these three branches are ReLU. And all hyper-parameters of
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 4
Fig. 3. Architecture of AlexNet. It has 5 convolutional layers (blue), 3 pooling layers (yellow) and 3 fully-connected layers (brown). Other layers such as
dropout and local response normalization (LRN) are not shown. Parallelism in the conv2, pool2, conv4, conv5 and pool5 layers is due to limited memory of
graphic processing units (GPUs) then.
layers, such as kernel size, kernel number and stride, are the
same as their correspondences in AlexNet. For example, the
conv1 l, conv1 m and conv1 s layers all correspond to the
conv1 layer in AlexNet. And the pool4 l and pool4 m layers
both correspond to the pool4 layer in AlexNet. We also add
batch normalization layers [29] after convolutional layers in
which parameters are updated.
All three convolutional branches output feature maps with
size 13 ×13. These feature maps are concatenated by depth
and then input into a convolutional layer named conv sq.
The conv sq layer use 1×1convolutional kernels, which is
inspired by “network in network” [18] and has the benefits
of reducing the channels of feature maps, increasing the non-
linearity of the whole model, training a branch with informa-
tion from the other two branches during back-propagation and
reducing newly introduced parameters to avoid over-fitting.
After a pooling layer, feature maps are input into the global
loss module, which has three fully-connected layers and two
dropout layers. The output number of the last fully-connected
layer varies based on specific tasks, 7 for coarse-grained VTR
on our MVVTR dataset and 431 for fine-grained VTR on
CompCars dataset. The activations are ReLU for the first two
fully-connected layers and identity for the last fully-connected
layer. The global loss module is important since feedback from
this loss will be propagated back into all three branches.
B. Network advantages
1) Multiple branches: FM-CNN has three convolutional
branches which take multi-scale inputs. There are two advan-
tages by utilizing multiple branches.
The first advantage is hierarchy details of images. The input
sizes of the three branches are 435 ×435,227 ×227 and
117 ×117, nearly 2×,1×and 0.5×of the original size
in AlexNet. Multiple scales, or called spatial pyramid, has
already been used in traditional handcraft image descriptors
like SIFT [27] and HOG [26]. Images with the same content
but of different scales display different key points and features.
Combinations of these features are invariant to scale variances
and help understanding image contents better. So compound
features extracted within fixed-size windows from multi-scale
images usually have better representation ability. And varying
input scales by cropping fixed-size patches from multi-scale
training images is also a usual technique to make CNNs scale-
invariant, because CNNs can learn with hierarchy information
of images [16], [20] through this method. However, in this
way CNNs handle multi-scale inputs with single convolutional
branch, which may degrade convolutional kernels’ learning
ability. The reason is that pixel-level contents are far different
within fixed-size receptive fields in the case of multi-scale
inputs. On the contrary, FM-CNN handles multi-scale inputs
with multi-branches and convolutional kernels in each branch
can learn features from images of one specific scale, without
being confused.
The second advantage is robustness to translation and
mirroring. A usual technique to assign models robustness to
certain variance is to augment training set so that the variance
is reflected by training images. Randomly cropping patches
from some larger images and mirroring these patches with a
probability (usually 0.5) are widely used to augment training
set of CNNs. By these augmentation methods, CNNs learn
features that are robust to translation and mirroring. FM-
CNN has three branches and each of these branches can
learn translation-and-mirroring-invariant features. Therefore,
the robustness to translation and mirroring of FM-CNN is
stronger than CNNs with single branch.
2) Feedback enhancement: Multi-branch (or multi-scale)
CNNs have already been used in scene understanding tasks
[30]–[32]. However, these CNNs can’t be applied directly in
VTR problems for two reasons:
1) In scene understanding tasks, there are no concepts of
foregrounds and backgrounds. Therefore, feature maps
from different branches, which are of different scales in
terms of receptive fields of convolutional kernels, offers
hierarchy information with little interference. But in VTR
problems, foregrounds are vehicles and backgrounds are
interference. Feature maps of different scales from dif-
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 5
Fig. 4. Details of FM-CNN. It has three convolutional branches, which take multi-scale inputs and output features maps with the same size 13 ×13.
These features maps are then concatenated and input into the global loss module. Except the new conv sq and fc8 vtr layers, all other layers have their
correspondences in AlexNet so to reuse the pre-trained parameters of AlexNet. (The fc8 vtr layer has 7 outputs for coarse-grained VTR on our MVVTR
dataset and 431 outputs for fine-grained VTR on CompCars dataset.) To suppress the latent interference from feature maps from different branches and enhance
the local feedback to each branch, we add three local branch loss modules which contribute to training with weight 0.5. These branch losses are used only
during training. And paths of the extra feedback from branch losses are indicated by the yellow arrowed lines.
ferent branches may amplify the interference.
2) In scene understanding tasks, the training loss for multi-
branch CNNs contain lots of pixel-level distances and
give enough feedback. But in VTR problems which are
classification tasks, the training loss is usually very small
(cross-entropy loss less than 10). Such a training loss
can not provide enough feedback to indicate the learning
of convolutional kernels as well as suppress the latent
interference from feature maps of different scales.
To improve performance of multi-branch CNNs for VTR
problems, we add extra branch loss modules to enhance feed-
back of each branch in our FM-CNN. Each branch loss module
has two pooling layers and three fully-connected layers. The
three branch losses contribute to the total training loss with
weights of all 0.5. Different from GoolLeNet where local
loss modules enhance feedback to mainly solve the gradient
vanishing problems, these local loss modules in FM-CNN
enhance feedback to mainly suppress the latent interference
from different feature maps.
Combining multi-branch architecture and extra branch loss
modules, FM-CNN has another advantage as model average.
If we ignore the global loss module, each branch with its
branch loss module is a complete CNN. So the whole FM-
CNN is equal to average over the feature maps learned by
the last convolutional layers in the three branches. Although
GoogLeNet also uses local loss modules, it does not have such
model average advantage. Besides, compared with general
model average algorithms which weight the final outputs of
every model, FM-CNN weights the pixels in the middle layer
feature maps of each branch, which is more flexible for FM-
CNN to decide which pixels are important.
C. Parameter Update
FM-CNN is both deep and broad. Training such a large
CNN needs a huge number of labeled images. If the training
set is small, over-fitting is easy to occur. Both MVVTR dataset
and CompCars dataset are not large enough. The former is
small in terms of the total image amount. And the latter is
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 6
Fig. 5. Images in our MVVTR dataset. Each row corresponds to a vehicle type. Multiple viewpoints is the key challenge in this dataset.
small in terms of the average per-category image amount
(only 37 in the training set used in [14]). Therefore, we
can’t train FM-CNN from scratch. Instead, we reuse the pre-
trained parameters of AlexNet on the ILSVRC2012 dataset.
Except the two newly introduced layers – the conv sq layer,
fc8 vtr layer and layers in the three branch loss modules,
all the parameters in remaining layers are initialized by their
correspondences in AlexNet. Parameters should be updated
since image contents in the ILSVRC2012 dataset differ a lot
with those in the two vehicle datasets and the input image sizes
of FM-CNN are not totally the same as the original one of
AlexNet. So, features learned from the ILSVRC2012 dataset
can’t be applied directly for VTR. We update the parameters in
higher convolutional layers and higher fully-connected layers
to use both new local and global information from new vehicle
datasets.
The parameter updating of higher convolutional layers is
inspired by the analogy between CNNs and human visual
systems. Human visual systems have the similar hierarchical
layer architecture and neurons in higher regions respond to
more semantic visual patterns. Neurons in the V1 cortex
respond to edges with various orientations [33]. Neurons
in the V2 cortex respond to a little more complex spatial
structures [34]. And neurons in the V4 cortex respond to more
invariant semantic features [35]. The responses to hierarchical
visual patterns are also found in CNNs by visualizing the
learned features of convolutional kernels [36]. The kernels
in the first convolutional layer are also interested in edges.
And those in higher convolutional layers are interested in
more semantic features like object parts or contours. Kernels
in shallow convolutional layers learn low-level and general
features. And kernels in higher convolutional layers learn high-
level and dataset-specific features. Therefore, with a new small
dataset, the new information should be used for updating the
parameters in higher convolutional layers. In this way, the
selectivity for more semantic visual patterns of these kernels
are updated. Otherwise, retraining shallow convolutional layers
may change litter about the selectivity for edges of kernels but
increase the possibility of over-fitting.
Convolutional layers usually only learn local features. In
order to utilize the global information in new datasets, we also
update the parameters in the second fully-connected layer. The
experiment results in the next section validate the effectiveness
of our parameter updating method.
IV. EXP ER IMENT RESULTS AN D ANA LYSI S
A. MVVTR dataset
There exist many datasets for VTR problems [8]–[10], [14],
[15], [25], [37], [38]. But none of them applies to the multi-
view coarse-grained case. So we create our own MVVTR
dataset. This dataset contains totally 7570 web-nature images.
We collect these images from Internet and manually label
each image. The 7 major coarse-grained vehicle types and
image amounts of each types are: MPV(1047), sedan(1217),
SUV(1152), sports car(777), minicar(901), bus(1316) and
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 7
Fig. 6. Images in CompCars dataset. Each row corresponds to a vehicle model. Similarity among models is the key challenge in this dataset.
TABLE I
COM PARI SON S ON MVVTR DATASE T
model Top-1 Accuracy
AlexNet [17] 0.817
NIN [18] 0.864
GoogLeNet [16] 0.888
VGG16 [19] 0.841
VGG19 [19] 0.830
ResNet50 [20] 0.907
ResNet101 [20] 0.910
ResNet152 [20] 0.913
FM-CNN parameter update 0.903
(central crop) +multi-branch 0.926
+feedback enhancement 0.942
FM-CNN parameter update 0.912
(10 crops) +multi-branch 0.936
+feedback enhancement 0.949
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 8
TABLE II
COM PARI SON S ON COMPCARS DATAS ET
model Top-1 Accuracy Top-5 Accuracy
AlexNet [17] 0.425 0.643
AlexNet [15] 0.848 0.954
NIN [18] 0.535 0.730
GoogLeNet [16] 0.508 0.720
GoogLeNet [14] 0.767 0.917
VGG16 [19] 0.505 0.712
VGG19 [19] 0.485 0.707
ResNet50 [20] 0.688 0.852
ResNet101 [20] 0.677 0.842
ResNet152 [20] 0.693 0.853
FM-CNN parameter update 0.792 0.915
(central crop) +multi-branch 0.857 0.951
+feedback enhancement 0.892 0.971
FM-CNN parameter update 0.832 0.935
(10 crops) +multi-branch 0.865 0.961
+feedback enhancement 0.910 0.978
truck(1163), around 1000 images for each category. Most
images in MVVTR dataset contain only one vehicle. All the
images are taken from unconstrained viewpoints. Multiple
viewpoints is the key challenge in MVVTR dataset. In ad-
dition, backgrounds of images in MVVTR dataset are more
diverse than those taken by fixed surveillance cameras. Sample
images in MVVTR dataset are shown in Fig. 5. For multi-
view fine-grained VTR problems, CompCars dataset is used.
Sample images are shown in Fig. 6.
B. Experiment Settings
AlexNet and FM-CNN are both implemented in Caffe
[39]. And the pre-trained parameters of AlexNet on the
ILSVRC2012 dataset is downloaded form ModelZoo2. We
update parameters with Adam [40], a variant of stochastic
gradient descent optimizer. In FM-CNN, the three branch
losses contribute to the total training loss with weights of all
0.5. The initial learning rate is 0.001 and we drop it to 0.1×
at a deliberately chosen stepsize.
C. Coarse-grained Classification
The experiments for coarse-grained multi-view VTR are
based on our MVVTR dataset. For each vehicle type, we
randomly select 70% images for training and the rest for
testing. So, there are totally 5302 training images and 2268
testing images. The only preprocessing is mean reduction.
2https://github.com/BVLC/caffe/wiki/Model-Zoo
During training, dataset augmentation includes random crop-
ping, mirroring, gamma adjustment with γ= 1.2and γ= 0.8
and rotation with θ=±5. During testing, we test FM-CNN
with central crops and average of 10 crops(up left, up right,
down left, down right, center and mirrored)
The Top-1 results on our MVVTR dataset are shown in
TABLE I. As mentioned in Section II, existing methods for
coarse-grained VTR do not consider the multi-view challenge.
So we only compare the performance with other CNNs that are
recently the most popular in classification. These CNNs are
used as feature extractors and classification is done by SVM.
Among these CNN algorithms, ResNet152 [20] achieves the
best performance with 91.3% Top-1 accuracy. For FM-CNN,
“parameter update” means a single branch (i.e. AlexNet) with
parameters in the conv4, conv5 and fc7 layers updated. It gets
91.2% Top-1 accuracy with 10 crops, only 0.1% worse than
ResNet152. Considering ResNet152 is 146 layer deeper that
the single branch, this result demonstrates that our parameter
update method is effective. “+multi-branch” means the param-
eter update method is kept and we extend the network to three
branches without branch losses. In this situation, parameters
in the conv2 s, conv5 m, conv5 l and fc7 layers are updated
with one global loss, and 93.6% Top-1 accuracy is achieved
with 10 crops. The result is better than the single branch
by 2.4%, demonstrating the effectiveness of the multi-branch
architecture. Finally, “+feedback enhancement” means we add
3 extra branch losses in FM-CNN to enhance the feedback to
each branch. These branch losses have weights of all 0.5 in
the total training loss. With four losses, FM-CNN gets 94.9%
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 9
Top-1 accuracy with 10 crops, better than the single branch by
3.7% and itself with one single loss by 1.3%. The results in
TABLE Idemonstrates contributions of the three techniques
of FM-CNN (parameter update, multi-branch, and feedback
enhancement) to coarse-grained VTR problems.
D. Fine-grained Classification
The experiments for fine-grained multi-view VTR problems
are based on CompCars dataset [14]. The split of the training
and testing set is the same as that in [14]. There are 16016
training images and 14939 testing images. The only prepro-
cessing method is mean reduction. Dataset augmentation is
the same as that mentioned above. We sill test FM-CNN with
central crop and 10 crops here.
The Top-1 and Top-5 results on CompCars dataset are
shown in TABLE II. Performance of GoogLeNet [14] and
AlexNet [15] are cited from their papers. Other CNNs are
still used as feature extractors and combined with SVM.
The baseline model in [14] is GoogLeNet fine-tuned with
76.7% Top-1 and 91.7% Top-5 accuracies. Sochor et al.
retrain AlexNet on CompCars dataset [15] with complicated
preprocessing procedures to generate 3d boxes as input, which
requires camera parameters. They improve the performance
to 84.8% Top-1 and 95.4% Top-5 accuracies. Performance
of other CNNs is not competitive on CompCars dataset. For
FM-CNN, “parameter update” means a single branch (i.e.
AlexNet) with parameters in the conv4, conv5 and fc7 layers
updated. It gets 83.2% Top-1 and 93.5% Top-5 accuracies
with 10 crops, which is better than the baseline GoogLeNet
by 6.5% and 1.8%. Although GoogLeNet is much deeper,
its performance is worse than the single branch with part
of parameters updated. This fact shows that our parameter
update method is efficient for the fine-grained case. “+multi-
branch” means extending the network to 3 branches where
parameters in the conv2 s, conv5 m, conv5 l and fc7 are
updated with a single global loss. The performance in this
situation is 86.5% Top-1 and 96.1% Top-5 accuracies with 10
crops and is already better than the state-of-the-art method by
1.7% and 0.7%, demonstrating the contribution of the multi-
branch architecture. In “+feedback enhancement” situation,
we add 3 extra branch losses. FM-CNN gets performance
improvement and achieves 91.0% Top-1 and 97.8% Top-5
accuracies with 10 crops, outperforming itself with one global
loss by 4.5% and 1.7%. These results in TABLE II demonstrate
the effectiveness of FM-CNN for fine-grained VTR problems.
E. Parameter Update
In order to study which parameters should be updated, we
test 4 update schemes of AlexNet to verify the effectiveness
of the proposed parameter updating method, on both MVVTR
dataset and CompCars dataset. Only the central crop is used
for testing.
The results are displayed in TABLE III and TABLE IV,
“on” means parameters in this layer are updated; “\” means
not. We find that updating all parameters in AlexNet on both
datasets can’t reach the best performance (scheme D) and the
best results are achieved by scheme A: update the parameters
in the conv4, conv5 and fc7 layers, which is consistent with our
analysis about similarities between CNNs and human vision
systems. Scheme A achieves the best 90.3% Top-1 accuracy
on MVVTR dataset and the best 79.2% Top-1 and 91.5%
Top-5 accuracies on CompCars dataset. The latter are already
better than the baseline model in [14]. These results in TABLE
III and TABLE IV demonstrate the efficiency of our update
method in two aspects:
Updating parameters in higher convolutional layers is
more important. Compared with scheme A, scheme C
update parameters in the lowest three convolutional lay-
ers. It’s obvious that scheme C shows worse performance
than scheme A, demonstrating that updating parameters
in higher convolutional layers is more efficient.
Updating parameters in higher fully-connected layer is
more efficient. From the difference of scheme A and
scheme B, we can see that scheme A does not update
the parameters in the fc6 layer but scheme B does. It can
be seen from TABLE III and TABLE IV that scheme
B shows worse performance with even more complexity,
therefore it’s more efficient to update the higher fully-
connect layer.
V. CO NC LU SI ON
In this paper, we try to solve the problem of multiple
viewpoints in both coarse-grained and fine-grained VTR prob-
lems with the proposed FM-CNN. FM-CNN takes advantages
of multi-branch architectures and enhanced feedbacks from
loss branch losses. It achieves improved performance on
MVVTR dataset and CompCars dataset without limitations of
viewpoints and cameras. FM-CNN takes 128ms to handle one
image on a Tesla k80 GPU. If we make use of the parallelism
inside a GPU, the complexity can be further reduced to support
real time applications.
In the future, we may consider combining traditional fea-
tures as well as complex preprocessing steps such as segmen-
tation or vehicle part detection with our FM-CNN and take
into account more challenges in real surveillance applications
like occlusion.
REF ER EN CE S
[1] P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale
convolutional networks,” in Neural Networks (IJCNN), The 2011 Inter-
national Joint Conference on. IEEE, 2011, pp. 2809–2813.
[2] R. Qian, Y. Yue, F. Coenen, and B. Zhang, “Traffic sign recognition
with convolutional neural network based on max pooling positions,” in
Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-
FSKD), 2016 12th International Conference on. IEEE, 2016, pp. 578–
582.
[3] Y. Yuan, J. Wan, and Q. Wang, “Congested scene classification via
efficient unsupervised feature learning and density estimation,Pattern
Recognition, vol. 56, pp. 159–169, 2016.
[4] Y. Yuan, Z. Xiong, and Q. Wang, “An incremental framework for
video-based traffic sign detection, tracking, and recognition,IEEE
Transactions on Intelligent Transportation Systems, vol. 18, no. 7, pp.
1918–1929, 2017.
[5] A. H. Lai, G. S. Fung, and N. H. Yung, “Vehicle type classification
from visual-based dimension estimation,” in Intelligent Transportation
Systems, 2001. Proceedings. 2001 IEEE. IEEE, 2001, pp. 201–206.
[6] R. P. Avery, Y. Wang, and G. S. Rutherford, “Length-based vehicle clas-
sification using images from uncalibrated video cameras,” in Intelligent
Transportation Systems, 2004. Proceedings. The 7th International IEEE
Conference on. IEEE, 2004, pp. 737–742.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 10
TABLE III
PER FOR MA NCE O F PARA ME TER UP DATE ME THO DS O N MVVTR DATAS ET
scheme conv1 conv2 conv3 conv4 conv5 fc6 fc7 Top-1
scheme A \\\on on \on 0.903
scheme B \\\on on on on 0.894
scheme C on on on \ \ \ on 0.860
scheme D on on on on on on on 0.863
TABLE IV
PER FOR MA NCE O F PARA ME TER UP DATE ME THO DS O N COMPCARS DATAS ET
scheme conv1 conv2 conv3 conv4 conv5 fc6 fc7 Top-1 Top-5
scheme A \\\on on \on 0.792 0.915
scheme B \\\on on on on 0.717 0.856
scheme C on on on \ \ \ on 0.583 0.756
scheme D on on on on on on on 0.613 0.760
[7] G. Zhang, R. Avery, and Y. Wang, “Video-based vehicle detection
and classification system for real-time traffic data collection using
uncalibrated video cameras,” Transportation Research Record: Journal
of the Transportation Research Board, no. 1993, pp. 138–147, 2007.
[8] Z. Dong, M. Pei, Y. He, T. Liu, Y. Dong, and Y. Jia, “Vehicle
type classification using unsupervised convolutional neural network,
in Pattern Recognition (ICPR), 2014 22nd International Conference on.
IEEE, 2014, pp. 172–177.
[9] Z. Dong, Y. Wu, M. Pei, and Y. Jia, “Vehicle type classification using
a semisupervised convolutional neural network,IEEE Transactions on
Intelligent Transportation Systems, vol. 16, no. 4, pp. 2247–2256, 2015.
[10] V. S. Petrovic and T. F. Cootes, “Analysis of features for rigid structure
vehicle type recognition.” in BMVC, 2004, pp. 1–10.
[11] U. Iqbal, S. Zamir, M. Shahid, K. Parwaiz, M. Yasin, and M. Sarfraz,
“Image based vehicle type identification,” in Information and Emerging
Technologies (ICIET), 2010 International Conference on. IEEE, 2010,
pp. 1–5.
[12] A. Psyllos, C.-N. Anagnostopoulos, and E. Kayafas, “Vehicle model
recognition from frontal view image measurements,Computer Stan-
dards & Interfaces, vol. 33, no. 2, pp. 142–151, 2011.
[13] B. Zhang, “Reliable classification of vehicle types based on cascade
classifier ensembles,” IEEE Transactions on Intelligent Transportation
Systems, vol. 14, no. 1, pp. 322–332, 2013.
[14] L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car dataset
for fine-grained categorization and verification,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 3973–3981.
[15] J. Sochor, A. Herout, and J. Havel, “Boxcars: 3d boxes as cnn input
for improved fine-grained vehicle recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2016,
pp. 3006–3015.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[18] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint
arXiv:1312.4400, 2013.
[19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv:1512.03385, 2015.
[21] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
Cun, “Overfeat: Integrated recognition, localization and detection using
convolutional networks,arXiv preprint arXiv:1312.6229, 2013.
[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[23] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, pp. 1440–1448.
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
information processing systems, 2015, pp. 91–99.
[25] M. S. Sarfraz, A. Saeed, M. H. Khan, and Z. Riaz, “Bayesian prior
models for vehicle make and model recognition,” in Proceedings of the
7th International Conference on Frontiers of Information Technology.
ACM, 2009, p. 35.
[26] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp.
886–893.
[27] D. G. Lowe, “Object recognition from local scale-invariant features,” in
Computer vision, 1999. The proceedings of the seventh IEEE interna-
tional conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in International
Conference on Machine Learning, 2015, pp. 448–456.
[30] C. Farabet, C. Couprie, L. Najman, and Y. Lecun, “Scene parsing with
multiscale feature learning, purity trees, and optimal covers,Eprint
Arxiv, 2012.
[31] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
single image using a multi-scale deep network,” Computer Science, pp.
2366–2374, 2014.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 11
[32] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic
labels with a common multi-scale convolutional architecture,” pp. 2650–
2658, 2015.
[33] D. McLaughlin, R. Shapley, M. Shelley, and D. J. Wielaard, “A neuronal
network model of macaque primary visual cortex (v1): Orientation
selectivity and dynamics in the input layer 4cα,” Proceedings of the
National Academy of Sciences, vol. 97, no. 14, pp. 8087–8092, 2000.
[34] L. Liu, L. She, M. Chen, T. Liu, H. D. Lu, Y. Dan, and M.-m. Poo,
“Spatial structure of neuronal receptive field in awake monkey secondary
visual cortex (v2),” Proceedings of the National Academy of Sciences,
vol. 113, no. 7, pp. 1913–1918, 2016.
[35] T. O. Sharpee, “How invariant feature selectivity is achieved in cortex,
Frontiers in Synaptic Neuroscience, vol. 8, 2016.
[36] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
volutional networks,” in European Conference on Computer Vision.
Springer, 2014, pp. 818–833.
[37] C. Caraffi, T. Voj´
ıˇ
r, J. Trefn`
y, J. ˇ
Sochman, and J. Matas, “A system for
real-time detection and tracking of vehicles from a single car-mounted
camera,” in 2012 15th International IEEE Conference on Intelligent
Transportation Systems. IEEE, 2012, pp. 975–982.
[38] H. Huttunen, F. S. Yancheshmeh, and K. Chen, “Car type recognition
with deep neural networks,” arXiv preprint arXiv:1602.07125, 2016.
[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22nd ACM international
conference on Multimedia. ACM, 2014, pp. 675–678.
[40] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
Zhibo Chen (M’01-SM’11) received the B. Sc.,
and Ph.D. degree from Department of Electrical
Engineering Tsinghua University in 1998 and 2003,
respectively. He is now a professor in University of
Science and Technology of China. Before that he
has worked in SONY and Thomson from 2003 to
2012. He used to be principal scientist and research
manager in Thomson Research & Innovation Depart-
ment.
His research interests include image and video
compression, visual quality of experience assess-
ment, immersive media computing and intelligent media computing. He has
more than 50 granted and over 100 filed EU and US patent applications,
more than 70 publications and standard proposals. He is IEEE senior member,
member of IEEE Visual Signal Processing and Communications Committee,
and member of IEEE Multimedia Communication Committee. He was orga-
nization committee member of ICIP 2017 and ICME 2013, served as TPC
member in IEEE ISCAS and IEEE VCIP.
Chenlu Ying received the B.S. degree in electronic
engineering from University of Science and Tech-
nology of China (USTC), Hefei, China, in 2015. He
is currently pursuing the M.S. degree in USTC. His
research interests include deeping learning, vehicle
type recognition, computer vision and video coding.
Chaoyi Lin received the B.S. degree from the HeFei
University of Technology, Hefei, China, in 2015. He
is currently pursuing the M.S. degree in University
of Science and Technology of China, Hefei, China.
His research interests include image processing, per-
ceptual modeling and computer vision.
Sen Liu received the B.S. degree in computer
science from the Beijing University of Posts and
Telecommunications, Beijing, China, in 2013. Cur-
rently, he is working towards the Ph.D. degree
at School of Information Science and Technology,
University of Science and Technology of China. His
area of interests includes artificial intelligence, deep
learning, video coding, computer vision and pattern
recognition and reinforcement learning.
Weiping Li (S’84-M’87-SM’97-F’00) received the
B.S. degree in electrical engineering from University
of Science and Technology of China (USTC), Hefei,
China, in 1982, and the M.S. and Ph.D. degrees
in electrical engineering from Stanford University,
Stanford, CA, USA, in 1983 and 1988, respectively.
In 1987, he joined Lehigh University, Bethlehem,
PA, USA, as an Assistant Professor with the De-
partment of Electrical Engineering and Computer
Science. In 1993, he was promoted to Associate
Professor with tenure. In 1998, he was promoted to
Full Professor. From 1998 to 2010, he worked in several high-tech companies
in the Silicon Valley (1998-2000, Optivision, Palo Alto; 2000-2002, Webcast
Technologies, Mountain View; 2002-2008, Amity Systems, Milpitas, 2008-
2010, Bada Networks, Santa Clara; all in California, USA). In March 2010, he
returned to USTC and is currently a Professor with the School of Information
Science and Technology.
Dr. Li had served as the Editor-in-Chief of IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and Guest Editor
of the PROCEEDINGS OF THE IEEE. He was the Chair of several Technical
Committees in the IEEE Circuits and Systems Society and IEEE International
Conferences, and the Chair of the Best Student Paper Award Committee
for SPIE Visual Communications and Image Processing Conference. He has
made many contributions to international standards. His inventions on fine
granularity scalable video coding and shape adaptive wavelet coding have
been included in the MPEG-4 international standard. He served as a Member
of the Moving Picture Experts Group (MPEG) of the International Standard
Organization (ISO) and an Editor of MPEG-4 international standard. He
served as a Founding Member of the Board of Directors of MPEG-4 Industry
Forum. As a Technical Advisor, he also made contributions to the Chinese
audio video coding standard and its applications. He was the recipient of the
Certificate of Appreciation from ISO/IEC as a Project Editor in development
of an international standard in 2004, the Spira Award for Excellence in
Teaching in 1992 at Lehigh University, and the first Guo Mo-Ruo Prize for
Outstanding Student in 1980 at USTC. .
... Wang et al. [3] proposed the weighted mask hierarchical bilinear pooling (WMHBP) to remove the clutter from background regions and the hierarchical loss to learn appearance differences between vehicle brands for vehicle model recognition. Chen et al. [30] proposed a threebranch CNN that fuses and extracts features from different scaled inputs of the same vehicle image and enhance them with multiple losses. ...
... DenseNet-264 + GTC [29] Stanford Cars 94.3% DenseNet-264 + GTC [29] CompCars (Web) 98.5% ResNet-101 + SWP [27] CompCars (Web) 97.6% ResNet-50 + COOC [28] CompCars (Web) 95.6% A 3 M [14] CompCars (Web) 95.4% GoogleNet [9] CompCars (Web) 91.2% FM-CNN [30] CompCars (Web) 91.0% Alexnet [9] CompCars (Web) 81.9% GoogleNet [9] CompCars (SV) 98.4% Alexnet [9] CompCars (SV) 98.0% ...
... -Contrary to [22] and [23], our approach requires a previous detection and localization of the vehicle. -Our work is similar to [30], but their architecture is oriented to classify a vehicle image with multiple scales. In our case, we classify the vehicle class with images from multiple cameras and points of view, i.e. they have a multi-scale approach, and we have a multi-view approach. ...
Article
Full-text available
Electronic Toll Collection is the combination of multiple components, either technical or operational, organized to optimize system efficiency for specific requirements. Four main components constitute an ETC system: Automated Vehicle Identification (AVI), Automated Vehicle Classification (AVC), Customer Service and Violation Enforcement. The AVI involves the identification of vehicles through the transmission of a unique identifier between an in-vehicle device and a tollbooth or roadside reader. To strengthen the reliability of this process, we propose a computer vision solution applied to AVI with subscription/membership. A camera system is set up to perform vehicle verification by extracting attributes of the vehicle to compare with those found in the membership. We focus on solving vehicle make and model classification by developing a fine-grained vehicle classification system that exploits the multi-camera composition of the system by powering a convolutional neural network with multiple views of the vehicle. We propose a multi-view network that extracts features from multiple views and combines them with late fusion to classify the make and model of the vehicle. We also propose a strategy to give each independent view a contribution to network learning. The presented evaluations show that using information from different views of a vehicle improves the classification performance of the make and model, especially in challenging tolling scenarios.
... The vehicle recognition area, including vehicle make-model recognition (VMMR), re-id, tracking, and parts-detection, has made significant progress in recent years, driven by several large-scale datasets for each task [7,9,16,1,5,19,31]. Each task starts from vehicle feature extraction and specializes on desired annotations: VMMR usually uses make and model annotations, whereas re-id focuses on vehicle color and type annotations [2]. ...
... Then, for each x i , we compute the divergence as the relative entropy, or KL-divergence, between q j and the dynamic confidence weight distribution. Then, we can use this to adjust the threshold adjustment α k for team T k for a sample x i , with α k = 0.5(1 − exp(D KL (P ||Q))) = 0. 5 ...
Preprint
Full-text available
The vehicle recognition area, including vehicle make-model recognition (VMMR), re-id, tracking, and parts-detection, has made significant progress in recent years, driven by several large-scale datasets for each task. These datasets are often non-overlapping, with different label schemas for each task: VMMR focuses on make and model, while re-id focuses on vehicle ID. It is promising to combine these datasets to take advantage of knowledge across datasets as well as increased training data; however, dataset integration is challenging due to the domain gap problem. This paper proposes ATEAM, an annotation team-of-experts to perform cross-dataset labeling and integration of disjoint annotation schemas. ATEAM uses diverse experts, each trained on datasets that contain an annotation schema, to transfer knowledge to datasets without that annotation. Using ATEAM, we integrated several common vehicle recognition datasets into a Knowledge Integrated Dataset (KID). We evaluate ATEAM and KID for vehicle recognition problems and show that our integrated dataset can help off-the-shelf models achieve excellent accuracy on VMMR and vehicle re-id with no changes to model architectures. We achieve mAP of 0.83 on VeRi, and accuracy of 0.97 on CompCars. We have released both the dataset and the ATEAM framework for public use.
... The existing web-scraping VMMR datasets can be further divided based on their size into three categories: (i) small-scale, containing less than 10,000 samples; (ii) medium-scale, containing between 10,000 and 150,000 samples; and (iii) large-scale, containing more than 150,000 samples. In [22], a web-scraping dataset which contains 7570 samples is introduced. The vehicles are manually divided into seven classes based on their body-style, e.g., Sport Utility Vehicle (SUV), crossovers, hatchbacks, wagons, sedan, coupes, and others. ...
... Bypass connections are introduced between the fire modules of the original SqueezeNet backbone to obtain a CNN-based solutions for real-time surveillance VMMR. The optimised architecture proposed in [22] employs AlexNet [15] but the algorithm is designed for vehicle body-style recognition, where only seven body type classes are used; this is an oversimplified problem compared to the VMMR problem tackled in this work. Moreover, unlike the existing CNN-based solutions, the proposed 2B-2S framework is designed to reduce the inter-make ambiguity, which is an important factor affecting the recognition accuracy in existing VMMR methods. ...
Article
Full-text available
In recent years, Vehicle Make and Model Recognition (VMMR) has attracted a lot of attention as it plays a crucial role in Intelligent Transportation Systems (ITS). Accurate and efficient VMMR systems are required in real-world applications including intelligent surveillance and autonomous driving. The paper introduces a new large-scale dataset and a novel deep learning paradigm for VMMR. A new large-scale dataset dubbed Diverse large-scale VMM (DVMM) is proposed collecting image-samples with the most popular vehicle brands operating in Europe. A novel VMMR framework is proposed which follows a two-branch architecture performing make and model recognition respectively. A two-stage training procedure and a novel decision module are proposed to process the make and model predictions and compute the final model prediction. In addition, a novel metric based on the true positive rate is proposed to compare classification confusion of the proposed 2B–2S and the baseline methods. A complex experimental validation is carried out, demonstrating the generality, diversity, and practicality of the proposed DVMM dataset. The experimental results show that the proposed framework provides 93.95% accuracy over the more diverse DVMM dataset and 95.85% accuracy over traditional VMMR datasets. The proposed two-branch approach outperforms the conventional one-branch approach for VMMR over small-, medium-, and large-scale datasets by providing lower vehicle model confusion and reduced inter-make ambiguity. The paper demonstrates the advantages of the proposed two-branch VMMR paradigm in terms of robustness and lower confusion relative to single-branch designs.
... Several categories of image pre-processing methods are pixel brightness and geometric transformations, a local neighborhood of the processed pixel, and image restoration [18], [27]. Pixel Brightness Transformation modify pixel brightness and methods such as grayscale, histogram equalization, and brightness correction [15], [28]. Geometric transform is often approximated by the bilinear transformation [27]. ...
... A local neighborhood of the pixel are Nearest Neighbor and Brightness Interpolation. Last, Image restoration methods are min, max, mean, and median [28], [29]. ...
Article
Full-text available
Lack of nutrients affects plant growth and causes plant damage. Deficiency of macronutrient such as nitrogen, potassium, calcium, and phosphorus are big problem for agriculture and its prevention will be very useful for agro-industry. The destructive methods for identifying nutrient deficiencies are soil analysis, plant tissue analysis which requires expert knowledge and laboratory testing, but the test results are not necessarily accurate due to human error. Non-destructive methods such as computer vision can help digital farmer who lack knowledge of botany to identify macronutrient deficiencies. Identification and estimation of macronutrient deficiencies using computer vision consists of several stages, namely data acquisition, preprocessing, segmentation, feature extraction, to identification and estimation method. Image data in the form of RGB, NIR, etc. Several researchers have conducted studies to identify and estimate macronutrient deficiencies using different method. These methods are traditional methods such as rule based to K-Nearest Neighbor (KNN), Linear Regression, Artificial Neural Networks (ANN), Deep Learning with various architectures, and others. Several studies have their respective results and limitations, therefore this paper focuses on reviewing current research developments and providing an overview of the work and challenges in the future. The result of the comparative study is that Deep Learning such as CNN is a promising method because most studies can identify macronutrient deficiencies with an accuracy of more than 80%. However, there are still some challenges such as overcoming overlapping images with complex backgrounds, identification of multi-deficiencies, and estimation of the content of each macronutrient in RGB images.
... In our research work, out of the 50 K images, only 35 K images are considered for the VC. In the multi-view vehicle-type recognition (MVVTR) [71,72] dataset, seven major vehicle types and 4793 images are included, while 1000 images were chosen for the VC in our work. Up-left, up-right, down-left, down-right, center, and mirrored vehicles comprise the classification of vehicles as labeled parts in this dataset. ...
Article
Full-text available
Recent advancements in image processing and machine-learning technologies have significantly improved vehicle monitoring and identification in road transportation systems. Vehicle classification (VC) is essential for effective monitoring and identification within large datasets. Detecting and classifying vehicles from surveillance videos into various categories is a complex challenge in current information acquisition and self-processing technology. In this paper, we implement a dual-phase procedure for vehicle selection by merging eXtreme Gradient Boosting (XGBoost) and the Multi-Objective Optimization Genetic Algorithm (Mob-GA) for VC in vehicle image datasets. In the initial phase, vehicle images are aligned using XGBoost to effectively eliminate insignificant images. In the final phase, the hybrid form of XGBoost and Mob-GA provides optimal vehicle classification with a pioneering attribute-selection technique applied by a prominent classifier on 10 publicly accessible vehicle datasets. Extensive experiments on publicly available large vehicle datasets have been conducted to demonstrate and compare the proposed approach. The experimental analysis was carried out using a myRIO FPGA board and HUSKY Lens for real-time measurements, achieving a faster execution time of 0.16 ns. The investigation results show that this hybrid algorithm offers improved evaluation measures compared to using XGBoost and Mob-GA individually for vehicle classification.
... In this paper, the truck Re-ID we studied is more challenging. Chen et al. [7] proposed a Feedback Enhancement Multi-branch CNN to solve the challenge from multiple viewpoints. Chen and Lu [8] proposed a soft discrimination hybrid view model for joint vehicle detection and view estimation. ...
Article
Full-text available
In intelligent transportation and smart city, truck re-identification (Re-ID) is a crucial task in controlling traffic violations of laws and regulations, especially in the absence of satellite positioning and license plate information. There are many specific fine-grained types in trucks compared to common person and vehicle Re-ID, which hinders the direct application of person and vehicle Re-ID methods to truck Re-ID. In this work, we contribute a new truck image dataset, named Truck-ID, for truck Re-ID specifically. The dataset contains 32,353 images of trucks from 7 monitoring sites of real traffic surveillance, including 13,137 license plate IDs. According to the difficulty of truck Re-ID, the gallery of Truck-ID dataset is further divided into three sub-datasets to evaluate the quality of different truck Re-ID models more comprehensively. Furthermore, we propose an effective Double Granularity Network (DGN) for truck Re-ID, which considers both global and local features of truck by focusing on truck head and body separately. Experiments show that DGN can effectively integrate global and local features to achieve robust fine-grained truck Re-ID. Our work provides a benchmark dataset for truck Re-ID and a baseline network for both research and industrial communities. The Truck-ID dataset and DGN codes are available at: https://pan.baidu.com/s/18Vc6NOiipGLLvcKj8U75Hw. Although the proposed DGN is relatively simple and easy to implement, it is effective in learning discriminative features of trucks and has remarkable performance in targeting truck re-identification. The Truck-ID dataset we made can promote the development of re-identification in the truck field.
... Nonetheless, VLD is a task of great practical significance as it aims to extract vital identity information about a vehicle. Currently, obtaining vehicle information is primarily achieved using license plate detection technology [6][7][8] and vehicle-type recognition methods [9][10][11]. However, the vehicle logo is often overlooked as an essential component in this process. ...
Article
Full-text available
The vehicle logo contains the vehicle’s identity information, so vehicle logo detection (VLD) technology has extremely important significance. Although the VLD field has been studied for many years, the detection task is still difficult due to the small size of the vehicle logo and the background interference problem. To solve these problems, this paper proposes a method of VLD based on the YOLO-T model and the correlation of the vehicle space structure. Aiming at the small size of the vehicle logo, we propose a vehicle logo detection network called YOLO-T. It integrates multiple receptive fields and establishes a multi-scale detection structure suitable for VLD tasks. In addition, we design an effective pre-training strategy to improve the detection accuracy of YOLO-T. Aiming at the background interference, we use the position correlation between the vehicle lights and the vehicle logo to extract the region of interest of the vehicle logo. This measure not only reduces the search area but also weakens the background interference. We have labeled a new vehicle logo dataset named LOGO-17, which contains 17 different categories of vehicle logos. The experimental results show that our proposed method achieves high detection accuracy and outperforms the existing vehicle logo detection methods.
Conference Paper
Full-text available
This technical report serves as an extension to our earlier work published in CVPR 2015. The experiments shown in Sec. 5 gain better performance on all three tasks, i.e. car model classification, attribute prediction, and car model verification, thanks to more training data and better network structures. The experimental results can serve as baselines in any later research works. The settings and the train/test splits are provided on the project page.
Article
Full-text available
Parsing the visual scene into objects is paramount to survival. Yet, how this is accomplished by the nervous system remains largely unknown, even in the comparatively well understood visual system. It is especially unclear how detailed peripheral signal representations are transformed into the object-oriented representations that are independent of object position and are provided by the final stages of visual processing. This perspective discusses advances in computational algorithms for fitting large-scale models that make it possible to reconstruct the intermediate steps of visual processing based on neural responses to natural stimuli. In particular, it is now possible to characterize how different types of position invariance, such as local (also known as phase invariance) and more global, are interleaved with nonlinear operations to allow for coding of curved contours. Neurons in the mid-level visual area V4 exhibit selectivity to pairs of even- and odd-symmetric profiles along curved contours. Such pairing is reminiscent of the response properties of complex cells in the primary visual cortex (V1) and suggests specific ways in which V1 signals are transformed within subsequent visual cortical areas. These examples illustrate that large-scale models fitted to neural responses to natural stimuli can provide generative models of successive stages of sensory processing.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Video-based traffic sign detection, tracking, and recognition is one of the important components for the intelligent transport systems. Extensive research has shown that pretty good performance can be obtained on public data sets by various state-of-the-art approaches, especially the deep learning methods. However, deep learning methods require extensive computing resources. In addition, these approaches mostly concentrate on single image detection and recognition task, which is not applicable in real-world applications. Different from previous research, we introduce a unified incremental computational framework for traffic sign detection, tracking, and recognition task using the mono-camera mounted on a moving vehicle under non-stationary environments. The main contributions of this paper are threefold: 1) to enhance detection performance by utilizing the contextual information, this paper innovatively utilizes the spatial distribution prior of the traffic signs; 2) to improve the tracking performance and localization accuracy under non-stationary environments, a new efficient incremental framework containing off-line detector, online detector, and motion model predictor together is designed for traffic sign detection and tracking simultaneously; and 3) to get a more stable classification output, a scale-based intra-frame fusion method is proposed. We evaluate our method on two public data sets and the performance has shown that the proposed system can obtain results comparable with the deep learning method with less computing resource in a near-real-time manner.