Content uploaded by Zhibo Chen
Author content
All content in this area was uploaded by Zhibo Chen on Jul 23, 2018
Content may be subject to copyright.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 1
Multi-view Vehicle Type Recognition with
Feedback-enhancement Multi-branch CNNs
Zhibo Chen, Senior Member, IEEE, Chenlu Ying, Chaoyi Lin, Sen Liu, Weiping Li, Fellow, IEEE
Abstract—Vehicle type recognition (VTR) is a quite common
requirement and one of key challenges in real surveillance
scenarios such as intelligent traffic and unmanned driving.
Usually coarse-grained and fine-grained VTR are applied in
different applications, and the challenge from multiple viewpoints
is critical for both cases. In this paper, we propose a Feedback-
enhancement Multi-branch CNN (FM-CNN) to solve the chal-
lenge in these two cases. The proposed FM-CNN takes three
derivatives of an image as input and leverages the advantages
of hierarchical details, feedback enhancement, model average
and stronger robustness to translation and mirroring. A single
global cross-entropy loss is insufficient to train such a complex
CNN and so we add extra branch losses to enhance feedbacks
to each branch. Though reusing pre-trained parameters, we
propose a novel parameter update method to adapt FM-CNN
to task-specific local visual patterns and global information in
new datasets. To test the effectiveness of FM-CNN, we create
our own Multi-view VTR (MVVTR) dataset since there is no
such datasets available. And for fine-grained VTR, we use
CompCars dataset. Compared with state-of-the-art classification
solutions without special preprocessing, the proposed FM-CNN
demonstrates better performance in both coarse-grained and fine-
grained scenarios. For coarse-grained VTR, it achieves 94.9%
Top-1 accuracy on MVVTR dataset. For fine-grained VTR, it
achieves 91.0% Top-1 and 97.8% Top-5 accuracies on CompCars
dataset.
Index Terms—VTR, multi-view, feedback-enhancement, multi-
branch, CNN.
I. INT ROD UCTION
SURVEILLANCE-RELATED tasks are hot topics in com-
puter vision domains [1]–[4]. Among these tasks, vehicle
type recognition(VTR) is a key challenge in surveillance data
analysis and has many applications. For example, it can help
automatically calculate the road tolls without human involve-
ment which are depended on vehicle sizes. And in unmanned
vehicles, VTR can help to determine the safe distances to
keep with the neighboring vehicles. These two applications
focus more on vehicle sizes, which are coarse-grained. Be-
sides, when it comes to searching a vehicle in surveillance
videos, methods based on the number plate usually can’t work
Zhibo Chen, Chenlu Ying, Chaoyi Lin, Sen Liu and Weiping Li are
with University of Science and Technology of China, Hefei, Anhui,
230026, China, (e-mail: chenzhibo@ustc.edu.cn, ying1992@mail.ustc.edu.cn,
lcy1993@mail.ustc.edu.cn, elsen@iat.ustc.edu.cn, wpli@ustc.edu.cn)
This work was supported in part by the National Key Research and
Development Program of China under Grant No. 2016YFC0801001, the
National Program on Key Basic Research Projects (973 Program) under Grant
2015CB351803, NSFC under Grant 61571413, 61632001,61390514, and Intel
ICRI MNC.
Copyright c
2018 IEEE. Personal use of this material is permitted. How-
ever, permission to use this material for any other purposes must be obtained
from the IEEE by sending an email to pubs-permissions@ieee.org.
since the number plate may be either fake or invisible. In
such situations, VTR can filter out vehicles with unrelated
types and reduce the burden of human searching. Such an
application needs not only coarse-grained information but also
fine-grained information like vehicle models.
(a) Multiple viewpoints
(b) Invisible number plates
Fig. 1. Challenges in our MVVTR dataset. Images are taken from various
viewpoints. Number plates are not shown in most images. These two cases are
common in real scenarios but have not been considered by existing coarse-
grained VTR solutions.
Accordingly, researches on VTR can be classified into
two categories. The first category focuses on the coarse-
grained classification and differentiates vehicles mainly by
their sizes, e.g. large, medium or small [5]–[9]. Vehicles
in these researches can be classified into bus, truck, sedan,
minicar and so on. Though these researches have achieved
good performance, the challenge of multiple viewpoints does
not seem to have been well studied. On the contrary, vehicle
images taken from various viewpoints are very common in real
scenarios. Therefore, existing coarse-grained VTR solutions
are limited in real applications. In this paper, we try to
solve this multi-view coarse-grained VTR problem by our
proposed FM-CNN. We also create a MVVTR dataset for
this problem. The second category focuses on the fine-grained
classification. In this case vehicles are classified mainly by
their detailed information such as makers and/or models [10]–
[15]. CompCars dataset [14] is created especially for fine-
grained car classification. It contains 431 vehicle models and
30955 web-nature images of various viewpoints. The baseline
performance with GoogLeNet [16] in [14] are not good. The
result in [15] achieves the state-of-the-art performance with
AlexNet [17] on this dataset. In this paper, we also verified
the proposed FM-CNN on CompCars dataset for fine-grained
VTR. Compared with many existing methods, our FM-CNN
handles the multi-view challenge with better performance and
does not have the requirement for specific viewpoints and
specific camera parameters.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 2
Recently, convolutional neural networks (CNNs) have
achieved excellent performance in computer vision tasks like
classification [16]–[20] and detection [21]–[24]. In this paper,
we also utilize the powerful feature extraction & learning
ability of CNNs and design FM-CNN inspired by AlexNet
[17]. FM-CNN is deep and broad. Training such a large CNN
with only one global cross-entropy loss would cause bad
convergence. Therefore, we add local branch loss modules to
enhance feedbacks to each branch. In addition, training such
a large CNN from scratch with MVVTR dataset or CompCars
dataset would cause over-fitting. Therefore, FM-CNN reuses
the pre-trained parameters of AlexNet on the ILSVRC2012
dataset1. Considering the analogy between CNNs and human
visual systems and the fact that neurons in higher cortices
are sensitive to more semantic visual features, we propose
a novel parameter update method to both adapt high level
convolutional kernels to task-specific local patterns and make
fully-connected layers reorganize global information according
to new datasets. With feedback-enhancement branch losses
and the update method, FM-CNN achieves excellent results
on MVVTR dataset and outperforms state-of-the-art solutions
on CompCars dataset.
Our contributions are as follows:
1) We design FM-CNN which has three major character-
istics. The first characteristic is multiple branches. The
FM-CNN has three convolutional branches with three
scale images as input. Multi-scale inputs expose hierarchy
details of images. More branches also mean stronger
robustness to translation and mirroring with training
set augmentation. The second characteristic is feedback
enhancement. Training the multi-branch with only one
global cross-entropy loss may be intractable because of
latent interference from multi-scale feature maps. So we
add a local loss module after each branch to enhance the
feedback. Every branch is an independent CNN with its
own branch loss. Therefore, the whole FM-CNN takes the
advantage of model average with such feedback enhance-
ment scheme. The third characteristic is parameter update
method to renew the parameters in high convolutional lay-
ers and high fully-connected layers. This update method
is inspired by the similarity between hierarchical layer
architectures of CNNs and human visual systems, and the
fact that neurons in high visual cortices like V4 respond
to more semantic visual patterns. Updating the parameters
in higher convolutional layers changes the selectivity of
convolutional kernels to new task-specific visual patterns.
And updating the parameters in higher fully-connected
layers can reorganize global information and avoid over-
fitting.
2) We create MVVTR dataset for multi-view coarse-grained
VTR problems. This dataset contains 7570 web-nature
images and 7 major vehicle types: bus, minicar, MPV,
sedan, sports car, SUV and truck. Each type has nearly
1000 images. All the images display vehicles from
various viewpoints (Fig.1a) and many of them do not
show number plates (Fig.1b). The multi-view attribute in
1http://www.image-net.org/challenges/LSVRC/2012/
MVVTR dataset invalidates nearly all existing solutions
for coarse-grained VTR.
3) With the extra feedback-enhancement branch losses
and the novel parameter update method, our FM-CNN
achieves 94.9% Top-1 accuracy on MVVTR dataset for
coarse-grained multi-view VTR and 91.0% Top-1 accu-
racy and 97.8% Top-5 accuracies on CompCars dataset
for fine-grained multi-view VTR. The performance on
CompCars outperforms the sate-of-the-art method [15]
by 6.2% and 2.4%. The results show our FM-CNN is ef-
fective and efficient for general VTR problems regardless
of viewpoints and granularities.
The remainder of this paper is organized as follows: Section
II reviews works related to VTR and CNNs. Section III
introduces the architecture and advantages of our FM-CNN.
Section IV introduces our MVVTR dataset and experiment
settings as well as analysis of the experiment results on
MVVTR dataset and CompCars dataset. Section Vconcludes
this paper. MVVTR dataset and the model and weights of our
FM-CNN will be shared on our website: http://staff.ustc.edu.
cn/∼chenzhibo/resources.html.
Fig. 2. Flowchart of FM-CNN. Images are resized to 3 scales and then fed to
conv branches. The feature maps are concatenated in the conv sq layer and
the last fc layer gives model probabilities.
II. RE LATE D WORK
A. Vehicle type recognition
VTR problems have already been studied for nearly 20
years. All existing methods can be summarized into three
categories.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 3
The first category is appearance-based and suits for both
coarse-grained and fined-grained VTR. These methods firstly
localize number plates as reference and then extract hand-
craft features from vehicle fronts, especially from regions
around lamps and number plates [10]–[13], [25]. Appearance-
based methods with powerful visual features like histogram
of oriented gradients (HOG) [26] and scale-invariant feature
transform (SIFT) [27] have achieved good performance and
overcame many challenges such as location and luminance
variances. Distinct features also help to recognize vehicle types
in a fine-grained level. However, an obvious drawback of
these methods is that they can only handle front images of
vehicles, which is due to number plate localization and distinct
feature extraction. We call this drawback as the limitation
of viewpoints. Because of this limitation, appearance-based
methods can not have a popular usage in real scenarios since
multi-view vehicle images are common.
The second category is geometry-based and only suits for
coarse-grained VTR. These methods firstly separate vehicles
from backgrounds. Then by applying the transformation from
the coordinate in images to the coordinate in the real world,
geometry parameters, e.g. length, width and height, of ve-
hicles are estimated to determine vehicles rough types [5]–
[7]. Geometry-based methods can’t determine vehicles’ fine-
grained types since vehicles of different types may have
similar geometry parameters. Besides, geometry-based meth-
ods need camera parameters to apply the transformation of
coordinates, which are usually hard to obtain in real scenarios.
We call this drawback as the limitation of camera parameters.
Because of this limitation, geometry-based methods are all
camera-specific and need to adjust its parameters for every
camera. This limits its application in real scenarios.
The third category is learning-based with CNNs and suits
for both coarse-grained and fine-grained VTR. For coarse-
grained VTR problems, works with CNNs succeed in recog-
nizing vehicles of several types in images that contain more
than one vehicles [8], [9]. But the images under study are only
taken from one or two specific camera viewpoints without
considering the challenge of multiple viewpoints. And the
CNN framework used in these works are too simple to learn
powerful features from vehicles of multiple viewpoints. For
fine-grained VTR problems, [14] offers a large car dataset
named CompCars dataset and gives the baseline performance
with GoogLeNet [16]. The baseline performance is low with
76.7% Top-1 accuracy and 91.7% Top-5 accuracy. The method
in [15] achieves the state-of-the-art performance on CompCars
dataset with 84.8% Top-1 accuracy and 95.4% Top-5 accuracy.
But this method has several drawbacks: firstly it needs three
faces of vehicles to be shown in images (the limitation of
viewpoints); secondly it requires camera parameters to eval-
uate the bounding boxes (the limitation of cameras); thirdly
its preprocessing of bounding box detection is complicated,
compared with the mean reduction preprocessing in many
CNN works [16], [17], [19], [20].
B. Convolutional neural networks
CNNs are binary graphical models and decompose complex
non-linear mappings to combinations of many simple non-
linear functions. The first named CNN for classification is
LeNet [28]. LeNet are successful in recognizing handwritten
numbers and demonstrate the effectiveness and efficiency of
back-propagation training methods. In lone time after LeNet,
researches on CNNs do not have much progress due to
hardware limitation. Until 2012, Krizhevsky et al. propose a
CNN which gets a good performance on the 1000-category
ILSVRC2012 dataset [17]. This CNN is called AlexNet and
demonstrates efficiency of many techniques such as dropout,
ReLU activations and data augmentation. Since AlexNet, re-
search on CNNs for object classification proceeds quickly. For
example, there are VGG [19], which uses fixed 3×3convolu-
tional kernels, GoogLeNet [16], which combines multi-scale
convolutional kernels into Inception modules, and ResNet
[20], which contains skip connections to learn residues. CNNs
for classification usually have these layers: convolutional lay-
ers to learn hierarchy local features; fully-connected layers to
use global information; pooling layers to reduce feature map
dimensions; activation layers to produce non-linearity; dropout
layers to avoid over-fitting and classification layer to output
probabilities of categories.
In this paper, we design FM-CNN for both coarse-grained
and fine-grained VTR problems. Contrary to many existing
methods mentioned above, our method does not need infor-
mation on viewpoints of images and parameters of cameras.
And FM-CNN does not need complex preprocessing like scene
segmentation or vehicle part detection neither. Fig.2shows
how FM-CNN works.
III. FEE DBA CK -EN HA NC EM EN T MULT I-B RA NC H CNN
A. Network architecture
Our FM-CNN is based on AlexNet. As shown in Fig. 3, the
input size of AlexNet is 227 ×227. It has five convolutional
layers to learn local features and three fully-connected layers
to organize global information. Two local response normaliza-
tion (LRN) layers which are behind the conv1 and conv2 layers
mimic lateral inhibition. Three pooling layers are behind the
lrn1, lrn2 and conv5 layers to reduce feature map dimensions.
Two dropout layers with probability 0.5 are behind the fc6
and fc7 layers to avoid over-fitting. Activations are all ReLU
except the last one, which is pure identity. AlexNet is trained
on the ILSVRC2012 dataset, so the last fc8 layer has 1000
outputs.
Our FM-CNN extend the convolutional part to three
branches, as shown in Fig. 4. The first branch has the largest
input size 435×435 and is called the large branch. We append
“l” after layer names in this branch. The large branch has
five convolutional layers, two LRN layers and three pooling
layers. The second branch has the same input size 227 ×227
as AlexNet and is called the medium branch. We append “ m”
after layer names in this branch. The medium branch has all
the layers in the large branch except the last pooling layer.
The third branch has the smallest input size 117 ×117 and
is called the small branch. We append “ s” after layer names
in this branch. The small branch only has two convolutional
layers, one LRN layer and one pooling layer. All activations in
these three branches are ReLU. And all hyper-parameters of
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 4
Fig. 3. Architecture of AlexNet. It has 5 convolutional layers (blue), 3 pooling layers (yellow) and 3 fully-connected layers (brown). Other layers such as
dropout and local response normalization (LRN) are not shown. Parallelism in the conv2, pool2, conv4, conv5 and pool5 layers is due to limited memory of
graphic processing units (GPUs) then.
layers, such as kernel size, kernel number and stride, are the
same as their correspondences in AlexNet. For example, the
conv1 l, conv1 m and conv1 s layers all correspond to the
conv1 layer in AlexNet. And the pool4 l and pool4 m layers
both correspond to the pool4 layer in AlexNet. We also add
batch normalization layers [29] after convolutional layers in
which parameters are updated.
All three convolutional branches output feature maps with
size 13 ×13. These feature maps are concatenated by depth
and then input into a convolutional layer named conv sq.
The conv sq layer use 1×1convolutional kernels, which is
inspired by “network in network” [18] and has the benefits
of reducing the channels of feature maps, increasing the non-
linearity of the whole model, training a branch with informa-
tion from the other two branches during back-propagation and
reducing newly introduced parameters to avoid over-fitting.
After a pooling layer, feature maps are input into the global
loss module, which has three fully-connected layers and two
dropout layers. The output number of the last fully-connected
layer varies based on specific tasks, 7 for coarse-grained VTR
on our MVVTR dataset and 431 for fine-grained VTR on
CompCars dataset. The activations are ReLU for the first two
fully-connected layers and identity for the last fully-connected
layer. The global loss module is important since feedback from
this loss will be propagated back into all three branches.
B. Network advantages
1) Multiple branches: FM-CNN has three convolutional
branches which take multi-scale inputs. There are two advan-
tages by utilizing multiple branches.
The first advantage is hierarchy details of images. The input
sizes of the three branches are 435 ×435,227 ×227 and
117 ×117, nearly 2×,1×and 0.5×of the original size
in AlexNet. Multiple scales, or called spatial pyramid, has
already been used in traditional handcraft image descriptors
like SIFT [27] and HOG [26]. Images with the same content
but of different scales display different key points and features.
Combinations of these features are invariant to scale variances
and help understanding image contents better. So compound
features extracted within fixed-size windows from multi-scale
images usually have better representation ability. And varying
input scales by cropping fixed-size patches from multi-scale
training images is also a usual technique to make CNNs scale-
invariant, because CNNs can learn with hierarchy information
of images [16], [20] through this method. However, in this
way CNNs handle multi-scale inputs with single convolutional
branch, which may degrade convolutional kernels’ learning
ability. The reason is that pixel-level contents are far different
within fixed-size receptive fields in the case of multi-scale
inputs. On the contrary, FM-CNN handles multi-scale inputs
with multi-branches and convolutional kernels in each branch
can learn features from images of one specific scale, without
being confused.
The second advantage is robustness to translation and
mirroring. A usual technique to assign models robustness to
certain variance is to augment training set so that the variance
is reflected by training images. Randomly cropping patches
from some larger images and mirroring these patches with a
probability (usually 0.5) are widely used to augment training
set of CNNs. By these augmentation methods, CNNs learn
features that are robust to translation and mirroring. FM-
CNN has three branches and each of these branches can
learn translation-and-mirroring-invariant features. Therefore,
the robustness to translation and mirroring of FM-CNN is
stronger than CNNs with single branch.
2) Feedback enhancement: Multi-branch (or multi-scale)
CNNs have already been used in scene understanding tasks
[30]–[32]. However, these CNNs can’t be applied directly in
VTR problems for two reasons:
1) In scene understanding tasks, there are no concepts of
foregrounds and backgrounds. Therefore, feature maps
from different branches, which are of different scales in
terms of receptive fields of convolutional kernels, offers
hierarchy information with little interference. But in VTR
problems, foregrounds are vehicles and backgrounds are
interference. Feature maps of different scales from dif-
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 5
Fig. 4. Details of FM-CNN. It has three convolutional branches, which take multi-scale inputs and output features maps with the same size 13 ×13.
These features maps are then concatenated and input into the global loss module. Except the new conv sq and fc8 vtr layers, all other layers have their
correspondences in AlexNet so to reuse the pre-trained parameters of AlexNet. (The fc8 vtr layer has 7 outputs for coarse-grained VTR on our MVVTR
dataset and 431 outputs for fine-grained VTR on CompCars dataset.) To suppress the latent interference from feature maps from different branches and enhance
the local feedback to each branch, we add three local branch loss modules which contribute to training with weight 0.5. These branch losses are used only
during training. And paths of the extra feedback from branch losses are indicated by the yellow arrowed lines.
ferent branches may amplify the interference.
2) In scene understanding tasks, the training loss for multi-
branch CNNs contain lots of pixel-level distances and
give enough feedback. But in VTR problems which are
classification tasks, the training loss is usually very small
(cross-entropy loss less than 10). Such a training loss
can not provide enough feedback to indicate the learning
of convolutional kernels as well as suppress the latent
interference from feature maps of different scales.
To improve performance of multi-branch CNNs for VTR
problems, we add extra branch loss modules to enhance feed-
back of each branch in our FM-CNN. Each branch loss module
has two pooling layers and three fully-connected layers. The
three branch losses contribute to the total training loss with
weights of all 0.5. Different from GoolLeNet where local
loss modules enhance feedback to mainly solve the gradient
vanishing problems, these local loss modules in FM-CNN
enhance feedback to mainly suppress the latent interference
from different feature maps.
Combining multi-branch architecture and extra branch loss
modules, FM-CNN has another advantage as model average.
If we ignore the global loss module, each branch with its
branch loss module is a complete CNN. So the whole FM-
CNN is equal to average over the feature maps learned by
the last convolutional layers in the three branches. Although
GoogLeNet also uses local loss modules, it does not have such
model average advantage. Besides, compared with general
model average algorithms which weight the final outputs of
every model, FM-CNN weights the pixels in the middle layer
feature maps of each branch, which is more flexible for FM-
CNN to decide which pixels are important.
C. Parameter Update
FM-CNN is both deep and broad. Training such a large
CNN needs a huge number of labeled images. If the training
set is small, over-fitting is easy to occur. Both MVVTR dataset
and CompCars dataset are not large enough. The former is
small in terms of the total image amount. And the latter is
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 6
Fig. 5. Images in our MVVTR dataset. Each row corresponds to a vehicle type. Multiple viewpoints is the key challenge in this dataset.
small in terms of the average per-category image amount
(only 37 in the training set used in [14]). Therefore, we
can’t train FM-CNN from scratch. Instead, we reuse the pre-
trained parameters of AlexNet on the ILSVRC2012 dataset.
Except the two newly introduced layers – the conv sq layer,
fc8 vtr layer and layers in the three branch loss modules,
all the parameters in remaining layers are initialized by their
correspondences in AlexNet. Parameters should be updated
since image contents in the ILSVRC2012 dataset differ a lot
with those in the two vehicle datasets and the input image sizes
of FM-CNN are not totally the same as the original one of
AlexNet. So, features learned from the ILSVRC2012 dataset
can’t be applied directly for VTR. We update the parameters in
higher convolutional layers and higher fully-connected layers
to use both new local and global information from new vehicle
datasets.
The parameter updating of higher convolutional layers is
inspired by the analogy between CNNs and human visual
systems. Human visual systems have the similar hierarchical
layer architecture and neurons in higher regions respond to
more semantic visual patterns. Neurons in the V1 cortex
respond to edges with various orientations [33]. Neurons
in the V2 cortex respond to a little more complex spatial
structures [34]. And neurons in the V4 cortex respond to more
invariant semantic features [35]. The responses to hierarchical
visual patterns are also found in CNNs by visualizing the
learned features of convolutional kernels [36]. The kernels
in the first convolutional layer are also interested in edges.
And those in higher convolutional layers are interested in
more semantic features like object parts or contours. Kernels
in shallow convolutional layers learn low-level and general
features. And kernels in higher convolutional layers learn high-
level and dataset-specific features. Therefore, with a new small
dataset, the new information should be used for updating the
parameters in higher convolutional layers. In this way, the
selectivity for more semantic visual patterns of these kernels
are updated. Otherwise, retraining shallow convolutional layers
may change litter about the selectivity for edges of kernels but
increase the possibility of over-fitting.
Convolutional layers usually only learn local features. In
order to utilize the global information in new datasets, we also
update the parameters in the second fully-connected layer. The
experiment results in the next section validate the effectiveness
of our parameter updating method.
IV. EXP ER IMENT RESULTS AN D ANA LYSI S
A. MVVTR dataset
There exist many datasets for VTR problems [8]–[10], [14],
[15], [25], [37], [38]. But none of them applies to the multi-
view coarse-grained case. So we create our own MVVTR
dataset. This dataset contains totally 7570 web-nature images.
We collect these images from Internet and manually label
each image. The 7 major coarse-grained vehicle types and
image amounts of each types are: MPV(1047), sedan(1217),
SUV(1152), sports car(777), minicar(901), bus(1316) and
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 7
Fig. 6. Images in CompCars dataset. Each row corresponds to a vehicle model. Similarity among models is the key challenge in this dataset.
TABLE I
COM PARI SON S ON MVVTR DATASE T
model Top-1 Accuracy
AlexNet [17] 0.817
NIN [18] 0.864
GoogLeNet [16] 0.888
VGG16 [19] 0.841
VGG19 [19] 0.830
ResNet50 [20] 0.907
ResNet101 [20] 0.910
ResNet152 [20] 0.913
FM-CNN parameter update 0.903
(central crop) +multi-branch 0.926
+feedback enhancement 0.942
FM-CNN parameter update 0.912
(10 crops) +multi-branch 0.936
+feedback enhancement 0.949
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 8
TABLE II
COM PARI SON S ON COMPCARS DATAS ET
model Top-1 Accuracy Top-5 Accuracy
AlexNet [17] 0.425 0.643
AlexNet [15] 0.848 0.954
NIN [18] 0.535 0.730
GoogLeNet [16] 0.508 0.720
GoogLeNet [14] 0.767 0.917
VGG16 [19] 0.505 0.712
VGG19 [19] 0.485 0.707
ResNet50 [20] 0.688 0.852
ResNet101 [20] 0.677 0.842
ResNet152 [20] 0.693 0.853
FM-CNN parameter update 0.792 0.915
(central crop) +multi-branch 0.857 0.951
+feedback enhancement 0.892 0.971
FM-CNN parameter update 0.832 0.935
(10 crops) +multi-branch 0.865 0.961
+feedback enhancement 0.910 0.978
truck(1163), around 1000 images for each category. Most
images in MVVTR dataset contain only one vehicle. All the
images are taken from unconstrained viewpoints. Multiple
viewpoints is the key challenge in MVVTR dataset. In ad-
dition, backgrounds of images in MVVTR dataset are more
diverse than those taken by fixed surveillance cameras. Sample
images in MVVTR dataset are shown in Fig. 5. For multi-
view fine-grained VTR problems, CompCars dataset is used.
Sample images are shown in Fig. 6.
B. Experiment Settings
AlexNet and FM-CNN are both implemented in Caffe
[39]. And the pre-trained parameters of AlexNet on the
ILSVRC2012 dataset is downloaded form ModelZoo2. We
update parameters with Adam [40], a variant of stochastic
gradient descent optimizer. In FM-CNN, the three branch
losses contribute to the total training loss with weights of all
0.5. The initial learning rate is 0.001 and we drop it to 0.1×
at a deliberately chosen stepsize.
C. Coarse-grained Classification
The experiments for coarse-grained multi-view VTR are
based on our MVVTR dataset. For each vehicle type, we
randomly select 70% images for training and the rest for
testing. So, there are totally 5302 training images and 2268
testing images. The only preprocessing is mean reduction.
2https://github.com/BVLC/caffe/wiki/Model-Zoo
During training, dataset augmentation includes random crop-
ping, mirroring, gamma adjustment with γ= 1.2and γ= 0.8
and rotation with θ=±5◦. During testing, we test FM-CNN
with central crops and average of 10 crops(up left, up right,
down left, down right, center and mirrored)
The Top-1 results on our MVVTR dataset are shown in
TABLE I. As mentioned in Section II, existing methods for
coarse-grained VTR do not consider the multi-view challenge.
So we only compare the performance with other CNNs that are
recently the most popular in classification. These CNNs are
used as feature extractors and classification is done by SVM.
Among these CNN algorithms, ResNet152 [20] achieves the
best performance with 91.3% Top-1 accuracy. For FM-CNN,
“parameter update” means a single branch (i.e. AlexNet) with
parameters in the conv4, conv5 and fc7 layers updated. It gets
91.2% Top-1 accuracy with 10 crops, only 0.1% worse than
ResNet152. Considering ResNet152 is 146 layer deeper that
the single branch, this result demonstrates that our parameter
update method is effective. “+multi-branch” means the param-
eter update method is kept and we extend the network to three
branches without branch losses. In this situation, parameters
in the conv2 s, conv5 m, conv5 l and fc7 layers are updated
with one global loss, and 93.6% Top-1 accuracy is achieved
with 10 crops. The result is better than the single branch
by 2.4%, demonstrating the effectiveness of the multi-branch
architecture. Finally, “+feedback enhancement” means we add
3 extra branch losses in FM-CNN to enhance the feedback to
each branch. These branch losses have weights of all 0.5 in
the total training loss. With four losses, FM-CNN gets 94.9%
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 9
Top-1 accuracy with 10 crops, better than the single branch by
3.7% and itself with one single loss by 1.3%. The results in
TABLE Idemonstrates contributions of the three techniques
of FM-CNN (parameter update, multi-branch, and feedback
enhancement) to coarse-grained VTR problems.
D. Fine-grained Classification
The experiments for fine-grained multi-view VTR problems
are based on CompCars dataset [14]. The split of the training
and testing set is the same as that in [14]. There are 16016
training images and 14939 testing images. The only prepro-
cessing method is mean reduction. Dataset augmentation is
the same as that mentioned above. We sill test FM-CNN with
central crop and 10 crops here.
The Top-1 and Top-5 results on CompCars dataset are
shown in TABLE II. Performance of GoogLeNet [14] and
AlexNet [15] are cited from their papers. Other CNNs are
still used as feature extractors and combined with SVM.
The baseline model in [14] is GoogLeNet fine-tuned with
76.7% Top-1 and 91.7% Top-5 accuracies. Sochor et al.
retrain AlexNet on CompCars dataset [15] with complicated
preprocessing procedures to generate 3d boxes as input, which
requires camera parameters. They improve the performance
to 84.8% Top-1 and 95.4% Top-5 accuracies. Performance
of other CNNs is not competitive on CompCars dataset. For
FM-CNN, “parameter update” means a single branch (i.e.
AlexNet) with parameters in the conv4, conv5 and fc7 layers
updated. It gets 83.2% Top-1 and 93.5% Top-5 accuracies
with 10 crops, which is better than the baseline GoogLeNet
by 6.5% and 1.8%. Although GoogLeNet is much deeper,
its performance is worse than the single branch with part
of parameters updated. This fact shows that our parameter
update method is efficient for the fine-grained case. “+multi-
branch” means extending the network to 3 branches where
parameters in the conv2 s, conv5 m, conv5 l and fc7 are
updated with a single global loss. The performance in this
situation is 86.5% Top-1 and 96.1% Top-5 accuracies with 10
crops and is already better than the state-of-the-art method by
1.7% and 0.7%, demonstrating the contribution of the multi-
branch architecture. In “+feedback enhancement” situation,
we add 3 extra branch losses. FM-CNN gets performance
improvement and achieves 91.0% Top-1 and 97.8% Top-5
accuracies with 10 crops, outperforming itself with one global
loss by 4.5% and 1.7%. These results in TABLE II demonstrate
the effectiveness of FM-CNN for fine-grained VTR problems.
E. Parameter Update
In order to study which parameters should be updated, we
test 4 update schemes of AlexNet to verify the effectiveness
of the proposed parameter updating method, on both MVVTR
dataset and CompCars dataset. Only the central crop is used
for testing.
The results are displayed in TABLE III and TABLE IV,
“on” means parameters in this layer are updated; “\” means
not. We find that updating all parameters in AlexNet on both
datasets can’t reach the best performance (scheme D) and the
best results are achieved by scheme A: update the parameters
in the conv4, conv5 and fc7 layers, which is consistent with our
analysis about similarities between CNNs and human vision
systems. Scheme A achieves the best 90.3% Top-1 accuracy
on MVVTR dataset and the best 79.2% Top-1 and 91.5%
Top-5 accuracies on CompCars dataset. The latter are already
better than the baseline model in [14]. These results in TABLE
III and TABLE IV demonstrate the efficiency of our update
method in two aspects:
•Updating parameters in higher convolutional layers is
more important. Compared with scheme A, scheme C
update parameters in the lowest three convolutional lay-
ers. It’s obvious that scheme C shows worse performance
than scheme A, demonstrating that updating parameters
in higher convolutional layers is more efficient.
•Updating parameters in higher fully-connected layer is
more efficient. From the difference of scheme A and
scheme B, we can see that scheme A does not update
the parameters in the fc6 layer but scheme B does. It can
be seen from TABLE III and TABLE IV that scheme
B shows worse performance with even more complexity,
therefore it’s more efficient to update the higher fully-
connect layer.
V. CO NC LU SI ON
In this paper, we try to solve the problem of multiple
viewpoints in both coarse-grained and fine-grained VTR prob-
lems with the proposed FM-CNN. FM-CNN takes advantages
of multi-branch architectures and enhanced feedbacks from
loss branch losses. It achieves improved performance on
MVVTR dataset and CompCars dataset without limitations of
viewpoints and cameras. FM-CNN takes 128ms to handle one
image on a Tesla k80 GPU. If we make use of the parallelism
inside a GPU, the complexity can be further reduced to support
real time applications.
In the future, we may consider combining traditional fea-
tures as well as complex preprocessing steps such as segmen-
tation or vehicle part detection with our FM-CNN and take
into account more challenges in real surveillance applications
like occlusion.
REF ER EN CE S
[1] P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale
convolutional networks,” in Neural Networks (IJCNN), The 2011 Inter-
national Joint Conference on. IEEE, 2011, pp. 2809–2813.
[2] R. Qian, Y. Yue, F. Coenen, and B. Zhang, “Traffic sign recognition
with convolutional neural network based on max pooling positions,” in
Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-
FSKD), 2016 12th International Conference on. IEEE, 2016, pp. 578–
582.
[3] Y. Yuan, J. Wan, and Q. Wang, “Congested scene classification via
efficient unsupervised feature learning and density estimation,” Pattern
Recognition, vol. 56, pp. 159–169, 2016.
[4] Y. Yuan, Z. Xiong, and Q. Wang, “An incremental framework for
video-based traffic sign detection, tracking, and recognition,” IEEE
Transactions on Intelligent Transportation Systems, vol. 18, no. 7, pp.
1918–1929, 2017.
[5] A. H. Lai, G. S. Fung, and N. H. Yung, “Vehicle type classification
from visual-based dimension estimation,” in Intelligent Transportation
Systems, 2001. Proceedings. 2001 IEEE. IEEE, 2001, pp. 201–206.
[6] R. P. Avery, Y. Wang, and G. S. Rutherford, “Length-based vehicle clas-
sification using images from uncalibrated video cameras,” in Intelligent
Transportation Systems, 2004. Proceedings. The 7th International IEEE
Conference on. IEEE, 2004, pp. 737–742.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 10
TABLE III
PER FOR MA NCE O F PARA ME TER UP DATE ME THO DS O N MVVTR DATAS ET
scheme conv1 conv2 conv3 conv4 conv5 fc6 fc7 Top-1
scheme A \\\on on \on 0.903
scheme B \\\on on on on 0.894
scheme C on on on \ \ \ on 0.860
scheme D on on on on on on on 0.863
TABLE IV
PER FOR MA NCE O F PARA ME TER UP DATE ME THO DS O N COMPCARS DATAS ET
scheme conv1 conv2 conv3 conv4 conv5 fc6 fc7 Top-1 Top-5
scheme A \\\on on \on 0.792 0.915
scheme B \\\on on on on 0.717 0.856
scheme C on on on \ \ \ on 0.583 0.756
scheme D on on on on on on on 0.613 0.760
[7] G. Zhang, R. Avery, and Y. Wang, “Video-based vehicle detection
and classification system for real-time traffic data collection using
uncalibrated video cameras,” Transportation Research Record: Journal
of the Transportation Research Board, no. 1993, pp. 138–147, 2007.
[8] Z. Dong, M. Pei, Y. He, T. Liu, Y. Dong, and Y. Jia, “Vehicle
type classification using unsupervised convolutional neural network,”
in Pattern Recognition (ICPR), 2014 22nd International Conference on.
IEEE, 2014, pp. 172–177.
[9] Z. Dong, Y. Wu, M. Pei, and Y. Jia, “Vehicle type classification using
a semisupervised convolutional neural network,” IEEE Transactions on
Intelligent Transportation Systems, vol. 16, no. 4, pp. 2247–2256, 2015.
[10] V. S. Petrovic and T. F. Cootes, “Analysis of features for rigid structure
vehicle type recognition.” in BMVC, 2004, pp. 1–10.
[11] U. Iqbal, S. Zamir, M. Shahid, K. Parwaiz, M. Yasin, and M. Sarfraz,
“Image based vehicle type identification,” in Information and Emerging
Technologies (ICIET), 2010 International Conference on. IEEE, 2010,
pp. 1–5.
[12] A. Psyllos, C.-N. Anagnostopoulos, and E. Kayafas, “Vehicle model
recognition from frontal view image measurements,” Computer Stan-
dards & Interfaces, vol. 33, no. 2, pp. 142–151, 2011.
[13] B. Zhang, “Reliable classification of vehicle types based on cascade
classifier ensembles,” IEEE Transactions on Intelligent Transportation
Systems, vol. 14, no. 1, pp. 322–332, 2013.
[14] L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car dataset
for fine-grained categorization and verification,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 3973–3981.
[15] J. Sochor, A. Herout, and J. Havel, “Boxcars: 3d boxes as cnn input
for improved fine-grained vehicle recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2016,
pp. 3006–3015.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[18] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint
arXiv:1312.4400, 2013.
[19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” arXiv preprint arXiv:1512.03385, 2015.
[21] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
Cun, “Overfeat: Integrated recognition, localization and detection using
convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[23] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, pp. 1440–1448.
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
information processing systems, 2015, pp. 91–99.
[25] M. S. Sarfraz, A. Saeed, M. H. Khan, and Z. Riaz, “Bayesian prior
models for vehicle make and model recognition,” in Proceedings of the
7th International Conference on Frontiers of Information Technology.
ACM, 2009, p. 35.
[26] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp.
886–893.
[27] D. G. Lowe, “Object recognition from local scale-invariant features,” in
Computer vision, 1999. The proceedings of the seventh IEEE interna-
tional conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in International
Conference on Machine Learning, 2015, pp. 448–456.
[30] C. Farabet, C. Couprie, L. Najman, and Y. Lecun, “Scene parsing with
multiscale feature learning, purity trees, and optimal covers,” Eprint
Arxiv, 2012.
[31] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
single image using a multi-scale deep network,” Computer Science, pp.
2366–2374, 2014.
1051-8215 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2017.2737460, IEEE
Transactions on Circuits and Systems for Video Technology
IEEE TRANSACTION ON CIRCUIT AND SYSTEM ON VIDEO TECHNOLOGY 11
[32] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic
labels with a common multi-scale convolutional architecture,” pp. 2650–
2658, 2015.
[33] D. McLaughlin, R. Shapley, M. Shelley, and D. J. Wielaard, “A neuronal
network model of macaque primary visual cortex (v1): Orientation
selectivity and dynamics in the input layer 4cα,” Proceedings of the
National Academy of Sciences, vol. 97, no. 14, pp. 8087–8092, 2000.
[34] L. Liu, L. She, M. Chen, T. Liu, H. D. Lu, Y. Dan, and M.-m. Poo,
“Spatial structure of neuronal receptive field in awake monkey secondary
visual cortex (v2),” Proceedings of the National Academy of Sciences,
vol. 113, no. 7, pp. 1913–1918, 2016.
[35] T. O. Sharpee, “How invariant feature selectivity is achieved in cortex,”
Frontiers in Synaptic Neuroscience, vol. 8, 2016.
[36] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
volutional networks,” in European Conference on Computer Vision.
Springer, 2014, pp. 818–833.
[37] C. Caraffi, T. Voj´
ıˇ
r, J. Trefn`
y, J. ˇ
Sochman, and J. Matas, “A system for
real-time detection and tracking of vehicles from a single car-mounted
camera,” in 2012 15th International IEEE Conference on Intelligent
Transportation Systems. IEEE, 2012, pp. 975–982.
[38] H. Huttunen, F. S. Yancheshmeh, and K. Chen, “Car type recognition
with deep neural networks,” arXiv preprint arXiv:1602.07125, 2016.
[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22nd ACM international
conference on Multimedia. ACM, 2014, pp. 675–678.
[40] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
Zhibo Chen (M’01-SM’11) received the B. Sc.,
and Ph.D. degree from Department of Electrical
Engineering Tsinghua University in 1998 and 2003,
respectively. He is now a professor in University of
Science and Technology of China. Before that he
has worked in SONY and Thomson from 2003 to
2012. He used to be principal scientist and research
manager in Thomson Research & Innovation Depart-
ment.
His research interests include image and video
compression, visual quality of experience assess-
ment, immersive media computing and intelligent media computing. He has
more than 50 granted and over 100 filed EU and US patent applications,
more than 70 publications and standard proposals. He is IEEE senior member,
member of IEEE Visual Signal Processing and Communications Committee,
and member of IEEE Multimedia Communication Committee. He was orga-
nization committee member of ICIP 2017 and ICME 2013, served as TPC
member in IEEE ISCAS and IEEE VCIP.
Chenlu Ying received the B.S. degree in electronic
engineering from University of Science and Tech-
nology of China (USTC), Hefei, China, in 2015. He
is currently pursuing the M.S. degree in USTC. His
research interests include deeping learning, vehicle
type recognition, computer vision and video coding.
Chaoyi Lin received the B.S. degree from the HeFei
University of Technology, Hefei, China, in 2015. He
is currently pursuing the M.S. degree in University
of Science and Technology of China, Hefei, China.
His research interests include image processing, per-
ceptual modeling and computer vision.
Sen Liu received the B.S. degree in computer
science from the Beijing University of Posts and
Telecommunications, Beijing, China, in 2013. Cur-
rently, he is working towards the Ph.D. degree
at School of Information Science and Technology,
University of Science and Technology of China. His
area of interests includes artificial intelligence, deep
learning, video coding, computer vision and pattern
recognition and reinforcement learning.
Weiping Li (S’84-M’87-SM’97-F’00) received the
B.S. degree in electrical engineering from University
of Science and Technology of China (USTC), Hefei,
China, in 1982, and the M.S. and Ph.D. degrees
in electrical engineering from Stanford University,
Stanford, CA, USA, in 1983 and 1988, respectively.
In 1987, he joined Lehigh University, Bethlehem,
PA, USA, as an Assistant Professor with the De-
partment of Electrical Engineering and Computer
Science. In 1993, he was promoted to Associate
Professor with tenure. In 1998, he was promoted to
Full Professor. From 1998 to 2010, he worked in several high-tech companies
in the Silicon Valley (1998-2000, Optivision, Palo Alto; 2000-2002, Webcast
Technologies, Mountain View; 2002-2008, Amity Systems, Milpitas, 2008-
2010, Bada Networks, Santa Clara; all in California, USA). In March 2010, he
returned to USTC and is currently a Professor with the School of Information
Science and Technology.
Dr. Li had served as the Editor-in-Chief of IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and Guest Editor
of the PROCEEDINGS OF THE IEEE. He was the Chair of several Technical
Committees in the IEEE Circuits and Systems Society and IEEE International
Conferences, and the Chair of the Best Student Paper Award Committee
for SPIE Visual Communications and Image Processing Conference. He has
made many contributions to international standards. His inventions on fine
granularity scalable video coding and shape adaptive wavelet coding have
been included in the MPEG-4 international standard. He served as a Member
of the Moving Picture Experts Group (MPEG) of the International Standard
Organization (ISO) and an Editor of MPEG-4 international standard. He
served as a Founding Member of the Board of Directors of MPEG-4 Industry
Forum. As a Technical Advisor, he also made contributions to the Chinese
audio video coding standard and its applications. He was the recipient of the
Certificate of Appreciation from ISO/IEC as a Project Editor in development
of an international standard in 2004, the Spira Award for Excellence in
Teaching in 1992 at Lehigh University, and the first Guo Mo-Ruo Prize for
Outstanding Student in 1980 at USTC. .