Content uploaded by Yang Xiao
Author content
All content in this area was uploaded by Yang Xiao on Jan 02, 2016
Content may be subject to copyright.
GROUND-BASED CLOUD IMAGE CATEGORIZATION USING
DEEP CONVOLUTIONAL VISUAL FEATURES
Liang Ye, Zhiguo Cao, Yang Xiao∗, Wei Li
National Key Laboratory of Science and Technology on Multi-spectral Information Processing,
School of Automation, Huazhong University of Science and Technology, P. R. China
{liang ye, zgcao, Yang Xiao, wlee}@hust.edu.cn
ABSTRACT
Ground-based cloud image categorization is an essential and
challenging task in automatic sky and cloud observation field.
Till now, it still has not been well addressed in both meteorol-
ogy and image processing communities, due to the large vari-
ation of cloud appearance. One feasible way to solve this is to
find more discriminative visual representation to characterize
the different kinds of clouds. Many efforts have been paid in
this way. However, to our knowledge, most of the existing
methods only resort to the hand-craft visual descriptors (e.g.,
LBP, CENTRIST and color histogram). The resulting per-
formance is unfortunately not satisfied enough. Inspired by
the great success of deep convolutional neural networks (C-
NN) in large-scale image classification task (e.g., ImageNet
challenge), we first propose to transfer CNN to solve our rel-
ative small-scale cloud classification issue. The experiments
on two challenging cloud datasets demonstrate that, using the
deep convolutional visual features generated by CNN can sig-
nificantly outperform all the state-of-the-art methods in most
cases. Another important contribution of our work is that, we
find that applying Fisher Vector (FV) to encoding the off-the-
shelf CNN features can further leverage the performance.
Index Terms—Ground-based cloud classification, con-
volutional neural networks, Fisher Vector
1. INTRODUCTION
Ground-based cloud observation plays an important role in
weather phenomenon observation, record, and research. In
recent years, ground-based cloud automatic observation sys-
tem has been needed and developed, in which cloud classifi-
cation is one of significant tasks. It can reduce the cost signif-
icantly, observe more continuously, and record more objec-
tively instead of human observers and recorders. Categoriza-
tion of cloud type is regarded as one of basic meteorological
The corresponding author is Yang Xiao (Yang Xiao@hust.edu.cn). This
work is jointly supported by the Chinese Fundamental Research Fund-
s for the Central Universities (HUST: 2014QNRC035 and 2015QN036),
and National High-tech R&D Program of China (863 Program) (Grant No.
2015AA011604).
elements specified by the China Meteorological Administra-
tion [1] and gets lots of attention from many researchers these
years. Isosalo et al. used local texture information to classi-
fy the clouds in the sky views and found that LBP performed
better than LEP in 2007 [2]. Soon after statistical texture fea-
tures were applied by Calb and Sabburg [3]. In 2010, Heinle
et al. presented an cloud classification algorithm based on a
set of mainly statistical features describing the color as well as
the texture of an image and they used the k-nearest-neighbor
classifier [4]. Liu et al. found that several structural features
extracted from the segment image and edge image are use-
ful for distinguishing cirriform, cumuliform, and waveform
clouds [5]. The most recent approach proposed by Zhuo et
al. can capture both texture and structure information from a
color cloud image [6]. Despite these efforts, the methods up
to now still lack in performance and accuracy hardly exceeds
80%.
Image features plays a key role in improving the perfor-
mance of cloud classification. Recently, the CNN has been
used to extract features for images which is proved to be very
useful for classification. Girshick et al. used CNN features to
represent image regions which were fed to a SVM classifier
[7]. For another, feature encoding and pooling methods can
often improve the quality of features for classification. Fisher
Vector (FV) is a state-of-the-art feature encoding technique
and has been widely used for image classification shown by
Snchez [8]. Recently, Cimpoi et al. proposed a new texture
descriptor obtained by Fisher Vector pooling of a Convolu-
tional Neural Network (CNN) filter bank [9]. This approach
which they called D-CNN combined the CNN and FV and
obtained good performance on texture and material recogni-
tion.
In this paper, we present a novel way to get the features
descriptor for ground-based cloud image classification. In-
stead of designing the filters and hand-craft descriptors via
different clouds characteristic represented on image, our ap-
proach uses CNN features of cloud images and FV encoding
results as low-level features of cloud images and outperforms
the state-of-the-art approaches for cloud classification.
Fig. 1. The main pipeline of ground-based cloud image classification using deep convolutional features.
2. APPROACH AND IMPLEMENTATION
As shown in Fig. 1, the whole process of our approach con-
sists of following steps: resize the input image for adapt-
ing to the input requirement of CNN; extract features using
the Caffe [10] implementation of the CNN configuration de-
scribed by Simonyan et al. [11]; encode the features extracted
by CNN using Fisher Vector; use a linear SVM classifier to
categorize the input images.
2.1. To be compatible with the CNN input
According to [10] and [11], the architecture of the CNN re-
quires inputs of a fixed 224 ×224 pixel size. Therefore, the
input images need be resized to be compatible with the C-
NN. Following [7], inputs were directly warped to 224 ×
224 pixel. Generally, the cloud images collected are nearly
foursquare, so the deformation of warped images is trivial. In
addition, the warped image will be normalized using the pa-
rameter in the pretrained CNN model before put in the CNN
refer to [12].
2.2. To get the features output from CNN
In R-CNN described by Girshick et al. [7], the features output
from the penultimate full connection layer are used as the fi-
nal features of an image region. As discussed in [9], R-CNN,
as an object descriptor, does not perform as well as D-CNN as
a texture descriptor in texture and material recognition. Com-
pared with different cloud categories, texture is more divisi-
ble and useful in classification than structure because shapes
of clouds usually change disorderly. Moreover, for some cat-
egories of cloud, the shapes can be also regarded as textures
of the images since clouds may do not cover the whole im-
age region. Therefore, an assumption that DCNN-like ap-
proach may obtain better performance in cloud classification
than RCNN-like approach will be testified by our following
experiments.
To get the above features, a simple and efficient MAT-
LAB toolbox implementing CNNs [12] is used with the pre-
trained model off-the-shelf described by [13] which has 5
convolutional layers and 3 full connection layers and named
“imagenet-vgg-m”. The structure of this model is also shown
in Fig. 1.
2.3. Encode the features via Fisher Vector
If we use the features result in full connection layer of C-
NN directly like R-CNN in [7], the Fisher Vector encoding
process is not needed. Whereas we use the features result in
convolutional layer like D-CNN in [9], then the Fisher Vec-
tor is applied to acquire the representation of images through
encoding the CNN features.
Fisher Vector is one of the methods for encoding features.
Based on the assumption that all of the descriptors of an im-
age are independent identically distributed and meet Gaussian
mixture Models(GMMs), the gradient vectors of the likeli-
hood function of each GMM are used to describe the direc-
tion in which parameters should be modified to best fit the
data. Normalizing these vectors result in Fisher Vector and
they have been shown to be very useful in image classifica-
tion [8].
For each cloud image, the features output from the
5th convolutional layer of model “imagenet-vgg-m” are
13×13×512-dimensional data. That could be regarded as
169 (13×13) descriptors, each of which is a 512-dimensional
feature. 64 Gaussian components were subsequently used
to encode those descriptors via Fisher Vector, resulting in a
65K-dimensional descriptor for one input. In addition, FV is
Fig. 2. The images from 6 different cloud categories.
implemented based on the open source of VLFEAT [14].
2.4. Classification using SVM
An great effect of dimension raise via Fisher Vector is mak-
ing the samples more linearly separable. Therefore, Linear
SVM (built from the library package of LIBLINEAR [15]) is
applied for classification with the final fea-tures represented
by FV following CNN.
3. EXPERIMENTS
To verify our approach which uses CNN features and FV, it
is tested on a challenging ground-based cloud image dataset
’6 cloud HUST cloud’ collected by Zhuo et al. [6]. There are
1231 images from 6 classes: cirrocumulus and altocumulus;
cirrus and cirrostratus; cumulus; stratocumulus; stratus and
altostratus; clear sky. Fig. 2 shows some of images from each
class and more details can be found in [6].
For comparing with the approach of zhuo et al. [6], the
same experimental setup is used. The experiments are divided
into 5 groups. The number of training samples per class used
respectively in the 5 groups are 5, 10, 20, 40 and 80. The
rest of samples are used as a testing set. Furthermore, the
available samples are split randomly into a training set and
a testing set for 10 rounds. Effectiveness and robustness of
different methods are investigated using the average accuracy
and standard deviation of 10 rounds of testing on 5 groups
with different number of training samples.
Due to the same experimental setup, the performance of
the methods proposed by Zhuo et al. [6], Liu et al. [5], Heinle
et al. [4], Calb and Sabburg [3], and Isosalo et al. [2] on “6
class HUST cloud” listed in [6] are cited and employed for
comparison with ours.
In addition, following the comparison did by [9], the ap-
proach which use the results of the penultimate full connec-
Fig. 3. The image samples of 9 different finer cloud categories
from the “6 class HUST cloud” dataset.
tion layer (named “fc7”) of CNN as features directly is al-
so employed for comparison in this paper, and we can call
this method “RCNN-like”. Our approach which is inspired
by D-CNN use the Fisher Vector results of the 5th convolu-
tional layer (named “conv5”) outputs as features can be called
“DCNN-like”.
Table. 1 shows the classification results of different ap-
proaches on the “6 class HUST cloud” dataset. The results
clearly show that the approach of DCNN-like outperforms the
state-of-the-art methods on average accuracy for 10 rounds,
corresponding to all the situations of different training sam-
ples numbers. It demonstrates the effectiveness and generality
of our approach for the ground-based cloud image classifica-
tion.
On the other hand, as described in [9], D-CNN is a bet-
ter texture descriptor than R-CNN which has been considered
a good object descriptor. That is due to, the spatial location
information of local image patch which can indicate position,
shape and contour is included in the features output from ful-
l connection layer of CNN, and this location information of
parts of image in texture describing is reasonably less useful
than it in object describing. R-CNN uses the features output
from full connection layer directly, while D-CNN pools the
image parts through encoding the features output from convo-
lutional layer via Fisher Vector. This is why D-CNN performs
better than R-CNN in texture classification in [9]. For cloud
classification, the images usually include both cloud pixels
and sky pixels. And the position of cloud pixels or patch-
es should not affect the classification of the cloud. Since the
Training samples 5 10 20 40 80
Isosalo et al.[2] 39.8(±3.1) 47.7(±4.8) 58.3(±2.1) 66.5(±1.3) 72.2(±1.0)
Calbo et al.[3] 37.4(±2.4) 43.3(±3.8) 50.9(±2.0) 57.6(±3.0) 63.8(±1.2)
Heinle et al.[4] 34.4(±2.3) 37.7(±3.1) 44.9(±1.9) 51.5(±1.5) 56.8(±1.6)
Liu et al.[5] 29.0(±2.5) 30.6(±2.0) 33.8(±1.5) 38.1(±1.1) 41.1(±1.9)
Zhuo et al.[6] 45.2(±3.6) 53.5(±2.8) 66.2(±2.1) 74.7(±1.0) 79.8(±1.3)
RCNN-like 57.9(±4.5) 65.5(±2.5) 70.7(±1.2) 75.5(±1.2) 80.3(±1.3)
DCNN-like 55.8(±3.4) 65.4(±2.0) 74.4(±2.6) 79.2(±1.1) 83.8(±1.3)
Table 1. Cloud classification result(%) on the “6 class HUST cloud” dataset. Each method is tested for 5 groups (select 5,
10, 20, 40, 80 samples randomly as training samples and the rest of samples as testing samples). For each group, 10 rounds
experiments are carried and results of average accuracy for 10 rounds are reported.
cloud types cumulus cirrus cirrostratus cirrocumulus altocumulus clear sky stratocumulus stratus altostratus average
Isosalo et al.[2] 20 30.1 23.1 28.1 26.0 48.2 35.9 39.7 67.5 35.4
Calbo et al.[3] 25.2 23.7 46.5 43.9 50.4 61.7 59.7 68.9 66.1 49.6
Heinle et al.[4] 43.7 42.6 43.0 51.3 45.9 57.7 61.3 65.4 73.5 53.8
Liu et al.[5] 42.7 58.3 52.1 60.5 52.1 57.4 62 63.8 71.4 57.8
Zhuo et al.[6] 60.5 58.2 60.7 75.1 57.9 72.8 52.5 74.9 64.5 64.1
RCNN-like 77.5 65.0 68.9 90.0 60.5 94.6 63.1 96.8 73.4 76.7
DCNN-like 79.7 74.9 72.1 95.2 68.1 90.1 62.4 98.4 78.6 80.0
Table 2. Cloud classification result(%) under the finer categorization principle on the dataset. Experiments are carried using 40
training samples randomly and the rest of samples in the dataset for testing. The classification accuracy of each cloud category
is shown for the different methods.
clouds in our images rarely include the entire shape of the
cloud, and the edges between sky and clouds can be also re-
garded as texture of images, considering the cloud classifi-
cation task as a texture recognition is better than considering
it as an object classification. This is also proved in Table.
1 that shows our D-CNN-like approach obtained the best per-
formance and the approach in [5] didn’t perform well because
it only used structure features like shape and position relation
while ignoring the texture information.
Following Zhuo et al.[6], we refined the “6 class HUST
cloud” dataset to 9 individual categories. As shown in Fig.
3, altocumulus and cirrocumulus are regarded as one class re-
spectively while they are compounded to a single class in the
“6 class HUST cloud” dataset because these two categories
of cloud are extremely similar. The same can be applied to
cirrus and cirrostratus, stratus and altostratus, all images in
the dataset are separated 9 classes. This finer categorization
principle brings more challenges to cloud classification. As it
stands, the average accuracy of the state-of-the-art approach
of zhuo et al. [6] drops ten percentage from 74.7% to 64.1%,
under the condition that 40 samples are used for training per
class. Our approach is also tested according to the finer cate-
gorization principle and methods in Table. 1 are reemployed
for comparison. Experiments are carried out under the same
condition with [6]. The classification results of each class and
overall average accuracy are both reported and listed in Table.
2.
According to Table. 2, our approaches ”RCNN-like” and
”DCNN-like” both outperform the other methods. Further-
more, Compared to the results on 6-class categorization prin-
ciple listed in Table. 1, the performance of our approach on
6-class and 9-class categorization principle are comparable.
The promising results above show the stronger descriptive
power of CNN features and the effectiveness of Fisher Vec-
tor encoding these features. Thus, our approach has more
potential in the practical applications of cloud classification.
Actually, except loading the CNN model and learning the G-
MMs and the SVM, to process a new test cloud image will
cost within 3.1s (30ms for getting CNN features and 2s for
encoding via FV) using our approach in a normal computer. It
satisfy the requirement of categorization task in ground-base
cloud automatic observation system.
4. CONCLUSIONS
In this paper, deep convolutional features are firstly used as
a new perception mechanism for ground-based cloud image
categorization. It cloud be great low-level visual descriptors
for cloud images and Fisher Vector encoding could improve
the classification performance advisably. According to the
experimental results, Our approach using deep convolution-
al features and Fisher Vector outperforms the state-of-the-
art methods significantly on the challenging “6 class HUST
cloud” dataset. And its advantage were highlighted under the
finer categorization rule (9 classes).
In future work, we intend to extend the ground-based im-
age dataset and test different models of convolutional neural
network. Moreover, fine tuning of the CNN models with our
extended dataset cloud be considered to improve the perfor-
mance of cloud categorization.
5. REFERENCES
[1] China Meteorological Administration, Specification for
Ground Meteorological Observation, cloud. China Me-
teorological Press, 2003.
[2] Antti Isosalo, Markus Turtinen, Matti Pietik¨
ainen,
A Isosalo, M Turtinen, and M Pietik¨
ainen, “Cloud char-
acterization using local texture information.,” in Proc.
Finnish Signal Processing Symposium 2007 (FINSIG
2007). in: Proc. Finnish Signal Processing Symposium
2007 (FINSIG 2007), Oulu, Finland., 2007.
[3] Josep Calbo and Jeff Sabburg, “Feature extraction from
whole-sky ground-based images for cloud-type recogni-
tion,” Journal of Atmospheric and Oceanic Technology,
vol. 25, no. 1, pp. 3–14, 2008.
[4] Anna Heinle, Andreas Macke, and Anand Srivastav,
“Automatic cloud classification of whole sky images,”
Atmospheric Measurement Techniques Discussions, vol.
3, no. 1, pp. 269–299, 2010.
[5] Lei Liu, Xuejin Sun, Feng Chen, Shijun Zhao, and
Taichang Gao, “Cloud classification based on structure
features of infrared images,” Journal of Atmospheric
and Oceanic Technology, vol. 28, no. 3, pp. 410–417,
2011.
[6] Wen Zhuo, Zhiguo Cao, and Yang Xiao, “Cloud classi-
fication of ground-based images using texture–structure
features,” Journal of Atmospheric and Oceanic Technol-
ogy, vol. 31, no. 1, pp. 79–92, 2014.
[7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-
dra Malik, “Rich feature hierarchies for accurate object
detection and semantic segmentation,” arXiv preprint
arXiv:1311.2524, 2013.
[8] Jorge S´
anchez, Florent Perronnin, Thomas Mensink,
and Jakob Verbeek, “Image classification with the fisher
vector: Theory and practice,” International journal of
computer vision, vol. 105, no. 3, pp. 222–245, 2013.
[9] Mircea Cimpoi, Subhransu Maji, and Andrea Vedaldi,
“Deep convolutional filter banks for texture recognition
and segmentation,” arXiv preprint arXiv:1411.6836,
2014.
[10] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey
Karayev, Jonathan Long, Ross Girshick, Sergio Guadar-
rama, and Trevor Darrell, “Caffe: Convolutional archi-
tecture for fast feature embedding,” arXiv preprint arX-
iv:1408.5093, 2014.
[11] Karen Simonyan and Andrew Zisserman, “Very deep
convolutional networks for large-scale image recogni-
tion,” arXiv preprint arXiv:1409.1556, 2014.
[12] A. Vedaldi and K. Lenc, “Matconvnet – convolutional
neural networks for matlab,” CoRR, vol. abs/1412.4564,
2014.
[13] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and
Andrew Zisserman, “Return of the devil in the details:
Delving deep into convolutional nets,” arXiv preprint
arXiv:1405.3531, 2014.
[14] A. Vedaldi and B. Fulkerson, “VLFeat: An open and
portable library of computer vision algorithms,” http:
//www.vlfeat.org/, 2008.
[15] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-
Rui Wang, and Chih-Jen Lin, “Liblinear: A li-
brary for large linear classification,” The Journal
of Machine Learning Research, vol. 9, pp. 1871–
1874, 2008, http://www.csie.ntu.edu.tw/
˜cjlin/liblinear.