Segmenting Sky Pixels in Images
*Cecilia La Place
Arizona State University
Aisha Urooj Khan
University of Central Florida
University of Central Florida
Outdoor scene parsing models are often trained on ideal
datasets and produce quality results. However, this leads to
a discrepancy when applied to the real world. The quality
of scene parsing, particularly sky classiﬁcation, decreases
in night time images, images involving varying weather
conditions, and scene changes due to seasonal weather.
This project focuses on approaching these challenges by us-
ing a state-of-the-art model in conjunction with non-ideal
datasets: SkyFinder and a subset from the SUN database
containing the Sky object. We focus speciﬁcally on sky seg-
mentation, the task of determining sky and not-sky pixels,
and improving upon an existing state-of-the-art model: Re-
ﬁneNet. As a result of our efforts, we have seen an improve-
ment of 10-15% in the average MCR compared to the prior
methods on the SkyFinder dataset. We have also improved
from an off-the-shelf model in terms of average mIOU by
nearly 35%. Further, we analyze our trained models on im-
ages w.r.t two aspects: times of day and weather, and ﬁnd
that in spite of facing the same challenges as prior methods,
our trained models signiﬁcantly outperform them.
Sky segmentation is a part of the scene parsing world in
which algorithms seek to label or identify objects in images.
However, due to being trained on ideal datasets, these algo-
rithms face difﬁculty in non-ideal conditions . As deep
learning methods might become more involved in real world
applications, it becomes apparent that off-the-shelf methods
are not always effective and reliability comes into question.
Inspired by Mihail et al., et al.  who compared existing
sky segmentation methods and sought to bring attention to
the problem of ideal datasets, we focused upon approach-
ing the challenges they mentioned through existing models.
The challenges of outdoor scene parsing lie in the time of
day, year, and varying weather conditions. Their SkyFinder
dataset allowed us to pursue these challenges in order to ob-
*First two authors contributed equally.
Figure 1. SkyFinder dataset. Top 3 rows show sample images for
different times of day. Each column signiﬁes a different time: 12
am, 4am, 8am, 12pm, 4pm and 8pm, across 3 separate cameras.
Rows 4, 5 and 6 show images across different scenes for different
weather types: clear, cloudy, and fog respectively, whereas, the
last row shows different weather conditions (clear, cloudy, hazy,
rain and sleet) over the same scene.
tain improved results. This work highlights the importance
of this problem. We offer an improved model that will aid
in challenging sky segmentation in the real world. In this
work, we adopted state-of-the-art segmentation model 
for the task of pixel level detection of the sky. Our task is
different than semantic segmentation in a sense that we are
only interested in one object i.e. sky. Sky, unlike other ob-
jects can be difﬁcult to segment due to poor lightning (night
time) and weather conditions where even humans are likely
to fail (e.g., over dense fog, thunder storms, etc). Thus, we
attempt to address this problem in this work.
Our contribution is as following. First, we evaluated
arXiv:1712.09161v2 [cs.CV] 8 Jan 2018
Figure 2. SUN-Sky dataset. Sample images from the various lo-
cations described in the SUN dataset with sky.
an off-the-shelf state-of-the-art model  to demonstrate
that existing models fail for different weather conditions,
times of day, and other transient attributes . Sec-
ond, we ﬁne-tuned ReﬁneNet-Res101-Cityscapes model on
the SkyFinder dataset and obtained 12.26% improvement
over off-the-shelf model in terms of misclassiﬁcation rate
(MCR). Third, taking advantage of the existing big dataset,
we trained  solely on SkyFinder, which further improved
accuracy and outperformed all baseline methods. Fourth,
to study across datasets performance of our trained mod-
els, we selected a subset from SUN  dataset with the
’sky’ label. We then both ﬁne-tuned and trained ReﬁneNet-
Res101 model on this subset of SUN dataset (we refer to
it as the SUN-Sky dataset in this paper), and perform eval-
uation across models trained on both SkyFinder and SUN-
Sky datasets, and report the results. Fifth, following ,
we investigate the effect of times of day and weather condi-
tions on performance of our model and transient attributes.
Sixth, we compare our analysis with  in terms of MCR
and report impact of weather and times of day w.r.t mIOU
scores. Finally, we determine the impact of noisy images
like motion blur, Gaussian noise, etcetera on our model’s
performance and report our results on robustness of our ap-
The rest of this paper is organized as follows. In sec-
tion 2, we present a brief overview of existing works in this
area. In section 3, we describe details of our plan including
utilized datasets, and our models. Section 4 discusses the
performance metrics we used to evaluate the trained mod-
els. Section 5 presents experimental ﬁndings on Sky seg-
mentation task for both datasets and analysis of different at-
tributes, times of day, weather conditions and noise on this
task, followed by discussions in section 6.
2. Related Work
The history of scene parsing i.e., assigning each pixel of
input image to one of the object labels , has evolved
from its beginnings. Scene labeling methods , , ,
,  mainly use local appearance information for ob-
jects being learned from training data . Although, previ-
ously this task has been addressed by using hand engineered
features with a classiﬁer, recent methods address this task
by learning features using deep neural networks. From con-
volutional networks spawned fully convolutional networks
which moved away from pixel level algorithms to whole-
image-at-a-time methods . Afterward, the introduction
of skip layers led to deep residual learning . Recently,
residual nets became the backbone to scene parsing algo-
rithms such as ReﬁneNet , PSPNet , and more.
Speciﬁcally, the sky segmentation task can be helpful for
a diverse variety of applications such as stylizing images us-
ing sky replacement , obstacle avoidance , and 
since sky tells a lot about weather conditions. Current appli-
cations of sky segmentation range from personal to public
use and more. Often times they are used in scene parsing
[5, 17], horizon estimation, and geolocation. Other appli-
cations include weather classiﬁcation , image editing
[7, 15], weather estimation [21, 14], and more.  and 
worked on weather recognition and used camera as weather
sensors. Weather detection has also been used for image
searching  where one can search outdoor scene images
based on weather related attributes.
Dev et al. used color based methods  for segmentation
of sky/cloud images, whereas  proposed deep learning
approach for segmenting sky. Like , we used the same
baseline methods (Hoeim et al. , Tighe et al.  and Lu
et al. ) to compare our results. Hoiem et al. uses a single
image to produce an estimate of scene geometry for three
classes: ground, sky and vertical regions by learning under-
lying geometry of an image via appearance-based models.
Tighe et al. combines region-level features with SVM-based
sliding window detectors for parsing an image into differ-
ent class labels including ’sky’. Lu et al. classiﬁes an input
image into two classes: sunny or cloudy. Their work uses
sky detection as an important cue for weather classiﬁcation.
For detecting sky, they used random forests to produce seed
patches for sky and non-sky, and then used graph cut to seg-
ment sky regions.  uses their sky detector for report-
ing results which we also used for our comparisons on the
SkyFinder dataset. Here we investigate the effectiveness of
existing state-of-the-art segmentation methods for this spe-
ciﬁc problem. To select among the best contenders for this
task, we evaluated off-the-shelf ReﬁneNet  and PSPNet
 methods on SkyFinder dataset, and chose  for our
further experiments as it outperforms  with a large mar-
gin on SkyFinder dataset. We further take a deep insight
on challenges (such as weather conditions, night time im-
ages, and noisy images) which are faced even when robust
methods are used.
Method Split 1 Split 2 Split 3 Avg.
mIOU(%) MCR(%) mIOU(%) MCR(%) mIOU(%) MCR(%) mIOU(%) MCR(%)
Hoiem et al.- 21.28 - 20.68 - 26.24 - 22.73
Lu et al.- 25.38 - 21.67 - 23.32 - 20.38
Tighe et al.- 17.48 - 20.33 - 31.58 - 26.21
ReﬁneNet-Res101-Cityscapes 49.31 17.12 46.31 18.00 49.87 21.68 48.5 18.93
ReﬁneNet-Res101-SkyFinder-FT 79.48 5.17 72.18 7.48 85.2 7.37 78.95 6.67
ReﬁneNet-Res101-SkyFinder 87.07 5.08 73.84 7.00 88.05 5.65 82.99 5.90
Table 1. Results from all three testing splits. MCR results for top 3 baselines are used from the SkyFinder metadata in order to compare our
pixel-level sky detector with their methods. We also report mIOU scores. ReﬁneNet-Res101-Cityscapes is the off-the-shelf model trained
on Cityscapes. ReﬁneNet-Res101-SkyFinder-FT shows results when we ﬁne-tuned ReﬁneNet-Res101-Cityscapes model on SkyFinder
dataset. Finally, training ReﬁneNet-Res101 from scratch on SkyFinder dataset (last row) gives best results. For ReﬁneNet, all these
numbers are reported when the model was evaluated at test scale of 0.8.
Our approach is based on adopting semantic segmenta-
tion methods for the task of pixel-level sky detection. We
used two datasets: SkyFinder  and SUN-Sky  for
our work. Our approach outperforms baseline methods in
the task of sky segmentation. We evaluated generalization
capability of our models by performing across datasets eval-
uation. Also, we studied inﬂuence of various factors like
transient attributes, weather conditions and noise on our
3.1.1 SkyFinder dataset
The SkyFinder dataset is a subset of the Archive of Many
Outdoor Scenes (AMOS) dataset. Due to the full SkyFinder
dataset not being available, we used 45 of the 53 cameras
shared. This entailed 60K-70K images, with an average of
1500 images per camera. These images were of varying di-
mensions, quality, time of day, season, and weather. How-
ever, some cameras contained images which indicated the
camera was experiencing technical difﬁculties or repairs.
These few images were removed to focus on the challenges
we wished to address. We then changed the sizes of the im-
ages to introduce a form of uniformity and make test eval-
uation faster by resizing them to be within the ranges of
A single segmentation map was associated with each
camera. This is due to each camera being stationary for
at least a year in order to be included within in the dataset.
See ﬁgure 1 for sample images from this dataset.
3.1.2 SUN dataset
The Scene UNderstanding dataset is comprised of a multi-
tude of different scenes and the objects that make up said
scenes . It is constantly being updated through com-
munity effort, thus adding to it’s already large nature. We
Orig. Gt. a b c
Figure 3. Improvement in night time images. Column 1
shows original image examples, col2 shows ground truths, where
last three columns show results for a) off-the-shelf ReﬁneNet-
Res101-Cityscapes, b) ReﬁneNet-Res101-SkyFinder-FT, and c)
Orig. Gt. a b c
Figure 4. Improvement in weather obscured images. Column 1
shows original image examples, col2 shows ground truths, where
last three columns show results for a) off-the-shelf ReﬁneNet-
Res101-Cityscapes, b) ReﬁneNet-Res101-SkyFinder-FT, and c)
primarily looked at this dataset for its ”sky” object classiﬁ-
cation which allowed us to begin comparing the SkyFinder
dataset and results. Unlike SkyFinder, images are not
grouped by camera, instead they are grouped by scene such
as airport terminal, or church. Also, SkyFinder focused
on images taken from stationary cameras, whereas SUN
Method mIOU(%) MCR(%)
ReﬁneNet-Res101-Cityscapes 61.69 8.4
ReﬁneNet-Res101-SUNdb-FT 83.1 3.7
ReﬁneNet-Res101-SUNdb 82.36 4.17
Table 2. Performance results of SUNdb Finetune and SUNdb Im-
has a variety of images from a variety of viewpoints. For
the purposes of this research, we focused on the subsec-
tion of the SUN database that had the object ”sky” labeled
(about 20,000 images). We resized those images to be
within the range of 320 ×480 for improved test evaluation
speed. In what follows, we refer to this subset as the SUN-
Sky dataset. See ﬁgure 1 for few sample images from this
3.2. Sky Segmentation
The model we used, ReﬁneNet, was created by Lin et al.
. This model seeks to to retain detail throughout the re-
construction of the image and its segmented output unlike
its predecessor and backbone, ResNet.
3.2.2 Off-the-Shelf ReﬁneNet
In order to establish a baseline in ReﬁneNet, we used Re-
ﬁneNet’s Res101 model that was trained on Cityscapes, a
dataset of European cities for scene segmentation. The
Cityscapes dataset is an ideal dataset for the Sky class, and
as a result does well in ideal conditions in the SkyFinder
dataset. After setting up the model and using it on a sin-
gle Titan X GPU, we ran each of the 45 cameras through
the model and evaluated solely on their sky classiﬁcation
ability, i.e. all other classes were construed as non-sky. We
refer this baseline model as ReﬁneNet-Res101-Cityscapes
in the following.
To obtain proof of concept prior to running the entire dataset
we focused on a smaller subset of the dataset. We took
10 random cameras and broke it into a train-val-test split.
From each camera we took 75 images for training, 25 im-
ages for validation, and between 175-300 images for testing
Following the success of ﬁne-tuning the model on the
subset, we ﬁne-tuned the ReﬁneNet-Res101-Cityscapes
model on the SkyFinder dataset. Being unable to ﬁnd the
same train-val-test split as Mihail et al. , we split the
dataset into our own train-val-test splits. Although, to keep
our experiments as much consistent as possible to , we
used the same number of test cameras in our experiments.
Hence, our split consisted of 13 cameras used for testing,
4 cameras used for validation, and the remaining cameras
used for training. We then shufﬂed the cameras in each sec-
tion but kept the same number of cameras for each train-
ing, validation, and testing set, and repeated the aforemen-
tioned ﬁne-tuning process two more times for a total of
three ﬁne-tuning trials. We used a learning rate of 5e-5 for
each instance and trained the model for 10 epochs. After 10
epochs, the validation accuracy started leveling out. Thus,
for consistency, we report all results on model trained till
10 epochs. We refer our ﬁne-tuned models as ReﬁneNet-
Res101-SkyFinder-FT in our results.
3.2.4 Training with SkyFinder dataset
Finally, to take advantage of the big size of the dataset, we
trained ReﬁneNet-Res101 from scratch, where Res101 was
initialized with pre-trained ImageNet model but trained the
ReﬁneNet on solely the SkyFinder dataset. We used the
same three train-val-test splits as mentioned above (to allow
for fair comparison) and trained at a learning rate of 5e-4 for
over 10 epochs. Due to the ﬂattening of the learning curve,
we used the model at epoch 10 for testing in both instances.
We refer to these models as ReﬁneNet-Res101-SkyFinder
in our results.
3.2.5 Training with Sun-Sky dataset
We broke down the sky-labeled sub-dataset of the SUN
dataset into a randomly shufﬂed 60-20-20 split for our uses.
After resizing the images to 320 ×240 to train and test
quickly, we then generated the ground truth segmentation
masks by keeping only the sky class and treating any other
classes as non-sky. To train and evaluate, we focused on
a similar process as we did with the SkyFinder dataset.
Training entails ﬁne-tuning from the ReﬁneNet-Res101-
Cityscapes model, and training on a model initialized from
ImageNet-Res101. Evaluation is comprised of calculat-
ing the average MCR and mIOU of the dataset. We ﬁne-
tuned ReﬁneNet-Res101-Cityscapes model using the SUN-
Sky dataset for 10 epochs at a learning rate of 5e-4, and sub-
sequently 10 more epochs at a learning rate of 5e-5. For our
second model, we initialized from ImageNet-Res101 and
trained using the same split as we did for the previous model
for 10 epochs at a learning rate of 5e-4. Much like the pre-
vious model, we again trained another 10 epochs at a lower
learning rate of 5e-5.
Table 2 shows quantitative results for SUN-Sky dataset
on both ﬁne-tuned model (ReﬁneNet-Res101-SUNdb-FT)
and trained model (ReﬁneNet-Res101-SUNdb). See ﬁg.
5.1.7 for qualitative results on this dataset.
Time of day Split 1 Split 2 Split 3 Avg.
(hour) mIOU MCR mIOU MCR mIOU MCR mIOU MCR
082.36 5.54 80.64 6.27 85.37 7.46 82.79 6.42
176.28 8.25 75.40 7.69 84.88 7.34 78.85 7.76
276.12 9.75 76.26 8.31 86.27 6.75 79.55 8.27
370.41 7.21 69.99 6.28 87.96 5.97 76.12 6.48
471.66 6.65 69.83 5.71 88.26 5.06 76.58 5.81
577.33 5.16 71.04 4.84 90.82 3.09 79.73 4.36
676.48 5.06 74.20 5.85 91.74 3.94 80.80 4.95
776.66 3.65 71.05 8.71 89.62 5.75 79.11 6.04
872.05 3.31 69.53 4.15 93.23 3.46 78.27 3.64
972.47 3.06 70.65 3.34 93.93 2.94 79.02 3.12
10 74.54 2.44 73.21 2.89 94.51 2.69 80.75 2.68
11 70.97 2.57 68.90 2.82 94.17 2.89 78.01 2.76
12 74.08 2.61 72.69 2.92 94.57 2.57 80.44 2.70
13 74.04 2.12 71.78 2.61 94.60 2.49 80.14 2.41
14 72.58 2.41 70.38 2.70 94.17 2.81 79.04 2.64
15 73.25 2.40 71.08 2.85 94.19 2.84 79.50 2.70
16 71.69 2.59 69.78 2.86 92.77 3.44 78.08 2.96
17 71.73 2.85 69.63 3.13 91.76 3.73 77.71 3.24
18 70.12 3.33 67.50 3.51 90.41 4.37 76.01 3.73
19 69.00 3.96 66.53 4.42 89.38 5.00 74.97 4.46
20 69.43 5.74 64.92 11.19 85.27 8.08 73.21 8.34
21 67.11 5.70 64.85 6.79 86.80 5.88 72.92 6.12
22 68.27 6.58 64.70 6.09 86.50 5.49 73.16 6.05
23 70.86 7.17 62.64 6.48 83.62 6.89 72.37 6.85
Avg. 72.89 4.77 70.30 6.45 90.20 4.62 77.8 5.25
Table 3. Sky segmentation results break down w.r.t different times
of the day on SkyFinder dataset. Numbers are in percentage.
To evaluate the accuracy of our results we used both the
misclassiﬁcation rate (MCR) deﬁned in Mihail et al. ,
and the mean intersection over union (mIOU). The use of
both of these metrics allowed for the ability to compare our
results to the results found in Mihail et al. , which fo-
cused primarily on MCR results, and determine the overlap-
ping accuracy of the segmentation outputs.
MCR =F P +F N
number of pixels (1)
mIOU =XT P
T P +F P +F N (2)
In this section, we discuss our experiments for sky seg-
mentation, and studying impact of different conditions like
times of day, weather situation, transient attributes and
noise on performance of our model. Please note that, al-
though  have done a similar study, we extend their work
and also report mIOU in our experiments.
5.1. SkyFinder dataset
Since we evaluated ReﬁneNet on SkyFinder dataset
in three settings: off-the-shelf ReﬁneNet, ﬁne-tuned, and
trained from the scratch, for reference, we use ReﬁneNet-
Res101-Cityscapes for off-the-shelf ReﬁneNet model, ini-
tialized from Res101 and trained on Cityscapes. We then
further ﬁne-tuned same off-the-shelf model on SkyFinder
dataset for all three splits and refer to the model as
ReﬁneNet-Res101-SkyFinder-FT. ReﬁneNet when initial-
ized from ImageNet pre-trained Res101 and trained on
SkyFinder from the scratch is referred as ReﬁneNet-
Res101-SkyFinder. We ﬁnd that ReﬁneNet-Res101-
SkyFinder outperforms the ﬁne-tuned model which is
clearly because the former takes an advantage of the large
size of this dataset. For comparison with other baseline
models (Hoiem et al., Lu et al., or Tighe et al.) mentioned
in , we used the MCR scores for all three test splits and
also report the average performance for all methods in table
1. To compute MCR scores for our baselines on our test
sets, we fetch the MCR score from metadata provided with
SkyFinder dataset after they evaluated these models.
For results in table 1, we evaluated ReﬁneNet at test scale
of 0.8 (default setting), which performs better than being
evaluated at full scale (scale = 1.0). Please refer to tables
1 and 3 for comparison. We ﬁnd that ReﬁneNet-Res101-
SkyFinder outperforms all baseline methods both in terms
of mIOU and MCR scores. For qualitative evaluation, the
images are selected from a few of the ﬁrst test set of cameras
and do not include visual results from Hoiem et al., Lu et
al., or Tighe et al. While Mihail et al. created their own
ensemble using the combination of the three methods, their
results were unreported for individual images.
Analysis on time of day and weather has been performed
with full scale test evaluation.
5.1.1 Performance on Camera 10917
First, some background on SkyFinder’s Camera 10917: this
speciﬁc camera is of a location in which the sky does not
peek through anywhere. It depicts a quaint village and trees,
but no sky. When SkyFinder was trained on the ﬁrst two
splits it did not contain this camera in its training data. As
a result we consistently witnessed abysmal IOU values of
0 which results in performance degradation for test splits
1 and 2 because the model has not previously seen images
with no sky. However, the model performs decent in terms
of MCR values (below 10%). We believe that the IOU eval-
uation method in a case such as this is inefﬁcient. There-
fore we pay particular attention to the MCR in regards to
this camera. For test split 3, this camera has been used for
training our models, thus model performs better on the test
5.1.2 Performance for night time
Fig 3 shows visual improvement in results for night time im-
ages when compared with off the shelf ReﬁneNet-Res101-
Cityscapes and ReﬁneNet-Res101-SkyFinder ﬁne-tuned,
which makes it obvious why our trained models win over
the baseline methods in terms of MCR as well.
Figure 5. Performance analysis of ReﬁneNet-Res101-SkyFinder in
terms of mean intersection over union w.r.t row1) time of day, and
row2) Weather conditions.
Weather Split 1 Split 2 Split 3 Avg.
mIOU MCR mIOU MCR mIOU MCR mIOU MCR
clear 59.88 2.96 66.48 3.54 92.61 4.21 72.99 3.57
cloudy 79.92 3.61 77.02 4.19 90.71 4.07 82.55 3.96
fog 77.82 7.18 71.72 7.75 86.22 5.58 78.59 6.83
hazy 75.74 6.72 72.32 6.71 89.31 4.49 79.12 5.97
mostly cloudy 73.12 3.44 69.10 4.89 90.40 4.06 77.54 4.13
partly cloudy 77.46 4.18 75.79 6.10 89.98 4.61 81.08 4.96
rain 61.88 3.94 58.00 4.32 89.79 4.65 69.89 4.30
sleet 47.66 4.17 50.00 4.75 92.57 4.51 63.41 4.48
snow 29.50 1.42 33.70 4.48 89.03 6.83 50.74 4.24
tstorms 84.50 3.88 80.75 3.32 89.09 4.58 84.78 3.93
unknown 83.36 5.76 68.27 5.29 86.17 3.84 79.27 4.96
blanks 83.09 7.08 83.84 6.38 78.93 8.94 81.95 7.47
Avg. 69.49 4.82 67.25 6.77 88.73 5.03 75.16 5.54
Table 4. Sky segmentation results break down w.r.t weather on
SkyFinder dataset. Numbers are in percentage.
5.1.3 Performance for weather obscured images
In section 5.1.5, our results suggests that sky segmenta-
tion in night time is more challenging than during day
hours. But, our ﬁnal trained model on SkyFinder still im-
proves in terms of mIOU over our established baseline mod-
els (ReﬁneNet-Res101-Cityscapes and ReﬁneNet-Res101-
SkyFinder-FT). Fig. 4 shows that despite of images ob-
scured due to dense fog, ReﬁneNet-Res101-SkyFinder was
able to perform reasonably well.
Figure 6. Illustration for performance analysis on time of day. X-
axis shows hour of the day and y-axis represents mean classiﬁca-
tion rate (MCR) over all three test splits we used in our experiment.
Input GT Pred. Input GT Pred. Input GT Pred.
Figure 7. Qualitative results for performance analysis w.r.t times
of day on SkyFinder dataset using ReﬁneNet-Res101-SkyFinder.
The ﬁrst row shows success cases(mIOU >0.8) for hour: 0(night),
12(noon) and 18(evening), and the last row shows images where
model fails(mIOU <0.5). The last row shows failure cases for
hours 6 and 0. Our model never failed for images around noon.
5.1.4 Performance analysis for different day times
We also wanted to see how the trained model performed
during different times of day, similar to . As we use
three test splits for reporting our results, we sorted each
of them w.r.t hour of the day. Then for each sorted split,
we evaluate the respective model on its test set and com-
pute mIOU and MCR. We then calculate average over them
(see table 3). Please note that while evaluating ReﬁneNet-
Res101-SkyFinder for each of its respective test set, we
used scale=1.0 (i.e. full scale at test time) and report the
numbers. We ﬁnd similar pattern as discussed in  for
ReﬁneNet as well i.e., model achieves good performance
during day time in terms of MCR, and performance de-
creases during start(early morning) and end of the day(night
time), thus MCR increases. In terms of mIOU, we witness
a decrease in performance towards the end of the day, but
overall the performance seems consistent. Fig 5 shows illus-
tration for mIOU during different day hours. For comparing
ReﬁneNet with the results in the baseline paper in this as-
pect, we fetched MCR scores for each split w.r.t hour and
report results provided by . Fig.6 shows that although
our trained model follows the same pattern, but our pixel-
level sky detector performs signiﬁcantly better than all three
Figure 8. Illustration of performance analysis on different weather
conditions. X-axis shows different weather conditions existing
in SkyFinder dataset and y-axis represents mean classiﬁcation
rate(MCR) over all three test splits we used in our experiment.
Please see ﬁgure 7 for qualitative results when we tested
ReﬁneNet-Res101-SkyFinder on sorted test split w.r.t time.
Row 1 shows success cases where row2 shows failure cases
for different times of day.
5.1.5 Performance analysis for weather conditions
Like , we aim to evaluate how well semantic segmen-
tation model when trained as a sky classiﬁer performs for
different weather conditions. We split the testset with re-
spect to different weather conditions as provided in meta-
data of SkyFinder dataset and evaluated ReﬁneNet-Res101-
SkyFinder on them. We run the same experiment for all
three test splits and then compute the average scores for
both mIOU and MCR (See table 4 for quantitative results).
Looking at the average mIOU scores, we ﬁnd that our model
struggles for few weather conditions like snow and sleet.
On the other hand our model performs better even when
weather is cloudy, partly cloudy, or foggy. Interestingly, the
model is able to perform really well during thunderstorms
both in terms of mIOU and MCR.
SkyFinder dataset has some images without labels for
weather conditions, we evaluated our model on them and
report the results. We observe that mIOU is lower than ex-
pected for clear sky. This is due to the reason that a big
share of clear sky images (1200+) are from camera 10917
where there is no sky visible, thus it lowers the mIOU at
test time for splits 1 and 2. Split 3 has been trained on such
images and therefore learns to tell when there is no sky in
the image (92.61% mIOU for clear sky).Fig 5 shows mIOU
scores averaged over all three splits for ReﬁneNet-Res101-
SkyFinder. Fig 8 compares MCR score for weather with
different methods, and ReﬁneNet signiﬁcantly outperforms
them. For calculating MCR for baseline methods, we use
the MCR scores from SkyFinder metadata. This ﬁgure re-
Input GT Pred. Input GT Pred. Input GT Pred.
Figure 9. Qualitative results for performance analysis w.r.t weather
on SkyFinder dataset using ReﬁneNet-Res101-SkyFinder. First
four rows show success cases(mIOU >0.8) for each weather type,
and last row shows images where model fails(mIOU <0.5). Suc-
cess cases in ﬁrst 4 rows follow the weather order: clear, cloudy,
fog, hazy, misc, mostly cloudy, partly cloudy, rain, sleet, snow,
tstorms, and unknown respectively. For each weather type, we
show input image, ground truth and prediction respectively. Last
row shows failure cases where weather type was: tstorms , clear,
ports the average MCR score over all test splits. Please see
ﬁgure 9 for qualitative results.
5.1.6 Performance analysis w.r.t transient attributes
Inspired by , we aimed to investigate how much tran-
sient attributes affect the performance of our trained model.
Hence, we selected images from one of our test splits for
four transient attributes: gloomy, clouds, cold and night.
We selected images having high presence of these attributes
i.e., thresholded above 0.8 and with low chances of their ex-
istence i.e., value is less than 0.2. We then tested ReﬁneNet-
Res101-SkyFinder on these subsets of images(2 for each at-
tribute : low and high) and report the performance.
As expected, the model performs well for the images where
the sky is not gloomy, cloudy, cold or night time. The model
is most robust to clouds and performs well(both in terms
of mIOU and MCR) even when clouds are largely present
in the image. In terms of MCR, model’s performance is
worse when we input a highly gloomy image (MCR=10.84)
or when it is night time (MCR=10.25). In terms of mIOU,
our trained model performs slightly better for gloomy im-
ages than night time images (see table 5, last row). We also
report MCR scores from baseline methods for comparisons
and ﬁnd that Tighe et al. performs poorly among all of the
baselines in high presence of the above mentioned transient
attributes. Lu et al. seems robust among the baselines but
our method signiﬁcantly outperforms all of them for this
experiment(see ﬁg. 10). Figure 11 shows the histogram of
MCR scores when each of these four transient attributes are
high and low in the input images. See ﬁgure 12 for qualita-
tive results of this experiment.
gloomy clouds cold night
>=0.8 <=0.2 >=0.8 <=0.2 >=0.8 <=0.2 >=0.8 <=0.2
Method mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR
Hoiem et al.- 40.70 - 19.70 - 18.47 - 18.39 - 24.93 - 19.23 - 41.57 - 18.46
Lu et al.- 36.42 - 14.88 - 12.74 - 8.35 - 16.44 - 14.98 - 36.60 - 10.03
Tighe et al.- 50.53 - 13.36 - 56.21 - 23.44 - 50.65 - 17.83 - 48.14 - 31.05
RNet-Res101-SkyFinder 84.16 10.84 92.97 1.94 93.09 4.35 93.65 2.93 90.84 5.35 92.19 2.90 83.26 10.25 94.36 2.72
Table 5. Performance analysis on transient attributes (high thresholded and low thresholded values).
Figure 10. Comparison of ReﬁneNet-Res101-SkyFinder with baseline methods (Tighe et al., Hoiem et al., and Lu et al.) in terms of
MCR given high values(>=0.8) and low values(<=0.2) of transient features(gloomy, clouds, cold, and night). Row 1 shows performance
comparison when each of four attributes has high value i.e., thresholded above 0.8, whereas row 2 shows results when very low value is
observed for each attribute.
5.1.7 Performance analysis w.r.t noisy images
We are also interested in investigating the impact of noise
on the task of Sky segmentation. We conducted this exper-
iment since the image can be noisy due to various reasons
in original settings as well. For this purpose, we selected
50 images for each camera from one of the test splits (total
650 images) and added different types of noise to them. We
added motion blur, Gaussian noise, Poisson noise, salt &
pepper, and speckle noise to our subset of images. We ﬁrst
evaluated performance without adding any noise and then
compared our results with noisy images(see table 6). We
ﬁnd that our trained model is robust to Poisson noise, and
highly sensitive to salt & pepper noise. Motion blur also
affects mIOU (a drop of approx. 3.5%). Similarly, Gaus-
sian noise and speckle noise also hurt performance (both in
terms of mIOU and MCR). See ﬁgure 13 for sample images
after adding different types of noise to input images.
5.2. Sun-Sky dataset
The SUN-Sky dataset is a subset of images from SUN
dataset  of outdoor scenes having sky present in them.
Using a similar approach as we used for training on
SkyFinder dataset, we ﬁrst evaluate ReﬁneNet-Res101-
Original Motion Blur Gaussian Poisson Salt & Pepper Speckle
mIOU 89.92 86.31 85.26 88.67 81.77 84.70
MCR 5.08 6.65 7.60 5.73 9.32 7.91
Table 6. Performance analysis on SkyFinder dataset when input is
a noisy image. Column 1 shows results on images without any
noise, whereas rest of the columns show results when we added
motion blur, Gaussian noise, Poisson noise, salt & pepper noise,
and speckle noise to the images respectively.
Cityscapes model on SUN-Sky dataset, then ﬁne-tune the
same model on this dataset referred as ReﬁneNet-Res101-
SUNdb-FT, and lastly, we trained ReﬁneNet-Res101- on
SUN-Sky dataset from the scratch (referred to as ReﬁneNet-
Res101-SUNdb). We randomly split the images into 60%-
20%-20% ratio for training, validation and testing respec-
tively. When evaluated these three models on our test set,
we ﬁnd that off-the-shelf model performs better(mIOU =
61.69, MCR = 8.4) on this dataset as compared to its per-
formance on SkyFinder dataset (mIOU = 48.5, MCR =
18.93), which maintains that SkyFinder is more challenging
in nature as compared to SUN-Sky dataset. Interestingly,
the ﬁne-tuned model on this dataset outperforms the model
when trained solely on SUN-Sky dataset with a slight mar-
gin (both in terms of mIOU and MCR) which is mainly be-
Figure 11. Frequency distribution of MCR with ReﬁneNet-Res101-SkyFinder given transient features(gloomy, clouds, cold, and night)
using low and high threshold values(0.2 and 0.8, respectively).
Input image GT Prediction Input image GT Prediction
Figure 12. Qualitative results for absence(below a threshold i.e. <=0.2) and presence(above a threshold i.e. >=0.8) of transient attributes
(gloomy, clouds, cold, and night) on SkyFinder dataset when using ReﬁneNet-Res101-SkyFinder model. Row 1 shows selected examples
for clouds with threshold <=0.2, row 2 shows results for clouds using high threshold >=0.8, row 3 and 4 show results for gloomy, row 5
and 6 are results for cold, and the last two rows show results for night following the same order as mentioned for ﬁrst two rows i.e., absence
and presence of each transient attribute, respectively.
cause the SUN-Sky dataset is not big enough in size. Over-
all, we ﬁnd that both ﬁne-tuned and trained models on this
dataset performs reasonably well as compared to off-the-
shelf ReﬁneNet-Res101-Cityscapes model (see table 3.1.2).
5.3. Cross datasets evaluation
To see generalization power of our models, we evalu-
ated them cross datasets i.e., models trained on SkyFinder
Image Motion blur Gaussian Poisson Salt & Pepper Speckle
Figure 13. Some example images for different types of noise added
to the selected subset of images from SkyFinder dataset.
Orig. Gt. a b c
Figure 14. Segmentation improvement across the ReﬁneNet-
Res101-Cityscapes model (a), the ReﬁneNet-Res101-SUN-FT
model (b), and the ReﬁneNet-Res101-SUN model (c).
Datasets SkyFinder SUN-Sky
mIOU(%) MCR(%) mIOU(%) MCR(%)
RNet-SkyFinder-FT 79.00 6.69 71.70 6.67
RNet-SkyFinder 83.00 5.89 71.57 7.24
RNet-SUNdb-FT 71.49 11.74 83.10 3.70
RNet-SUNdb 70.42 12.37 82.36 4.17
Table 7. Sky segmentation results on 2 datasets broken down and
trained on one sub dataset and tested on others. Numbers in bold
text show the best results on particular dataset, whereas, blue font
shows best results for that dataset during cross dataset evaluation.
Numbers are in percentage.
were evaluated on SUN-Sky dataset and vice versa. Table
7 shows results for both ﬁne-tuned and trained model on
each dataset and across other dataset. Looking at the re-
sults, we ﬁnd that SkyFinder, due to its large size, gives bet-
ter performance when we trained the model on it from the
scratch. Interestingly, our ﬁne-tuned models perform better
than trained from scratch when evaluated across datasets. In
terms of MCR, ﬁne-tuned model on SkyFinder generalizes
better than the model ﬁne-tuned on SUN-Sky dataset(6.67
vs. 11.74). Please note, for SkyFinder dataset, results
shown from all models are averaged over all test splits.
6. Discussion and Conclusion
We will ﬁrst direct our focus to the original results of
ReﬁneNet’s res101 model trained on Cityscapes in Table 1.
Both results incorporate the pretrained model’s incapabil-
ity to properly access non-ideal images, particularly night
images as addressed earlier. Upon ﬁnetuning on the model
using the SkyFinder dataset, we note a drastic improvement
in the mIOU and just over 10% decrease in the MCR. This is
clearly as a result of including the non-ideal images. Upon
training it on the SkyFinder dataset, we see a less drastic
improvement in one split, but improvement nonetheless.
Despite the averages in Table 1, focusing on a singu-
lar camera in one of the data splits, we can see a dras-
tic improvement. Camera 858, consists of 200 images
and was not seen by the ﬁnetuned or ImageNet initial-
ized model during training. The mIOU from the model
trained on Cityscapes was 69.21%, and had an MCR of
18.16%. However, after ﬁnetuning the model, the mIOU
and MCR results respectively quickly jumped up to 91.41%
and 4.69%. Finally, after training on only 28 cameras of
the data (again, not including camera 858) and initializing
from ImageNet, the results improved slightly. The mIOU
increased to 95.76% and the MCR dropped a little more to
2.30%. Other cameras show similar rates of improvement.
There are some cameras however that prove difﬁcult to seg-
ment for all methods and show a smaller rate of improve-
The importance of these results lies in the ability to use
these models in real-world applications. Using the original
pre-trained model would result in poor quality segmentation
outside of the ideal circumstances. Off-the-shelf methods
must be modiﬁed in order to be used most effectively in the
Understanding the impacts of these results, we also in-
corporate the results from Mihail et al’s ﬁndings for their
baseline methods. While their own model’s results which
reported an MCR of 12.96% across their own testing split
has not been used to report the performance in terms of
MCR on our test splits. In spite of this, we can still fur-
ther prove the idea that existing models are still effective–so
long as they have been modiﬁed to suit the task’s needs.
We also demonstrated that, overall, even state-of-the-
art models struggle with challenging conditions like night
time, variation in weather and other transient attributes. Al-
though, our trained models still perform much better than
prior methods in terms of MCR.
Following this work, we intend to look at other scene
parsing models to evaluate their off-the-shelf methods, ﬁne-
tune them, and possibly train them on the SkyFinder dataset
in order to compare the results to the above. We also plan to
develop our own end-to-end sky segmentation model to also
use for comparison. To possibly improve general results,
cleaning of the dataset may need to occur, such as remov-
ing timestamps. Other future directions include using sky
segmentation for applications such as weather classiﬁcation
and weather forecasting. Code, our trained models and data
will be made available for further exploration of this area.
 W.-T. Chu, X.-Y. Zheng, and D.-S. Ding. Camera as weather
sensor: Estimating weather information from single images.
Journal of Visual Communication and Image Representation,
 G. De Croon, C. De Wagter, B. Remes, and R. Ruijsink. Sky
segmentation approach to obstacle avoidance. In Aerospace
Conference, 2011 IEEE, pages 1–16. IEEE, 2011.
 S. Dev, Y. H. Lee, and S. Winkler. Color-based segmentation
of sky/cloud images from ground-based cameras. IEEE Jour-
nal of Selected Topics in Applied Earth Observations and
Remote Sensing, 10(1):231–242, 2017.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition, 2016.
 D. Hoeiem, A. A. Efros, and M. Hebert. Geometric context
from a single image, 2005.
 D. Hoiem, A. A. Efros, and M. Hebert. Geometric context
from a single image. In Computer Vision, 2005. ICCV 2005.
Tenth IEEE International Conference on, volume 1, pages
654–661. IEEE, 2005.
 P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient
attributes for high-level understanding and editing of outdoor
 G. Lin, A. Milan, C. Shen, and I. Reid. Reﬁnenet: Multi-path
reﬁnement networks for high-resolution semantic segmenta-
 C. Liu, J. Yuen, and A. Torralba. Nonparametric scene pars-
ing: Label transfer via dense scene alignment. In Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE
Conference on, pages 1972–1979. IEEE, 2009.
 J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation, 2015.
 C. Lu, D. Lin, J. Jia, and C. K. Tang. Two-class weather
 R. P. Mihail, S. Workman, Z. Bessinger, and N. Jacobs. Sky
segmentation in the wild: An empirical study. In Applica-
tions of Computer Vision (WACV), 2016 IEEE Winter Con-
ference on, pages 1–6. IEEE, 2016.
 M. Roser and F. Moosmann. Classiﬁcation of weather situa-
tions on single color images. In Intelligent Vehicles Sympo-
sium, 2008 IEEE, pages 798–803. IEEE, 2008.
 L. Tao, Y. Luo, and X. Zheng. Weather recognition based on
images captured by vision system in vehicle, 2009.
 L. Tao, L. Yuan, and J. Sun. Skyﬁnder: attribute-based sky
image search. In ACM Transactions on Graphics (TOG),
volume 28, page 68. ACM, 2009.
 J. Tighe and S. Lazebnik. Superparsing: scalable nonpara-
metric image parsing with superpixels. Computer Vision–
ECCV 2010, pages 352–365, 2010.
 J. Tighe and S. Lazebnik. Superparsing: scalable nonpara-
metric image parsing with superpixels, 2010.
 J. Tighe and S. Lazebnik. Finding things: Image parsing
with regions and per-exemplar detectors. In Proceedings of
the IEEE conference on computer vision and pattern recog-
nition, pages 3001–3008, 2013.
 J. Tighe, M. Niethammer, and S. Lazebnik. Scene parsing
with object instances and occlusion ordering. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3748–3755, 2014.
 Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, and M.-H. Yang.
Sky is not the limit: Semantic-aware sky replacement. ACM
Transactions on Graphics (TOG), 35(4):149, 2016.
 Wei-TaChu, Xiang-YouZheng, and D.-S. Ding. Camera as
weather sensor: Estimating weather information from single
images. Journal of Visual Communication and Image Rep-
resentation, 46(1):233–249, 2017.
 J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun
database: Large-scale scene recognition from abbey to zoo,
 J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
Sun database: Large-scale scene recognition from abbey to
zoo. In Computer vision and pattern recognition (CVPR),
2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
 X. Yan, Y. Luo, and X. Zheng. Weather recognition based
on images captured by vision system in vehicle. Advances in
Neural Networks–ISNN 2009, pages 390–398, 2009.
 A. P. Yazdanpanah, E. E. Regentova, A. K. Mandava, T. Ah-
mad, and G. Bebis. Sky segmentation by fusing clustering
with neural networks. In International Symposium on Visual
Computing, pages 663–672. Springer, 2013.
 H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network, 2017.