ArticlePDF Available

Abstract and Figures

Outdoor scene parsing models are often trained on ideal datasets and produce quality results. However, this leads to a discrepancy when applied to the real world. The quality of scene parsing, particularly sky classification, decreases in night time images, images involving varying weather conditions, and scene changes due to seasonal weather. This project focuses on approaching these challenges by using a state-of-the-art model in conjunction with a non-ideal dataset: SkyFinder and a subset from SUN database with Sky object. We focus specifically on sky segmentation, the task of determining sky and not-sky pixels, and improving upon an existing state-of-the-art model: RefineNet. As a result of our efforts, we have seen an improvement of 10-15% in the average MCR compared to the prior methods on SkyFinder dataset. We have also improved from an off-the shelf-model in terms of average mIOU by nearly 35%. Further, we analyze our trained models on images w.r.t two aspects: times of day and weather, and find that, in spite of facing same challenges as prior methods, our trained models significantly outperform them.
Content may be subject to copyright.
Segmenting Sky Pixels in Images
*Cecilia La Place
Arizona State University
cecilia.laplace@asu.edu
*
Aisha Urooj Khan
University of Central Florida
aishaurooj@gmail.com
Ali Borji
University of Central Florida
aliborji@gmail.com
Abstract
Outdoor scene parsing models are often trained on ideal
datasets and produce quality results. However, this leads to
a discrepancy when applied to the real world. The quality
of scene parsing, particularly sky classification, decreases
in night time images, images involving varying weather
conditions, and scene changes due to seasonal weather.
This project focuses on approaching these challenges by us-
ing a state-of-the-art model in conjunction with non-ideal
datasets: SkyFinder and a subset from the SUN database
containing the Sky object. We focus specifically on sky seg-
mentation, the task of determining sky and not-sky pixels,
and improving upon an existing state-of-the-art model: Re-
fineNet. As a result of our efforts, we have seen an improve-
ment of 10-15% in the average MCR compared to the prior
methods on the SkyFinder dataset. We have also improved
from an off-the-shelf model in terms of average mIOU by
nearly 35%. Further, we analyze our trained models on im-
ages w.r.t two aspects: times of day and weather, and find
that in spite of facing the same challenges as prior methods,
our trained models significantly outperform them.
1. Introduction
Sky segmentation is a part of the scene parsing world in
which algorithms seek to label or identify objects in images.
However, due to being trained on ideal datasets, these algo-
rithms face difficulty in non-ideal conditions [12]. As deep
learning methods might become more involved in real world
applications, it becomes apparent that off-the-shelf methods
are not always effective and reliability comes into question.
Inspired by Mihail et al., et al. [12] who compared existing
sky segmentation methods and sought to bring attention to
the problem of ideal datasets, we focused upon approach-
ing the challenges they mentioned through existing models.
The challenges of outdoor scene parsing lie in the time of
day, year, and varying weather conditions. Their SkyFinder
dataset allowed us to pursue these challenges in order to ob-
*First two authors contributed equally.
Figure 1. SkyFinder dataset. Top 3 rows show sample images for
different times of day. Each column signifies a different time: 12
am, 4am, 8am, 12pm, 4pm and 8pm, across 3 separate cameras.
Rows 4, 5 and 6 show images across different scenes for different
weather types: clear, cloudy, and fog respectively, whereas, the
last row shows different weather conditions (clear, cloudy, hazy,
rain and sleet) over the same scene.
tain improved results. This work highlights the importance
of this problem. We offer an improved model that will aid
in challenging sky segmentation in the real world. In this
work, we adopted state-of-the-art segmentation model [8]
for the task of pixel level detection of the sky. Our task is
different than semantic segmentation in a sense that we are
only interested in one object i.e. sky. Sky, unlike other ob-
jects can be difficult to segment due to poor lightning (night
time) and weather conditions where even humans are likely
to fail (e.g., over dense fog, thunder storms, etc). Thus, we
attempt to address this problem in this work.
Our contribution is as following. First, we evaluated
arXiv:1712.09161v2 [cs.CV] 8 Jan 2018
Figure 2. SUN-Sky dataset. Sample images from the various lo-
cations described in the SUN dataset with sky.
an off-the-shelf state-of-the-art model [8] to demonstrate
that existing models fail for different weather conditions,
times of day, and other transient attributes [12]. Sec-
ond, we fine-tuned RefineNet-Res101-Cityscapes model on
the SkyFinder dataset and obtained 12.26% improvement
over off-the-shelf model in terms of misclassification rate
(MCR). Third, taking advantage of the existing big dataset,
we trained [8] solely on SkyFinder, which further improved
accuracy and outperformed all baseline methods. Fourth,
to study across datasets performance of our trained mod-
els, we selected a subset from SUN [23] dataset with the
’sky’ label. We then both fine-tuned and trained RefineNet-
Res101 model on this subset of SUN dataset (we refer to
it as the SUN-Sky dataset in this paper), and perform eval-
uation across models trained on both SkyFinder and SUN-
Sky datasets, and report the results. Fifth, following [12],
we investigate the effect of times of day and weather condi-
tions on performance of our model and transient attributes.
Sixth, we compare our analysis with [12] in terms of MCR
and report impact of weather and times of day w.r.t mIOU
scores. Finally, we determine the impact of noisy images
like motion blur, Gaussian noise, etcetera on our model’s
performance and report our results on robustness of our ap-
proach.
The rest of this paper is organized as follows. In sec-
tion 2, we present a brief overview of existing works in this
area. In section 3, we describe details of our plan including
utilized datasets, and our models. Section 4 discusses the
performance metrics we used to evaluate the trained mod-
els. Section 5 presents experimental findings on Sky seg-
mentation task for both datasets and analysis of different at-
tributes, times of day, weather conditions and noise on this
task, followed by discussions in section 6.
2. Related Work
The history of scene parsing i.e., assigning each pixel of
input image to one of the object labels [12], has evolved
from its beginnings. Scene labeling methods [6], [9], [16],
[18], [19] mainly use local appearance information for ob-
jects being learned from training data [12]. Although, previ-
ously this task has been addressed by using hand engineered
features with a classifier, recent methods address this task
by learning features using deep neural networks. From con-
volutional networks spawned fully convolutional networks
which moved away from pixel level algorithms to whole-
image-at-a-time methods [10]. Afterward, the introduction
of skip layers led to deep residual learning [4]. Recently,
residual nets became the backbone to scene parsing algo-
rithms such as RefineNet [8], PSPNet [26], and more.
Specifically, the sky segmentation task can be helpful for
a diverse variety of applications such as stylizing images us-
ing sky replacement [20], obstacle avoidance [2], and [13]
since sky tells a lot about weather conditions. Current appli-
cations of sky segmentation range from personal to public
use and more. Often times they are used in scene parsing
[5, 17], horizon estimation, and geolocation. Other appli-
cations include weather classification [11], image editing
[7, 15], weather estimation [21, 14], and more. [1] and [24]
worked on weather recognition and used camera as weather
sensors. Weather detection has also been used for image
searching [15] where one can search outdoor scene images
based on weather related attributes.
Dev et al. used color based methods [3] for segmentation
of sky/cloud images, whereas [25] proposed deep learning
approach for segmenting sky. Like [12], we used the same
baseline methods (Hoeim et al. [5], Tighe et al. [17] and Lu
et al. [11]) to compare our results. Hoiem et al. uses a single
image to produce an estimate of scene geometry for three
classes: ground, sky and vertical regions by learning under-
lying geometry of an image via appearance-based models.
Tighe et al. combines region-level features with SVM-based
sliding window detectors for parsing an image into differ-
ent class labels including ’sky’. Lu et al. classifies an input
image into two classes: sunny or cloudy. Their work uses
sky detection as an important cue for weather classification.
For detecting sky, they used random forests to produce seed
patches for sky and non-sky, and then used graph cut to seg-
ment sky regions. [12] uses their sky detector for report-
ing results which we also used for our comparisons on the
SkyFinder dataset. Here we investigate the effectiveness of
existing state-of-the-art segmentation methods for this spe-
cific problem. To select among the best contenders for this
task, we evaluated off-the-shelf RefineNet [8] and PSPNet
[26] methods on SkyFinder dataset, and chose [8] for our
further experiments as it outperforms [26] with a large mar-
gin on SkyFinder dataset. We further take a deep insight
on challenges (such as weather conditions, night time im-
ages, and noisy images) which are faced even when robust
methods are used.
Method Split 1 Split 2 Split 3 Avg.
mIOU(%) MCR(%) mIOU(%) MCR(%) mIOU(%) MCR(%) mIOU(%) MCR(%)
Hoiem et al.- 21.28 - 20.68 - 26.24 - 22.73
Lu et al.- 25.38 - 21.67 - 23.32 - 20.38
Tighe et al.- 17.48 - 20.33 - 31.58 - 26.21
RefineNet-Res101-Cityscapes 49.31 17.12 46.31 18.00 49.87 21.68 48.5 18.93
RefineNet-Res101-SkyFinder-FT 79.48 5.17 72.18 7.48 85.2 7.37 78.95 6.67
RefineNet-Res101-SkyFinder 87.07 5.08 73.84 7.00 88.05 5.65 82.99 5.90
Table 1. Results from all three testing splits. MCR results for top 3 baselines are used from the SkyFinder metadata in order to compare our
pixel-level sky detector with their methods. We also report mIOU scores. RefineNet-Res101-Cityscapes is the off-the-shelf model trained
on Cityscapes. RefineNet-Res101-SkyFinder-FT shows results when we fine-tuned RefineNet-Res101-Cityscapes model on SkyFinder
dataset. Finally, training RefineNet-Res101 from scratch on SkyFinder dataset (last row) gives best results. For RefineNet, all these
numbers are reported when the model was evaluated at test scale of 0.8.
3. Approach
Our approach is based on adopting semantic segmenta-
tion methods for the task of pixel-level sky detection. We
used two datasets: SkyFinder [12] and SUN-Sky [23] for
our work. Our approach outperforms baseline methods in
the task of sky segmentation. We evaluated generalization
capability of our models by performing across datasets eval-
uation. Also, we studied influence of various factors like
transient attributes, weather conditions and noise on our
model’s performance.
3.1. Datasets
3.1.1 SkyFinder dataset
The SkyFinder dataset is a subset of the Archive of Many
Outdoor Scenes (AMOS) dataset. Due to the full SkyFinder
dataset not being available, we used 45 of the 53 cameras
shared. This entailed 60K-70K images, with an average of
1500 images per camera. These images were of varying di-
mensions, quality, time of day, season, and weather. How-
ever, some cameras contained images which indicated the
camera was experiencing technical difficulties or repairs.
These few images were removed to focus on the challenges
we wished to address. We then changed the sizes of the im-
ages to introduce a form of uniformity and make test eval-
uation faster by resizing them to be within the ranges of
320 ×480.
A single segmentation map was associated with each
camera. This is due to each camera being stationary for
at least a year in order to be included within in the dataset.
See figure 1 for sample images from this dataset.
3.1.2 SUN dataset
The Scene UNderstanding dataset is comprised of a multi-
tude of different scenes and the objects that make up said
scenes [22]. It is constantly being updated through com-
munity effort, thus adding to it’s already large nature. We
Orig. Gt. a b c
Figure 3. Improvement in night time images. Column 1
shows original image examples, col2 shows ground truths, where
last three columns show results for a) off-the-shelf RefineNet-
Res101-Cityscapes, b) RefineNet-Res101-SkyFinder-FT, and c)
RefineNet-Res101-SkyFinder respectively.
Orig. Gt. a b c
Figure 4. Improvement in weather obscured images. Column 1
shows original image examples, col2 shows ground truths, where
last three columns show results for a) off-the-shelf RefineNet-
Res101-Cityscapes, b) RefineNet-Res101-SkyFinder-FT, and c)
RefineNet-Res101-SkyFinder respectively.
primarily looked at this dataset for its ”sky” object classifi-
cation which allowed us to begin comparing the SkyFinder
dataset and results. Unlike SkyFinder, images are not
grouped by camera, instead they are grouped by scene such
as airport terminal, or church. Also, SkyFinder focused
on images taken from stationary cameras, whereas SUN
Method mIOU(%) MCR(%)
RefineNet-Res101-Cityscapes 61.69 8.4
RefineNet-Res101-SUNdb-FT 83.1 3.7
RefineNet-Res101-SUNdb 82.36 4.17
Table 2. Performance results of SUNdb Finetune and SUNdb Im-
ageNet.
has a variety of images from a variety of viewpoints. For
the purposes of this research, we focused on the subsec-
tion of the SUN database that had the object ”sky” labeled
(about 20,000 images). We resized those images to be
within the range of 320 ×480 for improved test evaluation
speed. In what follows, we refer to this subset as the SUN-
Sky dataset. See figure 1 for few sample images from this
dataset.
3.2. Sky Segmentation
3.2.1 RefineNet
The model we used, RefineNet, was created by Lin et al.
[8]. This model seeks to to retain detail throughout the re-
construction of the image and its segmented output unlike
its predecessor and backbone, ResNet.
3.2.2 Off-the-Shelf RefineNet
In order to establish a baseline in RefineNet, we used Re-
fineNet’s Res101 model that was trained on Cityscapes, a
dataset of European cities for scene segmentation. The
Cityscapes dataset is an ideal dataset for the Sky class, and
as a result does well in ideal conditions in the SkyFinder
dataset. After setting up the model and using it on a sin-
gle Titan X GPU, we ran each of the 45 cameras through
the model and evaluated solely on their sky classification
ability, i.e. all other classes were construed as non-sky. We
refer this baseline model as RefineNet-Res101-Cityscapes
in the following.
3.2.3 Finetuning
To obtain proof of concept prior to running the entire dataset
we focused on a smaller subset of the dataset. We took
10 random cameras and broke it into a train-val-test split.
From each camera we took 75 images for training, 25 im-
ages for validation, and between 175-300 images for testing
and evaluation.
Following the success of fine-tuning the model on the
subset, we fine-tuned the RefineNet-Res101-Cityscapes
model on the SkyFinder dataset. Being unable to find the
same train-val-test split as Mihail et al. [12], we split the
dataset into our own train-val-test splits. Although, to keep
our experiments as much consistent as possible to [12], we
used the same number of test cameras in our experiments.
Hence, our split consisted of 13 cameras used for testing,
4 cameras used for validation, and the remaining cameras
used for training. We then shuffled the cameras in each sec-
tion but kept the same number of cameras for each train-
ing, validation, and testing set, and repeated the aforemen-
tioned fine-tuning process two more times for a total of
three fine-tuning trials. We used a learning rate of 5e-5 for
each instance and trained the model for 10 epochs. After 10
epochs, the validation accuracy started leveling out. Thus,
for consistency, we report all results on model trained till
10 epochs. We refer our fine-tuned models as RefineNet-
Res101-SkyFinder-FT in our results.
3.2.4 Training with SkyFinder dataset
Finally, to take advantage of the big size of the dataset, we
trained RefineNet-Res101 from scratch, where Res101 was
initialized with pre-trained ImageNet model but trained the
RefineNet on solely the SkyFinder dataset. We used the
same three train-val-test splits as mentioned above (to allow
for fair comparison) and trained at a learning rate of 5e-4 for
over 10 epochs. Due to the flattening of the learning curve,
we used the model at epoch 10 for testing in both instances.
We refer to these models as RefineNet-Res101-SkyFinder
in our results.
3.2.5 Training with Sun-Sky dataset
We broke down the sky-labeled sub-dataset of the SUN
dataset into a randomly shuffled 60-20-20 split for our uses.
After resizing the images to 320 ×240 to train and test
quickly, we then generated the ground truth segmentation
masks by keeping only the sky class and treating any other
classes as non-sky. To train and evaluate, we focused on
a similar process as we did with the SkyFinder dataset.
Training entails fine-tuning from the RefineNet-Res101-
Cityscapes model, and training on a model initialized from
ImageNet-Res101. Evaluation is comprised of calculat-
ing the average MCR and mIOU of the dataset. We fine-
tuned RefineNet-Res101-Cityscapes model using the SUN-
Sky dataset for 10 epochs at a learning rate of 5e-4, and sub-
sequently 10 more epochs at a learning rate of 5e-5. For our
second model, we initialized from ImageNet-Res101 and
trained using the same split as we did for the previous model
for 10 epochs at a learning rate of 5e-4. Much like the pre-
vious model, we again trained another 10 epochs at a lower
learning rate of 5e-5.
Table 2 shows quantitative results for SUN-Sky dataset
on both fine-tuned model (RefineNet-Res101-SUNdb-FT)
and trained model (RefineNet-Res101-SUNdb). See fig.
5.1.7 for qualitative results on this dataset.
Time of day Split 1 Split 2 Split 3 Avg.
(hour) mIOU MCR mIOU MCR mIOU MCR mIOU MCR
082.36 5.54 80.64 6.27 85.37 7.46 82.79 6.42
176.28 8.25 75.40 7.69 84.88 7.34 78.85 7.76
276.12 9.75 76.26 8.31 86.27 6.75 79.55 8.27
370.41 7.21 69.99 6.28 87.96 5.97 76.12 6.48
471.66 6.65 69.83 5.71 88.26 5.06 76.58 5.81
577.33 5.16 71.04 4.84 90.82 3.09 79.73 4.36
676.48 5.06 74.20 5.85 91.74 3.94 80.80 4.95
776.66 3.65 71.05 8.71 89.62 5.75 79.11 6.04
872.05 3.31 69.53 4.15 93.23 3.46 78.27 3.64
972.47 3.06 70.65 3.34 93.93 2.94 79.02 3.12
10 74.54 2.44 73.21 2.89 94.51 2.69 80.75 2.68
11 70.97 2.57 68.90 2.82 94.17 2.89 78.01 2.76
12 74.08 2.61 72.69 2.92 94.57 2.57 80.44 2.70
13 74.04 2.12 71.78 2.61 94.60 2.49 80.14 2.41
14 72.58 2.41 70.38 2.70 94.17 2.81 79.04 2.64
15 73.25 2.40 71.08 2.85 94.19 2.84 79.50 2.70
16 71.69 2.59 69.78 2.86 92.77 3.44 78.08 2.96
17 71.73 2.85 69.63 3.13 91.76 3.73 77.71 3.24
18 70.12 3.33 67.50 3.51 90.41 4.37 76.01 3.73
19 69.00 3.96 66.53 4.42 89.38 5.00 74.97 4.46
20 69.43 5.74 64.92 11.19 85.27 8.08 73.21 8.34
21 67.11 5.70 64.85 6.79 86.80 5.88 72.92 6.12
22 68.27 6.58 64.70 6.09 86.50 5.49 73.16 6.05
23 70.86 7.17 62.64 6.48 83.62 6.89 72.37 6.85
Avg. 72.89 4.77 70.30 6.45 90.20 4.62 77.8 5.25
Table 3. Sky segmentation results break down w.r.t different times
of the day on SkyFinder dataset. Numbers are in percentage.
4. Evaluation
To evaluate the accuracy of our results we used both the
misclassification rate (MCR) defined in Mihail et al. [12],
and the mean intersection over union (mIOU). The use of
both of these metrics allowed for the ability to compare our
results to the results found in Mihail et al. [12], which fo-
cused primarily on MCR results, and determine the overlap-
ping accuracy of the segmentation outputs.
MCR =F P +F N
number of pixels (1)
mIOU =XT P
T P +F P +F N (2)
5. Results
In this section, we discuss our experiments for sky seg-
mentation, and studying impact of different conditions like
times of day, weather situation, transient attributes and
noise on performance of our model. Please note that, al-
though [12] have done a similar study, we extend their work
and also report mIOU in our experiments.
5.1. SkyFinder dataset
Since we evaluated RefineNet on SkyFinder dataset
in three settings: off-the-shelf RefineNet, fine-tuned, and
trained from the scratch, for reference, we use RefineNet-
Res101-Cityscapes for off-the-shelf RefineNet model, ini-
tialized from Res101 and trained on Cityscapes. We then
further fine-tuned same off-the-shelf model on SkyFinder
dataset for all three splits and refer to the model as
RefineNet-Res101-SkyFinder-FT. RefineNet when initial-
ized from ImageNet pre-trained Res101 and trained on
SkyFinder from the scratch is referred as RefineNet-
Res101-SkyFinder. We find that RefineNet-Res101-
SkyFinder outperforms the fine-tuned model which is
clearly because the former takes an advantage of the large
size of this dataset. For comparison with other baseline
models (Hoiem et al., Lu et al., or Tighe et al.) mentioned
in [12], we used the MCR scores for all three test splits and
also report the average performance for all methods in table
1. To compute MCR scores for our baselines on our test
sets, we fetch the MCR score from metadata provided with
SkyFinder dataset after they evaluated these models.
For results in table 1, we evaluated RefineNet at test scale
of 0.8 (default setting), which performs better than being
evaluated at full scale (scale = 1.0). Please refer to tables
1 and 3 for comparison. We find that RefineNet-Res101-
SkyFinder outperforms all baseline methods both in terms
of mIOU and MCR scores. For qualitative evaluation, the
images are selected from a few of the first test set of cameras
and do not include visual results from Hoiem et al., Lu et
al., or Tighe et al. While Mihail et al. created their own
ensemble using the combination of the three methods, their
results were unreported for individual images.
Analysis on time of day and weather has been performed
with full scale test evaluation.
5.1.1 Performance on Camera 10917
First, some background on SkyFinder’s Camera 10917: this
specific camera is of a location in which the sky does not
peek through anywhere. It depicts a quaint village and trees,
but no sky. When SkyFinder was trained on the first two
splits it did not contain this camera in its training data. As
a result we consistently witnessed abysmal IOU values of
0 which results in performance degradation for test splits
1 and 2 because the model has not previously seen images
with no sky. However, the model performs decent in terms
of MCR values (below 10%). We believe that the IOU eval-
uation method in a case such as this is inefficient. There-
fore we pay particular attention to the MCR in regards to
this camera. For test split 3, this camera has been used for
training our models, thus model performs better on the test
set.
5.1.2 Performance for night time
Fig 3 shows visual improvement in results for night time im-
ages when compared with off the shelf RefineNet-Res101-
Cityscapes and RefineNet-Res101-SkyFinder fine-tuned,
which makes it obvious why our trained models win over
the baseline methods in terms of MCR as well.
Figure 5. Performance analysis of RefineNet-Res101-SkyFinder in
terms of mean intersection over union w.r.t row1) time of day, and
row2) Weather conditions.
Weather Split 1 Split 2 Split 3 Avg.
mIOU MCR mIOU MCR mIOU MCR mIOU MCR
clear 59.88 2.96 66.48 3.54 92.61 4.21 72.99 3.57
cloudy 79.92 3.61 77.02 4.19 90.71 4.07 82.55 3.96
fog 77.82 7.18 71.72 7.75 86.22 5.58 78.59 6.83
hazy 75.74 6.72 72.32 6.71 89.31 4.49 79.12 5.97
mostly cloudy 73.12 3.44 69.10 4.89 90.40 4.06 77.54 4.13
partly cloudy 77.46 4.18 75.79 6.10 89.98 4.61 81.08 4.96
rain 61.88 3.94 58.00 4.32 89.79 4.65 69.89 4.30
sleet 47.66 4.17 50.00 4.75 92.57 4.51 63.41 4.48
snow 29.50 1.42 33.70 4.48 89.03 6.83 50.74 4.24
tstorms 84.50 3.88 80.75 3.32 89.09 4.58 84.78 3.93
unknown 83.36 5.76 68.27 5.29 86.17 3.84 79.27 4.96
blanks 83.09 7.08 83.84 6.38 78.93 8.94 81.95 7.47
Avg. 69.49 4.82 67.25 6.77 88.73 5.03 75.16 5.54
Table 4. Sky segmentation results break down w.r.t weather on
SkyFinder dataset. Numbers are in percentage.
5.1.3 Performance for weather obscured images
In section 5.1.5, our results suggests that sky segmenta-
tion in night time is more challenging than during day
hours. But, our final trained model on SkyFinder still im-
proves in terms of mIOU over our established baseline mod-
els (RefineNet-Res101-Cityscapes and RefineNet-Res101-
SkyFinder-FT). Fig. 4 shows that despite of images ob-
scured due to dense fog, RefineNet-Res101-SkyFinder was
able to perform reasonably well.
Figure 6. Illustration for performance analysis on time of day. X-
axis shows hour of the day and y-axis represents mean classifica-
tion rate (MCR) over all three test splits we used in our experiment.
Input GT Pred. Input GT Pred. Input GT Pred.
Figure 7. Qualitative results for performance analysis w.r.t times
of day on SkyFinder dataset using RefineNet-Res101-SkyFinder.
The first row shows success cases(mIOU >0.8) for hour: 0(night),
12(noon) and 18(evening), and the last row shows images where
model fails(mIOU <0.5). The last row shows failure cases for
hours 6 and 0. Our model never failed for images around noon.
5.1.4 Performance analysis for different day times
We also wanted to see how the trained model performed
during different times of day, similar to [12]. As we use
three test splits for reporting our results, we sorted each
of them w.r.t hour of the day. Then for each sorted split,
we evaluate the respective model on its test set and com-
pute mIOU and MCR. We then calculate average over them
(see table 3). Please note that while evaluating RefineNet-
Res101-SkyFinder for each of its respective test set, we
used scale=1.0 (i.e. full scale at test time) and report the
numbers. We find similar pattern as discussed in [12] for
RefineNet as well i.e., model achieves good performance
during day time in terms of MCR, and performance de-
creases during start(early morning) and end of the day(night
time), thus MCR increases. In terms of mIOU, we witness
a decrease in performance towards the end of the day, but
overall the performance seems consistent. Fig 5 shows illus-
tration for mIOU during different day hours. For comparing
RefineNet with the results in the baseline paper in this as-
pect, we fetched MCR scores for each split w.r.t hour and
report results provided by [12]. Fig.6 shows that although
our trained model follows the same pattern, but our pixel-
level sky detector performs significantly better than all three
methods.
Figure 8. Illustration of performance analysis on different weather
conditions. X-axis shows different weather conditions existing
in SkyFinder dataset and y-axis represents mean classification
rate(MCR) over all three test splits we used in our experiment.
Please see figure 7 for qualitative results when we tested
RefineNet-Res101-SkyFinder on sorted test split w.r.t time.
Row 1 shows success cases where row2 shows failure cases
for different times of day.
5.1.5 Performance analysis for weather conditions
Like [12], we aim to evaluate how well semantic segmen-
tation model when trained as a sky classifier performs for
different weather conditions. We split the testset with re-
spect to different weather conditions as provided in meta-
data of SkyFinder dataset and evaluated RefineNet-Res101-
SkyFinder on them. We run the same experiment for all
three test splits and then compute the average scores for
both mIOU and MCR (See table 4 for quantitative results).
Looking at the average mIOU scores, we find that our model
struggles for few weather conditions like snow and sleet.
On the other hand our model performs better even when
weather is cloudy, partly cloudy, or foggy. Interestingly, the
model is able to perform really well during thunderstorms
both in terms of mIOU and MCR.
SkyFinder dataset has some images without labels for
weather conditions, we evaluated our model on them and
report the results. We observe that mIOU is lower than ex-
pected for clear sky. This is due to the reason that a big
share of clear sky images (1200+) are from camera 10917
where there is no sky visible, thus it lowers the mIOU at
test time for splits 1 and 2. Split 3 has been trained on such
images and therefore learns to tell when there is no sky in
the image (92.61% mIOU for clear sky).Fig 5 shows mIOU
scores averaged over all three splits for RefineNet-Res101-
SkyFinder. Fig 8 compares MCR score for weather with
different methods, and RefineNet significantly outperforms
them. For calculating MCR for baseline methods, we use
the MCR scores from SkyFinder metadata. This figure re-
Input GT Pred. Input GT Pred. Input GT Pred.
Figure 9. Qualitative results for performance analysis w.r.t weather
on SkyFinder dataset using RefineNet-Res101-SkyFinder. First
four rows show success cases(mIOU >0.8) for each weather type,
and last row shows images where model fails(mIOU <0.5). Suc-
cess cases in first 4 rows follow the weather order: clear, cloudy,
fog, hazy, misc, mostly cloudy, partly cloudy, rain, sleet, snow,
tstorms, and unknown respectively. For each weather type, we
show input image, ground truth and prediction respectively. Last
row shows failure cases where weather type was: tstorms , clear,
and fog.
ports the average MCR score over all test splits. Please see
figure 9 for qualitative results.
5.1.6 Performance analysis w.r.t transient attributes
Inspired by [12], we aimed to investigate how much tran-
sient attributes affect the performance of our trained model.
Hence, we selected images from one of our test splits for
four transient attributes: gloomy, clouds, cold and night.
We selected images having high presence of these attributes
i.e., thresholded above 0.8 and with low chances of their ex-
istence i.e., value is less than 0.2. We then tested RefineNet-
Res101-SkyFinder on these subsets of images(2 for each at-
tribute : low and high) and report the performance.
As expected, the model performs well for the images where
the sky is not gloomy, cloudy, cold or night time. The model
is most robust to clouds and performs well(both in terms
of mIOU and MCR) even when clouds are largely present
in the image. In terms of MCR, model’s performance is
worse when we input a highly gloomy image (MCR=10.84)
or when it is night time (MCR=10.25). In terms of mIOU,
our trained model performs slightly better for gloomy im-
ages than night time images (see table 5, last row). We also
report MCR scores from baseline methods for comparisons
and find that Tighe et al. performs poorly among all of the
baselines in high presence of the above mentioned transient
attributes. Lu et al. seems robust among the baselines but
our method significantly outperforms all of them for this
experiment(see fig. 10). Figure 11 shows the histogram of
MCR scores when each of these four transient attributes are
high and low in the input images. See figure 12 for qualita-
tive results of this experiment.
gloomy clouds cold night
>=0.8 <=0.2 >=0.8 <=0.2 >=0.8 <=0.2 >=0.8 <=0.2
Method mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR mIOU MCR
Hoiem et al.- 40.70 - 19.70 - 18.47 - 18.39 - 24.93 - 19.23 - 41.57 - 18.46
Lu et al.- 36.42 - 14.88 - 12.74 - 8.35 - 16.44 - 14.98 - 36.60 - 10.03
Tighe et al.- 50.53 - 13.36 - 56.21 - 23.44 - 50.65 - 17.83 - 48.14 - 31.05
RNet-Res101-SkyFinder 84.16 10.84 92.97 1.94 93.09 4.35 93.65 2.93 90.84 5.35 92.19 2.90 83.26 10.25 94.36 2.72
Table 5. Performance analysis on transient attributes (high thresholded and low thresholded values).
Figure 10. Comparison of RefineNet-Res101-SkyFinder with baseline methods (Tighe et al., Hoiem et al., and Lu et al.) in terms of
MCR given high values(>=0.8) and low values(<=0.2) of transient features(gloomy, clouds, cold, and night). Row 1 shows performance
comparison when each of four attributes has high value i.e., thresholded above 0.8, whereas row 2 shows results when very low value is
observed for each attribute.
5.1.7 Performance analysis w.r.t noisy images
We are also interested in investigating the impact of noise
on the task of Sky segmentation. We conducted this exper-
iment since the image can be noisy due to various reasons
in original settings as well. For this purpose, we selected
50 images for each camera from one of the test splits (total
650 images) and added different types of noise to them. We
added motion blur, Gaussian noise, Poisson noise, salt &
pepper, and speckle noise to our subset of images. We first
evaluated performance without adding any noise and then
compared our results with noisy images(see table 6). We
find that our trained model is robust to Poisson noise, and
highly sensitive to salt & pepper noise. Motion blur also
affects mIOU (a drop of approx. 3.5%). Similarly, Gaus-
sian noise and speckle noise also hurt performance (both in
terms of mIOU and MCR). See figure 13 for sample images
after adding different types of noise to input images.
5.2. Sun-Sky dataset
The SUN-Sky dataset is a subset of images from SUN
dataset [23] of outdoor scenes having sky present in them.
Using a similar approach as we used for training on
SkyFinder dataset, we first evaluate RefineNet-Res101-
Original Motion Blur Gaussian Poisson Salt & Pepper Speckle
mIOU 89.92 86.31 85.26 88.67 81.77 84.70
MCR 5.08 6.65 7.60 5.73 9.32 7.91
Table 6. Performance analysis on SkyFinder dataset when input is
a noisy image. Column 1 shows results on images without any
noise, whereas rest of the columns show results when we added
motion blur, Gaussian noise, Poisson noise, salt & pepper noise,
and speckle noise to the images respectively.
Cityscapes model on SUN-Sky dataset, then fine-tune the
same model on this dataset referred as RefineNet-Res101-
SUNdb-FT, and lastly, we trained RefineNet-Res101- on
SUN-Sky dataset from the scratch (referred to as RefineNet-
Res101-SUNdb). We randomly split the images into 60%-
20%-20% ratio for training, validation and testing respec-
tively. When evaluated these three models on our test set,
we find that off-the-shelf model performs better(mIOU =
61.69, MCR = 8.4) on this dataset as compared to its per-
formance on SkyFinder dataset (mIOU = 48.5, MCR =
18.93), which maintains that SkyFinder is more challenging
in nature as compared to SUN-Sky dataset. Interestingly,
the fine-tuned model on this dataset outperforms the model
when trained solely on SUN-Sky dataset with a slight mar-
gin (both in terms of mIOU and MCR) which is mainly be-
Figure 11. Frequency distribution of MCR with RefineNet-Res101-SkyFinder given transient features(gloomy, clouds, cold, and night)
using low and high threshold values(0.2 and 0.8, respectively).
Input image GT Prediction Input image GT Prediction
Figure 12. Qualitative results for absence(below a threshold i.e. <=0.2) and presence(above a threshold i.e. >=0.8) of transient attributes
(gloomy, clouds, cold, and night) on SkyFinder dataset when using RefineNet-Res101-SkyFinder model. Row 1 shows selected examples
for clouds with threshold <=0.2, row 2 shows results for clouds using high threshold >=0.8, row 3 and 4 show results for gloomy, row 5
and 6 are results for cold, and the last two rows show results for night following the same order as mentioned for first two rows i.e., absence
and presence of each transient attribute, respectively.
cause the SUN-Sky dataset is not big enough in size. Over-
all, we find that both fine-tuned and trained models on this
dataset performs reasonably well as compared to off-the-
shelf RefineNet-Res101-Cityscapes model (see table 3.1.2).
5.3. Cross datasets evaluation
To see generalization power of our models, we evalu-
ated them cross datasets i.e., models trained on SkyFinder
Image Motion blur Gaussian Poisson Salt & Pepper Speckle
Figure 13. Some example images for different types of noise added
to the selected subset of images from SkyFinder dataset.
Orig. Gt. a b c
Figure 14. Segmentation improvement across the RefineNet-
Res101-Cityscapes model (a), the RefineNet-Res101-SUN-FT
model (b), and the RefineNet-Res101-SUN model (c).
Datasets SkyFinder SUN-Sky
mIOU(%) MCR(%) mIOU(%) MCR(%)
RNet-SkyFinder-FT 79.00 6.69 71.70 6.67
RNet-SkyFinder 83.00 5.89 71.57 7.24
RNet-SUNdb-FT 71.49 11.74 83.10 3.70
RNet-SUNdb 70.42 12.37 82.36 4.17
Table 7. Sky segmentation results on 2 datasets broken down and
trained on one sub dataset and tested on others. Numbers in bold
text show the best results on particular dataset, whereas, blue font
shows best results for that dataset during cross dataset evaluation.
Numbers are in percentage.
were evaluated on SUN-Sky dataset and vice versa. Table
7 shows results for both fine-tuned and trained model on
each dataset and across other dataset. Looking at the re-
sults, we find that SkyFinder, due to its large size, gives bet-
ter performance when we trained the model on it from the
scratch. Interestingly, our fine-tuned models perform better
than trained from scratch when evaluated across datasets. In
terms of MCR, fine-tuned model on SkyFinder generalizes
better than the model fine-tuned on SUN-Sky dataset(6.67
vs. 11.74). Please note, for SkyFinder dataset, results
shown from all models are averaged over all test splits.
6. Discussion and Conclusion
We will first direct our focus to the original results of
RefineNet’s res101 model trained on Cityscapes in Table 1.
Both results incorporate the pretrained model’s incapabil-
ity to properly access non-ideal images, particularly night
images as addressed earlier. Upon finetuning on the model
using the SkyFinder dataset, we note a drastic improvement
in the mIOU and just over 10% decrease in the MCR. This is
clearly as a result of including the non-ideal images. Upon
training it on the SkyFinder dataset, we see a less drastic
improvement in one split, but improvement nonetheless.
Despite the averages in Table 1, focusing on a singu-
lar camera in one of the data splits, we can see a dras-
tic improvement. Camera 858, consists of 200 images
and was not seen by the finetuned or ImageNet initial-
ized model during training. The mIOU from the model
trained on Cityscapes was 69.21%, and had an MCR of
18.16%. However, after finetuning the model, the mIOU
and MCR results respectively quickly jumped up to 91.41%
and 4.69%. Finally, after training on only 28 cameras of
the data (again, not including camera 858) and initializing
from ImageNet, the results improved slightly. The mIOU
increased to 95.76% and the MCR dropped a little more to
2.30%. Other cameras show similar rates of improvement.
There are some cameras however that prove difficult to seg-
ment for all methods and show a smaller rate of improve-
ment.
The importance of these results lies in the ability to use
these models in real-world applications. Using the original
pre-trained model would result in poor quality segmentation
outside of the ideal circumstances. Off-the-shelf methods
must be modified in order to be used most effectively in the
real world.
Understanding the impacts of these results, we also in-
corporate the results from Mihail et al’s findings for their
baseline methods. While their own model’s results which
reported an MCR of 12.96% across their own testing split
has not been used to report the performance in terms of
MCR on our test splits. In spite of this, we can still fur-
ther prove the idea that existing models are still effective–so
long as they have been modified to suit the task’s needs.
We also demonstrated that, overall, even state-of-the-
art models struggle with challenging conditions like night
time, variation in weather and other transient attributes. Al-
though, our trained models still perform much better than
prior methods in terms of MCR.
Following this work, we intend to look at other scene
parsing models to evaluate their off-the-shelf methods, fine-
tune them, and possibly train them on the SkyFinder dataset
in order to compare the results to the above. We also plan to
develop our own end-to-end sky segmentation model to also
use for comparison. To possibly improve general results,
cleaning of the dataset may need to occur, such as remov-
ing timestamps. Other future directions include using sky
segmentation for applications such as weather classification
and weather forecasting. Code, our trained models and data
will be made available for further exploration of this area.
References
[1] W.-T. Chu, X.-Y. Zheng, and D.-S. Ding. Camera as weather
sensor: Estimating weather information from single images.
Journal of Visual Communication and Image Representation,
46:233–249, 2017.
[2] G. De Croon, C. De Wagter, B. Remes, and R. Ruijsink. Sky
segmentation approach to obstacle avoidance. In Aerospace
Conference, 2011 IEEE, pages 1–16. IEEE, 2011.
[3] S. Dev, Y. H. Lee, and S. Winkler. Color-based segmentation
of sky/cloud images from ground-based cameras. IEEE Jour-
nal of Selected Topics in Applied Earth Observations and
Remote Sensing, 10(1):231–242, 2017.
[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition, 2016.
[5] D. Hoeiem, A. A. Efros, and M. Hebert. Geometric context
from a single image, 2005.
[6] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context
from a single image. In Computer Vision, 2005. ICCV 2005.
Tenth IEEE International Conference on, volume 1, pages
654–661. IEEE, 2005.
[7] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient
attributes for high-level understanding and editing of outdoor
scenes, 2014.
[8] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path
refinement networks for high-resolution semantic segmenta-
tion, 2016.
[9] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene pars-
ing: Label transfer via dense scene alignment. In Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE
Conference on, pages 1972–1979. IEEE, 2009.
[10] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation, 2015.
[11] C. Lu, D. Lin, J. Jia, and C. K. Tang. Two-class weather
classification, 2014.
[12] R. P. Mihail, S. Workman, Z. Bessinger, and N. Jacobs. Sky
segmentation in the wild: An empirical study. In Applica-
tions of Computer Vision (WACV), 2016 IEEE Winter Con-
ference on, pages 1–6. IEEE, 2016.
[13] M. Roser and F. Moosmann. Classification of weather situa-
tions on single color images. In Intelligent Vehicles Sympo-
sium, 2008 IEEE, pages 798–803. IEEE, 2008.
[14] L. Tao, Y. Luo, and X. Zheng. Weather recognition based on
images captured by vision system in vehicle, 2009.
[15] L. Tao, L. Yuan, and J. Sun. Skyfinder: attribute-based sky
image search. In ACM Transactions on Graphics (TOG),
volume 28, page 68. ACM, 2009.
[16] J. Tighe and S. Lazebnik. Superparsing: scalable nonpara-
metric image parsing with superpixels. Computer Vision–
ECCV 2010, pages 352–365, 2010.
[17] J. Tighe and S. Lazebnik. Superparsing: scalable nonpara-
metric image parsing with superpixels, 2010.
[18] J. Tighe and S. Lazebnik. Finding things: Image parsing
with regions and per-exemplar detectors. In Proceedings of
the IEEE conference on computer vision and pattern recog-
nition, pages 3001–3008, 2013.
[19] J. Tighe, M. Niethammer, and S. Lazebnik. Scene parsing
with object instances and occlusion ordering. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3748–3755, 2014.
[20] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, and M.-H. Yang.
Sky is not the limit: Semantic-aware sky replacement. ACM
Transactions on Graphics (TOG), 35(4):149, 2016.
[21] Wei-TaChu, Xiang-YouZheng, and D.-S. Ding. Camera as
weather sensor: Estimating weather information from single
images. Journal of Visual Communication and Image Rep-
resentation, 46(1):233–249, 2017.
[22] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun
database: Large-scale scene recognition from abbey to zoo,
2010.
[23] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
Sun database: Large-scale scene recognition from abbey to
zoo. In Computer vision and pattern recognition (CVPR),
2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
[24] X. Yan, Y. Luo, and X. Zheng. Weather recognition based
on images captured by vision system in vehicle. Advances in
Neural Networks–ISNN 2009, pages 390–398, 2009.
[25] A. P. Yazdanpanah, E. E. Regentova, A. K. Mandava, T. Ah-
mad, and G. Bebis. Sky segmentation by fusing clustering
with neural networks. In International Symposium on Visual
Computing, pages 663–672. Springer, 2013.
[26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network, 2017.
... Owing to an interest in automatically segmenting the skies for a wide variety of applications, Mihail et al. [13] created a dataset for evaluating sky segmentation in various conditions, and reviewed methods for sky segmentation. Place et al. [8] built on their prior work and evaluated the performance of a neural network, RefineNet [11], for the segmentation task. Tsai et al. [23] used a refined sky mask to replace the sky in a photograph to give it a more interesting appearance. ...
... Here we take advantage of the limited variability of RGB values in the skies. We calculate the probability that each "undetermined" pixel belongs to the "sky" pixels using Equation 8. All "undetermined" pixels with a probability p i greater than a threshold p c = 0.6 are re-labeled as being "sky", while those with probabilities below that threshold are re-labeled as "not sky" and assigned a low confidence of c undet (as described in Section 3.1). ...
Preprint
Full-text available
The sky is a major component of the appearance of a photograph, and its color and tone can strongly influence the mood of a picture. In nighttime photography, the sky can also suffer from noise and color artifacts. For this reason, there is a strong desire to process the sky in isolation from the rest of the scene to achieve an optimal look. In this work, we propose an automated method, which can run as a part of a camera pipeline, for creating accurate sky alpha-masks and using them to improve the appearance of the sky. Our method performs end-to-end sky optimization in less than half a second per image on a mobile device. We introduce a method for creating an accurate sky-mask dataset that is based on partially annotated images that are inpainted and refined by our modified weighted guided filter. We use this dataset to train a neural network for semantic sky segmentation. Due to the compute and power constraints of mobile devices, sky segmentation is performed at a low image resolution. Our modified weighted guided filter is used for edge-aware upsampling to resize the alpha-mask to a higher resolution. With this detailed mask we automatically apply post-processing steps to the sky in isolation, such as automatic spatially varying white-balance, brightness adjustments, contrast enhancement, and noise reduction.
Article
Full-text available
We live in a dynamic visual world where the appearance of scenes changes dramatically from hour to hour or season to season. In this work we study "transient scene attributes" - high level properties which affect scene appearance, such as "snow", "autumn", "dusk", "fog". We define 40 transient attributes and use crowd-sourcing to annotate thousands of images from 101 webcams. We use this "transient attribute database" to train regressors that can predict the presence of attributes in novel images. We demonstrate a photo organization method based on predicted attributes. Finally we propose a high-level image editing method which allows a user to adjust the attributes of a scene, e.g. change a scene to be "snowy" or "sunset". To support attribute manipulation we introduce a novel appearance transfer technique which is simple and fast yet competitive with the state-of-the-art. We show that we can convincingly modify many transient attributes in outdoor scenes.
Article
We estimate weather information from single images, as an important clue to unveil real-world characteristics available in the cyberspace, and as a complementary feature to facilitate computer vision applications. Based on an image collection with geotags, we crawl the associated weather and elevation properties from the web. With this large-scale and rich image dataset, various correlations between weather properties and metadata are observed, and are used to construct computational models based on random forests to estimate weather information for any given image. We describe interesting statistics linking weather properties with human behaviors, and show that image’s weather information can potentially benefit computer vision tasks such as landmark classification. Overall, this work proposes a large-scale image dataset with rich weather properties, and provides comprehensive studies on using cameras as weather sensors.
Article
Given a single outdoor image, we propose a collaborative learning approach using novel weather features to label the image as either sunny or cloudy. Though limited, this two-class classification problem is by no means trivial given the great variety of outdoor images captured by different cameras where the images may have been edited after capture. Our overall weather feature combines the data-driven convolutional neural network (CNN) feature and well-chosen weather-specific features. They work collaboratively within a unified optimization framework that is aware of the presence (or absence) of a given weather cue during learning and classification. In this paper we propose a new data augmentation scheme to substantially enrich the training data, which is used to train a latent SVM framework to make our solution insensitive to global intensity transfer. Extensive experiments are performed to verify our method. Compared with our previous work and the sole use of a CNN classifier, this paper improves the accuracy up to 7 − 8%. Our weather image dataset is available together with the executable of our classifier.
Article
Skies are common backgrounds in photos but are often less interesting due to the time of photographing. Professional photographers correct this by using sophisticated tools with painstaking efforts that are beyond the command of ordinary users. In this work, we propose an automatic background replacement algorithm that can generate realistic, artifact-free images with a diverse styles of skies. The key idea of our algorithm is to utilize visual semantics to guide the entire process including sky segmentation, search and replacement. First we train a deep convolutional neural network for semantic scene parsing, which is used as visual prior to segment sky regions in a coarse-to-fine manner. Second, in order to find proper skies for replacement, we propose a data-driven sky search scheme based on semantic layout of the input image. Finally, to re-compose the stylized sky with the original foreground naturally, an appearance transfer method is developed to match statistics locally and semantically. We show that the proposed algorithm can automatically generate a set of visually pleasing results. In addition, we demonstrate the effectiveness of the proposed algorithm with extensive user studies.
Article
Sky/cloud images captured by ground-based cameras (a.k.a. whole sky imagers) are increasingly used nowadays because of their applications in a number of fields, including climate modeling, weather prediction, renewable energy generation, and satellite communications. Due to the wide variety of cloud types and lighting conditions in such images, accurate and robust segmentation of clouds is challenging. In this paper, we present a supervised segmentation framework for ground-based sky/cloud images based on a systematic analysis of different color spaces and components, using partial least-squares regression. Unlike other state-of-the-art methods, our proposed approach is entirely learning based and does not require any manually defined parameters. In addition, we release the Singapore whole Sky imaging segmentation database, a large database of annotated sky/cloud images, to the research community.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.