Content uploaded by Christos Sakaridis
Author content
All content in this area was uploaded by Christos Sakaridis on May 27, 2020
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
Semantic Foggy Scene Understanding with Synthetic Data
Christos Sakaridis ·Dengxin Dai ·Luc Van Gool
Received: date / Accepted: date
Abstract This work addresses the problem of seman-
tic foggy scene understanding (SFSU). Although ex-
tensive research has been performed on image dehaz-
ing and on semantic scene understanding with clear-
weather images, little attention has been paid to SFSU.
Due to the difficulty of collecting and annotating foggy
images, we choose to generate synthetic fog on real im-
ages that depict clear-weather outdoor scenes, and then
leverage these partially synthetic data for SFSU by em-
ploying state-of-the-art convolutional neural networks
(CNN). In particular, a complete pipeline to add syn-
thetic fog to real, clear-weather images using incom-
plete depth information is developed. We apply our fog
synthesis on the Cityscapes dataset and generate Foggy
Cityscapes with 20550 images. SFSU is tackled in two
ways: 1) with typical supervised learning, and 2) with a
novel type of semi-supervised learning, which combines
1) with an unsupervised supervision transfer from clear-
weather images to their synthetic foggy counterparts.
In addition, we carefully study the usefulness of image
dehazing for SFSU. For evaluation, we present Foggy
Driving, a dataset with 101 real-world images depict-
ing foggy driving scenes, which come with ground truth
annotations for semantic segmentation and object de-
tection. Extensive experiments show that 1) supervised
learning with our synthetic data significantly improves
the performance of state-of-the-art CNN for SFSU on
Foggy Driving; 2) our semi-supervised learning strategy
further improves performance; and 3) image dehazing
marginally advances SFSU with our learning strategy.
C. Sakaridis ·D. Dai ·L. Van Gool
ETH Z¨urich, Zurich, Switzerland
L. Van Gool
KU Leuven, Leuven, Belgium
The datasets, models and code are made publicly avail-
able.
Keywords Foggy scene understanding ·Semantic
segmentation ·Object detection ·Depth denoising and
completion ·Dehazing ·Transfer learning
1 Introduction
Cameras and the accompanying vision algorithms are
widely used for applications such as surveillance [9],
remote sensing [17], and automated cars [34], and
their deployment keeps expanding. While these sensors
and algorithms are constantly getting better, they are
mainly designed to operate on clear-weather images and
videos [47]. Yet, outdoor applications can hardly escape
from “bad” weather. Thus, such computer vision sys-
tems should also function under adverse weather con-
ditions. Here we focus on the presence of fog.
Fog degrades the visibility of a scene signifi-
cantly [48,62]. This causes problems not only to human
observers, but also to computer vision algorithms. Dur-
ing the past years, a large body of research has been
conducted on image defogging (dehazing) to increase
scene visibility [29,50,70]. Meanwhile, marked progress
has been made in semantic scene understanding with
clear-weather images and videos [15,53,72]. In contrast,
the semantic understanding of foggy scenes has received
little attention, despite its importance in outdoor appli-
cations. For instance, an automated car still requires a
robust detection of road lanes, traffic lights, and other
traffic agents in the presence of fog. This work investi-
gates semantic foggy scene understanding (SFSU).
High-level semantic scene understanding is usually
tackled by learning from many annotations of real im-
ages [15,58]. Yet, the difficulty of collecting and an-
arXiv:1708.07819v3 [cs.CV] 17 May 2019
2 Christos Sakaridis et al.
Fig. 1 The pipeline of semantic foggy scene understanding with partially synthetic data: from a) fog simulation on real
outdoor scenes, to b) training with pairs of such partially synthetic foggy images and semantic annotations as well as pairs of
foggy images and clear-weather images, and c) scene understanding of real foggy scenes. This figure is seen better on a screen
notating images for unusual weather conditions such
as fog renders this standard protocol problematic. To
overcome this problem, we depart from this traditional
paradigm and propose another route, also different from
moving to fully synthetic scenes. Instead, we choose
to generate synthetic fog into real images that contain
clear-weather outdoor scenes, and then leverage these
partially synthetic foggy images for SFSU.
Given the fact that large-scale annotated data are
available for clear-weather images [15,19,24,58], we
present an automatic pipeline to add synthetic yet
highly realistic fog to such datasets. Our fog simu-
lation uses the standard optical model for daytime
fog [39] (which has already been used extensively in im-
age dehazing) to overlay existing clear-weather images
with synthetic fog in a physically sound way, simulat-
ing the underlying mechanism of foggy image forma-
tion. We leverage our fog simulation pipeline to cre-
ate our Foggy Cityscapes dataset, by adding fog to
urban scenes from the Cityscapes dataset [15]. This
has led to 550 carefully refined high-quality synthetic
foggy images with fine semantic annotations inherited
directly from Cityscapes, plus an additional 20000 syn-
thetic foggy images without fine annotations. The re-
sulting “synthetic-fog” images are used to adapt two
semantic segmentation models [44,72] and an object
detector [25] to foggy scenes. The models are trained
in two fashions: 1) by the typical supervised learning
scheme, using the 550 high-quality annotated foggy im-
ages, and 2) by a novel semi-supervised learning ap-
proach, which augments the dataset that is used in 1)
with the additional 20000 foggy images and draws the
missing supervision for these images from the predic-
tions of the source, clear-weather model on their clear-
weather counterparts. For evaluation purposes, we col-
lect and annotate a new dataset, Foggy Driving, with
101 images of driving scenes in the presence of fog. See
Figure 1 for the whole pipeline of our work. In addi-
tion, this work studies the utility of three state-of-the-
art image dehazing methods for SFSU as well as human
understanding of foggy scenes.
The main contributions of the paper are: 1) an au-
tomatic and scalable pipeline to impose high-quality
synthetic fog on real clear-weather images; 2) two new
datasets, one synthetic and one real, to facilitate train-
ing and evaluation of models used in SFSU; 3) a new
semi-supervised learning approach for SFSU; and 4)
a detailed study of the benefit of image dehazing for
SFSU and human perception of foggy scenes.
The rest of the paper is organized as follows. Sec-
tion 2presents the related work. Section 3is devoted
to our fog simulation pipeline, followed by Section 4
that introduces our two foggy datasets. Section 5de-
scribes supervised learning with our synthetic foggy
data and studies the usefulness of image dehazing for
SFSU in this context. Finally, Section 6extends the
learning to a semi-supervised paradigm, where super-
vision is transferred from clear-weather images to their
synthetic foggy counterparts, and Section 7concludes
the paper.
Semantic Foggy Scene Understanding with Synthetic Data 3
2 Related Work
Our work is relevant to image defogging (dehazing),
depth denoising and completion, foggy scene under-
standing, synthetic visual data, and transfer learning.
2.1 Image Defogging/Dehazing
Fog fades the color of observed objects and reduces
their contrast. Extensive research has been conducted
on image defogging (dehazing) to increase the visibility
of foggy scenes. This ill-posed problem has been tack-
led from different perspectives. For instance, in contrast
enhancement [48,62] the rationale is that clear-weather
images have higher contrast than images degraded by
fog. Depth and statistics of natural images are exploited
as priors as well [6,20,21,50]. Another line of work is
based on the dark channel prior [29], with the empiri-
cally validated assumption that pixels of clear-weather
images are very likely to have low values in some of
the three color channels. Certain works focus particu-
larly on enhancing foggy road scenes [49,65]. Methods
have also been developed for nighttime [42], given its
importance in outdoor applications. Fast dehazing ap-
proaches have been developed in [64,69] towards real-
time applications. Recent approaches also rely on train-
able architectures [63], which have evolved to end-to-
end models [45,54,73]. For a comprehensive overview
of defogging/dehazing algorithms, we point the reader
to [43,71]. All these approaches can greatly increase vis-
ibility. Our work is complementary and focuses on the
semantic understanding of foggy scenes.
2.2 Depth Denoising and Completion
Synthesizing a foggy image from its real, clear counter-
part generally requires an accurate depth map. In pre-
vious works, the colorization approach of [40] has been
used to inpaint depth maps of the indoor NYU Depth
dataset [60]. Such inpainted depth maps have been used
in state-of-the-art dehazing approaches such as [54] to
generate training data in the form of synthetic indoor
foggy images. In contrast, our work considers real out-
door urban scenes from the Cityscapes dataset [15],
which contains significantly more complex depth con-
figurations than NYU Depth. Furthermore, the avail-
able depth information in Cityscapes is not provided
by a depth sensor, but it is rather an estimate of the
depth resulting from the application of a semiglobal
matching stereo algorithm based on [32]. This depth
estimate usually contains a large amount of severe ar-
tifacts and large holes (cf. Figure 1), which render it
inappropriate for direct use in fog simulation. There are
several recent approaches that handle highly noisy and
incomplete depth maps, including stereoscopic inpaint-
ing [68], spatio-temporal hole filling [11] and layer depth
denoising and completion [59]. Our method builds on
the framework of stereoscopic inpainting [68] which per-
forms depth completion at the level of superpixels, and
introduces a novel, theoretically grounded objective for
the superpixel-matching optimization that is involved.
2.3 Foggy Scene Understanding
Semantic understanding of outdoor scenes is a cru-
cial enabler for applications such as assisted or au-
tonomous driving. Typical examples include road and
lane detection [5], traffic light detection [35], car and
pedestrian detection [24], and a dense, pixel-level seg-
mentation of road scenes into most of the relevant se-
mantic classes [8,15]. While deep recognition networks
have been developed [25,44,53,72,75] and large-scale
datasets have been presented [15,24], that research
mainly focused on clear weather. There is also a large
body of work on fog detection [7,22,51,61]. Classifica-
tion of scenes into foggy and fog-free has been tackled as
well [52]. In addition, visibility estimation has been ex-
tensively studied for both daytime [28,46,66] and night-
time [23], in the context of assisted and autonomous
driving. The closest of these works to ours is [66], in
which synthetic fog is generated and foggy images are
segmented to free-space area and vertical objects. Our
work differs in that: 1) our semantic understanding task
is more complex, with 19 semantic classes that are com-
monly involved in driving scenarios, 8 of which occur as
distinct objects; 2) we tackle the problem with modern
deep CNN for semantic segmentation [44,72] and object
detection [25], taking full advantage of the most recent
advances in this field; and 3) we compile and release a
large-scale dataset of synthetic foggy images based on
real scenes plus a dataset of real-world foggy scenes,
featuring both dense pixel-level semantic annotations
and annotations for object detection.
2.4 Synthetic Visual Data
The leap of computer vision in recent years can to
an important extent be attributed to the availability
of large, labeled datasets [15,19,58]. However, acquir-
ing and annotating such a dataset for each new prob-
lem is not (yet) doable. Thus, learning with synthetic
data is gaining attention. We give some notable ex-
amples. Dosovitskiy et al. [18] use the renderings of
a floating chair to train dense optical flow regression
4 Christos Sakaridis et al.
networks. Gupta et al. [26] impose text onto natural
images to learn an end-to-end text detection system.
V´azquez et al. [67] train pedestrian detectors with vir-
tual data. In [55,56] the authors leverage video game en-
gines to render images along with dense semantic anno-
tations that are subsequently used in combination with
real data to improve the semantic segmentation per-
formance of modern CNN architectures on real scenes.
Going one step further, [36] shows that for the task
of vehicle detection, training a CNN model only on
massive amounts of synthetic images can outperform
the same model trained on large-scale real datasets like
Cityscapes. By contrast, our work tackles semantic seg-
mentation and object detection for real foggy urban
scenes, by adding synthetic fog to real images taken un-
der clear weather. Hence, our approach is based on only
partially synthetic data. In the same vein, [2] is based
on real urban scenes, augmented with virtual cars. A
very interesting project is “FOG” [13]. Its team devel-
oped a prototype of a small-scale fog chamber, able to
produce stable visibility levels and homogeneous fog to
test the reaction of drivers.
2.5 Transfer Learning
Our work bears resemblance to works from the broad
field of transfer learning. Model adaptation across
weather conditions to semantically segment simple road
scenes is studied in [41]. More recently, a domain ad-
versarial based approach was proposed to adapt seman-
tic segmentation models both at pixel level and fea-
ture level from simulated to real environments [33]. Our
work generates synthetic fog from clear-weather data to
close the domain gap. Combining our method and the
aforementioned transfer learning methods is a promis-
ing direction for future work. The supervision trans-
fer from clear weather to foggy weather in this paper
is inspired by the stream of work on model distilla-
tion/imitation [16,27,31]. Our approach is similar in
that knowledge is transferred from one domain (model)
to another by using paired data samples as a bridge.
3 Fog Simulation on Real Outdoor Scenes
To simulate fog on input images that depict real scenes
with clear weather, the standard approach is to model
the effect of fog as a function that maps the radiance
of the clear scene to the radiance observed at the cam-
era sensor. Critically, this space-variant function is usu-
ally parameterized by the distance `of the scene from
the camera, which equals the length of the path along
which light has traveled and is closely related to scene
depth. As a result, the pair of the clear image and its
depth map forms the basis of our foggy image synthesis.
In this section, we first detail the optical model which
we use for fog and then present our complete pipeline
for fog simulation, with emphasis on our denoising and
completion of the input depth. Finally, we present some
criteria for selecting suitable images to generate high-
quality synthetic fog.
3.1 Optical Model of Choice for Fog
In the image dehazing literature, various optical mod-
els have been used to model the effect of haze on the
appearance of a scene. For instance, optical models tai-
lored for nighttime haze removal have been proposed
in [42,74], taking into account the space-variant light-
ing that characterizes most nighttime scenes. This vari-
ety of models is directly applicable to the case of fog as
well, since the physical process for image formation in
the presence of either haze or fog is essentially similar.
For our synthesis of foggy images, we consider the stan-
dard optical model of [39], which is used extensively in
the literature [20,29,54,63,64] and is formulated as
I(x) = R(x)t(x) + L(1 −t(x)),(1)
where I(x) is the observed foggy image at pixel x,
R(x) is the clear scene radiance and Lis the atmo-
spheric light. This model assumes the atmospheric light
to be globally constant, which is generally valid only
for daytime images. The transmission t(x) determines
the amount of scene radiance that reaches the cam-
era. In case of a homogeneous medium, transmission
depends on the distance `(x) of the scene from the cam-
era through
t(x) = exp (−β`(x)) .(2)
The parameter βis named attenuation coefficient and
it effectively controls the thickness of the fog: larger
values of βmean thicker fog. The meteorological optical
range (MOR), also known as visibility, is defined as the
maximum distance from the camera for which t(x)≥
0.05, which implies that if (2) is valid, then MOR =
2.996/β. Fog decreases the MOR to less than 1 km by
definition [1]. Therefore, the attenuation coefficient in
homogeneous fog is by definition
β≥2.996 ×10−3m−1,(3)
where the lower bound corresponds to the lightest fog
configuration. In our fog simulation, the value that is
used for βalways obeys (3).
Model (1) provides a powerful basis for simulating
fog on outdoor scenes with clear weather. Even though
Semantic Foggy Scene Understanding with Synthetic Data 5
(a) Input from Cityscapes (b) Nearest-neighbor depth completion (c) Our fog simulation
Fig. 2 Comparison of our fog simulation to nearest-neighbor interpolation for depth completion on images from Cityscapes.
This figure is better seen on a screen and zoomed in
its assumption of homogeneous atmosphere is strong, it
generates synthetic foggy images that can act as good
proxies for real world foggy images where this assump-
tion might not hold exactly, as long as it is provided
with an accurate transmission map t. Straightforward
extensions of (1) are used in [65] to simulate heteroge-
neous fog on synthetic scenes.
To sum up, the necessary inputs for fog simulation
using (1) are a color image Rof the original clear scene,
atmospheric light Land a dense transmission map t
defined at each pixel of R. Our task is thus twofold:
1. estimation of t, and
2. estimation of Lfrom R.
Step 2is simple: we use the method proposed in [29]
with the improvement of [63]. In the following, we focus
on step 1for the case of outdoor scenes with a noisy,
incomplete estimate of depth serving as input.
3.2 Depth Denoising and Completion for Outdoor
Scenes
The inputs that our method requires for generating an
accurate transmission map tare:
•the original, clear-weather color image Rto add syn-
thetic fog on, which constitutes the left image of a
stereo pair,
•the right image Qof the stereo pair,
•the intrinsic calibration parameters of the two cam-
eras of the stereo pair as well as the length of the
baseline,
•a dense, raw disparity estimate Dfor Rof the same
resolution as R, and
•a set Mcomprising the pixels where the value of D
is missing.
These requirements can be easily fulfilled with a stereo
camera and a standard stereo matching algorithm [32].
The main steps of our pipeline are the following:
1. calculation of a raw depth map din meters,
2. denoising and completion of dto produce a refined
depth map d0in meters,
3. calculation of a scene distance map `in meters from
d0,
4. application of (2) to obtain an initial transmission
map ˆ
t, and
5. guided filtering [30] of ˆ
tusing Ras guidance to com-
pute the final transmission map t.
The central idea is to leverage the accurate structure
that is present in the color images of the stereo pair
in order to improve the quality of depth, before using
the latter as input for computing transmission. We now
proceed in explaining each step in detail, except step 4
which is straightforward. In step 1, we use the input
disparity Din combination with the values of the focal
length and the baseline to obtain d. The missing values
for D, indicated by M, are also missing in d.
Step 2follows a segmentation-based depth filling
approach, which builds on the stereoscopic inpainting
method presented in [68]. More specifically, we use a
superpixel segmentation of the clear image Rto guide
depth denoising and completion at the level of super-
pixels, making the assumption that each individual su-
perpixel corresponds roughly to a plane in the 3D scene.
First, we apply a photo-consistency check between
Rand Q, using the input disparity Dto establish pixel
correspondences between the two images of the stereo
6 Christos Sakaridis et al.
pair, similar to Equation (12) in [68]. All pixels in R
for which the color deviation (measured as difference in
the RGB color space) from the corresponding pixel in
Qhas greater magnitude than = 12/255 are deemed
invalid regarding depth and hence are added to M.
We then segment Rinto superpixels with SLIC [3],
denoting the target number of superpixels as ˆ
Kand
the relevant range domain scale parameter as m= 10.
For depth denoising and completion on Cityscapes, we
use ˆ
K= 2048. The final number of superpixels that are
output by SLIC is denoted by K. These superpixels are
classified into reliable and unreliable ones with respect
to depth information, based on the number of pixels
with missing or invalid depth that they contain. More
formally, we use the criterion of Equation (2) in [68],
which states that a superpixel Tis reliable if and only
if
card(T\M)≥max{P, λ card(T)},(4)
setting P= 20 and λ= 0.6.
For each superpixel that fulfills (4), we fit a depth
plane by running RANSAC on its pixels that have a
valid value for depth. We use an adaptive inlier thresh-
old to account for differences in the range of depth val-
ues between distinct superpixels. For a superpixel T,
the inlier threshold is set as
θ= 0.01 median
x∈T\M{d(x)}.(5)
We use adaptive RANSAC and set the maximum num-
ber of iterations to 2000 and the bound on the probabil-
ity of having obtained a pure inlier sample to p= 0.99.
The greedy approach of [68] is used subsequently to
match unreliable superpixels to reliable ones pairwise
and assign the fitted depth planes of the latter to the
former. Different than [68], we propose a novel objective
function for matching pairs of superpixels. For a super-
pixel pair (s, t), our proposed objective is formulated as
E(s, t) = kCs−Ctk2+αkxs−xtk2.(6)
The first term measures the proximity of the two
superpixels in the range domain, where we denote the
average CIELAB color of superpixel swith Cs. In
other words, we penalize the squared Euclidean dis-
tance between the average colors of the superpixels in
the CIELAB color space, which has been designed to
increase perceptual uniformity [14]. On the contrary,
the objective of [68] uses the cosine similarity of aver-
age superpixel colors to form the range domain cost:
1−Cs
kCsk·Ct
kCtk.(7)
The disadvantage of (7) is that it assigns zero matching
cost to dissimilar colors in certain cases. For instance,
in the RGB color space, the pair of colors (δ, δ, δ) and
(1 −δ, 1−δ, 1−δ), where δis a small positive constant,
is assigned zero penalty, even though the former color
is very dark gray and the latter is very light gray.
The second term on the right-hand side of (6) mea-
sures the proximity of the two superpixels in the spatial
domain as the squared Euclidean distance between their
centroids xsand xt. By contrast, the spatial proximity
term of [68] assigns zero cost to pairs of adjacent super-
pixels and unit cost to non-adjacent pairs. This implies
that close yet non-adjacent superpixels are penalized
equally to very distant superpixels by [68]. As a result,
a certain superpixel scan be erroneously matched to
a very distant superpixel twhich is highly unlikely to
share the same depth plane as s, as long as the range
domain term for this pair is minimal and all superpix-
els adjacent to sare dissimilar to it with respect to ap-
pearance. Our proposed spatial cost handles these cases
successfully: tis assigned a very large spatial cost for
being matched to s, and other superpixels that have
less similar appearance yet smaller distance to sare
preferred.
In (6), α > 0 is a parameter that weights the rela-
tive importance of the spatial domain term versus the
range domain term. Similarly to [3], we set α=m2/S2,
where S=pN/K,Ndenotes the total number of pix-
els in the image, and m= 10 and Kare the same as
for SLIC. Our matching objective (6) is similar to the
distance that is defined in SLIC [3] and other super-
pixel segmentation methods for assigning an individual
pixel to a superpixel. In our case though, this distance
is rather used to measure similarity between pairs of
superpixels.
After all superpixels have been assigned a depth
plane, we use these planes to complete the missing
depth values for pixels belonging to M. In addition,
we replace the depth values of pixels which do not be-
long to Mbut constitute large-margin outliers with
respect to their corresponding plane (deviation larger
than ˆ
θ= 50m) with the values imputed by the plane.
This results in a complete, denoised depth map d0, and
concludes step 2.
In step 3, we compute the distance `(x) of the scene
from the camera at each pixel xbased on d0(x), using
the coordinates of the principal point plus the focal
length of the camera.
Finally, in step 5we post-process the initial trans-
mission map ˆ
twith guided filtering [30], in order to
smooth transmission while respecting the boundaries
of the clear image R. We fix the radius of the guided
filter window to r= 20 and the regularization param-
Semantic Foggy Scene Understanding with Synthetic Data 7
(a) Input image from Cityscapes
(b) Output of our fog simulation
Fig. 3 Sunny scene from Cityscapes and the result of our
fog simulation
eter to µ= 10−3,i.e. we use the same values as in the
haze removal experiments of [30].
Results of the presented pipeline for fog simulation
on example images from Cityscapes are provided in Fig-
ure 2 for β= 0.01, which corresponds to visibility of ca.
300m. We compare our fog simulation to an alternative
implementation, which employs nearest-neighbor inter-
polation to complete the missing values of the depth
map before computing the transmission and does not
involve guided filtering as a postprocessing step.
3.3 Input Selection for High-Quality Fog Simulation
Applying the presented pipeline to simulate fog on large
datasets with real outdoor scenes such as Cityscapes
with the aim of producing synthetic foggy images of
high quality calls for careful refinement of the input.
To be more precise, the sky is clear in the majority
of scenes in Cityscapes, with intense direct or indirect
sunlight, as shown in Figure 3(a). These images usu-
ally contain sharp shadows and have high contrast com-
pared to images that depict foggy scenes. This causes
our fog simulation to generate synthetic images which
do not resemble real fog very well, e.g. Figure 3(b).
Therefore, our first refinement criterion is whether the
sky is overcast, ensuring that the light in the input real
scene is not strongly directional.
Secondly, we observe that atmospheric light estima-
tion in step 2of our fog simulation sometimes fails to
select a pixel with ground truth semantic label sky as
the representative of the value of atmospheric light. In
rare cases, it even happens that the sky is not visible
at all in an image. This results in an erroneous, phys-
ically invalid value of atmospheric light being used in
(1) to synthesize the foggy image. Consequently, our
second refinement criterion is whether the pixel that
is selected as atmospheric light is labeled as sky, and
affords an automatic implementation.
4 Foggy Datasets
We present two distinct datasets for semantic un-
derstanding of foggy scenes: Foggy Cityscapes and
Foggy Driving. The former derives from the Cityscapes
dataset [15] and constitutes a collection of synthetic
foggy images generated with our proposed fog simu-
lation that automatically inherit the semantic anno-
tations of their real, clear counterparts. On the other
hand, Foggy Driving is a collection of 101 real-world
foggy road scenes with annotations for semantic seg-
mentation and object detection, used as a benchmark
for the domain of foggy weather.
4.1 Foggy Cityscapes
We apply the fog simulation pipeline that is presented
in Section 3to the complete set of images provided in
the Cityscapes dataset. More specifically, we first obtain
20000 synthetic foggy images from the larger, coarsely
annotated part of the dataset, and keep all of them,
without applying the refinement criteria of Section 3.3.
In this way, we trade the high visual quality of the syn-
thetic images for a very large scale and variability of the
synthetic dataset. We do not make use of the original
coarse annotations of these images for semantic seg-
mentation; rather, we produce labellings with state-of-
the-art semantic segmentation models on the original,
clear images and use them to transfer knowledge from
clear weather to foggy weather, as will be discussed in
Section 6. We name this set Foggy Cityscapes-coarse.
In addition, we use the two criteria of Section 3.3
in conjunction to filter the finely annotated part of
Cityscapes that originally comprises 2975 training and
500 validation images, and obtain a refined set of 550
images, 498 from the training set and 52 from the val-
idation set, which fulfill both criteria. Running our
fog simulation on this refined set provides us with
a moderate-scale collection of high-quality synthetic
foggy images. This collection automatically inherits the
original fine annotations for semantic segmentation, as
well as bounding box annotations for object detection
8 Christos Sakaridis et al.
(a) clear-weather (b) β= 0.005 (c) β= 0.01 (d) β= 0.02
Fig. 4 Different versions of an exemplar scene from Foggy Cityscapes for varying visibility
which we generate by leveraging the instance-level se-
mantic annotations that are provided in Cityscapes
for the 8 classes person,rider,car,truck,bus,train,
motorcycle and bicycle. We term this collection Foggy
Cityscapes-refined.
Since MOR can vary significantly in reality for dif-
ferent instances of fog, we generate five distinct ver-
sions of Foggy Cityscapes, each of which is character-
ized by a constant simulated attenuation coefficient β
in (2), hence a constant MOR. In particular, we use
β∈ {0.005,0.01,0.02,0.03,0.06}, which correspond
approximately to MOR of 600m, 300m, 150m, 100m
and 50m respectively. Figure 4 shows three of the five
synthesized foggy versions of a clear scene in Foggy
Cityscapes.
4.2 Foggy Driving
Foggy Driving consists of 101 color images depicting
real-world foggy driving scenes. We captured 51 of these
images with a cell phone camera in foggy conditions at
various areas of Zurich, and the rest 50 images were
carefully collected from the web. We note that all im-
ages have been preprocessed so that they have a maxi-
mum resolution of 960 ×1280 pixels.
We provide dense, pixel-level semantic annotations
for all images of Foggy Driving. In particular, we use
the 19 evaluation classes of Cityscapes: road,sidewalk,
building,wall,fence,pole,traffic light,traffic sign,vege-
tation,terrain,sky,person,rider,car,truck,bus,train,
motorcycle and bicycle. Pixels that do not belong to any
of the above classes or are not labeled are assigned the
void label, and they are ignored for semantic segmenta-
tion evaluation. At annotation time, we label individual
instances of person,rider,car,truck,bus,train,motor-
cycle and bicycle separately following the Cityscapes
annotation protocol, which directly affords bounding
box annotations for these 8 classes.
In total, 33 images have been finely annotated (cf.
the last three rows of Figure 13) in the aforementioned
procedure, and the rest 68 images have been coarsely
annotated (cf. the top three rows of Figure 13). We
provide per-class statistics for the pixel-level semantic
annotations of Foggy Driving in Figure 5. Furthermore,
Table 1 Absolute and average number of annotated pixels,
humans and vehicles for Foggy Driving (“Ours”), KITTI and
Cityscapes. “h/im” stands for humans per image and “v/im”
for vehicles per image. Only the training and validation sets
of KITTI and Cityscapes are considered
pixels humans vehicles h/im v/im
Ours (fine) 38.3M 236 288 7.2 8.7
Ours (coarse) 34.6M 54 221 0.8 3.3
KITTI 0.23G 6.1k 30.3k 0.8 4.1
Cityscapes 9.43G 24.0k 41.0k 7.0 11.8
statistics for the number of objects in the bounding
box annotations are shown in Figure 6. Because of the
coarse annotation that is created for one part of Foggy
Driving, we do not use this part in evaluation of object
detection approaches, as difficult objects that are not
included in the annotations may be detected by a good
method and missed by a comparatively worse method,
resulting in incorrect comparisons with respect to pre-
cision. On the contrary, the coarsely annotated images
are used without such issues in evaluation of seman-
tic segmentation approaches, since predictions at unla-
beled pixels are simply ignored and thus do not affect
the measured performance.
Foggy Driving may have a smaller size than other
recent datasets for semantic scene understanding, how-
ever, it features challenging foggy scenes with compara-
tively high complexity. As Table 1 shows, the subset of
33 images with fine annotations is roughly on par with
Cityscapes regarding the average number of humans
and vehicles per image. In total, Foggy Driving contains
more than 500 vehicles and almost 300 humans. We also
underline the fact that Table 1 compares Foggy Driv-
ing — a dataset used purely for testing — against the
unions of training and validation sets of KITTI [24] and
Cityscapes, which are much larger than their respective
testing sets that would provide a better comparison.
As a final note, we identify the subset of the 19 an-
notated classes that occur frequently in Foggy Driving.
These “frequent” classes either have a larger number
of total annotated pixels, e.g. road, or a larger number
of total annotated polygons or instances, e.g. pole and
person, compared to the rest of the classes. They are:
road,sidewalk,building,pole,traffic light,traffic sign,
Semantic Foggy Scene Understanding with Synthetic Data 9
flat construction nature vehicle sky object human
104
106
108
1instances are distinguished
road
sidewalk
build.
fence
wall
veget.
terrain
car1
truck1
train1
bus1
bicycle1
motorcycle1
sky
pole
traffic sign
traffic light
person1
rider1
number of pixels
Fig. 5 Number of annotated pixels per class for Foggy Driving
car
truck
train
bus
bicycle
motorc.
person
rider
0
100
200
300
400
500
number of objects
Fine
Coarse
(a) All classes with instances
truck
train
bus
bicycle
motorc.
rider
0
10
20
30
(b) Zoom of (a) for six of the classes
Fig. 6 Number of objects per class in Foggy Driving.(a) includes statistics for the complete set of eight classes for which
instances are distinguished, whereas (b) presents a zoomed version of (a) for six of these classes
vegetation,sky,person, and car. In the experiments that
follow in Section 5.1, we occasionally use this set of fre-
quent semantic classes as an alternative to the complete
set of semantic classes for averaging per-class scores, in
order to further verify results based only on classes with
plenty of examples.
5 Supervised Learning with Synthetic Fog
We first show that our synthetic Foggy Cityscapes-
refined dataset can be used per se for successfully
adapting modern CNN models to the condition of fog
with the usual supervised learning paradigm. Our ex-
periments focus primarily on the task of semantic seg-
mentation and additionally include comparisons on the
task of object detection, evidencing clearly the useful-
ness of our synthetic foggy data in understanding the
semantics of real foggy scenes such as those in Foggy
Driving.
More specifically, the general outline of our main
experiments can be summarized in two steps:
1. fine-tuning a model that has been trained on the
original Cityscapes dataset for clear weather by
using only synthetic images of Foggy Cityscapes-
refined, and
2. evaluating the fine-tuned model on Foggy Driving
and showing that its performance is improved com-
pared to the original, clear-weather model. Thus,
the reported results pertain to Foggy Driving unless
otherwise mentioned.
In other words, all models are ultimately evaluated on
data from a different domain than that of the data on
which they have been fitted, revealing their true gener-
alization potential on previously unseen foggy scenes.
We also consider dehazing as an optional prepro-
cessing step before feeding the input images to seman-
tic segmentation models for training and testing, and
examine the effect of this dehazing preprocessing on the
performance of such a model using state-of-the-art de-
hazing methods. The effect of dehazing on semantic seg-
mentation performance is additionally correlated with
its utility for human understanding of foggy scenes by
conducting a user study on Amazon Mechanical Turk.
5.1 Semantic Segmentation
Our model of choice for conducting experiments on se-
mantic segmentation with the supervised pipeline is the
modern dilated convolutions network (DCN) [72]. In
10 Christos Sakaridis et al.
particular, we make use of the publicly available Di-
lation10 model, which has been trained on the 2975
images of the training set of Cityscapes. We wish to
note that this model was originally trained and tested
on 1396 ×1396 image crops by the authors of [72], but
due to GPU memory limitations we train it on 756×756
crops and test it on 700 ×700 crops. Still, Dilation10
enjoys a fair mean intersection over union (IoU) score
of 34.9% on Foggy Driving.
In the following experiments of Section 5.1, we fine-
tune Dilation10 on the training set of Foggy Cityscapes-
refined which consists of 498 images, and reserve the
52 images of the respective validation set for additional
evaluation. In particular, we fine-tune all layers of the
original model for 3k iterations (ca. 6 epochs) using
mini-batches of size 1. Unless otherwise mentioned, the
attenuation coefficient βused in Foggy Cityscapes is
equal to 0.01.
Overall, we consider four different options with re-
spect to dehazing preprocessing: applying no dehazing
at all, dehazing with multi-scale convolutional neural
networks (MSCNN) [54], dehazing using the dark chan-
nel prior (DCP) [29], and non-local image dehazing [6].
Unless otherwise specified, no dehazing is applied. Our
experimental protocol is consistent with respect to de-
hazing preprocessing: the same option for dehazing pre-
processing is used both at training time and test time.
More specifically, at training time we first process the
synthetic foggy images of Foggy Cityscapes-refined ac-
cording to the specified option for dehazing preprocess-
ing and then use the processed images as input for fine-
tuning Dilation10. At evaluation time, we process the
images in Foggy Driving with the same dehazing pre-
processing that was used at training time (if any was),
and use the processed images to test the fine-tuned
model.
Benefit of Fine-tuning on Synthetic Fog. Our first
experiment evidences the benefit of fine-tuning on Foggy
Cityscapes-refined for improving semantic segmentation
performance on Foggy Driving.Table 2 presents com-
parative performance of the original Dilation10 model
against its fine-tuned counterparts in terms of mean
IoU over all annotated classes in Foggy Driving as well
as over frequent classes only. All four options regarding
dehazing preprocessing are considered. Note that we
also evaluate the original Dilation10 model for all de-
hazing preprocessing alternatives (only relevant at test
time in this case) in the first row of each part of Table 2.
Indeed, all fine-tuned models outperform Dilation10 ir-
respective of the type of dehazing preprocessing that is
applied, both for mean IoU over all classes and over
frequent classes only. The best-performing fine-tuned
model, which we refer to as FT-0.01, involves no dehaz-
Table 2 Performance comparison on Foggy Driving of Dila-
tion10 versus fine-tuned versions of it using Foggy Cityscapes-
refined, for four options regarding dehazing preprocessing.
“FT” stands for using fine-tuning and “W/o FT” for not
using fine-tuning
Mean IoU over all classes (%)
No dehazing MSCNN DCP Non-local
W/o FT 34.9 34.7 29.9 29.3
FT 37.8 37.1 37.4 36.6
Mean IoU over frequent classes in Foggy Driving (%)
No dehazing MSCNN DCP Non-local
W/o FT 52.4 52.4 45.5 46.2
FT 57.4 56.2 56.7 55.1
Table 3 Performance comparison on Foggy Driving of var-
ious fine-tuned versions of Dilation10 that correspond to
different fog simulation methods for generating the training
dataset Foggy Cityscapes-refined that is used for fine-tuning,
and different learning rate policies during fine-tuning. Mean
IoU (%) over all classes is used to report results
Constant l.r. “Poly” l.r.
Nearest neighbor 32.9 36.2
Ours w/o guided filtering 33.0 36.8
Ours 34.4 37.8
ing and outperforms Dilation10 significantly, i.e. by 3%
for mean IoU over all classes and 5% for mean IoU over
frequent classes. Note additionally that FT-0.01 has
been fine-tuned on only 498 training images of Foggy
Cityscapes-refined, compared to the 2975 training im-
ages of Cityscapes for Dilation10.
Comparison of Fog Simulation Approaches. Next,
we compare in Table 3 the utility of our proposed fog
simulation method for generating useful synthetic train-
ing data in terms of semantic segmentation performance
on Foggy Driving, against two alternative approaches:
the baseline that we considered in Figure 2 and a trun-
cated version of our method, where we omit the guided
filtering step. We consider two different policies for the
learning rate when fine-tuning on Foggy Cityscapes-
refined: a constant learning rate of 10−5and a poly-
nomially decaying learning rate, commonly referred to
as “poly” [12], with a base learning rate of 10−5and
a power parameter of 0.9. Our method for fog simula-
tion consistently outperforms the two baselines and the
“poly” learning rate policy allows the model to be fine-
tuned more effectively than the constant policy. In all
other experiments with DCN, we use the “poly” learn-
ing rate policy with the parameters specified above for
fine-tuning.
Increasing Returns at Larger Distance. As can
easily be deduced from (2), fog has a growing effect
Semantic Foggy Scene Understanding with Synthetic Data 11
0–20 20–50 50–80 80–120 120–160 160–230 230–400 >400
0
10
20
30
40
50
60
70
distance (meters)
mean IoU (%)
Dilation10
Ours (FT-0.01 )
Fig. 7 Performance of semantic segmentation models on
Foggy Cityscapes-refined at distinct ranges of scene distance
from the camera
on the appearance of the scene as distance from the
camera increases. Ideally, a model that is dedicated to
foggy scenes must deliver a greater benefit for distant
parts of the scene. In order to examine this aspect of
semantic segmentation of foggy scenes, we use the com-
pleted, dense distance maps of Cityscapes images that
have been computed as an intermediate output of our
fog simulation, given that Foggy Driving does not in-
clude depth information. In more detail, we consider
the validation set of Foggy Cityscapes-refined, the im-
ages of which are unseen both for Dilation10 and our
fine-tuned models, and bin the pixels according to their
value in the corresponding distance map. Each distance
range is considered separately for evaluation by ignoring
all pixels that do not belong to it. In Figure 7, we com-
pare mean IoU of Dilation10 and FT-0.01 individually
for each distance range. FT-0.01 brings a consistent
gain in performance across all distance ranges. What
is more, this gain is larger in both absolute and rela-
tive terms for pixels that are more than 50m away from
the camera, implying that our model is able to han-
dle better the most challenging parts of a foggy scene.
Note that most pixels in the very last distance range
(more than 400m away from the camera) belong to the
sky class and their appearance does not change much
between the clear and the synthetic foggy images.
Generalization in Synthetic Fog across Densi-
ties. In order to verify the ability of a model that has
been fine-tuned on Foggy Cityscapes-refined for a fixed
value β(t)of the attenuation coefficient, hence fixed fog
density, to generalize well to new, unseen fog densities,
we evaluate the model on multiple versions of the val-
idation set of Foggy Cityscapes-refined, each rendered
using a different value for βwhich is in general not equal
to β(t). In particular, we use the five different versions
of Foggy Cityscapes-refined as described in Section 4.1
and obtain five models by fine-tuning Dilation10 on the
training set of each version. In congruence with nota-
0 0.005 0.01 0.02 0.03 0.06
20
30
40
50
60
65
attenuation coefficient β(m−1)
mean IoU (%)
Dilation10
FT-0.005
FT-0.01
FT-0.02
FT-0.03
FT-0.06
Fig. 8 Performance of semantic segmentation models on var-
ious versions of the validation set of Foggy Cityscapes-refined
corresponding to different values of attenuation coefficient β
tion in previous experiments, we denote such a fine-
tuned model by FT-β(t),e.g. FT-0.02. Afterwards, we
evaluate each of these models plus Dilation10 on the
validation set of each of the five foggy versions plus the
original, clear-weather version where β= 0. The mean
IoU performance of the six models is presented in Fig-
ure 8. Whereas the performance of Dilation10 drops
rapidly as βincreases, all five fine-tuned “foggy” mod-
els are more robust to changes in βacross the examined
range. Analyzing the performance of each fine-tuned
model individually, we observe that performance is high
and fairly stable in the range [0, β(t)] and drops for
β > β(t). This implies that a “foggy” model is able to
generalize well to lighter synthetic fog than what was
used to fine-tune it. Moreover, all “foggy” models com-
pare favorably to Dilation10 across the largest part of
the range of β, with most “foggy” models being beaten
by Dilation10 only for clear weather. Note also that
the performance gain with “foggy” models under foggy
conditions is much larger than the corresponding per-
formance loss for clear weather.
Effect of Synthetic Fog Density on Real-world
Performance. Our final experiment on semantic seg-
mentation serves two purposes: to examine the ef-
fect of varying the fog density of the synthetic train-
ing data as well as that of dehazing preprocessing on
the performance of the fine-tuned model on real foggy
data. To this end, we use three of the versions of
Foggy Cityscapes-refined corresponding to the values
{0.005,0.01,0.02}for βand consider all four options
regarding dehazing preprocessing for fine-tuning Dila-
tion10. The performance of the 12 resulting fine-tuned
models on Foggy Driving in terms of mean IoU over all
12 Christos Sakaridis et al.
Table 4 Performance comparison on Foggy Driving of fine-
tuned versions of Dilation10 using Foggy Cityscapes-refined,
for three different values of attenuation coefficient βin fog
simulation and four options regarding dehazing preprocessing
Mean IoU over all classes (%)
β= 0.005 β= 0.01 β= 0.02
No dehazing 37.6 37.8 36.1
MSCNN 38.3 37.1 36.9
DCP 36.6 37.4 36.1
Non-local 36.2 36.6 35.3
Mean IoU over frequent classes in Foggy Driving (%)
β= 0.005 β= 0.01 β= 0.02
No dehazing 57.0 57.4 56.2
MSCNN 57.3 56.2 56.3
DCP 56.0 56.7 55.2
Non-local 55.1 55.1 54.5
annotated classes as well as over frequent classes only is
reported in Table 4. We first discuss the effect of vary-
ing fog density for each dehazing option individually
and defer a general comparison of the various dehazing
preprocessing options to the next paragraph.
The two conditions that must be met in order for
the examined models to achieve better performance are:
1. a good matching of the distributions of the synthetic
training data and the real, testing data, and
2. a clear appearance of both sets of data, in the sense
that the segmentation model should have an easy
job in mining discriminative features from the data.
Focusing on the case that does not involve dehazing, we
observe that the models with β= 0.005 and β= 0.01
perform significantly better than that with β= 0.02,
implying that according to point 1Foggy Driving is
dominated by scenes with light or medium fog. On the
other hand, each of the three dehazing methods that
are used for preprocessing has its own particularities in
enhancing the appearance and contrast of foggy scenes
while also introducing artifacts to the output. More
specifically, MSCNN is slightly conservative in remov-
ing fog, as was found for other learning-based dehazing
methods in [43], and operates best under lighter fog,
providing a significant improvement in this setting with
regard to point 2. In conjunction with the light-fog char-
acter of Foggy Driving, this explains why fine-tuning on
light fog (β= 0.005) combined with MSCNN prepro-
cessing delivers one of the two best overall results. By
contrast, the more aggressive DCP is known to operate
better at high levels of fog, as its estimated transmission
is biased towards lower values [63]. The performance of
models with DCP preprocessing thus peaks at medium
rather than low simulated-fog density, which signifies
a trade-off between removing fog to the proper extent
and minimal introduction of artifacts. Non-local dehaz-
ing has also been found to operate best at medium levels
of fog [43], which results in a similar performance trend
to DCP.
Effect of Dehazing Preprocessing on Real-world
Performance and Discussion. Comparing the four
options regarding dehazing preprocessing via Table 4,
we observe that applying no dehazing is the best or sec-
ond best option for both measures and across all three
values of β. Only MSCNN marginally beats the no-
dehazing option in some cases, while overall these two
options are roughly on a par. The absence of a signifi-
cant performance gain on Foggy Driving when perform-
ing dehazing preprocessing can be ascribed to generic
as well as method-specific reasons.
First, in the real-world setting of Foggy Driving, the
homogeneity and uniformity assumptions of the opti-
cal model (1) that is used by all examined dehazing
methods may not hold exactly. Of course, this model is
also used in our fog simulation, however, foggy image
synthesis is a forward problem, whereas image defog-
ging/dehazing is an inverse problem, hence inherently
more difficult. Thus, the artifacts that are introduced
by our fog simulation are likely to be less prominent
than those introduced by dehazing. This fact appears to
outweigh the potential increase in visibility for dehazed
images as far as point 2above is concerned. An interest-
ing insight that follows is the use of forward techniques
to generate training data for hard target domains based
on data from the source domain as an alternative to the
application of inverse techniques to transform such tar-
get domains into the easier source domain.
Second, the optical model (1), on which most of the
popular dehazing approaches rely, assumes a linear re-
lation between the irradiance at a pixel and the actual
value of the pixel in the processed hazy image. There-
fore, these approaches require that an initial gamma
correction step be applied before dehazing, otherwise
their performance may deteriorate significantly. This in
turn implies that the value of gamma must be known
for each image, which is not the case for Cityscapes
and Foggy Driving. Manually searching for “best” per-
image values is also infeasible for these large datasets.
In the absence of any further information, we have used
a constant value of 1 for gamma as the authors of [6]
recommend, which is probably suboptimal for most of
the images. We thus wish to point out that future work
on outdoor datasets, whether considering fog/haze or
not, should ideally record the value of gamma for each
image, so that dehazing methods can show their full
potential on such datasets.
Semantic Foggy Scene Understanding with Synthetic Data 13
Specifically for DCP, performance decreases com-
pared to MSCNN partly due to the light-fog charac-
ter of Foggy Driving which does not match the opti-
mal operating point of DCP. On the other hand, non-
local dehazing uses a different model for estimating
atmospheric light than the one that is shared by our
fog simulation, MSCNN, and DCP, and thus already
faces greater difficulty in dehazing images from Foggy
Cityscapes.
5.2 Linking the Objective and Subjective Utility of
Dehazing Preprocessing in Foggy Scene Understanding
Our experiments in Section 5.1 indicate that using any
of the three examined state-of-the-art dehazing meth-
ods to preprocess foggy images before feeding them to
a CNN for semantic segmentation does not provide a
clear benefit over feeding the foggy images directly in
the objective terms of mean IoU performance of the
trained model. In this section, we complement this ob-
jective evaluation with a study of the utility of dehazing
preprocessing for human understanding of foggy scenes
and show that the comparative results of the objective
evaluation generally agree with the comparative results
of the human-based evaluation.
Both for the objective semantic-segmentation-based
and the subjective human-based evaluation, we com-
pare the four aforementioned options with regard to
dehazing preprocessing individually on each image of
our datasets. Figure 9 presents examples of the tetrads
of images that we consider: the foggy image, which ei-
ther belongs to the validation set of Foggy Cityscapes-
refined with β= 0.01 or to Foggy Driving and corre-
sponds to no usage of dehazing, and its dehazed ver-
sions using DCP, MSCNN and non-local dehazing. For
comparative objective evaluation of the four alterna-
tives on each image, we use the mean IoU scores of the
respective fine-tuned DCN models that are considered
in the experiment of Table 2, measured on that image.
The classes that do not occur in an image are not con-
sidered for computing mean IoU on this image. The
four alternatives are ranked for each image according
to their mean IoU scores on it. Comparative evaluation
based on human subjects considers the same tetrads of
images but employs a more composite protocol, which
is detailed below.
User Study via Amazon Mechanical Turk. Hu-
mans are subjective and are not good at giving scores
to individual images in a linear scale [38]. We thus fol-
low the literature [57] and choose the paired compar-
isons technique to let human subjects compare the four
options regarding dehazing preprocessing. The partici-
pants are shown two images at a time that both pertain
to the same scene, side by side, and are simply asked to
choose the one which is more suitable for safe driving
(i.e. easier to interpret). Thus, six comparisons need to
be performed per scene, corresponding to all possible
pairs.
We use Amazon Mechanical Turk (AMT) to per-
form these comparisons. In order to guarantee high
quality, we only employ AMT Masters in our study and
verify the answers via a Known Answer Review Policy.
Masters are an elite group of subjects, who have con-
sistently demonstrated superior performance on AMT.
Each individual task completed by the participants, re-
ferred to as Human Intelligence Task (HIT), comprises
five image pairs to be compared, out of which three
pairs are the true query pairs and the rest two pairs
have a known correct answer and are only used for val-
idation. In particular, each known-answer pair consists
of two versions of a scene from Foggy Cityscapes-refined
with different levels of fog, choosing from three versions
of the dataset corresponding to clear weather, β= 0.005
and β= 0.01. The version with less fog is considered
the correct answer. In order to avoid answers based on
memorized patterns, the five image pairs in each HIT
are randomly shuffled and the left-right order of the
images in each pair is randomly swapped. In addition,
each HIT is completed by three different subjects to
increase reliability. The overall quality of the user sur-
vey is shown in Figure 10, which demonstrates that the
subjects have done a decent job: for 83% of the HITs,
both known-answer questions are answered correctly.
We only use results from these HITs in our following
analysis.
Consistency of Subjects’ Answers. We first study
the consistency of choices among subjects; all subjects
are in high agreement if the advantage of one option
over the other is obvious and consistent. To measure
this, we employ the coefficient of agreement [38]:
µ=2σ
m
2t
2−1,with σ=
t
X
i=1
t
X
j=1 aij
2,(8)
where aij is the number of times that option iis chosen
over option j,m= 3 is the number of subjects, and
t= 4 is the number of dehazing options. The maximum
of µis 1 for complete agreement and its minimum is
−1/3 for complete disagreement. The values of µfor all
pairs of options are shown in Table 5. The small positive
numbers in the table suggest that subjects tend to agree
when comparing options pairwise but no single option
has dominant advantage over another one.
Ranking and Correlation with Objective Eval-
uation. We finally compute the overall ranking of all
four options for each image based on the number of
14 Christos Sakaridis et al.
(a) Foggy (b) DCP (c) MSCNN (d) Non-local
Fig. 9 Example images from Foggy Driving and their dehazed versions using three state-of-the-art dehazing methods that
are examined in our experiments
0 10 20 30 40 50 60 70 80
Percentage (%)
0
50
100
Correctness (%)
Fig. 10 Quality of our user survey on AMT, computed using
known-answer questions
Foggy vs. DCP 0.155
Foggy vs. MSCNN 0.115
Foggy vs. Non-local 0.010
DCP vs. MSCNN 0.182
DCP vs. Non-local 0.036
MSCNN vs. Non-local 0.182
Mean 0.113
Table 5 Agreement coefficients for all pairwise comparisons
of the four dehazing options
times each option is chosen in all relevant pairwise com-
parisons. The correlation of these rankings with those
induced by mean IoU performance is measured with
Kendall’s τcoefficient [37] with −1≤τ≤1, where a
value of 1 implies perfect agreement, −1 implies perfect
disagreement, and 0 implies zero correlation. Figure 11
provides a complete overview of the comparative results
both for our user study and the semantic-segmentation-
based evaluation on Foggy Cityscapes-refined and Foggy
Driving, including rank correlation results for the two
types of evaluation.
The results in the top row of Figure 11 indicate that
none of the three examined methods for dehazing pre-
processing improves reliably the human understanding
of synthetic foggy scenes from Foggy Cityscapes or real
foggy scenes from Foggy Driving. In particular, the no-
dehazing option beats all other three options in pairwise
comparisons on Foggy Cityscapes-refined and loses only
to DCP marginally on Foggy Driving, while it is also
ranked first on more images than any other option for
both datasets.
In addition, the rankings obtained with the two
types of evaluation are generally in congruence for the
real-world case of Foggy Driving. The no-dehazing and
DCP options are ranked higher than MSCNN and non-
local dehazing both in the user study and in the objec-
tive evaluation. The high performance of DCP com-
pared to MSCNN is due to the usage of β= 0.01
for Foggy Cityscapes-refined (cf. the discussion in Sec-
tion 5.1). What is more, the two rankings exhibit a
positive correlation on average for Foggy Driving based
on the respective distribution of τin the bottom right
chart of Figure 11, which supports our conclusion in
Section 5.1 about the marginal benefit of dehazing pre-
processing for foggy scene understanding.
5.3 Object Detection
For our experiment on object detection in foggy scenes,
we select the modern Fast R-CNN [25] as the architec-
ture of the evaluated models. We prefer Fast R-CNN
over more recent approaches such as Faster R-CNN [53]
because the former involves a simpler training pipeline,
making fine-tuning to foggy conditions straightforward.
Consequently, we do not learn the front-end of the ob-
ject detection pipeline which involves generation of ob-
ject proposals; rather, we use multiscale combinatorial
grouping [4] for this task.
In order to ensure a fair comparison, we first ob-
tain a baseline Fast R-CNN model for the original
Cityscapes dataset, similarly to the preceding seman-
tic segmentation experiments. Since no such model is
publicly available, we begin with the model released by
the author of [25] which has been trained on PASCAL
VOC 2007 [19] and fine-tune it on the union of the
Semantic Foggy Scene Understanding with Synthetic Data 15
Subjective
0
10
20
30
40
Percentage
Foggy
DCP
MSCNN
NonLocal
0
20
40
60
80
100
Percentage
Foggy
DCP
MSCNN
NonLocal
DCP
MSCNN
NonLocal
MSCNN
NonLocal
NonLocal
Foggy
Foggy
Foggy
DCP
DCP
MSCNN
0
10
20
25
30
Percentage
Foggy
DCP
MSCNN
NonLocal
0
20
40
50
60
80
100
Percentage
DCP
MSCNN
NonLocal
MSCNN
NonLocal
NonLocal
Foggy
Foggy
Foggy
DCP
DCP
MSCNN
Objective
0
10
20
30
40
Percentage
Foggy
DCP
MSCNN
NonLocal
0
20
40
60
80
100
Percentage
DCP
MSCNN
NonLocal
MSCNN
NonLocal
NonLocal
Foggy
Foggy
Foggy
DCP
DCP
MSCNN
0
10
20
30
Percentage
Foggy
DCP
MSCNN
NonLocal
0
20
40
60
80
100
Percentage
DCP
MSCNN
NonLocal
MSCNN
NonLocal
NonLocal
Foggy
Foggy
Foggy
DCP
DCP
MSCNN
Correlation
-1 -0.5 0 0.5 1
distribution
0
10
20
30
Percentage
-1 -0.5 0 0.5 1
distribution
0
5
10
15
20
25
Percentage
(a) Foggy Cityscapes-refined (b) Foggy Driving
Fig. 11 Comparison of four options for dehazing preprocessing, i.e. no dehazing (“Foggy”), “DCP” [29], “MSCNN” [54], and
“NonLocal” [6], on (a) the validation set of Foggy Cityscapes-refined for β= 0.01 and (b) Foggy Driving, in terms of subjective
human understanding of the foggy scenes (top) and performance of the corresponding fine-tuned DCN models (middle). For
each combination of dataset and evaluation setting, we show the percentage of scenes for which each option is ranked first
overall on the left, and the respective percentages for pairwise comparisons of the options on the right. Bottom: Histograms of
correlation of the rankings obtained for the two evaluation settings over the datasets, measured with Kendall’s τ
training and validation sets of Cityscapes which com-
prises 3475 images. Fine-tuning through all layers is run
with the same configurations as in [25], except that we
use the “poly” learning rate policy with a base learning
rate of 2 ×10−4and a power parameter of 0.9, with 7k
iterations (4 epochs).
This baseline model that has been trained on the
real Cityscapes with clear weather serves as initializa-
tion for fine-tuning on our synthetic images from Foggy
Cityscapes-refined. To this end, we use all 550 training
and validation images of Foggy Cityscapes-refined and
fine-tune with the same settings as before, only that
the base learning rate is set to 10−4and we run 1650
iterations (6 epochs).
We experiment with two values of the attenuation
coefficient βfor Foggy Cityscapes-refined and present
comparative performance on the 33 finely annotated
images of Foggy Driving in Table 6. No dehazing is in-
volved in this experiment. We concentrate on the classes
car and person for evaluation, since they constitute the
intersection of the set of frequent classes in Foggy Driv-
ing and the set of annotated classes with distinct in-
stances. Individual average precision (AP) scores for
car and person are reported, as well as mean scores
over these two classes (“mean frequent”) and over the
complete set of 8 classes occurring in instances (“mean
all”). For completeness, we note that the original VOC
2007 model of [25] exhibits an AP of 2.1% for car and
1.9% for person.
Table 6 Performance comparison on Foggy Driving of base-
line Fast R-CNN model trained on Cityscapes (“W/o FT”)
versus fine-tuned versions of it using Foggy Cityscapes-refined.
“FT” stands for using fine-tuning and “W/o FT” for not us-
ing fine-tuning. AP (%) is used to report results
mean all car person mean frequent
W/o FT 11.1 30.5 10.3 20.4
FT β= 0.01 11.1 34.6 10.0 22.3
FT β= 0.005 11.7 35.3 10.3 22.8
Both of our fine-tuned models outperform the base-
line model by a significant margin for car. At the same
time, they are on a par with the baseline model for
person. The overall winner is the model that has been
fine-tuned on light fog, which we refer to as FT-0.005 :
it outperforms the baseline model by 2.4% on average
on the two frequent classes and it is also slightly better
when taking all 8 classes into account.
We provide a visual comparison of FT-0.005 and
the baseline model for car detection on example images
from Foggy Driving in Figure 12. Note the ability of our
model to detect distant cars, such as the two cars in the
image of the second row which are moving on the left
side of the road and are visible from their front part.
These two cars are both missed by the baseline model.
16 Christos Sakaridis et al.
Fig. 12 Qualitative results for detection of cars on Foggy Driving. From left to right: ground truth annotation, baseline Fast
R-CNN model trained on original Cityscapes, and our model FT-0.005 fine-tuned on Foggy Cityscapes-refined with light fog.
This figure is seen better when zoomed in on a screen
6 Semi-supervised Learning with Synthetic Fog
While standard supervised learning can improve the
performance of SFSU using our synthetic fog, the
paradigm still needs manual annotations for corre-
sponding clear-weather images. In this section, we ex-
tend the learning to a new paradigm which is also able
to acquire knowledge from unlabeled pairs of foggy im-
ages and clear-weather images. In particular, we train
a semantic segmentation model on clear-weather im-
ages using the standard supervised learning paradigm,
and apply the model to an even larger set of clear but
“unlabeled” images (e.g. our 20000 unlabeled images
of Foggy Cityscapes-coarse) to generate the class re-
sponses. Since we have created a foggy version for the
unlabeled dataset, these class responses can then be
used to supervise the training of models for SFSU.
This learning approach is inspired by the stream of
work on model distillation [27,31] or imitation [10,16].
[10,16,31] transfer supervision from sophisticated mod-
els to simpler models for efficiency, and [27] transfers
supervision from the domain of images to other do-
mains such as depth maps. In our case, supervision is
transferred from clear weather to foggy weather. The
underpinnings of our proposed approach are the fol-
lowing: 1) in clear weather, objects are easier to rec-
ognize than in foggy weather, thus models trained on
images with clear weather in principle generalize better
to new images of the same condition than those trained
on foggy images; and 2) since the synthetic foggy images
and their clear-weather counterparts depict exactly the
same scene, recognition results should also be the same
for both images ideally.
We formulate our semi-supervised learning (SSL)
for semantic segmentation as follows. Let us denote
Semantic Foggy Scene Understanding with Synthetic Data 17
(a) foggy image (b) ground truth (c) [44] (d) Ours
Void Road Sidewalk Building Wall Fence Pole Traffic Light Traffic Sign Vegetation
Terrain Sky Person Rider Car Truck Bus Train Motorcycle Bicycle
Fig. 13 Qualitative results for semantic segmentation on Foggy Driving, both for coarsely annotated images (top three rows)
and finely annotated images (bottom three rows). “Ours” stands for [44] fine-tuned with our SSL on Foggy Cityscapes
a clear-weather image by x, the corresponding foggy
one by x0, and the corresponding human annotation
by y. Then, the training data consist of both labeled
data Dl={(xi,x0
i,yi)}l
i=1 and unlabeled data Du=
{(xj,x0
j)}l+u
j=l+1, where ym,n
i∈ {1, ..., K}is the label
of pixel (m, n), and Kis the total number of classes.
18 Christos Sakaridis et al.
lis the number of labeled training images, and uis
the number of unlabeled training images. The aim is to
learn a mapping function φ0:X07→ Y from Dland Du.
In our case, Dlconsists of the 498 high-quality foggy
images in the training set of Foggy Cityscapes-refined
which have human annotations with fine details, and
Duconsists of the additional 20000 foggy images in
Foggy Cityscapes-coarse which do not have fine human
annotations.
Since Dudoes not have class labels, we use the idea
of supervision transfer to generate the supervisory la-
bels for all the images therein. To this end, we first learn
a mapping function φ:X 7→ Y with Dland then obtain
the labels ˆ
yj=φ(xj) for xjand x0
j,∀j∈ {l+1, ..., l+u}.
Duis then upgraded to ˆ
Du={(xj,x0
j,ˆ
yj)}l+u
j=l+1. The
proposed scheme for training semantic segmentation
models for foggy images x0is to learn a mapping func-
tion φ0so that human annotations yand the transferred
labels ˆ
yare both taken into account:
min
φ0
l
X
i=1
L(φ0(x0
i),yi) + λ
l+u
X
j=l+1
L(φ0(x0
j),ˆ
yj),(9)
where L(., .) is the Categorical Cross Entropy Loss func-
tion for classification, and λ=l
u×wis a parameter for
balancing the contribution of the two terms, serving as
the relative weight of each unlabeled image compared
to each labeled one. We empirically set w= 5 in our
experiment, but an optimal value can be obtained via
cross-validation if needed. In our implementation, we
approximate the optimization of (9) by mixing images
from Dland ˆ
Duin a proportion of 1 : wand feeding the
stream of hybrid data to a CNN for standard supervised
training.
We select RefineNet [44] as the CNN model for se-
mantic segmentation, which is a more recent and better
performing method than DCN [72] that is used in Sec-
tion 5. The reason for using DCN in Section 5is that
RefineNet had not been published yet at the time that
we were conducting the experiments of Section 5. We
would like to note that the state-of-the-art PSPNet [75],
which has been trained on the Cityscapes dataset sim-
ilarly to the original version of RefineNet that we use
as our baseline, achieved a mean IoU of only 24.0% on
Foggy Driving in our initial experiments.
We use mean IoU for evaluation, similarly to Sec-
tion 5, and β= 0.01 for Foggy Cityscapes. We com-
pare the performance of three trained models: 1) origi-
nal RefineNet [44] trained on Cityscapes, 2) RefineNet
fine-tuned on Dl, and 3) RefineNet fine-tuned on Dl
and ˆ
Du. The mean IoU scores of the three models on
Foggy Driving are 44.3%, 46.3%, and 49.7% respec-
tively. The 2% improvement of 2) over 1) confirms the
conclusion we draw in Section 5that fine-tuning with
our synthetic fog can indeed improve the performance
of semantic foggy scene understanding. The 3.4% im-
provement of 3) over 2) validates the efficacy of the SSL
paradigm. Figure 13 shows visual results of 1) and 3),
along with the foggy images and human annotations.
The re-trained model with our SSL paradigm can bet-
ter segment certain parts of the images which are mis-
classified by the original RefineNet, e.g. the pedestrian
in the first example, the tram in the fourth one, and
the sidewalk in the last one.
Both the quantitative and qualitative results sug-
gest that our approach is able to alleviate the need for
collecting large-scale training data for semantic under-
standing of foggy scenes, by training with the annota-
tions that are already available for clear-weather images
and the generated foggy images directly and by trans-
ferring supervision from clear-weather images to foggy
images of the same scenes.
7 Conclusion
In this paper, we have demonstrated the benefit of syn-
thetic data that are based on real images for seman-
tic understanding of foggy scenes. Two foggy datasets
have been constructed to this end: the partially syn-
thetic Foggy Cityscapes dataset which derives from
Cityscapes, and the real-world Foggy Driving dataset,
both with dense pixel-level semantic annotations for
19 classes and bounding box annotations for objects
belonging to 8 classes. We have shown that Foggy
Cityscapes can be used to boost performance of state-
of-the-art CNN models for semantic segmentation and
object detection on the challenging real foggy scenes
of Foggy Driving, both in a usual supervised setting
and in a novel, semi-supervised setting. Last but not
least, we have exposed through detailed experiments
the fact that image dehazing faces difficulties in work-
ing out of the box on real outdoor foggy data and
thus is marginally helpful for SFSU. In the future,
we would like to combine dehazing and semantic un-
derstanding of foggy scenes into a unified, end-to-end
learned pipeline, which can also leverage the type of
synthetic foggy data we have introduced. The datasets,
models and code are available at http://www.vision.
ee.ethz.ch/~csakarid/SFSU_synthetic.
Acknowledgements The authors would like to thank Kevis
Maninis for useful discussions. This work is funded by Toyota
Motor Europe via the research project TRACE-Z¨urich.
Semantic Foggy Scene Understanding with Synthetic Data 19
References
1. Federal Meteorological Handbook No. 1: Surface Weather
Observations and Reports. U.S. Department of Com-
merce / National Oceanic and Atmospheric Administra-
tion (2005)
2. Abu Alhaija, H., Mustikovela, S.K., Mescheder, L.,
Geiger, A., Rother, C.: Augmented reality meets deep
learning for car instance segmentation in urban scenes.
In: Proceedings of the British Machine Vision Conference
(BMVC) (2017)
3. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P.,
S¨usstrunk, S.: SLIC superpixels compared to state-of-
the-art superpixel methods. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence 34(11), 2274–
2282 (2012)
4. Arbel´aez, P., Pont-Tuset, J., Barron, J., Marques, F.,
Malik, J.: Multiscale combinatorial grouping. In: IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR) (2014)
5. Bar Hillel, A., Lerner, R., Levi, D., Raz, G.: Recent
progress in road and lane detection: A survey. Mach.
Vision Appl. 25(3), 727–745 (2014)
6. Berman, D., Treibitz, T., Avidan, S.: Non-local image
dehazing. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016)
7. Bronte, S., Bergasa, L.M., Alcantarilla, P.F.: Fog detec-
tion system based on computer vision techniques. In: In-
ternational IEEE Conference on Intelligent Transporta-
tion Systems (2009)
8. Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Seg-
mentation and recognition using structure from motion
point clouds. In: European Conference on Computer Vi-
sion (2008)
9. Buch, N., Velastin, S.A., Orwell, J.: A review of com-
puter vision techniques for the analysis of urban traffic.
IEEE Transactions on Intelligent Transportation Systems
12(3), 920–939 (2011)
10. Buciluˇa, C., Caruana, R., Niculescu-Mizil, A.: Model
compression. In: International Conference on Knowledge
Discovery and Data Mining (SIGKDD) (2006)
11. Camplani, M., Salgado, L.: Efficient spatio-temporal hole
filling strategy for Kinect depth maps. In: SPIE/IS&T
Electronic Imaging (2012)
12. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K.,
Yuille, A.L.: DeepLab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and
fully connected CRFs. IEEE Transactions on Pattern
Analysis and Machine Intelligence 40(4), 834–848 (2018)
13. Colomb, M., Hirech, K., Andr´e, P., Boreux, J.J., Lacote,
P., Dufour, J.: An innovative artificial fog production de-
vice improved in the European project “FOG”. Atmo-
spheric Research 87(3), 242–251 (2008)
14. Comaniciu, D., Meer, P.: Mean shift: a robust approach
toward feature space analysis. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence 24(5), 603–619
(2002)
15. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., En-
zweiler, M., Benenson, R., Franke, U., Roth, S., Schiele,
B.: The Cityscapes dataset for semantic urban scene un-
derstanding. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2016)
16. Dai, D., Kroeger, T., Timofte, R., Van Gool, L.: Metric
imitation by manifold transfer for efficient vision appli-
cations. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2015)
17. Dai, D., Yang, W.: Satellite image classification via two-
layer sparse coding with biased image representation.
IEEE Geoscience and Remote Sensing Letters 8(1), 173–
176 (2011)
18. Dosovitskiy, A., Fischery, P., Ilg, E., H¨ausser, P., Hazir-
bas, C., Golkov, V., v. d. Smagt, P., Cremers, D., Brox,
T.: FlowNet: Learning optical flow with convolutional
networks. In: IEEE International Conference on Com-
puter Vision (ICCV) (2015)
19. Everingham, M., Van Gool, L., Williams, C.K., Winn, J.,
Zisserman, A.: The PASCAL visual object classes (VOC)
challenge. IJCV 88(2), 303–338 (2010)
20. Fattal, R.: Single image dehazing. ACM transactions on
graphics (TOG) 27(3) (2008)
21. Fattal, R.: Dehazing using color-lines. ACM Transactions
on Graphics (TOG) 34(1) (2014)
22. Gallen, R., Cord, A., Hauti`ere, N., Aubert, D.: Towards
night fog detection through use of in-vehicle multipurpose
cameras. In: IEEE Intelligent Vehicles Symposium (IV)
(2011)
23. Gallen, R., Cord, A., Hauti`ere, N., Dumont, ´
E., Aubert,
D.: Nighttime visibility analysis and estimation method
in the presence of dense fog. IEEE Transactions on In-
telligent Transportation Systems 16(1), 310–320 (2015)
24. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for au-
tonomous driving? The KITTI vision benchmark suite.
In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2012)
25. Girshick, R.: Fast R-CNN. In: International Conference
on Computer Vision (ICCV) (2015)
26. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for
text localisation in natural images. In: IEEE Conference
on Computer Vision and Pattern Recognition (2016)
27. Gupta, S., Hoffman, J., Malik, J.: Cross modal distilla-
tion for supervision transfer. In: The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR)
(2016)
28. Hauti`ere, N., Tarel, J.P., Lavenant, J., Aubert, D.: Auto-
matic fog detection and estimation of visibility distance
through use of an onboard camera. Machine Vision and
Applications 17(1), 8–20 (2006)
29. He, K., Sun, J., Tang, X.: Single image haze removal using
dark channel prior. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence 33(12), 2341–2353 (2011)
30. He, K., Sun, J., Tang, X.: Guided image filtering. IEEE
Transactions on Pattern Analysis and Machine Intelli-
gence 35(6), 1397–1409 (2013)
31. Hinton, G., Vinyals, O., Dean, J.: Distilling the
knowledge in a neural network. arXiv preprint
arXiv:1503.02531 (2015)
32. Hirschm¨uller, H.: Stereo processing by semiglobal match-
ing and mutual information. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence 30(2), 328–341
(2008)
33. Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P.,
Saenko, K., Efros, A.A., Darrell, T.: CyCADA: Cycle-
Consistent Adversarial Domain Adaptation. ArXiv e-
prints (2017)
34. Janai, J., G¨uney, F., Behl, A., Geiger, A.: Computer
vision for autonomous vehicles: Problems, datasets and
state-of-the-art. arXiv preprint arXiv:1704.05519 (2017)
35. Jensen, M.B., Philipsen, M.P., Møgelmose, A., Moes-
lund, T.B., Trivedi, M.M.: Vision for looking at traffic
lights: Issues, survey, and perspectives. IEEE Transac-
tions on Intelligent Transportation Systems 17(7), 1800–
1815 (2016)
20 Christos Sakaridis et al.
36. Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar,
S.N., Rosaen, K., Vasudevan, R.: Driving in the matrix:
Can virtual worlds replace human-generated annotations
for real world tasks? In: IEEE International Conference
on Robotics and Automation (2017)
37. Kendall, M.G.: A new measure of rank correlation.
Biometrika 30(1/2), 81–93 (1938)
38. Kendall, M.G., Smith, B.B.: On the method of paired
comparisons. Biometrika 31(3/4), 324–345 (1940)
39. Koschmieder, H.: Theorie der horizontalen Sichtweite.
Beitrage zur Physik der freien Atmosph¨are (1924)
40. Levin, A., Lischinski, D., Weiss, Y.: Colorization using
optimization. In: ACM SIGGRAPH (2004)
41. Levinkov, E., Fritz, M.: Sequential bayesian model up-
date under structured scene prior for semantic road
scenes labeling. In: IEEE International Conference on
Computer Vision (2013)
42. Li, Y., Tan, R.T., Brown, M.S.: Nighttime haze removal
with glow and multiple light colors. In: IEEE Interna-
tional Conference on Computer Vision (ICCV) (2015)
43. Li, Y., You, S., Brown, M.S., Tan, R.T.: Haze visibility
enhancement: A survey and quantitative benchmarking
(2016). CoRR abs/1607.06235
44. Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-
path refinement networks with identity mappings for
high-resolution semantic segmentation. In: IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR) (2017)
45. Ling, Z., Fan, G., Wang, Y., Lu, X.: Learning deep trans-
mission network for single image dehazing. In: IEEE
International Conference on Image Processing (ICIP)
(2016)
46. Miclea, R.C., Silea, I.: Visibility detection in foggy envi-
ronment. In: International Conference on Control Sys-
tems and Computer Science (2015)
47. Narasimhan, S.G., Nayar, S.K.: Vision and the atmo-
sphere. Int. J. Comput. Vision 48(3), 233–254 (2002)
48. Narasimhan, S.G., Nayar, S.K.: Contrast restoration of
weather degraded images. IEEE Transactions on Pattern
Analysis and Machine Intelligence 25(6), 713–724 (2003)
49. Negru, M., Nedevschi, S., Peter, R.I.: Exponential con-
trast restoration in fog conditions for driving assistance.
IEEE Transactions on Intelligent Transportation Systems
16(4), 2257–2268 (2015)
50. Nishino, K., Kratz, L., Lombardi, S.: Bayesian defogging.
International Journal of Computer Vision 98(3), 263–278
(2012)
51. Pavli´c, M., Belzner, H., Rigoll, G., Ili´c, S.: Image based
fog detection in vehicles. In: IEEE Intelligent Vehicles
Symposium (2012)
52. Pavli´c, M., Rigoll, G., Ili´c, S.: Classification of images
in fog and fog-free scenes for use in vehicles. In: IEEE
Intelligent Vehicles Symposium (IV) (2013)
53. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN:
Towards real-time object detection with region proposal
networks. In: Advances in Neural Information Processing
Systems, pp. 91–99 (2015)
54. Ren, W., Liu, S., Zhang, H., Pan, J., Cao, X., Yang, M.H.:
Single image dehazing via multi-scale convolutional neu-
ral networks. In: European Conference on Computer Vi-
sion (2016)
55. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing
for data: Ground truth from computer games. In: Euro-
pean Conference on Computer Vision. Springer (2016)
56. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez,
A.M.: The SYNTHIA dataset: A large collection of syn-
thetic images for semantic segmentation of urban scenes.
In: The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR) (2016)
57. Rubinstein, M., Gutierrez, D., Sorkine, O., Shamir, A.:
A comparative study of image retargeting. ACM Trans.
Graph. 29(6), 160:1–160:10 (2010)
58. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,
S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., et al.: Imagenet large scale visual recogni-
tion challenge. International Journal of Computer Vision
115(3), 211–252 (2015)
59. Shen, J., Cheung, S.C.S.: Layer depth denoising and com-
pletion for structured-light RGB-D cameras. In: IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR) (2013)
60. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor
segmentation and support inference from RGBD images.
In: European Conference on Computer Vision (2012)
61. Spinneker, R., Koch, C., Park, S.B., Yoon, J.J.: Fast fog
detection for camera based advanced driver assistance
systems. In: International IEEE Conference on Intelligent
Transportation Systems (ITSC) (2014)
62. Tan, R.T.: Visibility in bad weather from a single image.
In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2008)
63. Tang, K., Yang, J., Wang, J.: Investigating haze-relevant
features in a learning framework for image dehazing.
In: IEEE Conference on Computer Vision and Pattern
Recognition (2014)
64. Tarel, J.P., Hauti`ere, N.: Fast visibility restoration from
a single color or gray level image. In: IEEE International
Conference on Computer Vision (2009)
65. Tarel, J.P., Hauti`ere, N., Caraffa, L., Cord, A., Halmaoui,
H., Gruyer, D.: Vision enhancement in homogeneous and
heterogeneous fog. IEEE Intelligent Transportation Sys-
tems Magazine 4(2), 6–20 (2012)
66. Tarel, J.P., Hauti`ere, N., Cord, A., Gruyer, D., Halmaoui,
H.: Improved visibility of road scene images under het-
erogeneous fog. In: IEEE Intelligent Vehicles Symposium,
pp. 478–485 (2010)
67. V´azquez, D., Lopez, A.M., Marin, J., Ponsa, D., Geron-
imo, D.: Virtual and real world adaptation for pedestrian
detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence 36(4), 797–809 (2014)
68. Wang, L., Jin, H., Yang, R., Gong, M.: Stereoscopic in-
painting: Joint color and depth completion from stereo
images. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2008)
69. Wang, W., Yuan, X., Wu, X., Liu, Y.: Fast image de-
hazing method based on linear transformation. IEEE
Transactions on Multimedia 19(6), 1142–1155 (2017)
70. Wang, Y.K., Fan, C.T.: Single image defogging by mul-
tiscale depth fusion. IEEE Transactions on Image Pro-
cessing 23(11), 4826–4837 (2014)
71. Xu, Y., Wen, J., Fei, L., Zhang, Z.: Review of video and
image defogging algorithms and related studies on image
restoration and enhancement. IEEE Access 4, 165–188
(2016)
72. Yu, F., Koltun, V.: Multi-scale context aggregation by
dilated convolutions. In: International Conference on
Learning Representations (2016)
73. Zhang, H., Sindagi, V.A., Patel, V.M.: Joint transmis-
sion map estimation and dehazing using deep networks
(2017). CoRR abs/1708.00581
74. Zhang, J., Cao, Y., Wang, Z.: Nighttime haze removal
based on a new imaging model. In: IEEE International
Conference on Image Processing (ICIP), pp. 4557–4561
(2014)
Semantic Foggy Scene Understanding with Synthetic Data 21
75. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene
parsing network. In: IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR) (2017)
A preview of this full-text is provided by Springer Nature.
Content available from International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.