ArticlePDF Available

RESIDE: A Benchmark for Single Image Dehazing

Authors:

Abstract

In this paper, we present a comprehensive study and evaluation of existing single image dehazing algorithms, using a new large-scale benchmark consisting of both synthetic and real-world hazy images, called REalistic Single Image DEhazing (RESIDE). RESIDE highlights diverse data sources and image contents, and is divided into five subsets, each serving different training or evaluation purposes. We further provide a rich variety of criteria for dehazing algorithm evaluation, ranging from full-reference metrics, to no-reference metrics, to subjective evaluation and the novel task-driven evaluation. Experiments on RESIDE sheds light on the comparisons and limitations of state-of-the-art dehazing algorithms, and suggest promising future directions.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
RESIDE: A Benchmark for Single Image Dehazing
Boyi Li*, Wenqi Ren*, Member, IEEE, Dengpan Fu, Dacheng Tao, Fellow, IEEE, Dan Feng, Member, IEEE,
Wenjun Zeng, Fellow, IEEE and Zhangyang Wang, Member, IEEE.
Abstract—In this paper, we present a comprehensive study and
evaluation of existing single image dehazing algorithms, using
a new large-scale benchmark consisting of both synthetic and
real-world hazy images, called REalistic Single Image DEhazing
(RESIDE). RESIDE highlights diverse data sources and image
contents, and is divided into five subsets, each serving different
training or evaluation purposes. We further provide a rich
variety of criteria for dehazing algorithm evaluation, ranging
from full-reference metrics, to no-reference metrics, to subjective
evaluation and the novel task-driven evaluation. Experiments on
RESIDE sheds light on the comparisons and limitations of state-
of-the-art dehazing algorithms, and suggest promising future
directions.
Index Terms—Dehazing, Detection, Dataset, Evaluations.
I. INTRODUCTION
A. Problem Description: Single Image Dehazing
Images captured in outdoor scenes often suffer from poor
visibility, reduced contrasts, fainted surfaces and color shift,
due to the presence of haze. Caused by aerosols such as
dust, mist, and fumes, the existence of haze adds complicated,
nonlinear and data-dependent noise to the images, making
the haze removal (a.k.a. dehazing) a highly challenging im-
age restoration and enhancement problem. Moreover, many
computer vision algorithms can only work well with the
scene radiance that is haze-free. However, a dependable vision
system must reckon with the entire spectrum of degradations
from unconstrained environments. Taking autonomous driv-
ing for example, hazy and foggy weather will obscure the
vision of on-board cameras and create confusing reflections
and glare, leaving state-of-the-art self-driving cars in struggle
[1]. Dehazing is thus becoming an increasingly desirable
technique for both computational photography and computer
vision tasks, whose advance will immediately benefit many
blooming application fields, such as video surveillance and
autonomous/assisted driving.
Boyi Li and Dan Feng are with the Wuhan National Laboratory for
Optoelectronics, Huazhong University of Science and Technology, Wuhan,
China. Email: boyilics@gmail.com, dfeng@hust.edu.cn.
Wenqi Ren is with State Key Laboratory of Information Security, In-
stitute of Information Engineering, Chinese Academy of Sciences. Email:
rwq.renwenqi@gmail.com.
Dengpan Fu is with Department of Electronic Engineering and Infor-
mation Science, University of Science and Technology of China. Email:
fdpan@mail.ustc.edu.cn.
Dacheng Tao is with the Centre for Quantum Computation and Intelligent
Systems and the Department of Engineering and Information Technology,
University of Technology Sydney, Sydney, NSW 2007, Australia. Email:
dacheng.tao@sydney.edu.au.
Wenjun Zeng is with Microsoft Research, Beijing, China. Email:
wezeng@microsoft.com.
Zhangyang Wang is with the Department of Computer Science and Engi-
neering, Texas A&M University, USA, and is the corresponding author. Email:
atlaswang@tamu.edu
The first two authors contribute equally.
While some earlier works consider multiple images from
the same scene to be available for dehazing [2], [3], [4],
[5], the single image dehazing proves to be a more realistic
setting in practice, and thus gained the dominant popularity.
The atmospheric scattering model has been the classical
description for the hazy image generation [6], [7], [8]:
I(x) = J(x)t(x) + A(1 t(x)) ,(1)
where I(x)is observed hazy image, J(x)is the haze-
free scene radiance to be recovered. There are two critical
parameters: Adenotes the global atmospheric light, and t(x)
is the transmission matrix defined as:
t(x) = eβd(x),(2)
where βis the scattering coefficient of the atmosphere, and
d(x)is the distance between the object and the camera.
We can re-write the model (1) for the clean image as the
output:
J(x) = 1
t(x)I(x)A1
t(x)+A. (3)
Most state-of-the-art single image dehazing methods exploit
the physical model (1), and estimate the key parameters A
and t(x)in either physically grounded or data-driven ways.
The performance of the top methods have continuously im-
proved [9], [10], [11], [12], [13], [14], especially after the
latest models embracing deep learning [15], [16], [17].
B. Existing Methodology: An Overview
Given the atmospheric scattering model, most dehazing
methods follow a similar three-step methodology: (1) estimat-
ing the transmission matrix t(x)from the hazy image I(x);
(2) estimating Ausing some other (often empirical) methods;
(3) estimating the clean image J(x)via computing (3).
Usually, the majority of attention is paid to the first step,
which can rely on either physically grounded priors or fully
data-driven approaches.
A noteworthy portion of dehazing methods exploited natural
image priors and depth statistics. [18] imposed locally constant
constraints of albedo values together with decorrelation of the
transmission in local areas, and then estimated the depth value
using the albedo estimates and the original image. It did not
constrain the scenes depth structure, thus often leads to the
inaccurate estimation of color or depth. [19], [20] discovered
the dark channel prior (DCP) to more reliably calculate the
transmission matrix, followed by many successors. However,
the prior is found to be unreliable when the scene objects
are similar to the atmospheric light [16]. [11] enforced the
boundary constraint and contextual regularization for sharper
restorations. [13] developed a color attenuation prior and
arXiv:1712.04143v1 [cs.CV] 12 Dec 2017
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
created a linear model of scene depth for the hazy image,
and then learned the model parameters in a supervised way.
[21] jointly estimated scene depth and recover the clear latent
image from a foggy video sequence. [14] proposed a non-local
prior, based on the assumption that each color cluster in the
clear image became a haze-line in RGB space.
In view of the prevailing success of Convolutional Neural
Networks (CNNs) in computer vision tasks, several dehazing
algorithms have relied on various CNNs to directly learn
t(x)fully from data, in order to avoid the often inaccu-
rate estimation of physical parameters from a single image.
DehazeNet [15] proposed a trainable model to estimate the
transmission matrix from a hazy image. [16] came up with
a multi-scale CNN (MSCNN), that first generated a coarse-
scale transmission matrix and gradually refined it. Despite
their promising results, the inherent limitation of training data
is becoming a increasingly severe obstacle for this booming
trend: see Section 2.1 for more discussions.
Besides, a few efforts have been made beyond the sub-
optimal procedure of separately estimating parameters, which
will cause accumulated or even amplified errors, when com-
bining them together to calculate (3). They instead advocate
simultaneous and unified parameter estimation. Earlier works
[22], [23] modeled the hazy image with a factorial Markov
random field, where t(x)and Awere two statistically inde-
pendent latent layers. Another line of researches [24], [25] try
to make use of Retinux theory to approximate the spectral
properties of object surfaces by the ratio of the reflected
light. Very recently, [17] presented a re-formulation of (2) to
integrate t(x)and Ainto one new variable. As a result. their
CNN dehazing model was fully end-to-end: J(x)was directly
generated from I(x), without any intermediate parameter
estimation step. The idea was later extended to video dehazing
in [26].
C. Our Contribution
Despite the prosperity of single image dehazing algorithms,
there have been two key hurdles to the further development of
this field: (1) there has been no large-scale, realistic benchmark
dataset so far for image dehazing; (2) current metrics for
evaluating and comparing image dehazing algorithms are
neither convincing nor sufficient. Detailed discussions will be
presented in Section 2.
This paper is directly motivated to overcome the above two
hurdles, and makes three-fold technical contributions:
We introduce the first-of-its-kind benchmark, called the
Realistic Single Image Dehazing (RESIDE) dataset. An
overview of RESIDE could be found in Tables I,II, and
image examples are displayed in Figure 1. Compared with
exiting synthetic training and testing sets of unrealistic
indoor scenes, the RESIDE dataset includes a large
diversity of both indoor and outdoor scene images for
training, as well as real-world hazy images in addition
to synthetic ones for evaluation. Specially, we annotate a
set of 4,322 real-world hazy images with object bounding
boxes.
We bring in an innovative set of evaluation strategies
in accordance with the new RESIDE dataset. Besides
TABLE I
OVE RVIE W OF RESIDE: DATA SOU RC ES AN D CO NTE NT S
Type Number
Synthetic Indoor Hazy Images 110,500
Synthetic Outdoor Hazy Images 313,950
Realistic Hazy Images (Unannotated) 4,807
Realistic Annotated Hazy Images 4,322
the widely adopted PSNR and SSIM, we further em-
ploy both no-reference metrics and human subjective
scores to evaluate the dehazing results, especially for
real-world hazy images without clean groundtruth. More
importantly, we recognize that image dehazing in practice
usually serves as the preprocessing step for mid-level
and high-level vision tasks. We thus propose to exploit
the object detection performance on the dehazed images
as a brand-new “task-specific” evaluation criterion for
dehazing algorithms.
We conduct an extensive and systematic range of ex-
periments to quantitatively compare nine state-of-the-art
single image dehazing algorithms, using the new RESIDE
dataset and the proposed variety of evaluation criteria.
Our evaluation and analysis demonstrate the performance
and limitations of state-of-the-art algorithms. The findings
from these experiments not only confirm what is com-
monly believed, but also suggest new research directions
in single image dehazing.
The RESIDE dataset will been made publicly available
soon for research purposes on our project’s website1. The
manuscript will be periodically updated to include more
benchmarking results.
II. DATASE T AN D EVALUATI ON : STATU S QUO
A. Training Data
Many image restoration and enhancement tasks benefit from
the continuous efforts for standardized benchmarks to allow
for comparison of different proposed methods under the same
conditions, such as [27], [28]. In comparison, a common
large-scale benchmark has been long missing for dehazing,
owing to the significant challenge in collecting or creating
realistic hazy images with clean groundtruth references. It is
generally impossible to capture the same visual scene with
and without haze, while all other environment conditions
stay identical. Therefore, recent dehazing models [29], [30]
typically generate their training sets by creating synthetic
hazy images from clean ones: they first obtain depth maps
of the clean images, by either utilizing available depth maps
for depth image datasets, or estimating the depth [31]; and
then generate the hazy images by computing (1). Data-driven
dehazing models could then be trained to regress clean images
from hazy ones.
Fattal’s dataset [29] provided 12 synthetic images. FRIDA
[32] produced a set of 420 synthetic images, for evaluating
the performance of automatic driving systems in various hazy
environments. Both of them are too small to train effective
1https://sites.google.com/site/boyilics/website-builder/reside
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
(a) ITS (b) OTS
(c) SOTS (d) RTTS
(e) HSTS. Top row: 10 synthetic hazy images; Bottom row: 10 realistic hazy images.
Fig. 1. Example images from the five sets in RESIDE (see Table II).
TABLE II
STRU CTU RE O F RESIDE: FIVE S UB SET S FO R TRA IN ING A ND T EST IN G
Subset Number
Indoor Training Set (ITS) 110,000
Outdoor Training Set (OTS) 313,950
Synthetic Objective Testing Set (SOTS) 1,000
Real-world Task-driven Testing Set (RTTS) 4,322
Hybrid Subjective Testing Set (HSTS) 20
dehazing models. To form large-scale training sets, [16], [17]
used the ground-truth images with depth meta-data from the
indoor NYU2 Depth Database [33] and the Middlebury stereo
database [34]. Recently, [30] generated Foggy Cityscapes
dataset with 20,550 images from the Cityscapes dataset, using
incomplete depth information.
Despite their positive driving effects in the development of
dehazing algorithms, those synthetic images for training have
inevitably brought in two limitations. On the one hand, many
depth image datasets used to generate synthetic images, e.g,
[33], [34], are collected from indoor scenes, while dehazing
is applied to outdoor environments. The content of training
data thus significantly diverges from the target subjects in real
dehazing applications. Such a mismatch will undermine the
practical effectiveness of the trained dehazing models. On the
other hand, while a limited number of of outdoor datasets
[35], [30] have been utilized, their depth information is either
incomplete or inaccurate, often leading to unrealistic hazy
images and artifacts during synthesis.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
B. Testing Data and Evaluation Criteria
The testing sets in use are mostly synthetic hazy images
with known groundtruth too, although some algorithms were
also visually evaluated on real hazy images [16], [15], [17].
With multiple dehazing algorithms available, it becomes
pivotal to find appropriate evaluation criteria to compare their
dehazing results. Most dehazing algorithms rely on the full-
reference PSNR and SSIM metrics, with assuming a synthetic
testing set with known clean groundtruth too. As discussed
above, their practical applicability may be in jeopardy even
a promising testing performance is achieved, due to the large
content divergence between synthetic and real hazy images.
To objectively evaluate dehazing algorithms on real hazy im-
ages without reference, no-reference image quality assessment
(IQA) models [36], [37], [38] are possible candidates. [39]
tested a few no-reference objective IQA models among several
dehazing approaches on their self-collected set of 25 hazy
images, but did not compare any latest CNN-based dehazing
models.
PSNR/SSIM, as well as other objective metrics, often align
poorly with human perceived visual qualities [39]. Many pa-
pers visually display dehazing results, but the result differences
between state-of-the-art dehazing algorithms are often too
subtle for people to reliably judge. That suggests the necessity
of conducting a subjective user study, towards which few
efforts have been made so far [40], [39].
It has been recognized that the performance of high-level
computer vision tasks, such as object detection and recogni-
tion, will deteriorate in the presence of various degradations,
and is thus largely affected by the quality of image restoration
and enhancement. Dehazing could be used as pre-processing
for many computer vision tasks executed in the wild, and
the resulting task performance could in turn be treated as an
indirect indicator of the dehazing quality. Such a “task-driven”
evaluation way has received little attention so far, despite
its great implications for outdoor applications. A relevant
preliminary effort was presented in [17], where the authors
compared a few CNN-based dehazing models by placing them
in an object detection pipeline, but their tests were on synthetic
hazy data with bounding boxes. [30] created a dataset of
101 real-world images depicting foggy driving scenes, which
came with ground truth annotations for evaluating semantic
segmentation and object detection. Besides being relatively
small, their dataset cannot be used for evaluating dehazing
perceptual quality, either objectively or subjectively .
III. A NE W DATASET: RESIDE
We propose the REalistic Single Image DEhazing
(RESIDE) dataset, a new large-scale dataset for benchmarking
single image dehazing algorithms. RESIDE is built to be
comprehensive and diverse in data sources (synthetic versus
real world), contents (indoor versus outdoor scenes), and
evaluation options (see Section 4 for details).
A. RESIDE Training Set
The REISDE training set consists of Indoor Training Set
(ITS) and Outdoor Training Set (OTS), both of which are
Fig. 2. Visual comparison between the synthetic hazy images directly
generated from Make3D and from OTS. The first and second rows are
the synthesized hazy images based on the Make3D dataset and our OTS,
respectively.
synthetic images but from distinct sources and synthesis ways.
ITS contains 110, 000 synthetic hazy images, generated using
images from existing indoor depth datasets such as NYU2 [33]
and Middlebury stereo [34]. An optional split of 100, 000 for
training and 10,000 for validation is provided. OTS contains
313,950 images synthesized from collected real world outdoor
scenes [41], without depth information. We use [31] to esti-
mate the depth map for each image, with which we synthesize
outdoor hazy images.
For generating both synthetic sets, We set different atmo-
spheric lights A, by choosing each channel uniformly between
[0.6,1.0], and select β∈ {0.4,0.6,0.8,1.0,1.2,1.4,1.6}. Both
ITS and OTS thus contains paired clean and hazy images,
where a clean groundtruth image can leads to multiple pairs
whose hazy images are generated under different parameters
Aand β.
One straightforward option would be to utilize a few ex-
isting depth datasets collected from outdoor scene such as
Make3D [35] and KITTI [42]. However, the outdoor depth
maps tend to be very unreliable. Due to the limitations of
RGB-based depth cameras, the Make3D dataset suffer from
at least 4 meters of average root mean squared error in the
predicted depths, and the KITTI dataset has at least 7 meters
of average error [43]. In comparison, the average depth errors
in indoor datasets, e.g., NYU-Depth-v2 [33], are usually as
small as 0.5 meter. For the outdoor depth maps can also
contain a large amount of artifacts and large holes, which
renders it inappropriate for direct use in haze simulation. In
comparison, [31] is observed to produce more accurate depth
maps and lead to artifact-free hazy images. Visual comparisons
between the synthetic hazy images directly generated from
Make3D [35] and from OTS are included in Figure 2. Another
possible alternative is to adopt recent approaches of depth
map denoising and in-painting [44], which we leave for future
exploration.
B. RESIDE Testing Set
The REISDE testing set is composed of three components:
Synthetic Objective Testing Set (SOTS), Real-world Task-
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
TABLE III
DETAILED CLASSES INFORMATION OF RTTS.
Category person bicycle car bus motorbike Total
Normal 7,950 534 18,413 1,838 862 29,597
Difficult 3,416 164 6,904 752 370 11,606
Total 11,366 698 25,317 2,590 1,232 41,203
driven Testing Set (RTTS), and the Hybrid Subjective Testing
Set (HSTS), each corresponding to a different evaluation
viewpoint. SOTS selects 500 indoor images from NYU2 [33]
and 500 outdoor scenes from internet (non-overlapping with
images in ITS and OTS), and follow the same process as ITS
to synthesize hazy images. We specially create challenging
dehazing cases for testing, e.g., white scenes added with heavy
haze. HSTS picks 10 synthetic outdoor hazy images generated
in the same way as OTS, together with 10 real-world hazy
images, combined for human subjective review.
RTTS collects 4, 322 real-world hazy images crawled from
the web, covering mostly traffic and driving scenarios. Each
image is annotated with object categories and bounding boxes,
and RTTS is organized in the same form as VOC2007 [45]. We
currently focus on five traffic-related categories: car, bicycle,
motorbike, person, bus. We obtain 41, 203 annotated bounding
boxes, 11, 606 of which are marked as “difficult” and not used
in this paper’s experiments. The class details of RTTS are
show in III. Additionally, we also collect 4,807 unannotated
real-world hazy images, which are not exploited in this paper,
but may potentially be used for domain adaption in future, etc.
IV. A NE W MINDSET FOR DEHAZING EVALUATION
A. From Full-Reference to No-Reference
Despite the popularity of the full-reference PSNR/SSIM
metrics for evaluating dehazing algorithms, they are inherently
limited due to the unavailability of clean groundtruth images
in practice, as well as their often poor alignment with human
perception quality [39]. We thus refer to two no-reference IQA
models: spatialspectral entropy-based quality (SSEQ) [38], and
blind image integrity notator using DCT statistics (BLIINDS-
II) [37], to complement the shortness of PSNR/SSIM. Note
that the score of SSEQ and BLIINDS2 used in [38] and [37]
are range from 0 (best) to 100 (worst), and we reverse the score
to make the correlation consistent to full-reference metrics.
We will apply PSNR, SSIM, SSEQ, and BLIINDS-II, to the
dehazed results on SOTS, and examine how consistent their
resulting ranking of dehazing algorithms will be. We will also
apply the four metrics on HSTS (PSNR and SSIM are only
computed on the 10 synthetic images), and further compare
those objective measures with subjective ratings.
B. From Objective to Subjective
[39] investigated various choices of full-reference and no-
reference IQA models, and found them to be limited in
predicting the quality of dehazed images. We then conduct
a subjective user study on the quality of dehazing results
produced by different algorithms, from which we gain more
useful observations. Groundtruth images are also included
when they are available as references.
In the previous survey [39] a participant scored each dehaz-
ing result image with an integer from 1 to 10 that best reflects
its perceptual quality. We make two important innovations:
(1) asking participants to give pairwise comparisons rather
than individual ratings, the former often believed to be more
robust and consistent in subjective surveys; (2) decomposing
the perceptual quality into two dimensions: the dehazing
Clearness and Authenticity, the former defined as how thor-
oughly the haze has been removed, and the latter defined
as how realistic the dehazed image looks like. Such two
disentangled dimensions are motivated by our observations
that some algorithms produce naturally-looking results but are
unable to fully remove haze, while some others remove the
haze at the price of unrealistic visual artifacts.
During the survey, each participant is shown a set of
dehazed result pairs obtained using two different algorithms
for the same hazy image. For each pair, the participant needs to
decide which one is better than the other in terms of clearness,
and then which one is better for Authenticity. The image pairs
are drawn from all the competitive methods randomly, and
the images winning the pairwise comparison will be compared
again in the next round, until the best one is selected. We fit
a Bradley-Terry [46] model to estimate the subjective scores
for each dehazing algorithm so that they can be ranked.
More details on the subjective survey are included in the
supplementary.
C. From Signal-Level to Task-Driven
Since dehazed images are often subsequently fed for au-
tomatic semantic analysis tasks such as recognition and de-
tection, we argue that the optimization target of dehazing in
these tasks is neither pixel-level or perceptual-level quality,
but the utility of the dehazed images in the given semantic
analysis task [47]. We thus propose the task-driven evaluation
for dehazing algorithms, and study the problem of object
detection in the presence of haze as an example. We notice that
[30] investigated detection and segmentation problems in hazy
images as well, but not for the dehazing evaluation purpose.
Specially, we used the same pre-trained Faster R-CNN [48]
model to detect the objects of interests, from the dehazed
results of RTTS by various algorithms, and rank all algorithms
via the mean Average Precision (mAP) achieved.
V. EX PE RI ME NTAL COMPARISON
Based on the rich resources provided by RESIDE, we eval-
uate 9 representative state-of-the-art algorithms: Dark-Channel
Prior (DCP) [9], Fast Visibility Restoration (FVR) [10],
Boundary Constrained Context Regularization (BCCR) [11],
Artifact Suppression via Gradient Residual Minimization
(GRM) [12], Color Attenuation Prior (CAP) [13], Non-local
Image Dehazing (NLD) [14], DehazeNet [15], Multi-scale
CNN (MSCNN) [16], and All-in-One Dehazing Network
(AOD-Net) [17]. The last three belong to the latest CNN-based
dehazing algorithms. For all data-driven algorithms, they are
trained on the entire RESIDE training set: ITS + OTS.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
TABLE IV
AVERA GE FU LL-AND NO-R EF ERE NC E EVALUAT ION S RE SULT S OF DE HAZ ED R ESU LTS ON SOTS.
DCP [9] FVR [10] BCCR [11] GRM [12] CAP [13] NLD [14] DehazeNet [15] MSCNN [16] AOD-Net [17]
500 synthetic indoor images in SOTS
PSNR 18.87 18.02 17.87 20.44 21.31 18.53 22.66 20.01 21.01
SSIM 0.7935 0.7256 0.7701 0.8233 0.8247 0.7018 0.8325 0.7907 0.8372
SSEQ 64.94 67.75 65.83 63.30 64.69 67.46 65.46 65.22 67.65
BLIINDS-II 74.41 76.13 65.45 66.67 73.41 74.85 71.71 74.49 78.94
500 synthetic outdoor images in SOTS
PSNR 18.54 16.61 17.71 20.77 23.95 19.52 26.84 21.73 24.08
SSIM 0.7100 0.7236 0.7409 0.7617 0.8692 0.7328 0.8264 0.8313 0.8726
SSEQ 83.59 82.87 83.04 76.18 81.74 84.10 81.65 81.46 84.13
BLIINDS-II 89.12 86.72 89.43 82.98 85.93 86.32 83.60 86.72 87.43
TABLE V
AVERA GE SU BJE CT IVE S CO RES ,AS W ELL A S FU LL-A ND N O-RE FE REN CE EVAL UATIO NS R ESU LTS,O F DE HAZ IN G RES ULTS O N HSTS.
DCP [9] FVR [10] BCCR [11] GRM [12] CAP [13] NLD [14] DehazeNet [15] MSCNN [16] AOD-Net [17]
Synthetic images
Clearness 1.26 0.18 0.62 0.75 0.50 10.29 1.22 0.86
Authenticity 0.78 0.14 0.50 0.95 0.86 11.94 0.54 1.41
PSNR 17.27 15.68 16.61 20.48 22.88 18.92 26.94 20.53 23.41
SSIM 0.7210 0.7157 0.6947 0.7631 0.8223 0.7411 0.8758 0.7893 0.8616
SSEQ 86.15 85.68 85.60 78.43 85.32 86.28 86.01 85.56 86.75
BLIINDS-II 90.70 87.65 91.05 82.30 85.75 85.30 87.15 88.70 87.50
Real-world images
Clearness 0.39 0.46 0.45 0.75 1 0.54 1.16 1.29 1.05
Authenticity 0.17 0.20 0.18 0.62 1 0.15 1.03 1.27 1.07
SSEQ 68.65 67.75 66.63 70.19 67.67 67.96 68.34 68.44 70.05
BLIINDS-II 69.35 72.10 68.55 79.60 63.55 70.80 60.35 62.65 74.75
A. Objective Comparison on SOTS
We first compare the dehazed results on SOTS using two
full-reference (PSNR, SSIM) and two no-reference metrics
(SSEQ, BLIINDS-II). Table IV displays the detailed scores
of each algorithm in terms of each metric.2
In general, since learning-based methods [15], [13], [16],
[17] are optimized by directly minimizing the MSE loss
between output and ground truth pairs or maximizing the
likelihood on large-scale data, they clearly outperform earlier
algorithms based on natural or statistical priors [9], [11], [10],
[12], [14], in terms of PSNR and SSIM. Especially, for both
indoor and outdoor synthetic images, DehazeNet [15] achieves
the highest PSNR value and AOD-Net [17] obtains the best
SSIM score, while CAP [13] obtains the suboptimal PSNR
and SSIM on indoor and outdoor images, respectively.
However, when it comes to no-reference metrics, the re-
sults become less consistent. AOD-Net [17] still maintains
competitive performance by obtaining the best BLIINDS-II
result on indoor images, and the best SSEQ result on outdoor
images. On the other hand, several prior-based methods, such
as DCP [9], FVR [10], BCCR [11] and NLD [14], also show
competitiveness: FVR [10] ranks first in term of SSEQ on
indoor images, and BCCR [11] wins on outdoor images in
term of BLIINDS-II. We visually observe the results, and find
that DCP [9], BCCR [11] and NLD [14] tend to produce sharp
edges and highly contrasting colors, which explains why they
are preferred by BLIINDS-II and SSEQ. Such an inconsistency
between full- and no-reference evaluations aligns with the
2We highlight the top-3 performances using red, cyan and blue, respectively.
previous argument [39] that existing objective objective IQA
models are very limited in providing proper quality predictions
of dehazed images.
B. Subjective Comparison on HSTS
We recruit 100 participants from different educational back-
grounds for the subjective survey as described in Section 4.2,
using HSTS which contains 10 synthetic outdoor and 10 real-
world hazy images. We fit a Bradley-Terry [46] model to
estimate the subjective score for each method so that they can
be ranked. In the Bradley-Terry model, the probability that an
object Xis favored over Yis assumed to be
p(XY) = esX
esX+esY=1
1 + esYsX,(4)
where sXand sYare the subjective scores for Xand Y.
The scores sfor all the objects can be jointly estimated by
maximizing the log likelihood of the pairwise comparison
observations:
max
sX
i,j
wij log 1
1 + esjsi,(5)
where wij is the (i, j)-th element in the winning matrix W,
representing the number of times when method iis favored
over method j. We use the Newton-Raphson method to solve
Eq. (5). Note that for a synthetic image, we have a 10 ×
10 winning matrix W, including the ground truth and nine
dehazing methods results. For a real-world image, its winning
matrix Wis 9×9due to the absence of ground truth. For
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
synthetic images, we set the score for ground truth method as
1 to normalization scores.
Figures 4and 5show qualitative examples of dehazed
results on a synthetic and a real-world image, respectively.
Quantitative results can be found in Table Vand the trends
are visualized in Figure 3. We also compute the full- and
no-reference metrics on synthetic images to examine their
consistency with the subjective scores.
A few interesting observations could be drawn:
The subjective qualities of various algorithms’ results
show different trends on synthetic and real hazy im-
ages. On the 10 synthetic images of HSTS, DCP [9]
receives the best clearness score and DehazeNet the
best in authenticity score. On the 10 real images, CNN-
based methods [15], [16], [17] rank top-3 in terms of
both clearness and authenticity, in which MSCNN [16]
achieves the best according to both scores.
The clearness and authenticity scores of the same image
are often not aligned. As can be seen from Figure 3, the
two subjective scores are hardly correlated on synthetic
images; their correlation shows better on real images.
That reflects the complexity and multi-facet nature of
subjective perceptual evaluation.
From Table V, we observe the divergence between sub-
jective and objective (both full- and no-reference) evalu-
ation results. For the best performer in subjective evalu-
ation, MSCNN [16], its PSNR/SSIM results on synthetic
indoor images are quite low, while SSEQ/BLIINDS-II
on both synthetic and outdoor images are moderate.
As another example, GRM [12] receives the highest
SSEQ/BLIINDS-II scores on real HSTS images. However
both of its subjective scores rank only fifth among nine
algorithms on the same set.
C. Task-driven Comparison on RTTS
We adopt the commonly used Faster R-CNN [48]3, and
use the same fixed model to detect objects from the dehazed
results, over the RTTS images. Figure 6compares the object
detection results on an RTTS hazy image and after applying
nine different dehazing algorithms4.
Table VI compares all mAP results, from which we can see
that BCCR [11] and MSCNN [16] are two best performers,
implying that both traditional and CNN-based methods have
good potential to contribute to object detection. On the other
hand, if comparing the ranking of detection mAP with the
no-reference results on the same set, we can again only
observe a weak correlation. For example, BCCR [11] achieves
highest BLIINDS-II value, but MSCNN has lower SSEQ and
BLIINDS-II scores than most competitors. That manifests the
necessity of evaluating in a task-driven way.
Discussion: Optimizing Detection Performance in Haze?
[17] for the first time reported the promising performance on
3Here we used py-faster-rcnn and its VGG16 based model trained on
VOC2007 trainval dataset.
4For FVR, only 3,966 images are counted, since for the remaining 356
FVR fails to provide any reasonable result.
TABLE VI
DET ECT IO N RES ULTS O N RTTS.
mAP Person Bicycle Car Bus Motorbike
RawHaze 0.38 0.61 0.41 0.35 0.21 0.30
DCP [9] 0.41 0.62 0.41 0.42 0.24 0.34
FVR [10] 0.35 0.58 0.39 0.35 0.19 0.25
BCCR [11]0.42 0.62 0.45 0.43 0.25 0.34
GRM [12] 0.29 0.50 0.31 0.26 0.15 0.22
CAP [13] 0.40 0.61 0.40 0.42 0.25 0.30
NLD [14] 0.40 0.61 0.40 0.42 0.24 0.33
DehazeNet [15] 0.41 0.61 0.41 0.42 0.25 0.34
MSCNN [16] 0.41 0.61 0.42 0.43 0.25 0.36
AOD-Net [17] 0.37 0.61 0.40 0.35 0.21 0.30
detecting objects in the haze, by concatenating and jointly tun-
ing AOD-Net with Faster-RCNN as one unified pipeline, fol-
lowing [49], [50]. The authors trained their detection pipeline
using an annotated dataset of synthetic hazy images, generated
from VOC2007 [45]. Due to the absence of annotated realistic
hazy images, they only reported quantitative performance on
a separate set of synthetic annotated images. While their goal
is different from the scope of RTTS (where a fixed Faster-
RCNN is applied on dehazing results for fair comparison), we
are interested to explore whether we could further boost the
detection mAP on RTTS realistic hazy images using such a
joint pipeline.
In order for further enhancing the performance of such a
dehazing + detection joint pipeline in realistic hazy photos
or videos, there are at least two other noteworthy potential
options as we can see for future efforts:
Developing photo-realistic simulation approaches of gen-
erating hazy images from clean ones [51], [52]. That
would resolve the bottleneck of handle-labeling and sup-
ply large-scale annotated training data with little mis-
match. The technique of haze severity estimation [53]
may also help the synthesis, by first estimating the
haze level from (unannotated) testing images and then
generating training images accordingly.
If we view the synthetic hazy images as the source do-
main (with abundant labels) and the realistic ones as the
target domain (with scarce labels), then the unsupervised
domain adaption can be performed to reduce the domain
gap in low-level features, by exploiting unannotated real-
istic hazy images. For example, [54] provided an example
of pre-training the robust low-level CNN filters using
unannotated data from both source and target domains,
leading to much improved robustness when applied to
testing on the target domain data. For this purpose, Vwe
have included 4,322 unannotated realistic hazy images in
RESIDE that might help build such models.
Apparently, the above discussions can be straightforwardly
applied to other high-level vision tasks in uncontrolled outdoor
environments (e.g., bad weathers and poor illumination), such
as tracking, recognition, semantic segmentation, etc.
D. Running Time
Table VIII reports the per-image running time of each algo-
rithm, averaged over the synthetic indoor images (620 ×460)
in SOTS, using a machine with 3.6 GHz CPU and 16G RAM.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Fig. 3. Averaged clearness and authenticity scores: (a) on 10 synthetic images in HSTS; and (b) on real-world images in HSTS.
(a) Clean Image
(b) Hazy Image
(c) DCP
(d) FVR
(e) BCCR
(f) GRM
(g) CAP
(h) NLD
(i) DehazeNet
(j) MSCNN
(k) AOD-Net
Fig. 4. Examples of dehazed results on a synthetic hazy image from HSTS.
TABLE VII
AVERA GE NO -REFERENCE METRICS OF DEHAZED RESULTS ON RTTS.
DCP [9] FVR [10] BCCR [11] GRM [12] CAP [13] NLD [14] DehazeNet [15] MSCNN [16] AOD-Net [17]
SSEQ 62.87 63.59 63.31 58.64 60.66 59.37 60.01 62.31 65.35
BLIINDS-II 68.34 67.68 74.07 54.54 65.15 68.32 52.54 56.59 71.05
TABLE VIII
COMPARISON OF AVERAGE PER-IMAGE RUNNING TIME (SEC OND )ON SYNTHETIC INDOOR IMAGES IN SOTS.
DCP [9] FVR [10] BCCR [11] GRM [12] CAP [13] NLD [14] DehazeNet [15] MSCNN [16] AOD-Net [17]
Time 1.62 6.79 3.85 83.96 0.95 9.89 2.51 2.60 0.65
All methods are implemented in MATLAB, except AOD-Net
by Pycaffe. However, it is fair to compare AOD-Net with
other methods since MATLAB implementation has superior
efficiency than Pycaffe as shown in [17]. AOD-Net shows a
clear advantage over others in efficiency, thanks to its light-
weight feed-forward structure.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we propose the RESIDE benchmark and
systematically evaluate the state-of-the-arts in single image de-
hazing. From the results presented, there seems to be no single-
best dehazing model for all criteria: AOD-Net and DehazeNet
are favored by PSNR and SSIM; DCP, FVR and BCCR are
kore competitive in terms of no-reference metrics; MSCNN
shows to have the most appreciated subjective quality; BCCR
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
(a) Hazy Image
(b) DCP
(c) FVR
(d) BCCR
(e) GRM
(f) CAP
(g) NLD
(h) DehazeNet
(i) MSCNN
(j) AOD-Net
Fig. 5. Examples of dehazed results on a real-world hazy image from HSTS.
and MSCNN lead to superior detection performance on real
hazy images; and finally AOD-Net is the most efficient among
all. We see the highly complicated nature of the dehazing
problem, in both real-world generalization and evaluation
criteria. For future research, we advocate to be evaluate and
optimize dehazing algorithms towards more dedicated cafe-
terias (e.g., subjective visual quality, or high-level target task
performance), rather than solely PSNR/SSIM, which are found
to be poorly aligned with other metrics we used. In particular,
correlating dehazing with high-level computer vision problems
will likely lead to innovative robust computer vision pipelines
that will find many immediate applications. Another blank to
fill is developing no-reference metrics that are better correlated
with human perception, for evaluating dehazing results. That
progress will accelerate the needed shift from current full-
reference evaluation on only synthetic images, to the more
realistic evaluation schemes with no ground truth.
ACK NOW LE DG EM EN T
We appreciate the support from the authors of [14], [18]. We
acknowledge Dr. Changxing Ding, South China University of
Technology, for his indispensible support to our data collection
and cleaning.
REFERENCES
[1] “How autonomous vehicles will navigate bad weather remains
foggy,https://www.forbes.com/sites/centurylink/2016/11/29/
how-autonomous-vehicles-will-navigate- bad-weather-remains-foggy/
#1aff07088662.1
[2] S. G. Narasimhan and S. K. Nayar, “Contrast restoration of weather
degraded images,” IEEE transactions on pattern analysis and machine
intelligence, vol. 25, no. 6, pp. 713–724, 2003. 1
[3] Y. Y. Schechner, S. G. Narasimhan, and S. K. Nayar, “Instant dehazing of
images using polarization,” in Computer Vision and Pattern Recognition,
2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society
Conference on, vol. 1. IEEE, 2001, pp. I–I. 1
[4] T. Treibitz and Y. Y. Schechner, “Polarization: Beneficial for visibility
enhancement?” in Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 525–532. 1
[5] J. Kopf, B. Neubert, B. Chen, M. Cohen, D. Cohen-Or, O. Deussen,
M. Uyttendaele, and D. Lischinski, “Deep photo: Model-based pho-
tograph enhancement and viewing,” in ACM transactions on graphics
(TOG), vol. 27, no. 5. ACM, 2008, p. 116. 1
[6] E. J. McCartney, “Optics of the atmosphere: scattering by molecules and
particles,” New York, John Wiley and Sons, Inc., 1976. 421 p., 1976. 1
[7] S. G. Narasimhan and S. K. Nayar, “Chromatic framework for vision
in bad weather,” in Computer Vision and Pattern Recognition, 2000.
Proceedings. IEEE Conference on, vol. 1. IEEE, 2000, pp. 598–605.
1
[8] ——, “Vision and the atmosphere,International Journal of Computer
Vision, vol. 48, no. 3, pp. 233–254, 2002. 1
[9] K. He, J. Sun, and X. Tang, “Single image haze removal using dark
channel prior,” in IEEE Conference on Computer Vision and Pattern
Recognition, 2009. 1,5,6,7,8
[10] J.-P. Tarel and N. Hautiere, “Fast visibility restoration from a single color
or gray level image,” in IEEE International Conference on Computer
Vision, 2009. 1,5,6,7,8
[11] G. Meng, Y. Wang, J. Duan, S. Xiang, and C. Pan, “Efficient image
dehazing with boundary constraint and contextual regularization,” in
IEEE International Conference on Computer Vision, 2013. 1,5,6,7,8
[12] C. Chen, M. N. Do, and J. Wang, “Robust image and video dehazing
with visual artifact suppression via gradient residual minimization,” in
European Conference on Computer Vision, 2016. 1,5,6,7,8
[13] Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm
using color attenuation prior,IEEE Transactions on Image Processing,
vol. 24, no. 11, pp. 3522–3533, 2015. 1,5,6,7,8
[14] D. Berman, S. Avidan et al., “Non-local image dehazing,” in IEEE
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
(a) Ground Truth
(b) RawHaze
(c) DCP
(d) FVR
(e) BCCR
(f) GRM
(g) CAP
(h) NLD
(i) DehazeNet
(j) MSCNN
(k) AOD-Net
Fig. 6. Visualization of two RTTS images’ object detection results after applying different dehazing algorithms.
Conference on Computer Vision and Pattern Recognition, 2016. 1,2,5,
6,7,8,9
[15] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “Dehazenet: An end-to-end
system for single image haze removal,IEEE Transactions on Image
Processing, vol. 25, no. 11, pp. 5187–5198, 2016. 1,2,4,5,6,7,8
[16] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang, “Single
image dehazing via multi-scale convolutional neural networks,” in
European Conference on Computer Vision, 2016. 1,2,4,5,6,7,8
[17] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “Aod-net: All-in-one
dehazing network,” in IEEE International Conference on Computer
Vision, 2017. 1,2,4,5,6,7,8
[18] R. Fattal, “Single image dehazing,” ACM transactions on graphics
(TOG), vol. 27, no. 3, p. 72, 2008. 1,9
[19] K. He, J. Sun, and X. Tang, “Single image haze removal using dark
channel prior,IEEE transactions on pattern analysis and machine
intelligence, vol. 33, no. 12, pp. 2341–2353, 2011. 1
[20] K. Tang, J. Yang, and J. Wang, “Investigating haze-relevant features
in a learning framework for image dehazing,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2014,
pp. 2995–3000. 1
[21] Z. Li, P. Tan, R. T. Tan, D. Zou, S. Zhiying Zhou, and L.-F. Cheong, “Si-
multaneous video defogging and stereo reconstruction,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2015, pp. 4988–4997. 2
[22] L. Kratz and K. Nishino, “Factorizing scene albedo and depth from a
single foggy image,” in Computer Vision, 2009 IEEE 12th International
Conference on. IEEE, 2009, pp. 1701–1708. 2
[23] K. Nishino, L. Kratz, and S. Lombardi, “Bayesian defogging,” Interna-
tional journal of computer vision, vol. 98, no. 3, pp. 263–278, 2012.
2
[24] D. Nair, P. A. Kumar, and P. Sankaran, “An effective surround filter for
image dehazing,” in Proceedings of the 2014 International Conference
on Interdisciplinary Advances in Applied Computing. ACM, 2014,
p. 20. 2
[25] J. Zhou and F. Zhou, “Single image dehazing motivated by retinex
theory,” in Instrumentation and Measurement, Sensor Network and
Automation (IMSNA), 2013 2nd International Symposium on, 2013, pp.
243–247. 2
[26] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “End-to-end united video
dehazing and detection,” arXiv preprint arXiv:1709.03919, 2017. 2
[27] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation
of recent full reference image quality assessment algorithms,” IEEE
Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
2
[28] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image
super-resolution: Dataset and study,” in The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) Workshops, vol. 3, 2017.
2
[29] R. Fattal, “Dehazing using color-lines,ACM Transactions on Graphics
(TOG), vol. 34, no. 1, p. 13, 2014. 2
[30] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene under-
standing with synthetic data,” arXiv preprint arXiv:1708.07819, 2017.
2,3,4,5
[31] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monoc-
ular images using deep convolutional neural fields,IEEE transactions
on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024–
2039, 2016. 2,4
[32] J.-P. Tarel, N. Hautiere, L. Caraffa, A. Cord, H. Halmaoui, and
D. Gruyer, “Vision enhancement in homogeneous and heterogeneous
fog,” IEEE Intelligent Transportation Systems Magazine, vol. 4, no. 2,
pp. 6–20, 2012. 2
[33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation
and support inference from rgbd images,” in European Conference on
Computer Vision. Springer, 2012, pp. 746–760. 3,4,5
[34] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using
structured light,” in Computer Vision and Pattern Recognition, 2003.
Proceedings. 2003 IEEE Computer Society Conference on, vol. 1. IEEE,
2003, pp. I–I. 3,4
[35] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure
from a single still image,” IEEE transactions on pattern analysis and
machine intelligence, vol. 31, no. 5, pp. 824–840, 2009. 3,4
[36] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image
quality assessment in the spatial domain,” IEEE Transactions on Image
Processing, vol. 21, no. 12, pp. 4695–4708, 2012. 4
[37] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality
assessment: A natural scene statistics approach in the dct domain,” IEEE
transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012.
4,5
[38] L. Liu, B. Liu, H. Huang, and A. C. Bovik, “No-reference image quality
assessment based on spatial and spectral entropies,” Signal Processing:
Image Communication, vol. 29, no. 8, pp. 856–863, 2014. 4,5
[39] K. Ma, W. Liu, and Z. Wang, “Perceptual evaluation of single image
dehazing algorithms,” in Image Processing (ICIP), 2015 IEEE Interna-
tional Conference on. IEEE, 2015, pp. 3600–3604. 4,5,6
[40] Z. Chen, T. Jiang, and Y. Tian, “Quality assessment for comparing image
enhancement algorithms,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2014, pp. 3003–3010. 4
[41] “Beijing realtime weather photos,” http://goo.gl/svzxLm.4
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
[42] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
driving? the kitti vision benchmark suite,” in Computer Vision and
Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012,
pp. 3354–3361. 4
[43] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse
depth samples and a single image,” arXiv preprint arXiv:1709.07492,
2017. 4
[44] L. Wang, H. Jin, R. Yang, and M. Gong, “Stereoscopic inpainting: Joint
color and depth completion from stereo images,” in Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE,
2008, pp. 1–8. 4
[45] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman, “The PASCAL Visual Object Classes
Challenge 2007 (VOC2007) Results,” http://www.pascal-
network.org/challenges/VOC/voc2007/workshop/index.html. 5,7
[46] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block
designs: I. the method of paired comparisons,” Biometrika, vol. 39, no.
3/4, pp. 324–345, 1952. 5,6
[47] D. Liu, D. Wang, and H. Li, “Recognizable or not: Towards image
semantic quality assessment for compression,” Sensing and Imaging,
vol. 18, no. 1, p. 1, 2017. 5
[48] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in Neural
Information Processing Systems.5,7
[49] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang, “Studying very
low resolution recognition using deep networks,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2016,
pp. 4792–4800. 7
[50] B. Cheng, Z. Wang, Z. Zhang, Z. Li, J. Yang, and T. S. Huang, “Robust
emotion recognition from low quality and low bit rate video: A deep
learning approach,” in Proceedings of the 7-th Conference on Affective
Computing and Intelligent Interaction, 2017. 7
[51] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb,
“Learning from simulated and unsupervised images through adversarial
training,” arXiv preprint arXiv:1612.07828, 2016. 7
[52] K. Li, Y. Li, S. You, and N. Barnes, “Photo-realistic simulation of road
scene for data-driven methods in bad weather,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Worksho,
2017, pp. 491–500. 7
[53] Y. Li, J. Huang, and J. Luo, “Using user generated online photos to
estimate and monitor air pollution in major cities,” in Proceedings of
the 7th International Conference on Internet Multimedia Computing and
Service. ACM, 2015, p. 79. 7
[54] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt,
and T. S. Huang, “Deepfont: Identify your font from an image,” in
Proceedings of the 23rd ACM international conference on Multimedia.
ACM, 2015, pp. 451–459. 7
... In this experiment, we trained the cycle-defog2refog algorithm using the RESIDE dataset (which includes ITS and SOTS datasets) [49]. The ITS dataset consists of 100,000 synthetic indoor foggy images. ...
Conference Paper
Autonomous Vehicle (AV) technologies are faced with several challenges under adverse weather conditions such as snow, fog, rain, sun glare, etc. Object detection under adverse weather conditions is one of the most critical issues facing autonomous driving. Several state-of-the-art Convolutional Neural Network (CNN) based object detection algorithms have been employed in autonomous vehicles and promising results have been established under favorable weather conditions. However, results from the literature show that the accuracy and performance of these CNN-based object detectors under adverse weather conditions tend to diminish rapidly. This problem continues to raise major concerns in the research and automotive community. In this paper, the foggy weather condition is our case study. The goal of this work is to investigate how defogging and restoring the quality of foggy images can improve the performance of CNN-based real-time object detectors. We employed a Cycle consistent Generative Adversarial Network (CycleGAN)-based image fog removal technique [1] to defog, improve the visibility and the quality of the foggy images. We train our YOLOv3 algorithm using the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset [2]. Using the trained YOLOv3 network, we perform object detection on the original foggy images and restored images. We compare the performances of the object detector under no fog, moderate fog, and heavy fog conditions. Our results show that detection performance improved significantly under moderate fog and there was no significant improvement under heavy fog conditions.
... However, the synthesized hazy image is not real enough due to the lack of real depth map information in the above dataset synthesis process based on atmospheric scattering model. erefore, RESIDE dataset [43] in which the hazy image is generated by clean image with its corresponding depth map on atmospheric scattering model is used as the second dataset for further verification. e image depth is combined in RESIDE dataset to generate the fog distribution based on the atmospheric scattering model, which is more consistent with the actual situation and more authentic. ...
Article
Full-text available
Maritime video surveillance of visual perception system has become an essential method to guarantee unmanned surface vessels (USV) traffic safety and security in maritime applications. However, when visual data are collected in a foggy marine environment, the essential optical information is often hidden in the fog, potentially resulting in decreased accuracy of ship detection. Therefore, a dual-channel and two-stage dehazing network (DTDNet) is proposed to improve the clarity and quality of the image to guarantee reliable ship detection under foggy conditions. Specifically, an upper and lower sampling structure is introduced to expand the original two-stage dehazing network into a two-channel network, to further capture the image features from different scale. Meanwhile, the attention mechanism is combined to provide different weights for different feature maps to maintain more image information. Furthermore, the perceptual function is constructed with the MSE-based loss function, so that it can better reduce the gap between the dehazing image and the unhazy image. Extensive experiments show that DTDNet has a better dehazing performance on both visual effects and quantitative index than other state-of-the-art dehazing networks. Moreover, the dehazing network is combined with the problem of ship detection under a sea-fog environment, and experiment results demonstrate that our network can be effectively applied to improve the visual perception performance of USV.
... Two widely-used test images with large sky areas ('Mountain' and 'Man') are commonly chosen in dehazing papers, as shown in the first and second rows of Fig. 13, respectively. We also choose two real-world outdoor hazy images from the test subset SOTS of the RESIDE dataset [46] for testing as shown in the third and fourth rows of Fig. 13. All the results are compared with those of six state-of-the-art methods: He's [20], Shin's [34], Berman's [31], DehazeNet [38], DCPDN [40], and MSBDN [42]. ...
Article
Full-text available
Single-image dehazing techniques are extensively used in outdoor optical image acquisition equipment. Most existing methods pay attention to use various priors to estimate scene transmission. In this paper, a fast single-image dehazing algorithm is proposed based on a piecewise transformation model between the minimum channels of the hazy image and the haze-free image in optical model. The minimum channel of the haze-free image is obtained by the piecewise transformation, which is a quadratic function model that we establish for the dark region, and a linear transformation model is established for the bright region. Using the minimum channels of the hazy image and the haze-free image, a transmission estimation model is established based on the haze optical model with adjustment variables. To obtain an accurate estimation of atmospheric light, we estimate the atmospheric light twice. Finally, the haze-free image is restored. Experimental results show that the proposed algorithm has minimal halo artifacts and color distortion in various depths of field, flat areas and sky areas. From the subjective evaluation, objective evaluation and running time analysis, it can be seen that the algorithm in this paper is superior to most existing technologies.
Chapter
In recent years, the researchers of image dehazing mainly focused on deep learning algorithms. However, due to the defective network structure, and inadequate feature extraction, the deep learning algorithm still has many problems to be solved. In this paper, we fuse the physical models including haze imaging model with absorption compensation, multiple scattering imaging model and multi-scale retinex imaging model with convolutional neural network to construct the image dehazing network. Multiple scattering haze imaging model is used to describe the haze imaging process in a more consistent way with the physical imaging mechanism. And the multi-scale retinex imaging model ensures the color fidelity. In the network structure, multi-scale feature extraction module can improve network performance in terms of feature reuse. In the attention feature extraction module, the back-propagating of the important front features is used to enhance features. This method can effectively make up for the deficiency that autocorrelation features cannot share the deep-level information, which is also effective for features replenishment. The results of the comparative experiment demonstrate that our network outperforms state-of-the-art dehazing methods.
Article
Full-text available
In recent decades, haze has become an environmental issue due to its effects on human health. It also reduces visibility and degrades the performance of computer vision algorithms in autonomous driving applications, which may jeopardize car driving safety. Therefore, it is extremely important to instantly remove the haze effect on an image. The purpose of this study is to leverage useful modules to achieve a lightweight and real-time image-dehazing model. Based on the U-Net architecture, this study integrates four modules, including an image pre-processing block, inception-like blocks, spatial pyramid pooling blocks, and attention gates. The original attention gate was revised to fit the field of image dehazing and consider different color spaces to retain the advantages of each color space. Furthermore, using an ablation study and a quantitative evaluation, the advantages of using these modules were illustrated. Through existing indoor and outdoor test datasets, the proposed method shows outstanding dehazing quality and an efficient execution time compared to other state-of-the-art methods. This study demonstrates that the proposed model can improve dehazing quality, keep the model lightweight, and obtain pleasing dehazing results. A comparison to existing methods using the RESIDE SOTS dataset revealed that the proposed model improves the SSIM and PSNR metrics by at least 5–10%.
Article
Single image dehazing is a challenging ill-posed task, we address it from the aspects of dehazing dataset and network architecture. As for the dehazing dataset, there are such problems as unnatural hazy images, unqualified ground truths, as well as monotonous and idealized depth-related haze synthesized by the physical model in the existing dehazing datasets. Therefore, we propose a novel haze data synthesis method to produce a dehazing dataset with non-homogeneous haze, which is named as FiveK-Haze. As for the network architecture, existing methods either get the image enhancement results directly in an end-to-end approach or restore the haze-free images based on the estimated physical parameters in a multiple-branch approach. To get better results for the real-world non-homogeneous haze images, we combine above two approaches together in a complementary way and design a new dehazing network with the enhancement-and-restoration fused CNNs, which is called as ERFNet. Exhaustive experimental results demonstrate the superiority of our method over the state-of-the-art methods in terms of both generalization performance and haze removal effects, especially for detail enhancement and color restoration.
Chapter
To remove image haze and make haze image scene clear, we proposed an image dehazing network based on multi-scale feature extraction (MSFNet) in this paper. The MSFNet first directly performs feature extraction on hazy images with three different resolutions to obtain fine feature maps and concatenates them with the rough feature maps extracted in the downsampling process for fusing and obtaining richer image information. Then, the fused feature maps are put into a network module composed of ResNeXt building blocks for network learning. Next, the feature maps extracted by upsampling are sequentially concatenated with the feature maps learned by the ResNeXt module for obtaining the residual image. Finally, the learned residual image is added to the input hazy image to obtain the image dehazing result. The experimental results on the SOTS dataset show that the MSFNet improves effectiveness of image dehazing.
Article
Full-text available
The purpose of image dehazing is the reduction of the image degradation caused by suspended particles for supporting high-level visual tasks. Besides the atmospheric scattering model, convolutional neural network (CNN) has been used for image dehazing. However, the existing image dehazing algorithms are limited in face of unevenly distributed haze and dense haze in real-world scenes. In this paper, we propose a novel end-to-end convolutional neural network called attention enhanced serial Unet++ dehazing network (AESUnet) for single image dehazing. We attempt to build a serial Unet++ structure that adopts a serial strategy of two pruned Unet++ blocks based on residual connection. Compared with the simple Encoder–Decoder structure, the serial Unet++ module can better use the features extracted by encoders and promote contextual information fusion in different resolutions. In addition, we take some improvement measures to the Unet++ module, such as pruning, introducing the convolutional module with ResNet structure, and a residual learning strategy. Thus, the serial Unet++ module can generate more realistic images with less color distortion. Furthermore, following the serial Unet++ blocks, an attention mechanism is introduced to pay different attention to haze regions with different concentrations by learning weights in the spatial domain and channel domain. Experiments are conducted on two representative datasets: the large-scale synthetic dataset RESIDE and the small-scale real-world datasets I-HAZY and O-HAZY. The experimental results show that the proposed dehazing network is not only comparable to state-of-the-art methods for the RESIDE synthetic datasets, but also surpasses them by a very large margin for the I-HAZY and O-HAZY real-world dataset.
Article
We present a quantitative performance analysis of a wide range of state-of-the-art object detection models, such as Mask R-CNN [8], RetinaNet [17] and EfficinetDet [28] in haze affected environments. This work uses two key performance metrics (Mean Average Precision and Localised Recall Precision) to provide a nuanced view of real world performance of these models in an on-road driving application. Our findings show that the presence of haze further exacerbates the performance differences between single-stage and multi-stage detection models. In addition, not all aspects of the model performance are affected equally. The inclusion of Local Recall Precision (LRP) [21] suggests that more recent models have much improved localisation performance even with similar false negative and false positive results. We also highlight some of the inherent limitations of Neural Network based approaches that could be addressed by Bayesian Neural Networks in the future.
Article
Full-text available
We consider the problem of dense depth prediction from a sparse set of depth measurements and a single RGB image. Since depth estimation from monocular images alone is inherently ambiguous and unreliable, we introduce additional sparse depth samples, which are either collected from a low-resolution depth sensor or computed from SLAM, to attain a higher level of robustness and accuracy. We propose the use of a single regression network to learn directly from the RGB-D raw data, and explore the impact of number of depth samples on prediction accuracy. Our experiments show that, as compared to using only RGB images, the addition of 100 spatially random depth samples reduces the prediction root-mean-square error by half in the NYU-Depth-v2 indoor dataset. It also boosts the percentage of reliable prediction from 59% to 92% on the more challenging KITTI driving dataset. We demonstrate two applications of the proposed algorithm: serving as a plug-in module in SLAM to convert sparse maps to dense maps, and creating much denser point clouds from low-resolution LiDARs. Codes and video demonstration are publicly available.
Article
Full-text available
The recent development of CNN-based image dehazing has revealed the effectiveness of end-to-end modeling. However, extending the idea to end-to-end video dehazing has not been explored yet. In this paper, we propose an End-to-End Video Dehazing Network (EVD-Net), to exploit the temporal consistency between consecutive video frames. A thorough study has been conducted over a number of structure options, to identify the best temporal fusion strategy. Furthermore, we build an End-to-End United Video Dehazing and Detection Network(EVDD-Net), which concatenates and jointly trains EVD-Net with a video object detection model. The resulting augmented end-to-end pipeline has demonstrated much more stable and accurate detection results in hazy video.
Article
Full-text available
This work addresses the problem of semantic foggy scene understanding (SFSU). Although extensive research has been performed on image dehazing and on semantic scene understanding with weather-clear images, little attention has been paid to SFSU. Due to the difficulty of collecting and annotating foggy images, we choose to generate synthetic fog on real images that depict weather-clear outdoor scenes, and then leverage these synthetic data for SFSU by employing state-of-the-art convolutional neural networks (CNN). In particular, a complete pipeline to generate synthetic fog on real, weather-clear images using incomplete depth information is developed. We apply our fog synthesis on the Cityscapes dataset and generate Foggy Cityscapes with 20550 images. SFSU is tackled in two fashions: 1) with typical supervised learning, and 2) with a novel semi-supervised learning, which combines 1) with an unsupervised supervision transfer from weather-clear images to their synthetic foggy counterparts. In addition, this work carefully studies the usefulness of image dehazing for SFSU. For evaluation, we present Foggy Driving, a dataset with 101 real-world images depicting foggy driving scenes, which come with ground truth annotations for semantic segmentation and object detection. Extensive experiments show that 1) supervised learning with our synthetic data significantly improves the performance of state-of-the-art CNN for SFSU on Foggy Driving; 2) our semi-supervised learning strategy further improves performance; and 3) image dehazing marginally benefits SFSU with our learning strategy. The datasets, models and code will be made publicly available to encourage further research in this direction.
Conference Paper
We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.
Article
With recent progress in graphics, it has become more tractable to train models on synthetic images, potentially avoiding the need for expensive annotations. However, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we propose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator's output using unlabeled real data, while preserving the annotation information from the simulator. We develop a method for S+U learning that uses an adversarial network similar to Generative Adversarial Networks (GANs), but with synthetic images as inputs instead of random vectors. We make several key modifications to the standard GAN algorithm to preserve annotations, avoid artifacts and stabilize training: (i) a 'self-regularization' term, (ii) a local adversarial loss, and (iii) updating the discriminator using a history of refined images. We show that this enables generation of highly realistic images, which we demonstrate both qualitatively and with a user study. We quantitatively evaluate the generated images by training models for gaze estimation and hand pose estimation. We show a significant improvement over using synthetic images, and achieve state-of-the-art results on the MPIIGaze dataset without any labeled real data.