Conference PaperPDF Available

Single Image Reflection Removal Exploiting Misaligned Training Data and Network Enhancements


Abstract and Figures

Removing undesirable reflections from a single image captured through a glass window is of practical importance to visual computing systems. Although state-of-the-art methods can obtain decent results in certain situations, performance declines significantly when tackling more general real-world cases. These failures stem from the intrinsic difficulty of single image reflection removal – the fundamental ill-posedness of the problem, and the insufficiency of densely-labeled training data needed for resolving this ambiguity within learning-based neural network pipelines. In this paper, we address these issues by exploiting targeted network enhancements and the novel use of misaligned data. For the former, we augment a baseline network architecture by embedding context encoding modules that are capable of leveraging high-level contextual clues to reduce indeterminacy within areas containing strong reflections. For the latter, we introduce an alignment-invariant loss function that facilitates exploiting misaligned real-world training data that is much easier to collect. Experimental results collectively show that our method outperforms the state-of-the-art with aligned data, and that significant improvements are possible when using additional misaligned data.
Content may be subject to copyright.
Single Image Reflection Removal Exploiting Misaligned Training Data and
Network Enhancements
Kaixuan Wei1Jiaolong Yang2Ying Fu1David Wipf2Hua Huang1
1Beijing Institute of Technology 2Microsoft Research
Removing undesirable reflections from a single image
captured through a glass window is of practical impor-
tance to visual computing systems. Although state-of-the-
art methods can obtain decent results in certain situations,
performance declines significantly when tackling more gen-
eral real-world cases. These failures stem from the intrin-
sic difficulty of single image reflection removal – the funda-
mental ill-posedness of the problem, and the insufficiency of
densely-labeled training data needed for resolving this am-
biguity within learning-based neural network pipelines. In
this paper, we address these issues by exploiting targeted
network enhancements and the novel use of misaligned
data. For the former, we augment a baseline network archi-
tecture by embedding context encoding modules that are ca-
pable of leveraging high-level contextual clues to reduce in-
determinacy within areas containing strong reflections. For
the latter, we introduce an alignment-invariant loss func-
tion that facilitates exploiting misaligned real-world train-
ing data that is much easier to collect. Experimental results
collectively show that our method outperforms the state-of-
the-art with aligned data, and that significant improvements
are possible when using additional misaligned data.
1. Introduction
Reflection is a frequently-encountered source of image
corruption that can arise when shooting through a glass sur-
face. Such corruptions can be addressed via the process of
single image reflection removal (SIRR), a challenging prob-
lem that has attracted considerable attention from the com-
puter vision community [22, 25, 39, 2, 5, 47, 44, 38]. Tra-
ditional optimization-based methods often leverage manual
intervention or strong prior assumptions to render the prob-
lem more tractable [22, 25]. Recently, alternative learning-
based approaches rely on deep Convectional Neural Net-
works (CNNs) in lieu of the costly optimization and hand-
crafted priors [5, 47, 44, 38]. But promising results notwith-
Corresponding author:
standing, SIRR remains a largely unsolved problem across
disparate imaging conditions and varying scene content.
For CNN-based reflection removal, our focus herein, the
challenge originates from at least two sources: (i) The ex-
traction of a background image layer devoid of reflection
artifacts is fundamentally ill-posed, and (ii) Training data
from real-world scenes, are exceedingly scarce because of
the difficulty in obtaining ground-truth labels.
Mathematically speaking, it is typically assumed that a
captured image Iis formed as a linear combination of a
background or transmitted layer Tand a reflection layer R,
i.e., I=T+R. Obviously, when given access only to I,
there exists an infinite number of feasible decompositions.
Further compounding the problem is the fact that both T
and Rinvolve content from real scenes that may have over-
lapping appearance distributions. This can make them diffi-
cult to distinguish even for human observers in some cases,
and simple priors that might mitigate this ambiguity are not
available except under specialized conditions.
On the other hand, although CNNs can perform a wide
variety visual tasks, at times exceeding human capabilities,
they generally require a large volume of labeled training
data. Unfortunately, real reflection images accompanied
with densely-labeled, ground-truth transmitted layer inten-
sities are scarce. Consequently, previous learning-based ap-
proaches have resorted to training with synthesized images
[5, 38, 47] and/or small real-world data captured from spe-
cialized devices [47]. However, existing image synthesis
procedures are heuristic and the domain gap may jeopardize
accuracy on real images. On the other hand, collecting suf-
ficient additional real data with precise ground-truth labels
is tremendously labor-intensive.
This paper is devoted to addressing both of the afore-
mentioned challenges. First, to better tackle the intrinsic
ill-posedness and diminish ambiguity, we propose to lever-
age a network architecture that is sensitive to contextual in-
formation, which has proven useful for other vision tasks
such as semantic segmentation [11, 48, 46, 13]. Note that
at a high level, our objective is to efficiently convert prior
information mined from labeled training data into network
structures capable of resolving this ambiguity. Within a tra-
ditional CNN model, especially in the early layers where
the effective receptive field is small, the extracted features
across all channels are inherently local. However, broader
non-local context is necessary to differentiate those features
that are descriptive of the desired transmitted image, and
those that can be discarded as reflection-based. For ex-
ample, in image neighborhoods containing a particularly
strong reflection component, accurate separation by any
possible method (even one trained with arbitrarily rich train-
ing data) will likely require contextual information from re-
gions without reflection. To address this issue, we utilize
two complementary forms of context, namely, channel-wise
context and multi-scale spatial context. Regarding the for-
mer, we apply a channel attention mechanism to the fea-
ture maps from convolutional layers such that different fea-
tures are weighed differently according to global statistics
of the activations. For the latter, we aggregate information
across a pyramid of feature map scales within each chan-
nel to reach a global contextual consistency in the spatial
domain. Our experiments demonstrate that significant im-
provement can be obtained by these enhancements, leading
to state-of-the-art performance on two real-image datasets.
Secondly, orthogonal to architectural considerations, we
seek to expand the sources of viable training data by facil-
itating the use of misaligned training pairs, which are con-
siderably easier to collect. Misalignment between an input
image Iand a ground-truth reflection-free version Tcan be
caused by camera and/or object movements during the ac-
quisition process. In the previous works [37, 46], data pairs
(I , T )were obtained by taking an initial photo through a
glass plane, followed by capturing a second one after the
glass has been removed. This process requires that the
camera, scene, and even lighting conditions remain static.
Adhering to these requirements across a broad acquisition
campaign can significantly reduce both the quantity and di-
versity of the collected data. Additionally, post-processing
may also be necessary to accurately align Iand Tto com-
pensate for spatial shifts caused by the refractive effect [37].
In contrast, capturing unaligned data is considerably less
burdensome, as shown in Fig. 1. For example, there is no
need for a tripod, table, or other special hardware; the cam-
era can be hand-held and the pose can be freely adjusted;
dynamic scenes in the presence of vehicles, humans, etc.
can be incorporated; and finally no post-processing of any
type is needed.
To handle such misaligned training data, we require a
loss function that is, to the extent possible, invariant to the
alignment, i.e., the measured image content discrepancy be-
tween the network prediction and its unaligned reference
should be similar to what would have been observed if
the reference was actually aligned. In the context of im-
age style transfer [17] and others, certain perceptual loss
functions have been shown to be relatively invariant to var-
[46] Ours
Figure 1: Comparison of the reflection image data collection meth-
ods in [46] and this paper.
ious transformations. Our study shows that the using only
the highest-level feature from a deep network (VGG-19 in
our case) leads to satisfactory results for our reflection re-
moval task. In both simulation tests and experiments us-
ing a newly collected dataset, we demonstrate for the first
time that training/fine-tuning a CNN with unaligned data
improves the reflection removal results by a large margin.
2. Related Work
This paper is concerned with reflection removal from
a single image. Previous methods utilizing multiple input
images of, e.g., flash/non-flash pairs [1], different polariza-
tion [20], multi-view or video sequences [6, 35, 30, 7, 24,
34, 9, 43, 45] will not be considered here.
Traditional methods. Reflection removal from a single im-
age is a massively ill-posed problem. Additional priors are
needed to solve the otherwise prohibitively-difficult prob-
lem in traditional optimization-based method [22, 25, 39,
2, 36]. In [22], user annotations are used to guide layer
separation jointly with a gradient sparsity prior [23]. [25]
introduces a relative smoothness prior where the reflections
are assumed to be blurry thus their large gradients are penal-
ized. [39] explores a variant of the smoothness prior where
a multi-scale Depth-of-Field (DoF) confidence map is uti-
lized to perform edge classification. [31] exploits the ghost
cues for layer separation. [2] proposes a simple optimiza-
tion formulation with an l0gradient penalty on the transmit-
ted layer inspired by image smoothing algorithms [42]. De-
spite decent results can be obtained by these methods where
their assumptions hold, the vastly-different imaging condi-
tions and complex scene content in the real world render
their generalization problematic.
Deep learning based methods. Recently, there is an
emerging interest in applying deep convolutional neural net-
works for single image reflection removal such that the
handcrafted priors can be replaced by data-driven learn-
ing [5, 38, 47, 44]. The first CNN-based method is due
to [5], where a network structure is proposed to first pre-
dict the background layer in the edge domain followed by
reconstructing it the color domain. Later, [38] proposes to
predict the edge and image intensity concurrently by two
cooperative sub-networks. The recent work of [44] presents
a cascade network structure which predicts the background
layer and reflection layer in an interleaved fashion. The ear-
lier CNN-based methods typical use the raw image intensity
discrepancy such as mean squared error (MSE) to train the
networks. Several recent works [47, 16, 3] adopt the per-
ceptual loss [17] which uses the multi-stage features of a
deep network pre-trained on ImageNet [29]. [47]. Adver-
sarial loss is investigated in [47, 21] to improve the realism
of the predicted background layers.
3. Approach
Given an input image Icontaminated with reflections,
our goal is to estimate a reflection-free trasmitted image ˆ
To achieve this, we train a feed-forward CNN GθGparame-
terized by θGto minimize a reflection removal loss function
l. Given training image pairs {(In, Tn)},n= 1,· · · , N,
this involves solving:
θG= arg minθG
n=1 l(GθG(In), Tn).(1)
We will first introduce the details of network architecture
GθGfollowed by the loss function lapplied to both aligned
data (the common case) and newly proposed unaligned data
extensions. The overall system is illustrated in Fig. 2.
3.1. Basic Image Reconstruction Network
Our starting point can be viewed as the basic image re-
construction neural network component from [5] but mod-
ified in three aspects: (1) We simplify the basic residual
block [12] by removing the batch normalization (BN) layer
[14]; (2) we increase the capacity by widening the network
from 64 to 256 feature maps; and (3) for each input image
I, we extract hypercolumn features [10] from a pretrained
VGG-19 network [32], and concatenate these features with
Ias an augmented network input. As explained in [47],
such an augmentation strategy can help enable the network
to learn semantic clues from the input image.
Note that removing the BN layer from our network turns
out to be critical for optimizing performance in the present
context. As shown in [41], if batch sizes become too small,
prediction errors can increase precipitously and stability is-
sues can arise. Moreover, for a dense prediction task such as
SIRR, large batch sizes can become prohibitively expensive
in terms of memory requirements. In our case, we found
that within the tenable batch sizes available for reflection re-
moval, BN led to considerably worse performance, includ-
ing color attenuation/shifting issues as sometimes observed
in image-to-image translation tasks [5, 15, 49]. BN layers
have similarly been removed from other dense prediction
tasks such as image super-resolution [26] or deblurring [28].
At this point, we have constructed a useful base archi-
tecture upon which other more targeted alterations will be
applied shortly. This baseline, which we will henceforth
refer to as BaseNet, performs quite well when trained and
tested on synthetic data. However, when deployed on real-
world reflection images we found that its performance de-
graded by an appreciable amount, especially on the 20 real
images from [47]. Therefore, to better mitigate the tran-
sition from the make-believe world of synthetic images to
real-life photographs, we describe two modifications for in-
troducing broader contextual information into otherwise lo-
cal convolutional filters.
3.2. Context Encoding Modules
As mentioned previously, we consider both context be-
tween channels and multi-scale context within channels.
Channel-wise context. The underlying design princi-
ple here is to introduce global contextual information
across channels, and a richer overall structure within resid-
ual blocks, without dramatically increasing the parameter
count. One way to accomplish this is by incorporating a
channel attention module originally developed in [13] to re-
calibrate feature maps using global summary statistics.
Let U= [u1, . . . , uc, . . . , uC]denote original, uncali-
brated activations produced by a network block, with C
feature maps of size of H×W. These activations gener-
ally only reflect local information residing within the corre-
sponding receptive fields of each filter. We then form scalar,
channel-specific descriptors zc=fgp(uc)by applying a
global average pooling operator fgp to each feature map
ucRH×W. The vector z= [z1, . . . , zC]RCrepresents
a simple statistical summary of global, per-channel activa-
tions and, when passed through a small network structure,
can be used to adaptively predict the relative importance of
each channel [13].
More specifically, the channel attention module first
computes s=σ(WUδ(WDz)) where WDis a trainable
weight matrix that downsamples zto dimension R < C,
δis a ReLU non-linearity, WUrepresents a trainable up-
sampling weight matrix, and σis a sigmoidal activation.
Elements of the resulting output vector sRCserve
as channel-specific gates for calibrating feature maps via
Consequently, although each individual convolutional
filter has a local receptive field, the determination of which
channels are actually important in predicting the transmis-
sion layer and suppressing reflections is based on the pro-
cessing of a global statistic (meaning the channel descrip-
tors computed as activations pass through the network dur-
ing inference). Additionally, the parameter overhead intro-
duced by this process is exceedingly modest given that WD
and WUare just small additional weight matrices associated
with each block.
𝑃𝑖𝑥𝑒𝑙 𝐿𝑜𝑠𝑠 𝑙
Loss for
Unaligned Data
𝐴𝑑𝑣𝑒𝑟𝑠𝑎𝑟𝑖𝑎𝑙 𝐿𝑜𝑠𝑠 𝑙
F𝑒𝑎𝑡𝑢𝑟𝑒 𝐿𝑜𝑠𝑠 𝑙
Align. 𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 𝐿𝑜𝑠𝑠 𝑙
13 blocks
Convolution ReLU Pyramid
Channel Attention
Loss for
aligned Data
Figure 2: Overview of our approach for single image reflection removal.
Multi-scale spatial context. Although we have found that
encoding the contextual information across channels al-
ready leads to significant empirical gains on real-world im-
ages, utilizing complementary multi-scale spatial informa-
tion within each channel provides further benefit. To ac-
complish this, we apply a pyramid pooling module [11],
which has proven to be an effective global-scene-level rep-
resentation in semantic segmentation [48]. As shown in
Fig. 2, we construct such a module using pooling opera-
tions at sizes 4, 8, 16, and 32 situated in the tail of our net-
work before the final construction of ˆ
T. Pooling in this way
fuses features under four different pyramid scales. After
harvesting the resulting sub-region representations, we per-
form a non-linear transformation (i.e. a Conv-ReLU pair) to
reduce the channel dimension. The refined features are then
upsampled via bilinear interpolation. Finally, the different
levels of features are concatenated together as a final repre-
sentation reflecting multi-scale spatial context within each
channel; the increased parameter overhead is negligible.
3.3. Training Loss for Aligned Data
In this section, we present our loss function for aligned
training pairs (I , T ), which consists of three terms similar
to previous methods [47, 44].
Pixel loss. Following [5], we penalize the pixel-wise in-
tensity difference of Tand ˆ
Tvia lpixel =αkˆ
T− ∇xTk1+k∇yˆ
T− ∇yTk1)where xand y
are the gradient operator along x- and y-direction, respec-
tively. We set α= 0.2and β= 0.4in all our experiments.
Feature loss. We define the feature loss based on the
activations of the 19-layer VGG network [33] pretrained
on ImageNet [29]. Let φlbe the feature from the l-th
layer of VGG-19, we define the feature loss as lf eat =
T)k1where {λl}are the balancing
weights. Similar to [47], we use the layers ‘conv2 2’,
‘conv3 2’, ‘conv4 2’, and ‘conv5 2’ of VGG-19 net.
Adversarial loss. We further add an adversarial loss to
improve the realism of the produced background images.
We define an opponent discriminator network DθDand
minimize the relativistic adversarial loss [18] defined as
ladv =lG
adv =log(DθD(T, ˆ
T , T )) for
GθGand lD
adv =log(1 DθD(T, ˆ
T)) log(DθD(ˆ
T , T ))
for DθDwhere DθD(T, ˆ
T) = σ(C(T)C(ˆ
T)) with σ(·)
being the sigmoid function and C(·)the non-transformed
discriminator function (refer to [18] for details).
To summarize, our loss for aligned data is defined as:
laligned =ω1lpixel +ω2lf eat +ω3ladv (2)
where we empirically set the weights as ω1= 1, ω2= 0.1,
and ω3= 0.01 respectively throughout our experiments.
3.4. Training Loss for Unaligned Data
To use misaligned data pairs (I , T )for training, we need
a loss function that is invariant to the alignment, such that
the true similarity between Tand the prediction ˆ
Tcan be
reasonably measured. In this regard, we note that human
observers can easily assess the similarity of two images
even if they are not aligned. Consequently, designing a
loss measuring image similarity on the perceptual-level may
serve our goal. This motivates us to directly use a deep fea-
ture loss for unaligned data.
Intuitively, the deeper the feature, the more likely it is
to be insensitive to misalignment. To experimentally ver-
ify this and find a suitable feature layer for our purposes,
we conducted tests using a pre-trained VGG-19 network as
follows. Given an unaligned image pair (I , T ), we use gra-
dient descent to finetune the weights of our network GθG
to minimize the feature difference of Tand ˆ
T, with features
extracted at different layers of VGG-19. Figure 3 shows that
using low-level or middle-level features from ‘conv2 2’ to
‘conv4 2’ leads to blurry results (similar to directly using a
pixel-wise loss), although the reflection is more thoroughly
removed. In contrast, using the highest-level feature from
‘conv5 2’ gives rise to a striking result: the predicted back-
ground image is sharp and almost reflection-free.
(a) Input (b) Unaligned Ref. (c) Pretrained
(d) lpixel (e) conv2 2 (f) conv3 2
(g) conv4 2 (h) conv5 2 (i) Loss of [27]
Figure 3: The effect of using different loss to handle misaligned
real data. (a) and (b) are the unaligned image pair (I , T ). (c)
shows the reflection removal result of our network trained on syn-
thetic data and a small number of aligned real data (see Section 4
for details). Reflection can still be observed in the predicted back-
ground image. (d) is the result finetuned on (I , T )with pixel-
wise intensity loss. (e)-(h) are the results finetuned with features
at different layers of VGG-19. Only the highest-level feature from
‘conv5 2’ yields satisfactory result. (i) shows the results finetuned
with the loss of [27]. (Best viewed on screen with zoom)
Recently, [27] introduced a “contextual loss” which is
also designed for training deep networks with unaligned
data for image-to-image translation tasks like image style
transfer. In Fig 3, we also present the finetuned result us-
ing this loss for our reflection removal task. Upon visual
inspection, the results are similar to our highest-level VGG
feature loss (quantitative comparison can be found in the
experiment section). However, our adopted loss (formally
defined below) is much simpler and more computationally
efficient than the loss from [27].
Alignment-invariant loss. Based on the above study, we
now formally define our invariant loss component designed
for unaligned data as linv =kφh(T)φh(ˆ
T)k1, where
φhdenotes the ‘conv5 2’ feature of the pretrained VGG-19
network. For unaligned data, we also apply an adversarial
loss which is not affected by misalignment. Therefore, our
overall loss for unaligned data can be written as
lunaligned =ω4linv +ω5ladv (3)
where we set the weights as ω4= 0.1and ω5= 0.01.
4. Experiments
4.1. Implementation Details
Training data. We adopt a fusion of synthetic and real data
as our train dataset. The images from [5] are used as sythetic
Table 1: Comparison of different settings. Our full model (i.e.
ERRNet) leads to best performance among all comparisons.
Synthetic Real20
CEILNet-F [5] 24.70 0.884 20.32 0.739
BaseNet only 25.71 0.926 21.51 0.780
BaseNet + CSC 27.64 0.940 22.61 0.796
BaseNet + MSC 26.03 0.928 21.75 0.783
ERRNet 27.88 0.941 22.89 0.803
data, i.e. 7,643 cropped images with size 224 ×224 from
PASCAL VOC dataset [4]. 90 real-world training images
from [47] are adopted as real data. For image synthesis,
we use the same data generation model as [5] to create our
synthetic data. In the following, we always use the same
dataset for training, unless specifically stated.
Training details. Our implementation1is based on Py-
Torch. We train the model with 60 epoch using the Adam
optimizer [19]. The base learning rate is set to 104and
halved at epoch 30, then reduced to 105at epoch 50. The
weights are initialized as in [26].
4.2. Ablation Study
In this section, we conduct an ablation study for our
method on 100 synthetic testing images from [5] and 20
real testing images from [47] (denoted by ‘Real20’).
Component analysis. To verify the importance of our
network design, we compare four model architectures as
described in Section 3, including (1) Our basic image re-
construction network BaseNet; (2) BaseNet with channel-
wise context module (BaseNet + CWC); (3) BaseNet with
multi-scale spatial context module (BaseNet + MSC); and
(4) Our enhanced reflection removal network, denoted ER-
RNet, i.e., BaseNet + CWC + MSC. The result from the
CEILNet [5] fine-tuned on our training data (denoted by
CEILNet-F) is also provided as an additional reference.
As shown in Table 1, our BaseNet has already achieved
a much better result than CEILNet-F. The performance of
our BaseNet could be obviously boosted by using channel-
wise context and multi-scale spatial context modules, espe-
cially by using them together, i.e. ERRNet. Figure 4 visu-
ally shows the results from BaseNet and our ERRNet. It can
be observed that BaseNet struggles to discriminate the re-
flection region and yields some obvious residuals, while the
ERRNet removes the reflection and produces much cleaner
transmitted images. These results suggest the effectiveness
of our network design, especially the components tailored
to encode the contextual clues.
Efficacy of the training loss for unaligned data. In this
1Code is released at
Input BaseNet ERRNet
Figure 4: Comparison of the results with (ERRNet) and without
(BaseNet) the context encoding modules.
Table 2: Simulation experiment to verify the efficacy our
alignment-invariant loss
Training Scheme PSNR SSIM
Synthetic only 19.79 0.741
+ 50 aligned 22.00 0.785
+ 90 aligned 22.89 0.803
+ 50 aligned, + 40 unaligned trained with:
lpixel 21.85 0.766
linv 22.38 0.797
lcx 22.47 0.796
linv +lcx 22.43 0.796
experiment, we first train our ERRNet with only ‘synthetic
data’, ‘synthetic + 50 aligned real data’, and ‘synthetic +
90 aligned real data’. The loss function in Eq. (2) is used
for aligned data. We can see that the testing results become
better with the increasing real data in Table 2.
Then, we synthesize misalignment through performing
random translations within [10,10] pixels on real data2,
and train ERRNet with ‘synthetic + 50 aligned real data +
40 unaligned data’. Pixel-wise loss lpixel and alignment-
invariant loss linv are used for 40 unaligned images. Ta-
ble 2 shows employing 40 unaligned data with lpixel loss
degrades the performance, even worse than that from 50
aligned images without additional unaligned data.
In addition, we also investigate the contextual loss lcx
of [27]. Results from both contextual loss lcx and our
alignment-invariant loss linv (or combination of them linv+
lcx) surpass analogous results obtained with only aligned
images by appreciable margins, indicating that these losses
provide useful supervision to networks granted unaligned
data. Note although linv and lcx perform equally well, our
linv is much simpler and computationally efficient than lcx,
suggesting linv is lightweight alternative to lcx in terms of
our reflection removal task.
2Our alignment-invariant loss linv can handle shifts of up to 20 pixels.
See suppl. material for more details.
4.3. Method Comparison on Benchmarks
In this section, we compare our ERRNet against state-of-
the-art methods including the optimization-based method of
[25] (LB14) and the learning-based approaches (CEILNet
[5], Zhang et al. [47], and BDN [44]). For fair comparison,
we finetune these models on our training dataset and report
results of both the original pretrained model and finetuned
version (denoted with a suffix ’-F’).
The comparison is conducted on four real-world
datasets, i.e. 20 testing images in [47] and three sub-datasets
from SIR2[37]. These three sub-datasets are captured under
different conditions: (1) 20 controlled indoor scenes com-
posed by solid objects; (2) 20 different controlled scenes
on postcards; and (3) 55 wild scenes3with ground truth
provided. In the following, we denote these datasets by
‘Real20’, ‘Objects’, ‘Postcard’, and ‘Wild’, respectively.
Table 3 summarizes the results of all competing meth-
ods on four real-world datasets. The quality metrics include
PSNR, SSIM [40], NCC [43, 37] and LMSE [8]. Larger
values of PSNR, SSIM, and NCC indicate better perfor-
mance, while a smaller value of LMSE implies a better re-
sult. Our ERRNet achieves the state-of-the-art performance
in ‘Real20’ and ‘Objects’ datasets. Meanwhile, our result
is comparable to the best-performing BDN-F on ‘Postcard’
data. The quantitative results on ‘Wild’ dataset reveal a
frustrating fact, namely, that no method could outperform
the naive baseline ’Input’, suggesting that there is still large
room for improvement.
Figure 5 displays visual results on real-world images. It
can be seen that all compared methods fail to handle some
strong reflections, but our network more accurately removes
many undesirable artifacts, e.g. removal of tree branches re-
flected on the building window in the fourth photo of Fig 5.
4.4. Training with Unaligned Data
To test our alignment-invariant loss on real-world un-
aligned data, we first collected a dataset of unaligned im-
age pairs with cameras and a portable glass, as shown in
Fig. 1 . Both a DSLR camera and a smart phone are used to
capture the images. We collected 450 image pairs in total,
and some samples are shown in Fig 6. These image pairs
are randomly split into a training set of 400 samples and a
testing set with 50 samples.
We conduct experiments on the BDN-F and ERRNet
models, each of which is first trained on aligned dataset
(w/o unaligned) as in Section 4.3, and then finetuned with
our alignment-invariant loss and unaligned training data.
The resulting pairs before and after finetuning are assem-
bled for human assessment, as no existing numerical metric
is available for evaluating unaligned data.
We asked 30 human observers to provide a preference
3Images indexed by 1, 2, 74 are removed due to misalignment.
Input LB14 [25] CEILNet-F [5] Zhang et al. [47] BDN-F [44] ERRNet Reference
Figure 5: Visual comparison on real-world images. The images are obtained from ‘Real20’ (Rows 1-3) and our collected unaligned dataset
(Rows 4- 5). More results can be found in the suppl. material.
Table 3: Quantitative results of different methods on four real-world benchmark datasets. The best results are indicated by red color and
the second best results are denoted by blue color. The results of ‘Average’ are obtained by averaging the metric scores of all images from
these four real-world datasets.
Dataset Index
[25] [5] F et al. [47] [44] F
PSNR 19.05 18.29 18.45 20.32 21.89 18.41 20.06 22.89
SSIM 0.733 0.683 0.690 0.739 0.787 0.726 0.738 0.803
NCC 0.812 0.789 0.813 0.834 0.903 0.792 0.825 0.877
LMSE 0.027 0.033 0.031 0.028 0.022 0.032 0.027 0.022
PSNR 23.74 19.39 23.62 23.36 22.72 22.73 24.00 24.87
SSIM 0.878 0.786 0.867 0.873 0.879 0.856 0.893 0.896
NCC 0.981 0.971 0.972 0.974 0.964 0.978 0.978 0.982
LMSE 0.004 0.007 0.005 0.005 0.005 0.005 0.004 0.003
PSNR 21.30 14.88 21.24 19.17 16.85 20.71 22.19 22.04
SSIM 0.878 0.795 0.834 0.793 0.799 0.859 0.881 0.876
NCC 0.947 0.929 0.945 0.926 0.886 0.943 0.941 0.946
LMSE 0.005 0.008 0.008 0.013 0.007 0.005 0.004 0.004
PSNR 26.24 19.05 22.36 22.05 21.56 22.36 22.74 24.25
SSIM 0.897 0.755 0.821 0.844 0.836 0.830 0.872 0.853
NCC 0.941 0.894 0.918 0.924 0.919 0.932 0.922 0.917
LMSE 0.005 0.027 0.013 0.009 0.010 0.009 0.008 0.011
PSNR 22.85 17.51 22.30 21.41 20.22 21.70 22.96 23.59
SSIM 0.874 0.781 0.841 0.832 0.838 0.848 0.879 0.879
NCC 0.955 0.937 0.948 0.943 0.925 0.951 0.950 0.956
LMSE 0.006 0.011 0.009 0.010 0.007 0.007 0.006 0.005
score among {-2,-1,0,1,2}with 2 indicating the finetuned
result is significantly better while -2 the opposite. To avoid
bias, we randomly switch the image positions of each pair.
In total, 3000 human judgments are collected (2 methods,
30 users, 50 images pairs). More details regarding this eval-
uation process can be found in the suppl. material.
Figure 6: Image samples in our unaligned image dataset. Our dataset covers a large variety of indoor and outdoor environments including
dynamic scenes with vehicles, human, etc.
Score Range Ratio BDN-F ERRNet
10 20 30 40 50
10 20 30 40 50
(0.25,2] 78% 54%
[0.25,0.25] 18% 36%
[2,0.25) 4% 10%
Average Score 0.62 0.51
Table 4: Human preference scores of self-comparsion experiments. Left: results of BDN-F; Right: results of ERRNet. X axis of each
sub-figure represents the image # of testing images (50 in total).
input reference w/o unaligned w. unaligned w/o unaligned w. unaligned
Figure 7: Results of training with and without unaligned data. See suppl. material for more examples. (Best view on screen with zoom)
Table 4 shows the average of human preference scores
for the resulting pairs of each method. As can be seen, hu-
man observers clearly tend to prefer the results produced
by the finetuned models over the raw ones, which demon-
strates the benefit of leveraging unaligned data for training
independent of the network architecture. Figure 7 shows
some typical results of the two methods; the results are sig-
nificantly improved by training on unaligned data.
5. Conclusion
We have proposed an enhanced reflection removal net-
work together with an alignment-invariant loss function to
help resolve the difficulty of single image reflection re-
moval. We investigated the possibility to directly utilize
misaligned training data, which can significantly alleviate
the burden of capturing real-world training data. To effi-
ciently extract the underlying knowledge from real train-
ing data, we introduce context encoding modules, which
can be seamlessly embedded into our network to help dis-
criminate and suppress the reflection component. Extensive
experiments demonstrate our approach set a new state-of-
the-art on real-world benchmarks of single image reflection
removal, both quantitatively and visually.
We thank Yunhao Zou for great help collecting the re-
flection image dataset. This work was supported by the Na-
tional Natural Science Foundation of China under Grants
No. 61425013 and No. 61672096.
[1] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Remov-
ing photography artifacts using gradient projection and flash-
exposure sampling. ACM Transactions on Graphics (TOG),
24(3):828–835, 2005.
[2] N. Arvanitopoulos, R. Achanta, and S. Susstrunk. Single im-
age reflection suppression. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), July 2017.
[3] Z. Chi, X. Wu, X. Shu, and J. Gu. Single image reflection
removal using deep encoder-decoder network. arXiv preprint
arXiv:1802.00094, 2018.
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The pascal visual object classes (voc)
challenge. International Journal of Computer Vision (IJCV),
88(2):303–338, 2010.
[5] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf. A generic
deep architecture for single image reflection removal and im-
age smoothing. In The IEEE International Conference on
Computer Vision (ICCV), Oct 2017.
[6] H. Farid and E. H. Adelson. Separating reflections and light-
ing using independent components analysis. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
July 1999.
[7] K. Gai, Z. Shi, and C. Zhang. Blind separation of superim-
posed moving images using image statistics. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (TPAMI),
34(1):19–32, 2012.
[8] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Free-
man. Ground truth dataset and baseline evaluations for in-
trinsic image algorithms. In IEEE International Conference
on Computer Vision (ICCV). IEEE, Oct 2009.
[9] X. Guo, X. Cao, and Y. Ma. Robust separation of reflection
from multiple images. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2014.
[10] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hy-
percolumns for object segmentation and fine-grained local-
ization. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2015.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
in deep convolutional networks for visual recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 37(9):1904–1916, 2015.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
[13] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net-
works. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
translation with conditional adversarial networks. In The
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), July 2017.
[16] M. Jin, S. Ssstrunk, and P. Favaro. Learning to see through
reflections. In IEEE International Conference on Computa-
tional Photography (ICCP), May 2018.
[17] J. Johnson, A. Alahi, and L. Feifei. Perceptual losses for
real-time style transfer and super-resolution. European Con-
ference on Computer Vision (ECCV), pages 694–711, 2016.
[18] A. Jolicoeur-Martineau. The relativistic discriminator: a key
element missing from standard GAN. In International Con-
ference on Learning Representations (ICLR), 2019.
[19] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[20] N. Kong, Y.-W. Tai, and J. S. Shin. A physically-based
approach to reflection separation: from physical modeling
to constrained optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 36(2):209–221,
[21] D. Lee, M.-H. Yang, and S. Oh. Generative single image re-
flection separation. arXiv preprint arXiv:1801.04102, 2018.
[22] A. Levin and Y. Weiss. User assisted separation of reflections
from a single image using a sparsity prior. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (TPAMI),
29(9):1647–1654, 2007.
[23] A. Levin, A. Zomet, and Y. Weiss. Learning to perceive
transparency from the statistics of natural scenes. In Ad-
vances in Neural Information Processing Systems (NIPS).
December 2002.
[24] Y. Li and M. S. Brown. Exploiting reflection change for au-
tomatic reflection removal. In The IEEE International Con-
ference on Computer Vision (ICCV), December 2013.
[25] Y. Li and M. S. Brown. Single image layer separation us-
ing relative smoothness. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 2752–2759,
[26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced
deep residual networks for single image super-resolution.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, July 2017.
[27] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual
loss for image transformation with non-aligned data. In The
European Conference on Computer Vision (ECCV), Septem-
ber 2018.
[28] S. Nah, T. Hyun Kim, and K. Mu Lee. Deep multi-scale
convolutional neural network for dynamic scene deblurring.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge. In-
ternational Journal of Computer Vision (IJCV), 115(3):211–
252, 2015.
[30] B. Sarel and M. Irani. Separating transparent layers through
layer information exchange. In European Conference on
Computer Vision (ECCV), September 2004.
[31] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman. Reflec-
tion removal using ghosting cues. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
[32] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. International
Conference on Learning Representations (ICLR), 2015.
[33] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. In International
Conference on Machine Learning (ICLR), 2015.
[34] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and
R. Szeliski. Image-based rendering for scenes with reflec-
tions. ACM Transactions on Graphics (TOG), 31(4):100–1,
[35] R. Szeliski, S. Avidan, and P. Anandan. Layer extrac-
tion from multiple images containing reflections and trans-
parency. In IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), July 2000.
[36] R. Wan, B. Shi, L. Duan, A. Tan, W. Gao, and A. C. Kot.
Region-aware reflection removal with unified content and
gradient priors. IEEE Transactions on Image Processing,
27(6):2927–2941, 2018.
[37] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot.
Benchmarking single-image reflection removal algorithms.
In The IEEE International Conference on Computer Vision
(ICCV), Oct 2017.
[38] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot. Crrn:
Multi-scale guided concurrent reflection removal network.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[39] R. Wan, B. Shi, T. A. Hwee, and A. C. Kot. Depth of field
guided reflection removal. In IEEE International Conference
on Image Processing, September 2016.
[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Image quality assessment: from error visibility to struc-
tural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004.
[41] Y. Wu and K. He. Group normalization. In European Con-
ference on Computer Vision (ECCV), September 2018.
[42] L. Xu, C. Lu, Y. Xu, and J. Jia. Image smoothing via L0
gradient minimization. In ACM Transactions on Graphics
(TOG), volume 30, page 174, 2011.
[43] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman. A com-
putational approach for obstruction-free photography. ACM
Transactions on Graphics (TOG), 34(4):79, 2015.
[44] J. Yang, D. Gong, L. Liu, and Q. Shi. Seeing deeply and
bidirectionally: A deep learning approach for single image
reflection removal. In The European Conference on Com-
puter Vision (ECCV), September 2018.
[45] J. Yang, H. Li, Y. Dai, and R. T. Tan. Robust optical flow
estimation of double-layer images under transparency or re-
flection. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[46] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
A. Agrawal. Context encoding for semantic segmentation.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[47] X. Zhang, R. Ng, and Q. Chen. Single image reflection
separation with perceptual losses. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
[48] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network. In The IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), July 2017.
[49] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
to-image translation using cycle-consistent adversarial net-
works. In The IEEE International Conference on Computer
Vision (ICCV), Oct 2017.
... based deep learning models [9][10][11][12][13][14][15][16][17][18], because it does not need any explicit mathematical prior but only paired training samples. ...
... Due to the above analysis, it is still a great challenge to find a universal prior to describe the difference between the two layers, therefore, model-based methods fail to get promising performance. Recent works began to focus on learning-based artificial networks [9][10][11][12][13][14][15][16][17][18]. ...
... The total training samples is 13790. The synthesizing model comes from ERRNet [14], where the corrupted input is generated by the addition of the transmission image and Gaussian blurred reflection image. We set the kernel size of the Gaussian filter as 11, with sigma varies from 1 to 1.5. ...
Full-text available
Removing the undesired reflection layer from images taken through glass windows is an important yet challenging task. Many existing CNN‐based methods try to utilize the gradient as an important clue to guide the training and achieve better separation. But the scene depth of real‐world scenarios is usually uncontrollable, leading to the uncertainty of smooth level in the transmission and reflection layers, which makes it a great challenge to model the two layers in the gradient domain. This paper proposes a multi‐scale gradient refinement network to resolve this problem. First, it is suggested that even the two layers are usually partially smooth, their gradients can still be sharp in the down‐scaled samples. To this end, the separation is conducted at four different scales by minimizing the similarity of the two layers to boost the gradient sharpness prior. Second, it is considered that the separation performance of downscaled samples is usually superior to that of the high‐resolution images because of the sharper edges. For this reason, a cascade architecture is designed that takes the down‐scaled predictions to promote the high‐resolution decomposition stage‐by‐stage to recover the full‐resolution results. Besides, the scale‐wise memory mechanism is introduced into the prediction network to resolve the detail loss issue caused by the multi‐stage upscaling refinement process. The experimental results on benchmark datasets indicate that the new model surpasses several state‐of‐the‐art methods.
... Image restoration is a fundamental research topic in computer vision due to its extensive applications including image reflection removal [1], [2], image rain streak removal [3], [4] and image dehazing [5], [6]. The methods based on deep generative framework [7]- [9] have brought about great progress in image restoration from regular size of degraded images. ...
... Investigation on the model efficiency. To investigate whether the performance superiority of our GLSGN over competing methods is benefited from the advantages of model design or model complexity, we compare the model efficiency of GLSGN with six state-of-the-art methods [1], [2], [5], [8], [30], [41] for image restoration in Table I. Besides, as shown in Figure 7, we also perform the comparison between GLSGN and other methods on 4KID [30] dataset in terms of model performance and runtime on the dehazing task. ...
While the research on image background restoration from regular size of degraded images has achieved remarkable progress, restoring ultra high-resolution (e.g., 4K) images remains an extremely challenging task due to the explosion of computational complexity and memory usage, as well as the deficiency of annotated data. In this paper we present a novel model for ultra high-resolution image restoration, referred to as the Global-Local Stepwise Generative Network (GLSGN), which employs a stepwise restoring strategy involving four restoring pathways: three local pathways and one global pathway. The local pathways focus on conducting image restoration in a fine-grained manner over local but high-resolution image patches, while the global pathway performs image restoration coarsely on the scale-down but intact image to provide cues for the local pathways in a global view including semantics and noise patterns. To smooth the mutual collaboration between these four pathways, our GLSGN is designed to ensure the inter-pathway consistency in four aspects in terms of low-level content, perceptual attention, restoring intensity and high-level semantics, respectively. As another major contribution of this work, we also introduce the first ultra high-resolution dataset to date for both reflection removal and rain streak removal, comprising 4,670 real-world and synthetic images. Extensive experiments across three typical tasks for image background restoration, including image reflection removal, image rain streak removal and image dehazing, show that our GLSGN consistently outperforms state-of-the-art methods.
... Traditional SIRR methods employ different priors (e.g., sparsity [32], smoothness [33], [34], and ghost [35]) to exploit the special properties of the transmitted and reflection layers. In recent deep-learning-based methods, edge information [36], [37], perceptual loss [38] and adversarial loss [39] are used to improve the recovered transmitted layer. SIRR can be seen as an image enhancement problem. ...
Glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot may crash into a glass wall. However, sensing the presence of glass is not straightforward. The key challenge is that arbitrary objects/scenes can appear behind the glass. In this paper, we propose an important problem of detecting glass surfaces from a single RGB image. To address this problem, we construct the first large-scale glass detection dataset (GDD) and propose a novel glass detection network, called GDNet-B, which explores abundant contextual cues in a large field-of-view via a novel large-field contextual feature integration (LCFI) module and integrates both high-level and low-level boundary features with a boundary feature enhancement (BFE) module. Extensive experiments demonstrate that our GDNet-B achieves satisfying glass detection results on the images within and beyond the GDD testing set. We further validate the effectiveness and generalization capability of our proposed GDNet-B by applying it to other vision tasks, including mirror segmentation and salient object detection. Finally, we show the potential applications of glass detection and discuss possible future research directions.
Images shot through glass will produce reflections. However, these reflections will cause the objects in the images to be unrecognizable. Hence, to improve the recognition rate, the reflections in the images should be removed. Reflection removal has been widely used in the field of deep learning. Although these methods have good results, they all assume that reflection removal is performed in a specific situation, and provide their own datasets for research, such as strong reflections in some local areas, this limitation will lead to the inability to effectively remove the reflections in the real world, which will affect the recognition of objects. To address this issue, we propose a novel structure-preserving wavelet pyramid reflection removal network (SpWPRRNet) to achieve effective background structure preservation and reflection removal to further improve the object recognition rate in images. After wavelet decomposition, we put the high and low frequency images of each level into reflection layer removal and detail enhancement subnetwork (RDSn) and structure preservation subnetwork (SPSn) respectively, and then use structure level fusion (SLF) and inverse wavelet transform (ISWT) to restore clean images recursively. In addition, to further separate the reflection layer and the transmission layer, we also propose the reflection layer information transmission (RLIT), through which the reflection layer features of the high-frequency images can be extracted to help the SPSn to effectively separate these two layers to achieve the results of reflection removal and improve object recognition rate. The experimental results indicate that the proposed method can greatly improve the object recognition rate in images.
We describe a novel approach to decompose a single panorama of an empty indoor environment into four appearance components: specular, direct sunlight, diffuse and diffuse ambient without direct sunlight. Our system is weakly supervised by automatically generated semantic maps (with floor, wall, ceiling, lamp, window and door labels) that have shown success on perspective views and are trained for panoramas using transfer learning without any further annotations. A GAN-based approach supervised by coarse information obtained from the semantic map extracts specular reflection and direct sunlight regions on the floor and walls. These lighting effects are removed via a similar GAN-based approach and a semantic-aware inpainting step. The appearance decomposition enables multiple applications including sun direction estimation, virtual furniture insertion, floor material replacement, and sun direction change, providing an effective tool for virtual home staging. We demonstrate the effectiveness of our approach on a large and recently released dataset of panoramas of empty homes.
Full-text available
We describe a novel approach to decompose a single panorama of an empty indoor environment into four appearance components: specular, direct sunlight, diffuse and diffuse ambient without direct sunlight. Our system is weakly supervised by automatically generated semantic maps (with floor, wall, ceiling, lamp, window and door labels) that have shown success on perspective views and are trained for panoramas using transfer learning without any further annotations. A GAN-based approach supervised by coarse information obtained from the semantic map extracts specular reflection and direct sunlight regions on the floor and walls. These lighting effects are removed via a similar GAN-based approach and a semantic-aware inpainting step. The appearance decomposition enables multiple applications including sun direction estimation, virtual furniture insertion, floor material replacement, and sun direction change, providing an effective tool for virtual home staging. We demonstrate the effectiveness of our approach on a large and recently released dataset of panoramas of empty homes.
Full-text available
Image of a scene captured through a piece of transparent and reflective material, such as glass, is often spoiled by a superimposed layer of reflection image. While separating the reflection from a familiar object in an image is mentally not difficult for humans, it is a challenging, ill-posed problem in computer vision. In this paper, we propose a novel deep convolutional encoder-decoder method to remove the objectionable reflection by learning a map between image pairs with and without reflection. For training the neural network, we model the physical formation of reflections in images and synthesize a large number of photo-realistic reflection-tainted images from reflection-free images collected online. Extensive experimental results show that, although the neural network learns only from synthetic data, the proposed method is effective on real-world images, and it significantly outperforms the other tested state-of-the-art techniques.
Full-text available
Single image reflection separation is an ill-posed problem since two scenes, a transmitted scene and a reflected scene, need to be inferred from a single observation. To make the problem tractable, in this work we assume that categories of two scenes are known. It allows us to address the problem by generating both scenes that belong to the categories while their contents are constrained to match with the observed image. A novel network architecture is proposed to render realistic images of both scenes based on adversarial learning. The network can be trained in a weakly supervised manner, i.e., it learns to separate an observed image without corresponding ground truth images of transmission and reflection scenes which are difficult to collect in practice. Experimental results on real and synthetic datasets demonstrate that the proposed algorithm performs favorably against existing methods.
Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-the-art results 51.7% mIoU on PASCAL-Context, 85.9% mIoU on PASCAL VOC 2012. Our single model achieves a final score of 0.5567 on ADE20K test set, which surpass the winning entry of COCO-Place Challenge in 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45%, which is comparable with state-of-the-art approaches with over 10 times more layers. The source code for the complete system are publicly available.
Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image. Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations. However, for many tasks, aligned training pairs of images will not be available. We present an alternative loss function that does not require alignment, thus providing an effective and simple solution for a new space of problems. Our loss is based on both context and semantics -- it compares regions with similar semantic meaning, while considering the context of the entire image. Hence, for example, when transferring the style of one face to another, it will translate eyes-to-eyes and mouth-to-mouth.
Removing the undesired reflections in images taken through the glass is of broad application to various image processing and computer vision tasks. Existing single image based solutions heavily rely on scene priors such as separable sparse gradients caused by different levels of blur, and they are fragile when such priors are not observed. In this paper, we notice that strong reflections usually dominant a limited region in the whole image, and propose a Region-aware Reflection Removal (R3) approach by automatically detecting and heterogeneously processing regions with and without reflections. We integrate content and gradient priors to jointly achieve missing contents restoration as well as background and reflection separation in a unified optimization framework. Extensive validation using 50 sets of real data shows that the proposed method outperforms state-of-the-art on both quantitative metrics and visual qualities. IEEE