Content uploaded by Kaixuan Wei
Author content
All content in this area was uploaded by Kaixuan Wei on Mar 27, 2020
Content may be subject to copyright.
Single Image Reflection Removal Exploiting Misaligned Training Data and
Network Enhancements
Kaixuan Wei1Jiaolong Yang2Ying Fu1∗David Wipf2Hua Huang1
1Beijing Institute of Technology 2Microsoft Research
Abstract
Removing undesirable reflections from a single image
captured through a glass window is of practical impor-
tance to visual computing systems. Although state-of-the-
art methods can obtain decent results in certain situations,
performance declines significantly when tackling more gen-
eral real-world cases. These failures stem from the intrin-
sic difficulty of single image reflection removal – the funda-
mental ill-posedness of the problem, and the insufficiency of
densely-labeled training data needed for resolving this am-
biguity within learning-based neural network pipelines. In
this paper, we address these issues by exploiting targeted
network enhancements and the novel use of misaligned
data. For the former, we augment a baseline network archi-
tecture by embedding context encoding modules that are ca-
pable of leveraging high-level contextual clues to reduce in-
determinacy within areas containing strong reflections. For
the latter, we introduce an alignment-invariant loss func-
tion that facilitates exploiting misaligned real-world train-
ing data that is much easier to collect. Experimental results
collectively show that our method outperforms the state-of-
the-art with aligned data, and that significant improvements
are possible when using additional misaligned data.
1. Introduction
Reflection is a frequently-encountered source of image
corruption that can arise when shooting through a glass sur-
face. Such corruptions can be addressed via the process of
single image reflection removal (SIRR), a challenging prob-
lem that has attracted considerable attention from the com-
puter vision community [22, 25, 39, 2, 5, 47, 44, 38]. Tra-
ditional optimization-based methods often leverage manual
intervention or strong prior assumptions to render the prob-
lem more tractable [22, 25]. Recently, alternative learning-
based approaches rely on deep Convectional Neural Net-
works (CNNs) in lieu of the costly optimization and hand-
crafted priors [5, 47, 44, 38]. But promising results notwith-
∗Corresponding author: fuying@bit.edu.cn
standing, SIRR remains a largely unsolved problem across
disparate imaging conditions and varying scene content.
For CNN-based reflection removal, our focus herein, the
challenge originates from at least two sources: (i) The ex-
traction of a background image layer devoid of reflection
artifacts is fundamentally ill-posed, and (ii) Training data
from real-world scenes, are exceedingly scarce because of
the difficulty in obtaining ground-truth labels.
Mathematically speaking, it is typically assumed that a
captured image Iis formed as a linear combination of a
background or transmitted layer Tand a reflection layer R,
i.e., I=T+R. Obviously, when given access only to I,
there exists an infinite number of feasible decompositions.
Further compounding the problem is the fact that both T
and Rinvolve content from real scenes that may have over-
lapping appearance distributions. This can make them diffi-
cult to distinguish even for human observers in some cases,
and simple priors that might mitigate this ambiguity are not
available except under specialized conditions.
On the other hand, although CNNs can perform a wide
variety visual tasks, at times exceeding human capabilities,
they generally require a large volume of labeled training
data. Unfortunately, real reflection images accompanied
with densely-labeled, ground-truth transmitted layer inten-
sities are scarce. Consequently, previous learning-based ap-
proaches have resorted to training with synthesized images
[5, 38, 47] and/or small real-world data captured from spe-
cialized devices [47]. However, existing image synthesis
procedures are heuristic and the domain gap may jeopardize
accuracy on real images. On the other hand, collecting suf-
ficient additional real data with precise ground-truth labels
is tremendously labor-intensive.
This paper is devoted to addressing both of the afore-
mentioned challenges. First, to better tackle the intrinsic
ill-posedness and diminish ambiguity, we propose to lever-
age a network architecture that is sensitive to contextual in-
formation, which has proven useful for other vision tasks
such as semantic segmentation [11, 48, 46, 13]. Note that
at a high level, our objective is to efficiently convert prior
information mined from labeled training data into network
structures capable of resolving this ambiguity. Within a tra-
8178
ditional CNN model, especially in the early layers where
the effective receptive field is small, the extracted features
across all channels are inherently local. However, broader
non-local context is necessary to differentiate those features
that are descriptive of the desired transmitted image, and
those that can be discarded as reflection-based. For ex-
ample, in image neighborhoods containing a particularly
strong reflection component, accurate separation by any
possible method (even one trained with arbitrarily rich train-
ing data) will likely require contextual information from re-
gions without reflection. To address this issue, we utilize
two complementary forms of context, namely, channel-wise
context and multi-scale spatial context. Regarding the for-
mer, we apply a channel attention mechanism to the fea-
ture maps from convolutional layers such that different fea-
tures are weighed differently according to global statistics
of the activations. For the latter, we aggregate information
across a pyramid of feature map scales within each chan-
nel to reach a global contextual consistency in the spatial
domain. Our experiments demonstrate that significant im-
provement can be obtained by these enhancements, leading
to state-of-the-art performance on two real-image datasets.
Secondly, orthogonal to architectural considerations, we
seek to expand the sources of viable training data by facil-
itating the use of misaligned training pairs, which are con-
siderably easier to collect. Misalignment between an input
image Iand a ground-truth reflection-free version Tcan be
caused by camera and/or object movements during the ac-
quisition process. In the previous works [37, 46], data pairs
(I , T )were obtained by taking an initial photo through a
glass plane, followed by capturing a second one after the
glass has been removed. This process requires that the
camera, scene, and even lighting conditions remain static.
Adhering to these requirements across a broad acquisition
campaign can significantly reduce both the quantity and di-
versity of the collected data. Additionally, post-processing
may also be necessary to accurately align Iand Tto com-
pensate for spatial shifts caused by the refractive effect [37].
In contrast, capturing unaligned data is considerably less
burdensome, as shown in Fig. 1. For example, there is no
need for a tripod, table, or other special hardware; the cam-
era can be hand-held and the pose can be freely adjusted;
dynamic scenes in the presence of vehicles, humans, etc.
can be incorporated; and finally no post-processing of any
type is needed.
To handle such misaligned training data, we require a
loss function that is, to the extent possible, invariant to the
alignment, i.e., the measured image content discrepancy be-
tween the network prediction and its unaligned reference
should be similar to what would have been observed if
the reference was actually aligned. In the context of im-
age style transfer [17] and others, certain perceptual loss
functions have been shown to be relatively invariant to var-
[46] Ours
Figure 1: Comparison of the reflection image data collection meth-
ods in [46] and this paper.
ious transformations. Our study shows that the using only
the highest-level feature from a deep network (VGG-19 in
our case) leads to satisfactory results for our reflection re-
moval task. In both simulation tests and experiments us-
ing a newly collected dataset, we demonstrate for the first
time that training/fine-tuning a CNN with unaligned data
improves the reflection removal results by a large margin.
2. Related Work
This paper is concerned with reflection removal from
a single image. Previous methods utilizing multiple input
images of, e.g., flash/non-flash pairs [1], different polariza-
tion [20], multi-view or video sequences [6, 35, 30, 7, 24,
34, 9, 43, 45] will not be considered here.
Traditional methods. Reflection removal from a single im-
age is a massively ill-posed problem. Additional priors are
needed to solve the otherwise prohibitively-difficult prob-
lem in traditional optimization-based method [22, 25, 39,
2, 36]. In [22], user annotations are used to guide layer
separation jointly with a gradient sparsity prior [23]. [25]
introduces a relative smoothness prior where the reflections
are assumed to be blurry thus their large gradients are penal-
ized. [39] explores a variant of the smoothness prior where
a multi-scale Depth-of-Field (DoF) confidence map is uti-
lized to perform edge classification. [31] exploits the ghost
cues for layer separation. [2] proposes a simple optimiza-
tion formulation with an l0gradient penalty on the transmit-
ted layer inspired by image smoothing algorithms [42]. De-
spite decent results can be obtained by these methods where
their assumptions hold, the vastly-different imaging condi-
tions and complex scene content in the real world render
their generalization problematic.
Deep learning based methods. Recently, there is an
emerging interest in applying deep convolutional neural net-
works for single image reflection removal such that the
handcrafted priors can be replaced by data-driven learn-
ing [5, 38, 47, 44]. The first CNN-based method is due
to [5], where a network structure is proposed to first pre-
dict the background layer in the edge domain followed by
8179
reconstructing it the color domain. Later, [38] proposes to
predict the edge and image intensity concurrently by two
cooperative sub-networks. The recent work of [44] presents
a cascade network structure which predicts the background
layer and reflection layer in an interleaved fashion. The ear-
lier CNN-based methods typical use the raw image intensity
discrepancy such as mean squared error (MSE) to train the
networks. Several recent works [47, 16, 3] adopt the per-
ceptual loss [17] which uses the multi-stage features of a
deep network pre-trained on ImageNet [29]. [47]. Adver-
sarial loss is investigated in [47, 21] to improve the realism
of the predicted background layers.
3. Approach
Given an input image Icontaminated with reflections,
our goal is to estimate a reflection-free trasmitted image ˆ
T.
To achieve this, we train a feed-forward CNN GθGparame-
terized by θGto minimize a reflection removal loss function
l. Given training image pairs {(In, Tn)},n= 1,· · · , N,
this involves solving:
ˆ
θG= arg minθG
1
NPN
n=1 l(GθG(In), Tn).(1)
We will first introduce the details of network architecture
GθGfollowed by the loss function lapplied to both aligned
data (the common case) and newly proposed unaligned data
extensions. The overall system is illustrated in Fig. 2.
3.1. Basic Image Reconstruction Network
Our starting point can be viewed as the basic image re-
construction neural network component from [5] but mod-
ified in three aspects: (1) We simplify the basic residual
block [12] by removing the batch normalization (BN) layer
[14]; (2) we increase the capacity by widening the network
from 64 to 256 feature maps; and (3) for each input image
I, we extract hypercolumn features [10] from a pretrained
VGG-19 network [32], and concatenate these features with
Ias an augmented network input. As explained in [47],
such an augmentation strategy can help enable the network
to learn semantic clues from the input image.
Note that removing the BN layer from our network turns
out to be critical for optimizing performance in the present
context. As shown in [41], if batch sizes become too small,
prediction errors can increase precipitously and stability is-
sues can arise. Moreover, for a dense prediction task such as
SIRR, large batch sizes can become prohibitively expensive
in terms of memory requirements. In our case, we found
that within the tenable batch sizes available for reflection re-
moval, BN led to considerably worse performance, includ-
ing color attenuation/shifting issues as sometimes observed
in image-to-image translation tasks [5, 15, 49]. BN layers
have similarly been removed from other dense prediction
tasks such as image super-resolution [26] or deblurring [28].
At this point, we have constructed a useful base archi-
tecture upon which other more targeted alterations will be
applied shortly. This baseline, which we will henceforth
refer to as BaseNet, performs quite well when trained and
tested on synthetic data. However, when deployed on real-
world reflection images we found that its performance de-
graded by an appreciable amount, especially on the 20 real
images from [47]. Therefore, to better mitigate the tran-
sition from the make-believe world of synthetic images to
real-life photographs, we describe two modifications for in-
troducing broader contextual information into otherwise lo-
cal convolutional filters.
3.2. Context Encoding Modules
As mentioned previously, we consider both context be-
tween channels and multi-scale context within channels.
Channel-wise context. The underlying design princi-
ple here is to introduce global contextual information
across channels, and a richer overall structure within resid-
ual blocks, without dramatically increasing the parameter
count. One way to accomplish this is by incorporating a
channel attention module originally developed in [13] to re-
calibrate feature maps using global summary statistics.
Let U= [u1, . . . , uc, . . . , uC]denote original, uncali-
brated activations produced by a network block, with C
feature maps of size of H×W. These activations gener-
ally only reflect local information residing within the corre-
sponding receptive fields of each filter. We then form scalar,
channel-specific descriptors zc=fgp(uc)by applying a
global average pooling operator fgp to each feature map
uc∈RH×W. The vector z= [z1, . . . , zC]∈RCrepresents
a simple statistical summary of global, per-channel activa-
tions and, when passed through a small network structure,
can be used to adaptively predict the relative importance of
each channel [13].
More specifically, the channel attention module first
computes s=σ(WUδ(WDz)) where WDis a trainable
weight matrix that downsamples zto dimension R < C,
δis a ReLU non-linearity, WUrepresents a trainable up-
sampling weight matrix, and σis a sigmoidal activation.
Elements of the resulting output vector s∈RCserve
as channel-specific gates for calibrating feature maps via
ˆuc=sc·uc.
Consequently, although each individual convolutional
filter has a local receptive field, the determination of which
channels are actually important in predicting the transmis-
sion layer and suppressing reflections is based on the pro-
cessing of a global statistic (meaning the channel descrip-
tors computed as activations pass through the network dur-
ing inference). Additionally, the parameter overhead intro-
duced by this process is exceedingly modest given that WD
and WUare just small additional weight matrices associated
with each block.
8180
𝑇
𝑇
𝑃𝑖𝑥𝑒𝑙 𝐿𝑜𝑠𝑠 𝑙
Loss for
Unaligned Data
𝐴𝑑𝑣𝑒𝑟𝑠𝑎𝑟𝑖𝑎𝑙 𝐿𝑜𝑠𝑠 𝑙
Residual
Block
F𝑒𝑎𝑡𝑢𝑟𝑒 𝐿𝑜𝑠𝑠 𝑙
Align. 𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 𝐿𝑜𝑠𝑠 𝑙
…
…
13 blocks
1/32
1/16
1/8
1/4
Convolution ReLU Pyramid
Pooling
Global
Pooling
SigmoidUpsample
VGG19-features
𝐷
𝐺
Channel Attention
Loss for
aligned Data
Figure 2: Overview of our approach for single image reflection removal.
Multi-scale spatial context. Although we have found that
encoding the contextual information across channels al-
ready leads to significant empirical gains on real-world im-
ages, utilizing complementary multi-scale spatial informa-
tion within each channel provides further benefit. To ac-
complish this, we apply a pyramid pooling module [11],
which has proven to be an effective global-scene-level rep-
resentation in semantic segmentation [48]. As shown in
Fig. 2, we construct such a module using pooling opera-
tions at sizes 4, 8, 16, and 32 situated in the tail of our net-
work before the final construction of ˆ
T. Pooling in this way
fuses features under four different pyramid scales. After
harvesting the resulting sub-region representations, we per-
form a non-linear transformation (i.e. a Conv-ReLU pair) to
reduce the channel dimension. The refined features are then
upsampled via bilinear interpolation. Finally, the different
levels of features are concatenated together as a final repre-
sentation reflecting multi-scale spatial context within each
channel; the increased parameter overhead is negligible.
3.3. Training Loss for Aligned Data
In this section, we present our loss function for aligned
training pairs (I , T ), which consists of three terms similar
to previous methods [47, 44].
Pixel loss. Following [5], we penalize the pixel-wise in-
tensity difference of Tand ˆ
Tvia lpixel =αkˆ
T−Tk2
2+
β(k∇xˆ
T− ∇xTk1+k∇yˆ
T− ∇yTk1)where ∇xand ∇y
are the gradient operator along x- and y-direction, respec-
tively. We set α= 0.2and β= 0.4in all our experiments.
Feature loss. We define the feature loss based on the
activations of the 19-layer VGG network [33] pretrained
on ImageNet [29]. Let φlbe the feature from the l-th
layer of VGG-19, we define the feature loss as lf eat =
Plλlkφl(T)−φl(ˆ
T)k1where {λl}are the balancing
weights. Similar to [47], we use the layers ‘conv2 2’,
‘conv3 2’, ‘conv4 2’, and ‘conv5 2’ of VGG-19 net.
Adversarial loss. We further add an adversarial loss to
improve the realism of the produced background images.
We define an opponent discriminator network DθDand
minimize the relativistic adversarial loss [18] defined as
ladv =lG
adv =−log(DθD(T, ˆ
T))−log(1−DθD(ˆ
T , T )) for
GθGand lD
adv =−log(1 −DθD(T, ˆ
T)) −log(DθD(ˆ
T , T ))
for DθDwhere DθD(T, ˆ
T) = σ(C(T)−C(ˆ
T)) with σ(·)
being the sigmoid function and C(·)the non-transformed
discriminator function (refer to [18] for details).
To summarize, our loss for aligned data is defined as:
laligned =ω1lpixel +ω2lf eat +ω3ladv (2)
where we empirically set the weights as ω1= 1, ω2= 0.1,
and ω3= 0.01 respectively throughout our experiments.
3.4. Training Loss for Unaligned Data
To use misaligned data pairs (I , T )for training, we need
a loss function that is invariant to the alignment, such that
the true similarity between Tand the prediction ˆ
Tcan be
reasonably measured. In this regard, we note that human
observers can easily assess the similarity of two images
even if they are not aligned. Consequently, designing a
loss measuring image similarity on the perceptual-level may
serve our goal. This motivates us to directly use a deep fea-
ture loss for unaligned data.
Intuitively, the deeper the feature, the more likely it is
to be insensitive to misalignment. To experimentally ver-
ify this and find a suitable feature layer for our purposes,
we conducted tests using a pre-trained VGG-19 network as
follows. Given an unaligned image pair (I , T ), we use gra-
dient descent to finetune the weights of our network GθG
to minimize the feature difference of Tand ˆ
T, with features
extracted at different layers of VGG-19. Figure 3 shows that
using low-level or middle-level features from ‘conv2 2’ to
‘conv4 2’ leads to blurry results (similar to directly using a
pixel-wise loss), although the reflection is more thoroughly
removed. In contrast, using the highest-level feature from
‘conv5 2’ gives rise to a striking result: the predicted back-
ground image is sharp and almost reflection-free.
8181
(a) Input (b) Unaligned Ref. (c) Pretrained
(d) lpixel (e) conv2 2 (f) conv3 2
(g) conv4 2 (h) conv5 2 (i) Loss of [27]
Figure 3: The effect of using different loss to handle misaligned
real data. (a) and (b) are the unaligned image pair (I , T ). (c)
shows the reflection removal result of our network trained on syn-
thetic data and a small number of aligned real data (see Section 4
for details). Reflection can still be observed in the predicted back-
ground image. (d) is the result finetuned on (I , T )with pixel-
wise intensity loss. (e)-(h) are the results finetuned with features
at different layers of VGG-19. Only the highest-level feature from
‘conv5 2’ yields satisfactory result. (i) shows the results finetuned
with the loss of [27]. (Best viewed on screen with zoom)
Recently, [27] introduced a “contextual loss” which is
also designed for training deep networks with unaligned
data for image-to-image translation tasks like image style
transfer. In Fig 3, we also present the finetuned result us-
ing this loss for our reflection removal task. Upon visual
inspection, the results are similar to our highest-level VGG
feature loss (quantitative comparison can be found in the
experiment section). However, our adopted loss (formally
defined below) is much simpler and more computationally
efficient than the loss from [27].
Alignment-invariant loss. Based on the above study, we
now formally define our invariant loss component designed
for unaligned data as linv =kφh(T)−φh(ˆ
T)k1, where
φhdenotes the ‘conv5 2’ feature of the pretrained VGG-19
network. For unaligned data, we also apply an adversarial
loss which is not affected by misalignment. Therefore, our
overall loss for unaligned data can be written as
lunaligned =ω4linv +ω5ladv (3)
where we set the weights as ω4= 0.1and ω5= 0.01.
4. Experiments
4.1. Implementation Details
Training data. We adopt a fusion of synthetic and real data
as our train dataset. The images from [5] are used as sythetic
Table 1: Comparison of different settings. Our full model (i.e.
ERRNet) leads to best performance among all comparisons.
Synthetic Real20
Model PSNR SSIM PSNR SSIM
CEILNet-F [5] 24.70 0.884 20.32 0.739
BaseNet only 25.71 0.926 21.51 0.780
BaseNet + CSC 27.64 0.940 22.61 0.796
BaseNet + MSC 26.03 0.928 21.75 0.783
ERRNet 27.88 0.941 22.89 0.803
data, i.e. 7,643 cropped images with size 224 ×224 from
PASCAL VOC dataset [4]. 90 real-world training images
from [47] are adopted as real data. For image synthesis,
we use the same data generation model as [5] to create our
synthetic data. In the following, we always use the same
dataset for training, unless specifically stated.
Training details. Our implementation1is based on Py-
Torch. We train the model with 60 epoch using the Adam
optimizer [19]. The base learning rate is set to 10−4and
halved at epoch 30, then reduced to 10−5at epoch 50. The
weights are initialized as in [26].
4.2. Ablation Study
In this section, we conduct an ablation study for our
method on 100 synthetic testing images from [5] and 20
real testing images from [47] (denoted by ‘Real20’).
Component analysis. To verify the importance of our
network design, we compare four model architectures as
described in Section 3, including (1) Our basic image re-
construction network BaseNet; (2) BaseNet with channel-
wise context module (BaseNet + CWC); (3) BaseNet with
multi-scale spatial context module (BaseNet + MSC); and
(4) Our enhanced reflection removal network, denoted ER-
RNet, i.e., BaseNet + CWC + MSC. The result from the
CEILNet [5] fine-tuned on our training data (denoted by
CEILNet-F) is also provided as an additional reference.
As shown in Table 1, our BaseNet has already achieved
a much better result than CEILNet-F. The performance of
our BaseNet could be obviously boosted by using channel-
wise context and multi-scale spatial context modules, espe-
cially by using them together, i.e. ERRNet. Figure 4 visu-
ally shows the results from BaseNet and our ERRNet. It can
be observed that BaseNet struggles to discriminate the re-
flection region and yields some obvious residuals, while the
ERRNet removes the reflection and produces much cleaner
transmitted images. These results suggest the effectiveness
of our network design, especially the components tailored
to encode the contextual clues.
Efficacy of the training loss for unaligned data. In this
1Code is released at https://github.com/Vandermode/ERRNet
8182
Input BaseNet ERRNet
Figure 4: Comparison of the results with (ERRNet) and without
(BaseNet) the context encoding modules.
Table 2: Simulation experiment to verify the efficacy our
alignment-invariant loss
Training Scheme PSNR SSIM
Synthetic only 19.79 0.741
+ 50 aligned 22.00 0.785
+ 90 aligned 22.89 0.803
+ 50 aligned, + 40 unaligned trained with:
lpixel 21.85 0.766
linv 22.38 0.797
lcx 22.47 0.796
linv +lcx 22.43 0.796
experiment, we first train our ERRNet with only ‘synthetic
data’, ‘synthetic + 50 aligned real data’, and ‘synthetic +
90 aligned real data’. The loss function in Eq. (2) is used
for aligned data. We can see that the testing results become
better with the increasing real data in Table 2.
Then, we synthesize misalignment through performing
random translations within [−10,10] pixels on real data2,
and train ERRNet with ‘synthetic + 50 aligned real data +
40 unaligned data’. Pixel-wise loss lpixel and alignment-
invariant loss linv are used for 40 unaligned images. Ta-
ble 2 shows employing 40 unaligned data with lpixel loss
degrades the performance, even worse than that from 50
aligned images without additional unaligned data.
In addition, we also investigate the contextual loss lcx
of [27]. Results from both contextual loss lcx and our
alignment-invariant loss linv (or combination of them linv+
lcx) surpass analogous results obtained with only aligned
images by appreciable margins, indicating that these losses
provide useful supervision to networks granted unaligned
data. Note although linv and lcx perform equally well, our
linv is much simpler and computationally efficient than lcx,
suggesting linv is lightweight alternative to lcx in terms of
our reflection removal task.
2Our alignment-invariant loss linv can handle shifts of up to 20 pixels.
See suppl. material for more details.
4.3. Method Comparison on Benchmarks
In this section, we compare our ERRNet against state-of-
the-art methods including the optimization-based method of
[25] (LB14) and the learning-based approaches (CEILNet
[5], Zhang et al. [47], and BDN [44]). For fair comparison,
we finetune these models on our training dataset and report
results of both the original pretrained model and finetuned
version (denoted with a suffix ’-F’).
The comparison is conducted on four real-world
datasets, i.e. 20 testing images in [47] and three sub-datasets
from SIR2[37]. These three sub-datasets are captured under
different conditions: (1) 20 controlled indoor scenes com-
posed by solid objects; (2) 20 different controlled scenes
on postcards; and (3) 55 wild scenes3with ground truth
provided. In the following, we denote these datasets by
‘Real20’, ‘Objects’, ‘Postcard’, and ‘Wild’, respectively.
Table 3 summarizes the results of all competing meth-
ods on four real-world datasets. The quality metrics include
PSNR, SSIM [40], NCC [43, 37] and LMSE [8]. Larger
values of PSNR, SSIM, and NCC indicate better perfor-
mance, while a smaller value of LMSE implies a better re-
sult. Our ERRNet achieves the state-of-the-art performance
in ‘Real20’ and ‘Objects’ datasets. Meanwhile, our result
is comparable to the best-performing BDN-F on ‘Postcard’
data. The quantitative results on ‘Wild’ dataset reveal a
frustrating fact, namely, that no method could outperform
the naive baseline ’Input’, suggesting that there is still large
room for improvement.
Figure 5 displays visual results on real-world images. It
can be seen that all compared methods fail to handle some
strong reflections, but our network more accurately removes
many undesirable artifacts, e.g. removal of tree branches re-
flected on the building window in the fourth photo of Fig 5.
4.4. Training with Unaligned Data
To test our alignment-invariant loss on real-world un-
aligned data, we first collected a dataset of unaligned im-
age pairs with cameras and a portable glass, as shown in
Fig. 1 . Both a DSLR camera and a smart phone are used to
capture the images. We collected 450 image pairs in total,
and some samples are shown in Fig 6. These image pairs
are randomly split into a training set of 400 samples and a
testing set with 50 samples.
We conduct experiments on the BDN-F and ERRNet
models, each of which is first trained on aligned dataset
(w/o unaligned) as in Section 4.3, and then finetuned with
our alignment-invariant loss and unaligned training data.
The resulting pairs before and after finetuning are assem-
bled for human assessment, as no existing numerical metric
is available for evaluating unaligned data.
We asked 30 human observers to provide a preference
3Images indexed by 1, 2, 74 are removed due to misalignment.
8183
Input LB14 [25] CEILNet-F [5] Zhang et al. [47] BDN-F [44] ERRNet Reference
Figure 5: Visual comparison on real-world images. The images are obtained from ‘Real20’ (Rows 1-3) and our collected unaligned dataset
(Rows 4- 5). More results can be found in the suppl. material.
Table 3: Quantitative results of different methods on four real-world benchmark datasets. The best results are indicated by red color and
the second best results are denoted by blue color. The results of ‘Average’ are obtained by averaging the metric scores of all images from
these four real-world datasets.
Dataset Index
Methods
Input LB14 CEILNet CEILNet Zhang BDN BDN ERRNet
[25] [5] F et al. [47] [44] F
Real20
PSNR 19.05 18.29 18.45 20.32 21.89 18.41 20.06 22.89
SSIM 0.733 0.683 0.690 0.739 0.787 0.726 0.738 0.803
NCC 0.812 0.789 0.813 0.834 0.903 0.792 0.825 0.877
LMSE 0.027 0.033 0.031 0.028 0.022 0.032 0.027 0.022
Objects
PSNR 23.74 19.39 23.62 23.36 22.72 22.73 24.00 24.87
SSIM 0.878 0.786 0.867 0.873 0.879 0.856 0.893 0.896
NCC 0.981 0.971 0.972 0.974 0.964 0.978 0.978 0.982
LMSE 0.004 0.007 0.005 0.005 0.005 0.005 0.004 0.003
Postcard
PSNR 21.30 14.88 21.24 19.17 16.85 20.71 22.19 22.04
SSIM 0.878 0.795 0.834 0.793 0.799 0.859 0.881 0.876
NCC 0.947 0.929 0.945 0.926 0.886 0.943 0.941 0.946
LMSE 0.005 0.008 0.008 0.013 0.007 0.005 0.004 0.004
Wild
PSNR 26.24 19.05 22.36 22.05 21.56 22.36 22.74 24.25
SSIM 0.897 0.755 0.821 0.844 0.836 0.830 0.872 0.853
NCC 0.941 0.894 0.918 0.924 0.919 0.932 0.922 0.917
LMSE 0.005 0.027 0.013 0.009 0.010 0.009 0.008 0.011
Average
PSNR 22.85 17.51 22.30 21.41 20.22 21.70 22.96 23.59
SSIM 0.874 0.781 0.841 0.832 0.838 0.848 0.879 0.879
NCC 0.955 0.937 0.948 0.943 0.925 0.951 0.950 0.956
LMSE 0.006 0.011 0.009 0.010 0.007 0.007 0.006 0.005
score among {-2,-1,0,1,2}with 2 indicating the finetuned
result is significantly better while -2 the opposite. To avoid
bias, we randomly switch the image positions of each pair.
In total, 3000 human judgments are collected (2 methods,
30 users, 50 images pairs). More details regarding this eval-
uation process can be found in the suppl. material.
8184
Figure 6: Image samples in our unaligned image dataset. Our dataset covers a large variety of indoor and outdoor environments including
dynamic scenes with vehicles, human, etc.
Score Range Ratio BDN-F ERRNet
10 20 30 40 50
-1
0
1
2
10 20 30 40 50
-1
0
1
2
(0.25,2] 78% 54%
[−0.25,0.25] 18% 36%
[−2,−0.25) 4% 10%
Average Score 0.62 0.51
Table 4: Human preference scores of self-comparsion experiments. Left: results of BDN-F; Right: results of ERRNet. X axis of each
sub-figure represents the image # of testing images (50 in total).
BDN-F ERRNet
input reference w/o unaligned w. unaligned w/o unaligned w. unaligned
Figure 7: Results of training with and without unaligned data. See suppl. material for more examples. (Best view on screen with zoom)
Table 4 shows the average of human preference scores
for the resulting pairs of each method. As can be seen, hu-
man observers clearly tend to prefer the results produced
by the finetuned models over the raw ones, which demon-
strates the benefit of leveraging unaligned data for training
independent of the network architecture. Figure 7 shows
some typical results of the two methods; the results are sig-
nificantly improved by training on unaligned data.
5. Conclusion
We have proposed an enhanced reflection removal net-
work together with an alignment-invariant loss function to
help resolve the difficulty of single image reflection re-
moval. We investigated the possibility to directly utilize
misaligned training data, which can significantly alleviate
the burden of capturing real-world training data. To effi-
ciently extract the underlying knowledge from real train-
ing data, we introduce context encoding modules, which
can be seamlessly embedded into our network to help dis-
criminate and suppress the reflection component. Extensive
experiments demonstrate our approach set a new state-of-
the-art on real-world benchmarks of single image reflection
removal, both quantitatively and visually.
Acknowledgments
We thank Yunhao Zou for great help collecting the re-
flection image dataset. This work was supported by the Na-
tional Natural Science Foundation of China under Grants
No. 61425013 and No. 61672096.
8185
References
[1] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Remov-
ing photography artifacts using gradient projection and flash-
exposure sampling. ACM Transactions on Graphics (TOG),
24(3):828–835, 2005.
[2] N. Arvanitopoulos, R. Achanta, and S. Susstrunk. Single im-
age reflection suppression. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), July 2017.
[3] Z. Chi, X. Wu, X. Shu, and J. Gu. Single image reflection
removal using deep encoder-decoder network. arXiv preprint
arXiv:1802.00094, 2018.
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The pascal visual object classes (voc)
challenge. International Journal of Computer Vision (IJCV),
88(2):303–338, 2010.
[5] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf. A generic
deep architecture for single image reflection removal and im-
age smoothing. In The IEEE International Conference on
Computer Vision (ICCV), Oct 2017.
[6] H. Farid and E. H. Adelson. Separating reflections and light-
ing using independent components analysis. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
July 1999.
[7] K. Gai, Z. Shi, and C. Zhang. Blind separation of superim-
posed moving images using image statistics. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (TPAMI),
34(1):19–32, 2012.
[8] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Free-
man. Ground truth dataset and baseline evaluations for in-
trinsic image algorithms. In IEEE International Conference
on Computer Vision (ICCV). IEEE, Oct 2009.
[9] X. Guo, X. Cao, and Y. Ma. Robust separation of reflection
from multiple images. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2014.
[10] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hy-
percolumns for object segmentation and fine-grained local-
ization. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2015.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
in deep convolutional networks for visual recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 37(9):1904–1916, 2015.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
[13] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net-
works. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
translation with conditional adversarial networks. In The
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), July 2017.
[16] M. Jin, S. Ssstrunk, and P. Favaro. Learning to see through
reflections. In IEEE International Conference on Computa-
tional Photography (ICCP), May 2018.
[17] J. Johnson, A. Alahi, and L. Feifei. Perceptual losses for
real-time style transfer and super-resolution. European Con-
ference on Computer Vision (ECCV), pages 694–711, 2016.
[18] A. Jolicoeur-Martineau. The relativistic discriminator: a key
element missing from standard GAN. In International Con-
ference on Learning Representations (ICLR), 2019.
[19] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[20] N. Kong, Y.-W. Tai, and J. S. Shin. A physically-based
approach to reflection separation: from physical modeling
to constrained optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 36(2):209–221,
2014.
[21] D. Lee, M.-H. Yang, and S. Oh. Generative single image re-
flection separation. arXiv preprint arXiv:1801.04102, 2018.
[22] A. Levin and Y. Weiss. User assisted separation of reflections
from a single image using a sparsity prior. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (TPAMI),
29(9):1647–1654, 2007.
[23] A. Levin, A. Zomet, and Y. Weiss. Learning to perceive
transparency from the statistics of natural scenes. In Ad-
vances in Neural Information Processing Systems (NIPS).
December 2002.
[24] Y. Li and M. S. Brown. Exploiting reflection change for au-
tomatic reflection removal. In The IEEE International Con-
ference on Computer Vision (ICCV), December 2013.
[25] Y. Li and M. S. Brown. Single image layer separation us-
ing relative smoothness. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 2752–2759,
2014.
[26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced
deep residual networks for single image super-resolution.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, July 2017.
[27] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual
loss for image transformation with non-aligned data. In The
European Conference on Computer Vision (ECCV), Septem-
ber 2018.
[28] S. Nah, T. Hyun Kim, and K. Mu Lee. Deep multi-scale
convolutional neural network for dynamic scene deblurring.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge. In-
ternational Journal of Computer Vision (IJCV), 115(3):211–
252, 2015.
[30] B. Sarel and M. Irani. Separating transparent layers through
layer information exchange. In European Conference on
Computer Vision (ECCV), September 2004.
[31] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman. Reflec-
tion removal using ghosting cues. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
2015.
8186
[32] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. International
Conference on Learning Representations (ICLR), 2015.
[33] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. In International
Conference on Machine Learning (ICLR), 2015.
[34] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and
R. Szeliski. Image-based rendering for scenes with reflec-
tions. ACM Transactions on Graphics (TOG), 31(4):100–1,
2012.
[35] R. Szeliski, S. Avidan, and P. Anandan. Layer extrac-
tion from multiple images containing reflections and trans-
parency. In IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), July 2000.
[36] R. Wan, B. Shi, L. Duan, A. Tan, W. Gao, and A. C. Kot.
Region-aware reflection removal with unified content and
gradient priors. IEEE Transactions on Image Processing,
27(6):2927–2941, 2018.
[37] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot.
Benchmarking single-image reflection removal algorithms.
In The IEEE International Conference on Computer Vision
(ICCV), Oct 2017.
[38] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot. Crrn:
Multi-scale guided concurrent reflection removal network.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[39] R. Wan, B. Shi, T. A. Hwee, and A. C. Kot. Depth of field
guided reflection removal. In IEEE International Conference
on Image Processing, September 2016.
[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Image quality assessment: from error visibility to struc-
tural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004.
[41] Y. Wu and K. He. Group normalization. In European Con-
ference on Computer Vision (ECCV), September 2018.
[42] L. Xu, C. Lu, Y. Xu, and J. Jia. Image smoothing via L0
gradient minimization. In ACM Transactions on Graphics
(TOG), volume 30, page 174, 2011.
[43] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman. A com-
putational approach for obstruction-free photography. ACM
Transactions on Graphics (TOG), 34(4):79, 2015.
[44] J. Yang, D. Gong, L. Liu, and Q. Shi. Seeing deeply and
bidirectionally: A deep learning approach for single image
reflection removal. In The European Conference on Com-
puter Vision (ECCV), September 2018.
[45] J. Yang, H. Li, Y. Dai, and R. T. Tan. Robust optical flow
estimation of double-layer images under transparency or re-
flection. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[46] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
A. Agrawal. Context encoding for semantic segmentation.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[47] X. Zhang, R. Ng, and Q. Chen. Single image reflection
separation with perceptual losses. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
2018.
[48] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network. In The IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), July 2017.
[49] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
to-image translation using cycle-consistent adversarial net-
works. In The IEEE International Conference on Computer
Vision (ICCV), Oct 2017.
8187