Conference PaperPDF Available

Single Image Reflection Removal Exploiting Misaligned Training Data and Network Enhancements

Authors:

Abstract and Figures

Removing undesirable reflections from a single image captured through a glass window is of practical importance to visual computing systems. Although state-of-the-art methods can obtain decent results in certain situations, performance declines significantly when tackling more general real-world cases. These failures stem from the intrinsic difficulty of single image reflection removal – the fundamental ill-posedness of the problem, and the insufficiency of densely-labeled training data needed for resolving this ambiguity within learning-based neural network pipelines. In this paper, we address these issues by exploiting targeted network enhancements and the novel use of misaligned data. For the former, we augment a baseline network architecture by embedding context encoding modules that are capable of leveraging high-level contextual clues to reduce indeterminacy within areas containing strong reflections. For the latter, we introduce an alignment-invariant loss function that facilitates exploiting misaligned real-world training data that is much easier to collect. Experimental results collectively show that our method outperforms the state-of-the-art with aligned data, and that significant improvements are possible when using additional misaligned data.
Content may be subject to copyright.
Single Image Reflection Removal Exploiting Misaligned Training Data and
Network Enhancements
Kaixuan Wei1Jiaolong Yang2Ying Fu1David Wipf2Hua Huang1
1Beijing Institute of Technology 2Microsoft Research
Abstract
Removing undesirable reflections from a single image
captured through a glass window is of practical impor-
tance to visual computing systems. Although state-of-the-
art methods can obtain decent results in certain situations,
performance declines significantly when tackling more gen-
eral real-world cases. These failures stem from the intrin-
sic difficulty of single image reflection removal – the funda-
mental ill-posedness of the problem, and the insufficiency of
densely-labeled training data needed for resolving this am-
biguity within learning-based neural network pipelines. In
this paper, we address these issues by exploiting targeted
network enhancements and the novel use of misaligned
data. For the former, we augment a baseline network archi-
tecture by embedding context encoding modules that are ca-
pable of leveraging high-level contextual clues to reduce in-
determinacy within areas containing strong reflections. For
the latter, we introduce an alignment-invariant loss func-
tion that facilitates exploiting misaligned real-world train-
ing data that is much easier to collect. Experimental results
collectively show that our method outperforms the state-of-
the-art with aligned data, and that significant improvements
are possible when using additional misaligned data.
1. Introduction
Reflection is a frequently-encountered source of image
corruption that can arise when shooting through a glass sur-
face. Such corruptions can be addressed via the process of
single image reflection removal (SIRR), a challenging prob-
lem that has attracted considerable attention from the com-
puter vision community [22, 25, 39, 2, 5, 47, 44, 38]. Tra-
ditional optimization-based methods often leverage manual
intervention or strong prior assumptions to render the prob-
lem more tractable [22, 25]. Recently, alternative learning-
based approaches rely on deep Convectional Neural Net-
works (CNNs) in lieu of the costly optimization and hand-
crafted priors [5, 47, 44, 38]. But promising results notwith-
Corresponding author: fuying@bit.edu.cn
standing, SIRR remains a largely unsolved problem across
disparate imaging conditions and varying scene content.
For CNN-based reflection removal, our focus herein, the
challenge originates from at least two sources: (i) The ex-
traction of a background image layer devoid of reflection
artifacts is fundamentally ill-posed, and (ii) Training data
from real-world scenes, are exceedingly scarce because of
the difficulty in obtaining ground-truth labels.
Mathematically speaking, it is typically assumed that a
captured image Iis formed as a linear combination of a
background or transmitted layer Tand a reflection layer R,
i.e., I=T+R. Obviously, when given access only to I,
there exists an infinite number of feasible decompositions.
Further compounding the problem is the fact that both T
and Rinvolve content from real scenes that may have over-
lapping appearance distributions. This can make them diffi-
cult to distinguish even for human observers in some cases,
and simple priors that might mitigate this ambiguity are not
available except under specialized conditions.
On the other hand, although CNNs can perform a wide
variety visual tasks, at times exceeding human capabilities,
they generally require a large volume of labeled training
data. Unfortunately, real reflection images accompanied
with densely-labeled, ground-truth transmitted layer inten-
sities are scarce. Consequently, previous learning-based ap-
proaches have resorted to training with synthesized images
[5, 38, 47] and/or small real-world data captured from spe-
cialized devices [47]. However, existing image synthesis
procedures are heuristic and the domain gap may jeopardize
accuracy on real images. On the other hand, collecting suf-
ficient additional real data with precise ground-truth labels
is tremendously labor-intensive.
This paper is devoted to addressing both of the afore-
mentioned challenges. First, to better tackle the intrinsic
ill-posedness and diminish ambiguity, we propose to lever-
age a network architecture that is sensitive to contextual in-
formation, which has proven useful for other vision tasks
such as semantic segmentation [11, 48, 46, 13]. Note that
at a high level, our objective is to efficiently convert prior
information mined from labeled training data into network
structures capable of resolving this ambiguity. Within a tra-
8178
ditional CNN model, especially in the early layers where
the effective receptive field is small, the extracted features
across all channels are inherently local. However, broader
non-local context is necessary to differentiate those features
that are descriptive of the desired transmitted image, and
those that can be discarded as reflection-based. For ex-
ample, in image neighborhoods containing a particularly
strong reflection component, accurate separation by any
possible method (even one trained with arbitrarily rich train-
ing data) will likely require contextual information from re-
gions without reflection. To address this issue, we utilize
two complementary forms of context, namely, channel-wise
context and multi-scale spatial context. Regarding the for-
mer, we apply a channel attention mechanism to the fea-
ture maps from convolutional layers such that different fea-
tures are weighed differently according to global statistics
of the activations. For the latter, we aggregate information
across a pyramid of feature map scales within each chan-
nel to reach a global contextual consistency in the spatial
domain. Our experiments demonstrate that significant im-
provement can be obtained by these enhancements, leading
to state-of-the-art performance on two real-image datasets.
Secondly, orthogonal to architectural considerations, we
seek to expand the sources of viable training data by facil-
itating the use of misaligned training pairs, which are con-
siderably easier to collect. Misalignment between an input
image Iand a ground-truth reflection-free version Tcan be
caused by camera and/or object movements during the ac-
quisition process. In the previous works [37, 46], data pairs
(I , T )were obtained by taking an initial photo through a
glass plane, followed by capturing a second one after the
glass has been removed. This process requires that the
camera, scene, and even lighting conditions remain static.
Adhering to these requirements across a broad acquisition
campaign can significantly reduce both the quantity and di-
versity of the collected data. Additionally, post-processing
may also be necessary to accurately align Iand Tto com-
pensate for spatial shifts caused by the refractive effect [37].
In contrast, capturing unaligned data is considerably less
burdensome, as shown in Fig. 1. For example, there is no
need for a tripod, table, or other special hardware; the cam-
era can be hand-held and the pose can be freely adjusted;
dynamic scenes in the presence of vehicles, humans, etc.
can be incorporated; and finally no post-processing of any
type is needed.
To handle such misaligned training data, we require a
loss function that is, to the extent possible, invariant to the
alignment, i.e., the measured image content discrepancy be-
tween the network prediction and its unaligned reference
should be similar to what would have been observed if
the reference was actually aligned. In the context of im-
age style transfer [17] and others, certain perceptual loss
functions have been shown to be relatively invariant to var-
[46] Ours
Figure 1: Comparison of the reflection image data collection meth-
ods in [46] and this paper.
ious transformations. Our study shows that the using only
the highest-level feature from a deep network (VGG-19 in
our case) leads to satisfactory results for our reflection re-
moval task. In both simulation tests and experiments us-
ing a newly collected dataset, we demonstrate for the first
time that training/fine-tuning a CNN with unaligned data
improves the reflection removal results by a large margin.
2. Related Work
This paper is concerned with reflection removal from
a single image. Previous methods utilizing multiple input
images of, e.g., flash/non-flash pairs [1], different polariza-
tion [20], multi-view or video sequences [6, 35, 30, 7, 24,
34, 9, 43, 45] will not be considered here.
Traditional methods. Reflection removal from a single im-
age is a massively ill-posed problem. Additional priors are
needed to solve the otherwise prohibitively-difficult prob-
lem in traditional optimization-based method [22, 25, 39,
2, 36]. In [22], user annotations are used to guide layer
separation jointly with a gradient sparsity prior [23]. [25]
introduces a relative smoothness prior where the reflections
are assumed to be blurry thus their large gradients are penal-
ized. [39] explores a variant of the smoothness prior where
a multi-scale Depth-of-Field (DoF) confidence map is uti-
lized to perform edge classification. [31] exploits the ghost
cues for layer separation. [2] proposes a simple optimiza-
tion formulation with an l0gradient penalty on the transmit-
ted layer inspired by image smoothing algorithms [42]. De-
spite decent results can be obtained by these methods where
their assumptions hold, the vastly-different imaging condi-
tions and complex scene content in the real world render
their generalization problematic.
Deep learning based methods. Recently, there is an
emerging interest in applying deep convolutional neural net-
works for single image reflection removal such that the
handcrafted priors can be replaced by data-driven learn-
ing [5, 38, 47, 44]. The first CNN-based method is due
to [5], where a network structure is proposed to first pre-
dict the background layer in the edge domain followed by
8179
reconstructing it the color domain. Later, [38] proposes to
predict the edge and image intensity concurrently by two
cooperative sub-networks. The recent work of [44] presents
a cascade network structure which predicts the background
layer and reflection layer in an interleaved fashion. The ear-
lier CNN-based methods typical use the raw image intensity
discrepancy such as mean squared error (MSE) to train the
networks. Several recent works [47, 16, 3] adopt the per-
ceptual loss [17] which uses the multi-stage features of a
deep network pre-trained on ImageNet [29]. [47]. Adver-
sarial loss is investigated in [47, 21] to improve the realism
of the predicted background layers.
3. Approach
Given an input image Icontaminated with reflections,
our goal is to estimate a reflection-free trasmitted image ˆ
T.
To achieve this, we train a feed-forward CNN GθGparame-
terized by θGto minimize a reflection removal loss function
l. Given training image pairs {(In, Tn)},n= 1,· · · , N,
this involves solving:
ˆ
θG= arg minθG
1
NPN
n=1 l(GθG(In), Tn).(1)
We will first introduce the details of network architecture
GθGfollowed by the loss function lapplied to both aligned
data (the common case) and newly proposed unaligned data
extensions. The overall system is illustrated in Fig. 2.
3.1. Basic Image Reconstruction Network
Our starting point can be viewed as the basic image re-
construction neural network component from [5] but mod-
ified in three aspects: (1) We simplify the basic residual
block [12] by removing the batch normalization (BN) layer
[14]; (2) we increase the capacity by widening the network
from 64 to 256 feature maps; and (3) for each input image
I, we extract hypercolumn features [10] from a pretrained
VGG-19 network [32], and concatenate these features with
Ias an augmented network input. As explained in [47],
such an augmentation strategy can help enable the network
to learn semantic clues from the input image.
Note that removing the BN layer from our network turns
out to be critical for optimizing performance in the present
context. As shown in [41], if batch sizes become too small,
prediction errors can increase precipitously and stability is-
sues can arise. Moreover, for a dense prediction task such as
SIRR, large batch sizes can become prohibitively expensive
in terms of memory requirements. In our case, we found
that within the tenable batch sizes available for reflection re-
moval, BN led to considerably worse performance, includ-
ing color attenuation/shifting issues as sometimes observed
in image-to-image translation tasks [5, 15, 49]. BN layers
have similarly been removed from other dense prediction
tasks such as image super-resolution [26] or deblurring [28].
At this point, we have constructed a useful base archi-
tecture upon which other more targeted alterations will be
applied shortly. This baseline, which we will henceforth
refer to as BaseNet, performs quite well when trained and
tested on synthetic data. However, when deployed on real-
world reflection images we found that its performance de-
graded by an appreciable amount, especially on the 20 real
images from [47]. Therefore, to better mitigate the tran-
sition from the make-believe world of synthetic images to
real-life photographs, we describe two modifications for in-
troducing broader contextual information into otherwise lo-
cal convolutional filters.
3.2. Context Encoding Modules
As mentioned previously, we consider both context be-
tween channels and multi-scale context within channels.
Channel-wise context. The underlying design princi-
ple here is to introduce global contextual information
across channels, and a richer overall structure within resid-
ual blocks, without dramatically increasing the parameter
count. One way to accomplish this is by incorporating a
channel attention module originally developed in [13] to re-
calibrate feature maps using global summary statistics.
Let U= [u1, . . . , uc, . . . , uC]denote original, uncali-
brated activations produced by a network block, with C
feature maps of size of H×W. These activations gener-
ally only reflect local information residing within the corre-
sponding receptive fields of each filter. We then form scalar,
channel-specific descriptors zc=fgp(uc)by applying a
global average pooling operator fgp to each feature map
ucRH×W. The vector z= [z1, . . . , zC]RCrepresents
a simple statistical summary of global, per-channel activa-
tions and, when passed through a small network structure,
can be used to adaptively predict the relative importance of
each channel [13].
More specifically, the channel attention module first
computes s=σ(WUδ(WDz)) where WDis a trainable
weight matrix that downsamples zto dimension R < C,
δis a ReLU non-linearity, WUrepresents a trainable up-
sampling weight matrix, and σis a sigmoidal activation.
Elements of the resulting output vector sRCserve
as channel-specific gates for calibrating feature maps via
ˆuc=sc·uc.
Consequently, although each individual convolutional
filter has a local receptive field, the determination of which
channels are actually important in predicting the transmis-
sion layer and suppressing reflections is based on the pro-
cessing of a global statistic (meaning the channel descrip-
tors computed as activations pass through the network dur-
ing inference). Additionally, the parameter overhead intro-
duced by this process is exceedingly modest given that WD
and WUare just small additional weight matrices associated
with each block.
8180
𝑇
𝑇
𝑃𝑖𝑥𝑒𝑙 𝐿𝑜𝑠𝑠 𝑙
Loss for
Unaligned Data
𝐴𝑑𝑣𝑒𝑟𝑠𝑎𝑟𝑖𝑎𝑙 𝐿𝑜𝑠𝑠 𝑙
Residual
Block
F𝑒𝑎𝑡𝑢𝑟𝑒 𝐿𝑜𝑠𝑠 𝑙
Align. 𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 𝐿𝑜𝑠𝑠 𝑙
13 blocks
1/32
1/16
1/8
1/4
Convolution ReLU Pyramid
Pooling
Global
Pooling
SigmoidUpsample
VGG19-features
𝐷
𝐺
Channel Attention
Loss for
aligned Data
Figure 2: Overview of our approach for single image reflection removal.
Multi-scale spatial context. Although we have found that
encoding the contextual information across channels al-
ready leads to significant empirical gains on real-world im-
ages, utilizing complementary multi-scale spatial informa-
tion within each channel provides further benefit. To ac-
complish this, we apply a pyramid pooling module [11],
which has proven to be an effective global-scene-level rep-
resentation in semantic segmentation [48]. As shown in
Fig. 2, we construct such a module using pooling opera-
tions at sizes 4, 8, 16, and 32 situated in the tail of our net-
work before the final construction of ˆ
T. Pooling in this way
fuses features under four different pyramid scales. After
harvesting the resulting sub-region representations, we per-
form a non-linear transformation (i.e. a Conv-ReLU pair) to
reduce the channel dimension. The refined features are then
upsampled via bilinear interpolation. Finally, the different
levels of features are concatenated together as a final repre-
sentation reflecting multi-scale spatial context within each
channel; the increased parameter overhead is negligible.
3.3. Training Loss for Aligned Data
In this section, we present our loss function for aligned
training pairs (I , T ), which consists of three terms similar
to previous methods [47, 44].
Pixel loss. Following [5], we penalize the pixel-wise in-
tensity difference of Tand ˆ
Tvia lpixel =αkˆ
TTk2
2+
β(k∇xˆ
T− ∇xTk1+k∇yˆ
T− ∇yTk1)where xand y
are the gradient operator along x- and y-direction, respec-
tively. We set α= 0.2and β= 0.4in all our experiments.
Feature loss. We define the feature loss based on the
activations of the 19-layer VGG network [33] pretrained
on ImageNet [29]. Let φlbe the feature from the l-th
layer of VGG-19, we define the feature loss as lf eat =
Plλlkφl(T)φl(ˆ
T)k1where {λl}are the balancing
weights. Similar to [47], we use the layers ‘conv2 2’,
‘conv3 2’, ‘conv4 2’, and ‘conv5 2’ of VGG-19 net.
Adversarial loss. We further add an adversarial loss to
improve the realism of the produced background images.
We define an opponent discriminator network DθDand
minimize the relativistic adversarial loss [18] defined as
ladv =lG
adv =log(DθD(T, ˆ
T))log(1DθD(ˆ
T , T )) for
GθGand lD
adv =log(1 DθD(T, ˆ
T)) log(DθD(ˆ
T , T ))
for DθDwhere DθD(T, ˆ
T) = σ(C(T)C(ˆ
T)) with σ(·)
being the sigmoid function and C(·)the non-transformed
discriminator function (refer to [18] for details).
To summarize, our loss for aligned data is defined as:
laligned =ω1lpixel +ω2lf eat +ω3ladv (2)
where we empirically set the weights as ω1= 1, ω2= 0.1,
and ω3= 0.01 respectively throughout our experiments.
3.4. Training Loss for Unaligned Data
To use misaligned data pairs (I , T )for training, we need
a loss function that is invariant to the alignment, such that
the true similarity between Tand the prediction ˆ
Tcan be
reasonably measured. In this regard, we note that human
observers can easily assess the similarity of two images
even if they are not aligned. Consequently, designing a
loss measuring image similarity on the perceptual-level may
serve our goal. This motivates us to directly use a deep fea-
ture loss for unaligned data.
Intuitively, the deeper the feature, the more likely it is
to be insensitive to misalignment. To experimentally ver-
ify this and find a suitable feature layer for our purposes,
we conducted tests using a pre-trained VGG-19 network as
follows. Given an unaligned image pair (I , T ), we use gra-
dient descent to finetune the weights of our network GθG
to minimize the feature difference of Tand ˆ
T, with features
extracted at different layers of VGG-19. Figure 3 shows that
using low-level or middle-level features from ‘conv2 2’ to
‘conv4 2’ leads to blurry results (similar to directly using a
pixel-wise loss), although the reflection is more thoroughly
removed. In contrast, using the highest-level feature from
‘conv5 2’ gives rise to a striking result: the predicted back-
ground image is sharp and almost reflection-free.
8181
(a) Input (b) Unaligned Ref. (c) Pretrained
(d) lpixel (e) conv2 2 (f) conv3 2
(g) conv4 2 (h) conv5 2 (i) Loss of [27]
Figure 3: The effect of using different loss to handle misaligned
real data. (a) and (b) are the unaligned image pair (I , T ). (c)
shows the reflection removal result of our network trained on syn-
thetic data and a small number of aligned real data (see Section 4
for details). Reflection can still be observed in the predicted back-
ground image. (d) is the result finetuned on (I , T )with pixel-
wise intensity loss. (e)-(h) are the results finetuned with features
at different layers of VGG-19. Only the highest-level feature from
‘conv5 2’ yields satisfactory result. (i) shows the results finetuned
with the loss of [27]. (Best viewed on screen with zoom)
Recently, [27] introduced a “contextual loss” which is
also designed for training deep networks with unaligned
data for image-to-image translation tasks like image style
transfer. In Fig 3, we also present the finetuned result us-
ing this loss for our reflection removal task. Upon visual
inspection, the results are similar to our highest-level VGG
feature loss (quantitative comparison can be found in the
experiment section). However, our adopted loss (formally
defined below) is much simpler and more computationally
efficient than the loss from [27].
Alignment-invariant loss. Based on the above study, we
now formally define our invariant loss component designed
for unaligned data as linv =kφh(T)φh(ˆ
T)k1, where
φhdenotes the ‘conv5 2’ feature of the pretrained VGG-19
network. For unaligned data, we also apply an adversarial
loss which is not affected by misalignment. Therefore, our
overall loss for unaligned data can be written as
lunaligned =ω4linv +ω5ladv (3)
where we set the weights as ω4= 0.1and ω5= 0.01.
4. Experiments
4.1. Implementation Details
Training data. We adopt a fusion of synthetic and real data
as our train dataset. The images from [5] are used as sythetic
Table 1: Comparison of different settings. Our full model (i.e.
ERRNet) leads to best performance among all comparisons.
Synthetic Real20
Model PSNR SSIM PSNR SSIM
CEILNet-F [5] 24.70 0.884 20.32 0.739
BaseNet only 25.71 0.926 21.51 0.780
BaseNet + CSC 27.64 0.940 22.61 0.796
BaseNet + MSC 26.03 0.928 21.75 0.783
ERRNet 27.88 0.941 22.89 0.803
data, i.e. 7,643 cropped images with size 224 ×224 from
PASCAL VOC dataset [4]. 90 real-world training images
from [47] are adopted as real data. For image synthesis,
we use the same data generation model as [5] to create our
synthetic data. In the following, we always use the same
dataset for training, unless specifically stated.
Training details. Our implementation1is based on Py-
Torch. We train the model with 60 epoch using the Adam
optimizer [19]. The base learning rate is set to 104and
halved at epoch 30, then reduced to 105at epoch 50. The
weights are initialized as in [26].
4.2. Ablation Study
In this section, we conduct an ablation study for our
method on 100 synthetic testing images from [5] and 20
real testing images from [47] (denoted by ‘Real20’).
Component analysis. To verify the importance of our
network design, we compare four model architectures as
described in Section 3, including (1) Our basic image re-
construction network BaseNet; (2) BaseNet with channel-
wise context module (BaseNet + CWC); (3) BaseNet with
multi-scale spatial context module (BaseNet + MSC); and
(4) Our enhanced reflection removal network, denoted ER-
RNet, i.e., BaseNet + CWC + MSC. The result from the
CEILNet [5] fine-tuned on our training data (denoted by
CEILNet-F) is also provided as an additional reference.
As shown in Table 1, our BaseNet has already achieved
a much better result than CEILNet-F. The performance of
our BaseNet could be obviously boosted by using channel-
wise context and multi-scale spatial context modules, espe-
cially by using them together, i.e. ERRNet. Figure 4 visu-
ally shows the results from BaseNet and our ERRNet. It can
be observed that BaseNet struggles to discriminate the re-
flection region and yields some obvious residuals, while the
ERRNet removes the reflection and produces much cleaner
transmitted images. These results suggest the effectiveness
of our network design, especially the components tailored
to encode the contextual clues.
Efficacy of the training loss for unaligned data. In this
1Code is released at https://github.com/Vandermode/ERRNet
8182
Input BaseNet ERRNet
Figure 4: Comparison of the results with (ERRNet) and without
(BaseNet) the context encoding modules.
Table 2: Simulation experiment to verify the efficacy our
alignment-invariant loss
Training Scheme PSNR SSIM
Synthetic only 19.79 0.741
+ 50 aligned 22.00 0.785
+ 90 aligned 22.89 0.803
+ 50 aligned, + 40 unaligned trained with:
lpixel 21.85 0.766
linv 22.38 0.797
lcx 22.47 0.796
linv +lcx 22.43 0.796
experiment, we first train our ERRNet with only ‘synthetic
data’, ‘synthetic + 50 aligned real data’, and ‘synthetic +
90 aligned real data’. The loss function in Eq. (2) is used
for aligned data. We can see that the testing results become
better with the increasing real data in Table 2.
Then, we synthesize misalignment through performing
random translations within [10,10] pixels on real data2,
and train ERRNet with ‘synthetic + 50 aligned real data +
40 unaligned data’. Pixel-wise loss lpixel and alignment-
invariant loss linv are used for 40 unaligned images. Ta-
ble 2 shows employing 40 unaligned data with lpixel loss
degrades the performance, even worse than that from 50
aligned images without additional unaligned data.
In addition, we also investigate the contextual loss lcx
of [27]. Results from both contextual loss lcx and our
alignment-invariant loss linv (or combination of them linv+
lcx) surpass analogous results obtained with only aligned
images by appreciable margins, indicating that these losses
provide useful supervision to networks granted unaligned
data. Note although linv and lcx perform equally well, our
linv is much simpler and computationally efficient than lcx,
suggesting linv is lightweight alternative to lcx in terms of
our reflection removal task.
2Our alignment-invariant loss linv can handle shifts of up to 20 pixels.
See suppl. material for more details.
4.3. Method Comparison on Benchmarks
In this section, we compare our ERRNet against state-of-
the-art methods including the optimization-based method of
[25] (LB14) and the learning-based approaches (CEILNet
[5], Zhang et al. [47], and BDN [44]). For fair comparison,
we finetune these models on our training dataset and report
results of both the original pretrained model and finetuned
version (denoted with a suffix ’-F’).
The comparison is conducted on four real-world
datasets, i.e. 20 testing images in [47] and three sub-datasets
from SIR2[37]. These three sub-datasets are captured under
different conditions: (1) 20 controlled indoor scenes com-
posed by solid objects; (2) 20 different controlled scenes
on postcards; and (3) 55 wild scenes3with ground truth
provided. In the following, we denote these datasets by
‘Real20’, ‘Objects’, ‘Postcard’, and ‘Wild’, respectively.
Table 3 summarizes the results of all competing meth-
ods on four real-world datasets. The quality metrics include
PSNR, SSIM [40], NCC [43, 37] and LMSE [8]. Larger
values of PSNR, SSIM, and NCC indicate better perfor-
mance, while a smaller value of LMSE implies a better re-
sult. Our ERRNet achieves the state-of-the-art performance
in ‘Real20’ and ‘Objects’ datasets. Meanwhile, our result
is comparable to the best-performing BDN-F on ‘Postcard’
data. The quantitative results on ‘Wild’ dataset reveal a
frustrating fact, namely, that no method could outperform
the naive baseline ’Input’, suggesting that there is still large
room for improvement.
Figure 5 displays visual results on real-world images. It
can be seen that all compared methods fail to handle some
strong reflections, but our network more accurately removes
many undesirable artifacts, e.g. removal of tree branches re-
flected on the building window in the fourth photo of Fig 5.
4.4. Training with Unaligned Data
To test our alignment-invariant loss on real-world un-
aligned data, we first collected a dataset of unaligned im-
age pairs with cameras and a portable glass, as shown in
Fig. 1 . Both a DSLR camera and a smart phone are used to
capture the images. We collected 450 image pairs in total,
and some samples are shown in Fig 6. These image pairs
are randomly split into a training set of 400 samples and a
testing set with 50 samples.
We conduct experiments on the BDN-F and ERRNet
models, each of which is first trained on aligned dataset
(w/o unaligned) as in Section 4.3, and then finetuned with
our alignment-invariant loss and unaligned training data.
The resulting pairs before and after finetuning are assem-
bled for human assessment, as no existing numerical metric
is available for evaluating unaligned data.
We asked 30 human observers to provide a preference
3Images indexed by 1, 2, 74 are removed due to misalignment.
8183
Input LB14 [25] CEILNet-F [5] Zhang et al. [47] BDN-F [44] ERRNet Reference
Figure 5: Visual comparison on real-world images. The images are obtained from ‘Real20’ (Rows 1-3) and our collected unaligned dataset
(Rows 4- 5). More results can be found in the suppl. material.
Table 3: Quantitative results of different methods on four real-world benchmark datasets. The best results are indicated by red color and
the second best results are denoted by blue color. The results of ‘Average’ are obtained by averaging the metric scores of all images from
these four real-world datasets.
Dataset Index
Methods
Input LB14 CEILNet CEILNet Zhang BDN BDN ERRNet
[25] [5] F et al. [47] [44] F
Real20
PSNR 19.05 18.29 18.45 20.32 21.89 18.41 20.06 22.89
SSIM 0.733 0.683 0.690 0.739 0.787 0.726 0.738 0.803
NCC 0.812 0.789 0.813 0.834 0.903 0.792 0.825 0.877
LMSE 0.027 0.033 0.031 0.028 0.022 0.032 0.027 0.022
Objects
PSNR 23.74 19.39 23.62 23.36 22.72 22.73 24.00 24.87
SSIM 0.878 0.786 0.867 0.873 0.879 0.856 0.893 0.896
NCC 0.981 0.971 0.972 0.974 0.964 0.978 0.978 0.982
LMSE 0.004 0.007 0.005 0.005 0.005 0.005 0.004 0.003
Postcard
PSNR 21.30 14.88 21.24 19.17 16.85 20.71 22.19 22.04
SSIM 0.878 0.795 0.834 0.793 0.799 0.859 0.881 0.876
NCC 0.947 0.929 0.945 0.926 0.886 0.943 0.941 0.946
LMSE 0.005 0.008 0.008 0.013 0.007 0.005 0.004 0.004
Wild
PSNR 26.24 19.05 22.36 22.05 21.56 22.36 22.74 24.25
SSIM 0.897 0.755 0.821 0.844 0.836 0.830 0.872 0.853
NCC 0.941 0.894 0.918 0.924 0.919 0.932 0.922 0.917
LMSE 0.005 0.027 0.013 0.009 0.010 0.009 0.008 0.011
Average
PSNR 22.85 17.51 22.30 21.41 20.22 21.70 22.96 23.59
SSIM 0.874 0.781 0.841 0.832 0.838 0.848 0.879 0.879
NCC 0.955 0.937 0.948 0.943 0.925 0.951 0.950 0.956
LMSE 0.006 0.011 0.009 0.010 0.007 0.007 0.006 0.005
score among {-2,-1,0,1,2}with 2 indicating the finetuned
result is significantly better while -2 the opposite. To avoid
bias, we randomly switch the image positions of each pair.
In total, 3000 human judgments are collected (2 methods,
30 users, 50 images pairs). More details regarding this eval-
uation process can be found in the suppl. material.
8184
Figure 6: Image samples in our unaligned image dataset. Our dataset covers a large variety of indoor and outdoor environments including
dynamic scenes with vehicles, human, etc.
Score Range Ratio BDN-F ERRNet
10 20 30 40 50
-1
0
1
2
10 20 30 40 50
-1
0
1
2
(0.25,2] 78% 54%
[0.25,0.25] 18% 36%
[2,0.25) 4% 10%
Average Score 0.62 0.51
Table 4: Human preference scores of self-comparsion experiments. Left: results of BDN-F; Right: results of ERRNet. X axis of each
sub-figure represents the image # of testing images (50 in total).
BDN-F ERRNet
input reference w/o unaligned w. unaligned w/o unaligned w. unaligned
Figure 7: Results of training with and without unaligned data. See suppl. material for more examples. (Best view on screen with zoom)
Table 4 shows the average of human preference scores
for the resulting pairs of each method. As can be seen, hu-
man observers clearly tend to prefer the results produced
by the finetuned models over the raw ones, which demon-
strates the benefit of leveraging unaligned data for training
independent of the network architecture. Figure 7 shows
some typical results of the two methods; the results are sig-
nificantly improved by training on unaligned data.
5. Conclusion
We have proposed an enhanced reflection removal net-
work together with an alignment-invariant loss function to
help resolve the difficulty of single image reflection re-
moval. We investigated the possibility to directly utilize
misaligned training data, which can significantly alleviate
the burden of capturing real-world training data. To effi-
ciently extract the underlying knowledge from real train-
ing data, we introduce context encoding modules, which
can be seamlessly embedded into our network to help dis-
criminate and suppress the reflection component. Extensive
experiments demonstrate our approach set a new state-of-
the-art on real-world benchmarks of single image reflection
removal, both quantitatively and visually.
Acknowledgments
We thank Yunhao Zou for great help collecting the re-
flection image dataset. This work was supported by the Na-
tional Natural Science Foundation of China under Grants
No. 61425013 and No. 61672096.
8185
References
[1] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Remov-
ing photography artifacts using gradient projection and flash-
exposure sampling. ACM Transactions on Graphics (TOG),
24(3):828–835, 2005.
[2] N. Arvanitopoulos, R. Achanta, and S. Susstrunk. Single im-
age reflection suppression. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), July 2017.
[3] Z. Chi, X. Wu, X. Shu, and J. Gu. Single image reflection
removal using deep encoder-decoder network. arXiv preprint
arXiv:1802.00094, 2018.
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The pascal visual object classes (voc)
challenge. International Journal of Computer Vision (IJCV),
88(2):303–338, 2010.
[5] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf. A generic
deep architecture for single image reflection removal and im-
age smoothing. In The IEEE International Conference on
Computer Vision (ICCV), Oct 2017.
[6] H. Farid and E. H. Adelson. Separating reflections and light-
ing using independent components analysis. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
July 1999.
[7] K. Gai, Z. Shi, and C. Zhang. Blind separation of superim-
posed moving images using image statistics. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (TPAMI),
34(1):19–32, 2012.
[8] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Free-
man. Ground truth dataset and baseline evaluations for in-
trinsic image algorithms. In IEEE International Conference
on Computer Vision (ICCV). IEEE, Oct 2009.
[9] X. Guo, X. Cao, and Y. Ma. Robust separation of reflection
from multiple images. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2014.
[10] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hy-
percolumns for object segmentation and fine-grained local-
ization. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2015.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
in deep convolutional networks for visual recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 37(9):1904–1916, 2015.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
[13] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net-
works. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
translation with conditional adversarial networks. In The
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), July 2017.
[16] M. Jin, S. Ssstrunk, and P. Favaro. Learning to see through
reflections. In IEEE International Conference on Computa-
tional Photography (ICCP), May 2018.
[17] J. Johnson, A. Alahi, and L. Feifei. Perceptual losses for
real-time style transfer and super-resolution. European Con-
ference on Computer Vision (ECCV), pages 694–711, 2016.
[18] A. Jolicoeur-Martineau. The relativistic discriminator: a key
element missing from standard GAN. In International Con-
ference on Learning Representations (ICLR), 2019.
[19] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[20] N. Kong, Y.-W. Tai, and J. S. Shin. A physically-based
approach to reflection separation: from physical modeling
to constrained optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 36(2):209–221,
2014.
[21] D. Lee, M.-H. Yang, and S. Oh. Generative single image re-
flection separation. arXiv preprint arXiv:1801.04102, 2018.
[22] A. Levin and Y. Weiss. User assisted separation of reflections
from a single image using a sparsity prior. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (TPAMI),
29(9):1647–1654, 2007.
[23] A. Levin, A. Zomet, and Y. Weiss. Learning to perceive
transparency from the statistics of natural scenes. In Ad-
vances in Neural Information Processing Systems (NIPS).
December 2002.
[24] Y. Li and M. S. Brown. Exploiting reflection change for au-
tomatic reflection removal. In The IEEE International Con-
ference on Computer Vision (ICCV), December 2013.
[25] Y. Li and M. S. Brown. Single image layer separation us-
ing relative smoothness. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 2752–2759,
2014.
[26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced
deep residual networks for single image super-resolution.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, July 2017.
[27] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual
loss for image transformation with non-aligned data. In The
European Conference on Computer Vision (ECCV), Septem-
ber 2018.
[28] S. Nah, T. Hyun Kim, and K. Mu Lee. Deep multi-scale
convolutional neural network for dynamic scene deblurring.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge. In-
ternational Journal of Computer Vision (IJCV), 115(3):211–
252, 2015.
[30] B. Sarel and M. Irani. Separating transparent layers through
layer information exchange. In European Conference on
Computer Vision (ECCV), September 2004.
[31] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman. Reflec-
tion removal using ghosting cues. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
2015.
8186
[32] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. International
Conference on Learning Representations (ICLR), 2015.
[33] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. In International
Conference on Machine Learning (ICLR), 2015.
[34] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and
R. Szeliski. Image-based rendering for scenes with reflec-
tions. ACM Transactions on Graphics (TOG), 31(4):100–1,
2012.
[35] R. Szeliski, S. Avidan, and P. Anandan. Layer extrac-
tion from multiple images containing reflections and trans-
parency. In IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), July 2000.
[36] R. Wan, B. Shi, L. Duan, A. Tan, W. Gao, and A. C. Kot.
Region-aware reflection removal with unified content and
gradient priors. IEEE Transactions on Image Processing,
27(6):2927–2941, 2018.
[37] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot.
Benchmarking single-image reflection removal algorithms.
In The IEEE International Conference on Computer Vision
(ICCV), Oct 2017.
[38] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot. Crrn:
Multi-scale guided concurrent reflection removal network.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[39] R. Wan, B. Shi, T. A. Hwee, and A. C. Kot. Depth of field
guided reflection removal. In IEEE International Conference
on Image Processing, September 2016.
[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Image quality assessment: from error visibility to struc-
tural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004.
[41] Y. Wu and K. He. Group normalization. In European Con-
ference on Computer Vision (ECCV), September 2018.
[42] L. Xu, C. Lu, Y. Xu, and J. Jia. Image smoothing via L0
gradient minimization. In ACM Transactions on Graphics
(TOG), volume 30, page 174, 2011.
[43] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman. A com-
putational approach for obstruction-free photography. ACM
Transactions on Graphics (TOG), 34(4):79, 2015.
[44] J. Yang, D. Gong, L. Liu, and Q. Shi. Seeing deeply and
bidirectionally: A deep learning approach for single image
reflection removal. In The European Conference on Com-
puter Vision (ECCV), September 2018.
[45] J. Yang, H. Li, Y. Dai, and R. T. Tan. Robust optical flow
estimation of double-layer images under transparency or re-
flection. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[46] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
A. Agrawal. Context encoding for semantic segmentation.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[47] X. Zhang, R. Ng, and Q. Chen. Single image reflection
separation with perceptual losses. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
2018.
[48] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network. In The IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), July 2017.
[49] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
to-image translation using cycle-consistent adversarial net-
works. In The IEEE International Conference on Computer
Vision (ICCV), Oct 2017.
8187
... Φ(T, R) = I − T − R is the residue of the reconstruction, which can be caused by attenuation, overexposure, etc. Φ(·, ·) represents a group of functions that can be used to model the residue in specific situations. For example, Φ(T, R) = 0 in [23,5,17,41,36,12], Φ(T, R) = (α−1)T+(β−1)R in [34,39], Φ(T, R) = −(W•T+W•R) in [37,42]. In an effort to rule these different cases, a learnable module is desired to be introduced to model the residue Φ(T, R). ...
... Moreover, the exclusivity loss is developed to penalize the intersected gradients. ERRNet [36] goes a step further by leveraging an additional set of misaligned pairs. However, it appears to overlook the estimation of the reflection layer, which is essentially an image component and potentially important to distinguish the transmission parts. ...
... The semantic information provided by pre-trained models is usually introduced to alleviate the impact of illposedness in SIRS. The HyperColumn [41,36,12], as a prevalent multi-scale semantic feature extractor in SIRS, aggregates semantics through interpolating the features extracted by a pre-trained model into the same scale as the input images. It then concatenates them together and employs a 1 × 1 convolution to rapidly reduce the channel dimensions (typically from 1475 to 64 or 256), before feeding them into the decomposition networks. ...
Preprint
The reflection superposition phenomenon is complex and widely distributed in the real world, which derives various simplified linear and nonlinear formulations of the problem. In this paper, based on the investigation of the weaknesses of existing models, we propose a more general form of the superposition model by introducing a learnable residue term, which can effectively capture residual information during decomposition, guiding the separated layers to be complete. In order to fully capitalize on its advantages, we further design the network structure elaborately, including a novel dual-stream interaction mechanism and a powerful decomposition network with a semantic pyramid encoder. Extensive experiments and ablation studies are conducted to verify our superiority over state-of-the-art approaches on multiple real-world benchmark datasets. Our code is publicly available at https://github.com/mingcv/DSRNet.
... In the IRR task, the reflection-contaminated image is usually modeled as a linear combination of the transmission layer T [1,2] that indicates the real content and the reflection layer R that indicates the impurity : ...
... BDN [11] proposed a cascade network structure to improve the IRR performance through a process which can multiply and alternately estimate the transmission layers and reflection layers. Wei et al. [2] proposed ERRNet to reduce the difficulty of training datasets obtaining which based on the attribute maps extracted from VGG [12] that can be applied to unaligned training data. IBCLN [13] used a cascade refinement approach in which two convolutional LSTM networks learn to simultaneously predict transmission and residual reflection. ...
... Wei et al. [2] proposed a network (ERRNet) based on the attribute graph extracted from VGG [12], which can be applied to misaligned training data. ERRNet used a feature pyramid in the decoding process to preserve details and the whole. ...
Preprint
Full-text available
Techniques for removing image reflections can effectively eliminate artifacts superimposed on the subject due to factors such as reflections in light or through glass. Most current methods are designed on the assumption that reflection areas maintain the original image content. However, the neglect some extreme cases where there is no original information left, such as photos taken in museums with generated light spots.In this paper, we propose a novel model capable of further removing light spots reflections. Specifically, it takes an image with reflection contamination as input, and then guided by the proposed reflection classifiers and structure restorer, ultimately outputs a predicted transmission layer image.Experimental results demonstrate that the proposed model is applicable to different categories of reflection images, outperforming state-of-the-art reflection removal techniques.In summary, the proposed model improves the effect of image reflection technology based on artificial intelligence(AI) in the case of spot reflection.
... BDN [17] utilizes a cascade deep neural network to predict the reflection layer, which is then used as feature information to predict the transmission layer. Wei et al. [24] propose a contextual sensitively network, which includes two contextual forms, channel-wise context and multi-scale spatial context. They also introduce an alignment-invariant loss for training misaligned data. ...
... Following [24], our loss function contains three terms: pixel loss, feature loss, and adversarial loss. ...
... In this benchmark study, we evaluate six state-of-the-art SIRR methods, BDN [17], Wei et al. [24], Wen et al. [23], IBCLN [26], Kim et al. [25], and Chang et al. [49]. ...
Preprint
Deep learning based methods have achieved significant success in the task of single image reflection removal (SIRR). However, the majority of these methods are focused on High-Definition/Standard-Definition (HD/SD) images, while ignoring higher resolution images such as Ultra-High-Definition (UHD) images. With the increasing prevalence of UHD images captured by modern devices, in this paper, we aim to address the problem of UHD SIRR. Specifically, we first synthesize two large-scale UHD datasets, UHDRR4K and UHDRR8K. The UHDRR4K dataset consists of $2,999$ and $168$ quadruplets of images for training and testing respectively, and the UHDRR8K dataset contains $1,014$ and $105$ quadruplets. To the best of our knowledge, these two datasets are the first largest-scale UHD datasets for SIRR. Then, we conduct a comprehensive evaluation of six state-of-the-art SIRR methods using the proposed datasets. Based on the results, we provide detailed discussions regarding the strengths and limitations of these methods when applied to UHD images. Finally, we present a transformer-based architecture named RRFormer for reflection removal. RRFormer comprises three modules, namely the Prepossessing Embedding Module, Self-attention Feature Extraction Module, and Multi-scale Spatial Feature Extraction Module. These modules extract hypercolumn features, global and partial attention features, and multi-scale spatial features, respectively. To ensure effective training, we utilize three terms in our loss function: pixel loss, feature loss, and adversarial loss. We demonstrate through experimental results that RRFormer achieves state-of-the-art performance on both the non-UHD dataset and our proposed UHDRR datasets. The code and datasets are publicly available at https://github.com/Liar-zzy/Benchmarking-Ultra-High-Definition-Single-Image-Reflection-Removal.
... Specifically, given the input features, the decoder first aggregates them by bilinear-resizing each of the features to the same size as the input image, and then channel-wisely concatenates them together. Inspired by the attention mechanism [30,31,32], we apply the channel-spatial attention [31] for the context between channels and multi-scale context within channels. ...
... Specifically, given the input features, the decoder first aggregates them by bilinear-resizing each of the features to the same size as the input image, and then channel-wisely concatenates them together. Inspired by the attention mechanism [30,31,32], we apply the channel-spatial attention [31] for the context between channels and multi-scale context within channels. ...
Preprint
Full-text available
Low-light hazy scenes commonly appear at dusk and early morning. The visual enhancement for low-light hazy images is an ill-posed problem. Even though numerous methods have been proposed for image dehazing and low-light enhancement respectively, simply integrating them cannot deliver pleasing results for this particular task. In this paper, we present a novel method to enhance visibility for low-light hazy scenarios. To handle this challenging task, we propose two key techniques, namely cross-consistency dehazing-enhancement framework and physically based simulation for low-light hazy dataset. Specifically, the framework is designed for enhancing visibility of the input image via fully utilizing the clues from different sub-tasks. The simulation is designed for generating the dataset with ground-truths by the proposed low-light hazy imaging model. The extensive experimental results show that the proposed method outperforms the SOTA solutions on different metrics including SSIM (9.19%) and PSNR(5.03%). In addition, we conduct a user study on real images to demonstrate the effectiveness and necessity of the proposed method by human visual perception.
Chapter
Artificial intelligence has generally surpassed humans in the game of Go, and the actual human-machine gaming environment is often more complex. In this environment, issues such as highlights, piece misplacement, and occlusions on the image can severely affect board recognition and reduce the intelligence level of the robot player. This paper proposes a multi-stage highlight removal network based on Swin Transformer, which can remove highlights on the image and recognize Go game information. Firstly, the image is processed by a multi-task detection network based on YOLOv5 to obtain highlight information. Then, the image is enhanced with the highlight information and input into the highlight removal network based on Swin Transformer for highlight removal. The output image is further processed by the detection network to obtain board information. We collected a total of 4242 images from publicly available datasets to train the detection network, which to some extent solves the problem of piece displacement. Additionally, we collected 1096 pairs of images for training the highlight removal network. Finally, the effectiveness of the proposed method was tested on a test set containing 56 images, achieving a recognition accuracy of 95.67%.
Article
We propose a simple yet effective reflection-free cue for robust reflection removal from a pair of flash and ambient (no-flash) images. The reflection-free cue exploits a flash-only image obtained by subtracting the ambient image from the corresponding flash image in raw data space. The flash-only image is equivalent to an image taken in a dark environment with only a flash on. This flash-only image is visually reflection-free and thus can provide robust cues to infer the reflection in the ambient image. Since the flash-only image usually has artifacts, we further propose a dedicated model that not only utilizes the reflection-free cue but also avoids introducing artifacts, which helps accurately estimate reflection and transmission. Our experiments on real-world images with various types of reflection demonstrate the effectiveness of our model with reflection-free flash-only cues: our model outperforms state-of-the-art reflection removal approaches by more than 5.23 dB in PSNR. We extend our approach to handheld photography to address the misalignment between the flash and no-flash pair. With misaligned training data and the alignment module, our aligned model outperforms our previous version by more than 3.19 dB in PSNR on a misaligned dataset. We also study using linear RGB images as training data. Our source code and dataset are publicly available at https://github.com/ChenyangLEI/flash-reflection-removal .
Article
Full-text available
Image of a scene captured through a piece of transparent and reflective material, such as glass, is often spoiled by a superimposed layer of reflection image. While separating the reflection from a familiar object in an image is mentally not difficult for humans, it is a challenging, ill-posed problem in computer vision. In this paper, we propose a novel deep convolutional encoder-decoder method to remove the objectionable reflection by learning a map between image pairs with and without reflection. For training the neural network, we model the physical formation of reflections in images and synthesize a large number of photo-realistic reflection-tainted images from reflection-free images collected online. Extensive experimental results show that, although the neural network learns only from synthetic data, the proposed method is effective on real-world images, and it significantly outperforms the other tested state-of-the-art techniques.
Article
Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-the-art results 51.7% mIoU on PASCAL-Context, 85.9% mIoU on PASCAL VOC 2012. Our single model achieves a final score of 0.5567 on ADE20K test set, which surpass the winning entry of COCO-Place Challenge in 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45%, which is comparable with state-of-the-art approaches with over 10 times more layers. The source code for the complete system are publicly available.
Article
Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image. Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations. However, for many tasks, aligned training pairs of images will not be available. We present an alternative loss function that does not require alignment, thus providing an effective and simple solution for a new space of problems. Our loss is based on both context and semantics -- it compares regions with similar semantic meaning, while considering the context of the entire image. Hence, for example, when transferring the style of one face to another, it will translate eyes-to-eyes and mouth-to-mouth.
Article
Removing the undesired reflections in images taken through the glass is of broad application to various image processing and computer vision tasks. Existing single image based solutions heavily rely on scene priors such as separable sparse gradients caused by different levels of blur, and they are fragile when such priors are not observed. In this paper, we notice that strong reflections usually dominant a limited region in the whole image, and propose a Region-aware Reflection Removal (R3) approach by automatically detecting and heterogeneously processing regions with and without reflections. We integrate content and gradient priors to jointly achieve missing contents restoration as well as background and reflection separation in a unified optimization framework. Extensive validation using 50 sets of real data shows that the proposed method outperforms state-of-the-art on both quantitative metrics and visual qualities. IEEE
Article
Photographs taken through glass windows often contain both the desired scene and undesired reflections. Separating the reflection and transmission layers is an important but ill-posed problem that has both aesthetic and practical applications. In this work, we introduce the use of ghosting cues that exploit asymmetry between the layers, thereby helping to reduce the ill-posedness of the problem. These cues arise from shifted double reflections of the reflected scene off the glass surface. In double-pane windows, each pane reflects shifted and attenuated versions of objects on the same side of the glass as the camera. For single-pane windows, ghosting cues arise from shifted reflections on the two surfaces of the glass pane. Even though the ghosting is sometimes barely perceptible by humans, we can still exploit the cue for layer separation. In this work, we model the ghosted reflection using a double-impulse convolution kernel, and automatically estimate the spatial separation and relative attenuation of the ghosted reflection components. To separate the layers, we propose an algorithm that uses a Gaussian Mixture Model for regularization. Our method is automatic and requires only a single input image. We demonstrate that our approach removes a large fraction of reflections on both synthetic and real-world inputs.