Conference PaperPDF Available

Abstract and Figures

Reflections in image sequences consist of several layers superimposed over each other. This phenomenon causes many image processing techniques to fail as they assume the presence of only one layer at each examined site e.g. motion estimation and object recognition. This work presents an automated technique for detecting reflections in image sequences by analyzing motion trajectories of feature points. It models reflection as regions containing two different layers moving over each other. We present a strong detector based on combining a set of weak detectors. We use novel priors, generate sparse and dense detection maps and our results show high detection rate with rejection to pathological motion and occlusion.
Content may be subject to copyright.
Reflection Detection in Image Sequences
Mohamed Abdelaziz Ahmed Francois Pitie Anil Kokaram
Sigmedia, Electronic and Electrical Engineering Department, Trinity College Dublin {www.sigmedia.tv/People}
Abstract
Reflections in image sequences consist of several layers
superimposed over each other. This phenomenon causes
many image processing techniques to fail as they assume the
presence of only one layer at each examined site e.g. motion
estimation and object recognition. This work presents an
automated technique for detecting reflections in image se-
quences by analyzing motion trajectories of feature points.
It models reflection as regions containing two different lay-
ers moving over each other. We present a strong detector
based on combining a set of weak detectors. We use novel
priors, generate sparse and dense detection maps and our
results show high detection rate with rejection to patholog-
ical motion and occlusion.
1. Introduction
Reflections are often the result of superimposing differ-
ent layers over each other (see Fig. 1,2,4,5). They mainly
occur due to photographing objects situated behind a semi
reflective medium (e.g. a glass window). As a result the
captured image is a mixture between the reflecting surface
(background layer) and the reflected image (foreground).
When viewed from a moving camera, two different layers
moving over each other in different directions are observed.
This phenomenon violates many of the existing models for
video sequences and hence causes many consumer video
applications to fail e.g. slow-motion effects, motion based
sports summarization and so on. This calls for the need of
an automated technique that detects reflections and assigns
a different treatment to them.
Detecting reflections requires analyzing data for specific
reflection characteristics. However, as reflections can arise
by mixing any two images, they come in many shapes and
colors (Fig. 1,2,4,5). This makes extracting characteris-
tics specific to reflections not an easy task. Furthermore,
one should be careful when using motion information of re-
flections as there is a high probability of motion estimation
failure. For these reasons the problem of reflection detec-
tion is hard and was not examined before.
Reflection can be detected by examining the possibility
of decomposing an image into two different layers. Lots of
work exist on separating mixtures of semi-transparent lay-
ers [17,11,12,7,4,1,13,3,2]. Nevertheless, most of the
still image techniques [11,4,1,3,2] require two mixtures
of the same layers under two different mixing conditions
while video techniques [17,12,13] assume a simple rigid
motion for the background [17,13] or a repetitive one [12].
These assumptions are hardly valid for reflections on mov-
ing image sequences.
This paper presents an automated technique for detect-
ing reflections in image sequences. It is based on analyzing
spatio-temporal profiles of feature point trajectories. This
work focuses on analyzing three main features of reflec-
tions: 1) the ability of decomposing an image into two in-
dependent layers 2) image sharpness 3) the temporal be-
havior of image patches. Several weak detectors based on
analyzing these features through different measures are pro-
posed. A final strong detector is generated by combining
the weak detectors. The problem is formulated within a
Bayesian framework and priors are defined in a way to re-
ject false alarms. Several sequences are processed and re-
sults show high detection rate with rejection to complicated
motion patterns e.g. blur, occlusion, fast motion.
Aspects of novelty in this paper include: 1) A technique
for decomposing a color still image containing reflection
into two images containing the structures of the source lay-
ers. We do not claim that this technique could be used to
fully remove reflections from videos. What we claim is that
the extracted layers can be useful for reflection detection
since on a block basis, reflection is reduced. This technique
can not compete with state of the art separation techniques.
However we use this technique because it works on single
frames and thus does not require motion, which is not the
case with any existing separation technique. 2) Diagnos-
tic tools for reflection detection based on analyzing feature
points trajectories 3) A scheme for combining weak de-
tectors in one strong reflection detector using Adaboost 4)
Incorporating priors which reject spatially and temporally
impulsive detections 5) The generation of dense detection
maps from sparse detections and using thresholding by hys-
1
Figure 1. Examples of different reflections (shown in green). Reflection is the result of superimposing different layers over each other. As a
result they have a wide range of colors and shapes.
teresis to avoid selecting particular thresholds for the system
parameters 6) Using the generated maps to perform better
frame rate conversion in regions of reflection. Frame rate
conversion is a computer vision application that is widely
used in the post-production industry. In the next section we
present a review on the relevant techniques for layer separa-
tion. In section 3 we propose our layer separation technique.
We then go to propose our Bayesian framework followed by
the results section.
2. Review on Layer Separation Techniques
A mixed image Mis modeled as a linear combination
between the source layers L1and L2according to the mix-
ing parameters (a, b)as follows.
M=aL1+bL2(1)
Layer separation techniques attempt to decompose reflec-
tion Minto two independent layers. They do so by ex-
changing information between the source layers (L1and
L2) until their mutual independence is maximized. This
however requires the presence of two mixtures of the same
layers under two different mixing proportions [11,4,1,3,
2]. Different separation techniques use different forms of
expressing the mutual layer independence. Current forms
used include minimizing the number of corners in the sep-
arated layers [7] and minimizing the grayscale correlation
between the layers [11].
Other techniques [17,12,13] avoid the requirement of
having two mixtures of the same layers by using tempo-
ral information. However they often require either a static
background throughout the whole image sequence [17],
constraint both layers to be of non-varying content through
time [13], or require the presence of repetitive dynamic mo-
tion in one of the layers [12]. Yair Weiss [17] developed a
technique which estimates the intrinsic image (static back-
ground) of an image sequence. Gradients of the intrinsic
layer are calculated by temporally filtering the gradient field
of the sequence. Filtering is performed in horizontal and
vertical directions and the generated gradients are used to
reconstruct the rest of the background image.
3. Layer Separation Using Color Independence
The source layers of a reflection Mare usually color in-
dependent. We noticed that the red and blue channels of
Mare the two most uncorrelated RGB channels. Each of
these channels is usually dominated by one layer. Hence the
source layers (L1, L2)can be estimated by exchanging in-
formation between the red and blue channels till the mutual
independence between both channels is maximized. Infor-
mation exchange for layer separation was first introduced
by Sarel et. al [12] and it is reformulated for our problem as
follows
L1=MRαMB
L2=MBβM R(2)
Here (MR, MB) are the red and blue channels of the
mixture Mwhile (α, β)are separation parameters to be
calculated. An exhaustive search for (α, β)is performed.
Motivated by Levin et. al. work on layer separation [7], the
best separated layer is selected as the one with the lowest
cornerness value. The Harris cornerness operator is used
here. A minimum texture is imposed on the separated lay-
ers by discarding layers with a variance less than Tx. For an
8-bit image, Txis set to 2. The removal of this constraint
can generate empty meaningless layers. The novelty in this
layer separation technique is that unlike previous techniques
[11,4,1,3,2], it only requires one image.
Fig.2shows separation results generated by the proposed
technique for different images. Results show that our tech-
nique reduces reflections and shadows. Results are only dis-
played to illustrate a preprocess step, that is used for one of
our reflection measures and not to illustrate full reflection
removal. Blocky artifacts are due to processing images in
50 ×50 blocks. These artifacts are irrelevant to reflection
detection.
4. Bayesian Inference for Reflection Detection
(BIRD)
The goal of the algorithm is to find regions in image
sequences containing reflections. This is achieved by an-
(a) (b) (c)
(d) (e) (f)
Figure 2. Reducing reflections/shadows using the proposed layer separation technique. Color images are the original images with reflec-
tions/shadows (shown in green). The uncolored images represent one source layer (calculated by our technique) with reflections/shadows
reduced. In (e) reflection still remains apparent however the person in the car is fully removed.
alyzing trajectories of feature points. Trajectories are gen-
erated using KLT feature point tracker [9,14]. Denote Pi
n
as the feature point of ith track in frame nand Fi
nas the
50 ×50 image patch centered on Pi
n. Trajectories are ana-
lyzed by examining all feature points along tracks of length
more than 4 frames. For each point, analysis are carried
over the three image patches (Fi
n1,Fi
n,Fi
n+1). Based on
the analysis outcome, a binary label field li
nis assigned to
each Fi
n.li
nis set to 1 for reflection and 0 otherwise.
4.1. Bayesian Framework
The system derives an estimate for li
nfrom the posterior
P(l|F)(where (i,n) are dropped for clarity). The posterior
is factorized in a Bayesian fashion as follows
P(l|F) = P(F|l)P(l|lN)(3)
The likelihood term P(F|l)consists of 9 detectors D1D9
each performing different analysis on Fand operating at
thresholds T19(see Sec. 4.5.1). The prior P(l|lN)en-
forces various smoothness constraints in space and time to
reject spatially and temporally impulsive detections and to
generate dense detection masks. Here Ndenote the spatio-
temporal neighborhood of the examined site.
4.2. Layer Separation Likelihood
This likelihood measures the ability of decomposing an
image patch Fi
ninto two independent layers. Three detec-
tors are proposed. Two of them attempts to perform layer
separation before analyzing data while the third measures
the possibility of layer separation by measuring the color
channels independence.
Layer Separation via Color Independence D1:Our
technique (presented in Sec.3) is used to decompose the im-
age patch Fi
ninto two layers L1i
nand L2i
n. This is applied
for every point along every track. Reflection is detected
by comparing the temporal behavior of the observed image
patches Fwith the temporal behavior of the extracted lay-
ers. Patches containing reflection are defined as ones with
higher temporal discontinuity before separation than after
separation. Temporal discontinuity is measured using struc-
ture similarity index SSIM [16] as follows
D1i
n= max(SS(Gi
n,Gi
n1),SS(Gi
n,Gi
n+1))
max(SS(Li
n,Li
n1),SS(Li
n,Li
n+1))
SS(Li
n,Li
n1) = max(SS(L1i
n,L1i
n1),SS(L2i
n,L2i
n1)))
SS(Li
n,Li
n+1) = max(SS(L1i
n,L1i
n+1),SS(L2i
n,L2i
n+1))
Here G= 0.1FR+ 0.7FG+ 0.2FBwhere (FR,FG,FB)
are the red, green and blue components of Frespectively.
SS(Gi
n,Gi
n1)denotes the structure similarity between the
two images Fi
nand Fi
n1. We only compare the structures
of (Gi
n,Gi
n1)by turning off the luminance component of
SSIM [16]. SS(., .)returns an a value between 01where
1denotes identical similarity. Reflection is detected if D1i
n
is less than T1.
Intrinsic Layer Extraction D2:Let INTRidenote the
intrinsic (reflectance) image extracted by processing the
50 ×50 ith track using Yair technique [17]. In case of re-
flection the structure similarity between the observed mix-
ture Fi
nand INTRishould be low. Therefore, Fi
nis flagged
as containing reflection if SS(Fi
n,INTRi)is less than T2.
Color Channels Independence D3:This approach
measures the Generalized Normalized Cross Correlation
(GNGC) [11] between the red and blue channels of the ex-
amined patch Fi
nto infer whether the patch is a mixture
between two different layers or not. GNGC takes values
between 0 and 1 where 1 denotes perfect match between
the red and blue channels (MRand MBrespectively). This
analysis is applied to every image patch Fi
nand reflection
is detected if GNGC(MR, M B)<T3.
4.3. Image Sharpness Likelihood: D4,D5
Two approaches for analyzing image sharpness are used.
The first, D4, estimates the first order derivatives for the
examined patch Fi
nand flags it as containing reflection if
the mean of the gradient magnitude within the examined
patch is smaller than a threshold T4. The second approach,
D5, uses the sharpness metric of Ferzil et. al. [5] and flags
a patch as reflection if its sharpness value is less than T5.
4.4. Temporal Discontinuity Likelihood
SIFT Temporal Profile D6:This detector flags the ex-
amined patch Fi
nas reflection if its SIFT features [8] are
undergoing high temporal mismatch. A vector p= [xsg] is
assigned to every interest point in Fi
n. The vector contains
the position of the point x= (x, y), scale and dominate ori-
entation from the SIFT descriptor, s= (δ, o), and the 128
point SIFT descriptor g. Interest points are matched with
neighboring frames using [8]. Fi
nis flagged as reflection
if the average distance between the matched vectors pis
larger than T6.
Color Temporal Profile D7:This detector flags the im-
age patch Fi
nas reflection if its grayscale profile does not
change smoothly through time. The temporal change in
color is defined as follows
D7i
n= min(kCi
n− Ci
n1k,kCi
n− Ci
n+1k)(4)
Here Ci
nis the mean value for Gi
n, the grayscale representa-
tion of Fi
n.Fi
nis flagged as reflection if D7i
n>T7.
AutoCorrelation Temporal Profile D8:This detector
flags the image patch Fi
nas reflection if its autocorrelation
is undergoing large temporal change. The temporal change
in the autocorrelation is defined as follows
D8i
n=rmin( 1
NkAi
n− Ai
n1k2,1
NkAi
n− Ai
n+1k2)
(5)
Ai
nis a vector containing the autocorrelation of Gi
nwhile N
is the number of pels in Ai
n.Fi
nis flagged as reflection if
D8i
nis bigger than T8.
Motion Field Divergence D9:D9for the examined
patch Fi
nis defined as follows
D9i
n=DFD (kdiv(d(n))k+kdiv(d(n+ 1))k)/2(6)
DFD and div(d(n)) are the Displaced Frame Difference
and Motion Field Divergence for Fi
n.d(n)is the 2D motion
vector calculated using block matching. DFD is set to the
minimum of the forward and backward DFDs. div(d(n))
is set to the minimum of the forward and backward di-
vergence. The divergence is averaged over blocks of two
frames to reduce the effect of possible motion blur gener-
ated by unsteady camera motion. Fi
nis flagged as reflection
if D9>T9.
4.5. Solving for li
n
4.5.1 Maximum Likelihood (ML) Solution
The likelihood is factorized as follows
P(F|l) = P(l|D1)P(l|D28)P(l|D9)(7)
The first and last terms are solved using D1<T1and
D9>T9respectively. D28are used to form one strong
detector Dsand P(l|D28)is solved by Ds>Ts. We
found that not including (D1,D9) in Dsgenerates better de-
tection results than when included. Feature analysis of each
detector are averaged over a block of three frames to gen-
erate temporally consistent detections. T9is fixed to 10 in
all experiments. In Sec. 4.5.2 we avoid selecting particular
thresholds for (T1,Ts) by imposing spatial and temporal
priors on the generated maps.
Calculating Ds:The strong detector Dsis expressed as
a linear combination of weak detectors operating at different
thresholds Tas follows
P(l|D28) =
M
X
k=1
W(V(k),T)P(DV(k)|T )(8)
10−2 10−1 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Alarm Rate
Correct Detection Rate
D1
D2
D3
D4
D5
D6
D7
D8
D9
Adaboost Ds
Figure 3. ROC for D19and Ds. The Adaboost detector Dsout-
performs all other techniques and D1is the second best in the
range of false alarms <0.1.
Here Mis the number of weak detectors (fixed to 20) used
in forming Dsand V(k)is a function which returns a value
between 2-8 to indicate which detectors from D28are
used. k indexes the weak detectors in order of their impor-
tance as defined by the weights W.Wand Tare learned
through Adaboost [15] (see Tab. 1). Our training set consist
of 89393 images of size 50×50 pels. Reflection is modeled
in 35966 images each being a synthetic mixture between
two different images.
Fig. 3shows the the Receiver Operating Characteristic
(ROC) of applying D19and Dson the training samples.
Dsoutperforms all the other detectors due to its higher cor-
rect detection rate and lower false alarms.
D6D8D5D3D2D4D7
W1.31 0.96 0.48 0.52 0.33 0.32 0.26
T0.29 6.76e60.04 0.95 0.61 7 2.17
Table 1. Weights Wand operating thresholds Tfor the best seven
detectors selected by Adaboost.
4.5.2 Successive Refinement for Maximum A-
Posteriori (MAP)
The prior P(l|lN)of Eq. 3imposes spatial and temporal
smoothness on detection masks. We create a MAP estimate
by refining the sparse maps from the previous ML steps. We
first refine the labeling of all the existing feature points P
in each image and then use the overlapping 50 ×50 patches
around the refined labeled points as a dense pixel map.
ML Refinement: First we reject false detections from
ML which are spatially inconsistent. Every feature point
l= 1 is considered and the sum of the geodesic distance
from that site to the two closest neighbors which are labeled
l= 1 is measured. When that distance is more than 0.005
then that decision is rejected i.e. we set l= 0 . Geodesic
distances allow the nature of the image material between
point to be taken in to account more effectively and have
been in use for some time now [10]. To reduce the compu-
tational load of this step, we downsample the image mas-
sively by 50 in both directions. This retains gross image
topology only.
Spatio-Temporal Dilation: Labels are extended in
space and time to other feature points along their trajecto-
ries. If li
n= 1, all feature points lying along the track iare
set to l= 1. In addition, lis extended to all image patches
(Fn) overlapping spatially with the examined patch. This
generates a denser representation of the detection masks.
We call this step ML-Denser.
Hysteresis: We can avoid selecting particular thresholds
[T1,Ts]for BIRD by applying Hysteresis using a set of dif-
ferent thresholds. Let TH= [0.4,5] and TL= [0,3] de-
note a high and low configuration for [T1,Ts]. Detection
starts by examining ML-Denser at high thresholds. High
thresholds generate detected points Phwith high confi-
dence. Points within a small geodesic distance (< Dgeo )
and small euclidean distance (< Deuc) to each other are
grouped together. Here we use (Dgeo ,Deuc) = (0.0025,4)
and resize the examined frames as mentioned previously.
The centroids of each group is then calculated. Thresholds
are lowered and a new detection point is added to an exist-
ing group if it is within Dgeo and Deuc to the centroid of this
group. This is the hysteresis idea. If however the examined
point has a large euclidean distance (> Deuc) but a small
geodesic distance (< Dgeo) to the centroid of all existing
groups, a new group is formed. Points at which distances
> Dgeo and > Deuc are regarded as outliers and discarded.
Group centroids are updated and the whole process is re-
peated iteratively till the examined threshold reaches TL.
The detection map generated at TLis made more dense by
performing Spatio-Temporal Dilation above.
Spatio-Temporal ‘Opening’: False alarms of the previ-
ous step are removed by propagating the patches detected
in the first frame to the rest of the sequence along the fea-
ture point trajectories. A detection sample at fame nis
kept if it agrees with the propagated detections from the
previous frame. Correct detections missed from this step
are recovered by running Spatio-Temporal Dilation on the
‘temporally eroded’ solution. This does mean that trajecto-
ries which do not start in the first frame are not likely to be
considered, however this does not affect the performance in
our real examples shown here. The selection of an optimal
frame from which to perform this opening operation is the
subject of future work.
=
Figure 4. From Top: ML (calculated at (T1,Ts) = (0.13,3.15)), Hysteresis and Spatio-Temporal ‘Opening’ for three consecutive frames
from the SelimH sequence. Reflection is shown in red and detected reflection using our technique is shown in green. Spatio-Temporal
‘Opening’ rejects false alarms generated by ML and by Hysteresis (shown in yellow and blue respectively).
5. Results
5.1. Reflection Detection
15 sequences containing 932 frames of size 576 ×720
are processed with BIRD. Full sequences with reflection de-
tection can be found in www.sigmedia.tv/Misc/CVPR2011.
Fig. 4compares the ML, Hysteresis and Spatio-Temporal
‘Opening’ for three consecutive frames from the SelimH se-
quence. This sequence contains occlusion, motion blur and
strong edges in the reflection (shown in red). The ML so-
lution (first line) generates good sparse reflection detection
(shown in green), however it generates some errors (shown
in yellow). Hysteresis rejects these errors and generates
dense masks with some false alarm (shown in blue). These
false alarms are rejected by Spatio-Temporal ‘Opening’.
Fig. 5shows the result of processing four sequences us-
ing BIRD. In the first two sequences, BIRD detected regions
of reflections correctly and discarded regions of occlusion
(shown in purple) and motion blur (shown in blue). In Girl-
Ref most of the sequence is correctly classified as reflection.
In SelimK1 the portrait on the right is correctly classified
as containing reflection even in the presence of motion blur
(shown in blue). Nevertheless, BIRD failed in detecting the
reflection on the left portrait as it does not contain strong
distinctive feature points.
Fig. 6shows the ROC plot for 50 frames from SelimH.
Here we compare our technique BIRD against DFD and Im-
age Sharpness[5]. DFD, flags a region as reflection if it has
high displaced frame difference. Image Sharpness flags a
region as reflection if it has low sharpness. Frames are pro-
cessed on 50 ×50 blocks. Ground truth reflection masks
are generated manually and detection rates are calculated
on pel basis. The ROC shows that BIRD outperforms the
other techniques by achieving a very high correct detection
rate of 0.9 for a false detection rate of 0.1. This is a major
improvement over a correct detection rate of 0.2 and 0.1 for
DFD and Sharpness respectively.
5.2. Frame Rate Conversion: An application
One application for reflection detection is improving
frame rate conversion in regions of reflection. Frame rate
conversion is the process of creating new frames from ex-
isting ones. This is done by using motion vectors to inter-
polate objects in the new frames. This process usually fails
in regions of reflection due to motion estimation failure.
Fig. 7illustrates the generation of a slow motion effect
for the person’s leg in GirlRef (see Fig. 5, third line). This
is done by doubling the frame rate using the Foundry’s Kro-
nos plugin [6]. Kronos has an input which defines the den-
sity of the motion vector field. The larger the density the
Figure 5. Detection results of BIRD (shown in green) on, From top: BuilOnWind [10,35,49],PHouse 9-11, GirlRef [45,55,65],SelimK1
32-35. Reflections are shown in red. Good detections are generated despite occlusion (shown in purple) and motion blur (shown in blue).
For GirlRef we replace Hysteresis and Spatio-Temporal ‘Opening’ with a manual parameter configuration of (T1,Ts)=(0.01,3.15)
followed by a Spatio-Temporal Dilation step. This setting generates good detections for all examined sequences with static backgrounds.
more detailed the vector and hence the better the interpo-
lation. However, using highly detailed vectors generate ar-
tifacts in regions of reflections as shown in Fig. 7(second
line). We reduce these artifacts by lowering the motion vec-
tor density in regions of reflection indicated by BIRD (see
Fig. 7, third line). Image sequence results and more exam-
ples are available in www.sigmedia.tv/Misc/CVPR2011.
6. Conclusion
This paper has presented a technique for detecting reflec-
tions in image sequences. This problem was not addressed
before. Our technique performs several analysis on feature
point trajectories and generates a strong detector by com-
bining these analysis. Results show major improvement
over techniques which measure image sharpness and tem-
poral discontinuity. Our technique generates high correct
detection rate with rejection to regions containing compli-
cated motion eg. motion blur, occlusion. The technique
was fully automated in generating most results. As an ap-
plication, we showed how the generated detections can be
used to improve frame rate conversion. A limiting factor
of our technique is that it requires source layers with strong
distinctive feature points. This could lead to incomplete de-
tections.
Acknowledgment: This work is funded by the Irish Re-
serach Council for Science, Engineering and Technology
Figure 7. Slow motion effect for the person’s leg of GirlRef (see Fig: 5third line). Top: Original frames 59-61; Middle: generated frames
using the Foundry’s plugin Kronos [6] with one motion vector calculated for every 4 pels; Bottom; with one motion vector calculated for
every 64 pels in regions of reflection.
Figure 6. ROC plots for our technique BIRD, DFD and Sharpness
for SelimH. Our technique BIRD outperforms DFD and Sharp-
ness with a massive increase in the Correct Detection Rate.
(IRCSET) and Science Foundation Ireland (SFI).
References
[1] A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, and Y. Y.
Zeevi. Sparse ICA for blind separation of transmitted and
reflected images. International Journal of Imaging Systems
and Technology, 15(1):84–91, 2005. 1,2
[2] N. Chen and P. De Leon. Blind image separation through
kurtosis maximization. In Asilomar Conference on Signals,
Systems and Computers, volume 1, pages 318–322, 2001. 1,
2
[3] K. Diamantaras and T. Papadimitriou. Blind separation of
reflections using the image mixtures ratio. In ICIP, pages
1034–1037, 2005. 1,2
[4] H. Farid and E. Adelson. Separating reflections from images
by use of independent components analysis. Journal of the
Optical Society of America, 16(9):2136–2145, 1999. 1,2
[5] R. Ferzli and L. J. Karam. A no-reference objective image
sharpness metric based on the notion of just noticeable blur
(jnb). IEEE Trans. on Img. Proc. (TIPS), 18(4):717–728,
2009. 4,6
[6] T. Foundry. Nuke, furnace suite. www.thefoundry.co.uk. 6,
8
[7] A. Levin, A. Zomet, and Y. Weiss. Separating reflections
from a single image using local features. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
306–313, 2004. 1,2
[8] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. 4
[9] B. D. Lucas and T. Kanade. An iterative image registra-
tion technique with an application to stereo vision (darpa).
In DARPA Image Understanding Workshop, pages 121–130,
1981. 3
[10] D. Ring and F. Pitie. Feature-assisted sparse to dense motion
estimation using geodesic distances. In International Ma-
chine Vision and Image Processing Conference, pages 7–12,
2009. 5
[11] B. Sarel and M. Irani. Separating transparent layers through
layer information exchange. In European Conference on
Computer Vision (ECCV), pages 328–341, 2004. 1,2,4
[12] B. Sarel and M. Irani. Separating transparent layers of repet-
itive dynamic behaviors. In ICCV, pages 26–32, 2005. 1,
2
[13] R. Szeliski, S. Avidan, and P. Anandan. Layer extrac-
tion from multiple images containing reflections and trans-
parency. In CVPR, volume 1, pages 246–253, 2000. 1,2
[14] C. T. Takeo and T. Kanade. Detection and tracking of
point features. Carnegie Mellon University Technical Report
CMU-CS-91-132, 1991. 3
[15] P. Viola and M. Jones. Robust real-time object detection. In
International Journal of Computer Vision, 2001. 5
[16] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image
quality assessment: from error visibility to structural simi-
larity. TIPS, 13(4):600–612, April 2004. 4
[17] Y. Weiss. Deriving intrinsic images from image sequences.
In ICCV, pages 68–75, 2001. 1,2,4
... Nevertheless, these approaches for instance detection and segmentation show high accuracy in case of good image quality only. With reflections and a bad SNR, even the sophisticated DL models for image processing often fail [Ahm11a]. Thus, specific pre-processing is necessitated, e.g. ...
Conference Paper
Elevators, a vital means for urban transportation, are generally lacking proper emergency call systems besidesan emergency button. In the case of unconscious or otherwise incapacitated passengers this can lead to lethalsituations. A camera-based surveillance system with AI-based alerts utilizing an elevator state machine can helppassengers unable to initiate an emergency call. In this research work, the applicability of RGB-D images asinput for instance segmentation in the highly reflective environment of an elevator cabin is evaluated. For objectsegmentation, a Region-based Convolution Neural Network (R-CNN) deep learning model is adapted to use depthinput data besides RGB by applying transfer learning, hyperparameter optimization and re-training on a newlyprepared elevator image dataset. Evaluations prove that with the chosen strategy, the accuracy of R-CNN instancesegmentation is applicable on RGB-D data, thereby resolving lack of image quality in the noise affected andreflective elevator cabins. The mean average precision (mAP) of 0.753 is increased to 0.768 after the incorporationof additional depth data and with additional FuseNet-FPN backbone on RGB-D the mAP is further increased to0.794. With the proposed instance segmentation model, reliable elevator surveillance becomes feasible as firstprototypes and on-road tests proof.
... Our future work will explore deeper integration of the stereo system into the graphical model, and extension with alternative camera modalities. Additional sensors, such as compass, clock, and GPS, could be used to determine the position of sun, and use it as a prior in explicit detection of reflections and glitter in the water (e.g., [39]) to improve robustness. The current algorithm is able to adapt to the water appearance changes and runs in real-time, but uses very simple visual features. ...
Article
Full-text available
A new obstacle detection algorithm for unmanned surface vehicles (USVs) is presented. A state-of-the-art graphical model for semantic segmentation is extended to incorporate boat pitch and roll measurements from the on-board inertial measurement unit (IMU), and a stereo verification algorithm that consolidates tentative detections obtained from the segmentation is proposed. The IMU readings are used to estimate the location of horizon line in the image, which automatically adjusts the priors in the probabilistic semantic segmentation model. We derive the equations for projecting the horizon into images, propose an efficient optimization algorithm for the extended graphical model, and offer a practical IMU-camera-USV calibration procedure. Using an USV equipped with multiple synchronized sensors, we captured a new challenging multi-modal dataset, and annotated its images with water edge and obstacles. Experimental results show that the proposed algorithm significantly outperforms the state of the art, with nearly 30% improvement in water-edge detection accuracy, an over 21% reduction of false positive rate, an almost 60% reduction of false negative rate, and an over 65% increase of true positive rate, while its Matlab implementation runs in real-time.
Article
There is currently great interest in enhancing the ability of aerial robots to navigate indoors. Navigating a building under various lighting and environmental conditions would have application in disaster response, infrastructure inspections, as well as a wide variety of commercial applications. In order to achieve this goal, one common feature of indoor environments that must be addressed is the detection of transparent/reflective barriers. Transparent/reflective barriers as they pertain to structures generally take the form of a window, office divider, or storefront. Human tele-operators of aerial robots in environments such as malls, airports, office buildings, and museums that feature transparent barriers will need some means to enhance their situational awareness so they can recognize the presence of transparent/reflective barriers, distinguish between the two, and have some idea of the distance and pose relative to the transparent/reflective barrier. The ability to detect and localize transparent barriers will also be important for autonomous navigation. The focus of this work is to develop a multi-modal sensing solution that can successfully identify transparent/reflective barriers, distinguish between the two, and provide information on pose and distance to the barrier at time-scales exceeding human response in order to facilitate navigation of indoor spaces. The sensing solution relies on using an imager to measure the differences in the interactions of actively visible light illumination of transparent/reflective barriers. A silicon retina event-driven imager is used in this work to provide a path to obtaining information on transparent/reflective barriers at high speeds, while requiring very little communications bandwidth.
Conference Paper
There is currently an interest in developing mobile sensing platforms that fly indoors. The primary goal for these platforms is to be able to successfully navigate a building under various lighting and environmental conditions. There are numerous research challenges associated with this goal, one of which is the platform’s ability to detect and identify the presence of transparent barriers. Transparent barriers could include windows, glass partitions, or skylights. For example, in order to successfully navigate inside of a structure, these platforms will need to sense if a space contains a transparent barrier and whether or not this space can be traversed. This project’s focus has been developing a multimodal sensing system that can successfully identify such transparent barriers under various lighting conditions while aboard a mobile platform. Along with detecting transparent barriers, this sensing platform is capable of distinguishing between reflective, opaque, and transparent barriers. It will be critical for this system to be able to identify transparent barriers in real-time in order for the navigation system to maneuver accordingly. The properties associated with the interaction between various frequencies of light and transparent materials were one of the techniques leveraged to solve this problem.
Article
Computer vision techniques such as Structurefrom- Motion (SfM) and object recognition tend to fail on scenes with highly reflective objects because the reflections behave differently to the true geometry of the scene. Such image sequences may be treated as two layers superimposed over each other - the nonreflection scene source layer and the reflection layer. However, decomposing the two layers is a very challenging task as it is ill-posed and common methods rely on prior information. This work presents an automated technique for detecting reflective features with a comprehensive analysis of the intrinsic, spatial, and temporal properties of feature points. A support vector machine (SVM) is proposed to learn reflection feature points. Predicted reflection feature points are used as priors to guide the reflection layer separation. This gives more robust and reliable results than what is achieved by performing layer separation alone.
Article
The transmitted scene superposed with the reflected scene from a transparent surface leads to mixed images. Few methods have been devoted for tracking on mixed images while such images are ubiquitous in the real world. Thus, this paper proposes a robust single object tracking scheme for mixed images acquired by mobile cameras. Layer separation that decomposes mixed images extracts intrinsic dynamic layers before tracking. In order to make the tracker robust against camera motion, motion compensation is applied to both layer separation and prediction stage of the particle filter. To maximize the observation likelihood and thus optimize particle weights in the face of reflections, the proposed scheme combines sequential importance resampling (SIR) based co-inference and maximum likelihood for multi-cue integration. Experimental results show that the proposed scheme effectively improves tracking accuracy on mixed images with camera motion.
Conference Paper
We propose to detect edges of reflections, which we call the REF-edges, from a single image via convex optimization. Our method is designed based on two observations on reflections: (i) reflections have almost monotone color and (ii) color around REF-edges varies smoothly. The first one can be translated into the property that gradients around REF-edges distribute linearly in the RGB color space, which we call the REF-linearity. The second one can be interpreted as follows: color differences around REF-edges are small; for an entry of REF-edges, gradients among its surrounding entries have small variance. Using the above properties, we characterize REF-edges as a solution of a constrained convex optimization problem. The optimization problem is solved by the Alternating Direction Method of Multipliers (ADMM). Experiments using real-world images with reflections show the utility of our proposed method.
Conference Paper
In this work we describe a method of pupil detection for subsequent gaze tracking, when specular reflection is present in the image. Gaze tracking commonly uses the spatial relationship between the pupil and corneal reflection, but is not robust when the user is wearing eyeglasses, since light reflected from the surroundings changes the appearance of the pupil. In this research we propose and evaluate a pupil detection method that can perform robustly even in the presence of such reflection.
Conference Paper
Full-text available
Large motion displacements in image sequences are still a problem for most motion estimation techniques. Progress in feature matching allows to establish robust correspondences between images for a sparse set of points. Recent works have attempted to use this sparse information to guide the dense motion field estimation. We propose to achieve this in an extended motion estimation framework, which integrates information about the geodesic distance to the sparse features. Results show that by considering a handful of these feature matches, the geodesic distance is able to propagate the information efficiently.
Article
This paper describes a visual object detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features and yields extremely efficient number of critical visual features and yields extremely efficient classifiers [6]. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. A set of experiments in the domain of face detection are presented. The system yields face detection performace comparable to the best previous systems [18, 13, 16, 12, 1]. Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
The factorization method described in this series of reports requires an algorithm to track the motion of features in an image stream. Given the small inter-frame displacement made possible by the factorization approach, the best tracking method turns out to be the one proposed by Lucas and Kanade in 1981. The method defines the measure of match between fixed-size feature windows in the past and current frame as the sum of squared intensity differences over the windows. The displacement is then defined as the one that minimizes this sum. For small motions, a linearization of the image intensities leads to a Newton-Raphson style minimization. In this report, after rederiving the method in a physically intuitive way, we answer the crucial question of how to choose the feature windows that are best suited for tracking. Our selection criterion is based directly on the definition of the tracking algorithm, and expresses how well a feature can be tracked. As a result, the criterion is optima...
Article
Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a Structural Similarity Index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000.
Article
We address the problem of recovering a scene recorded through a semireflecting medium (i.e. planar lens), with a virtual reflected image being superimposed on the image of the scene transmitted through the semirefelecting lens. Recent studies propose imaging through a linear polarizer at several orientations to estimate the reflected and the transmitted components in the scene. In this study we extend the sparse ICA (SPICA) technique and apply it to the problem of separating the image of the scene without having any a priori knowledge about its structure or statistics. Recent novel advances in the SPICA approach are discussed. Simulation and experimental results demonstrate the efficacy of the proposed methods.© 2005 Wiley Periodicals, Inc. © 2005 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 15, 84–91, 2005; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ima.20042
Conference Paper
A new method for the blind separation of linear image mixtures is presented in this paper. Such mixtures often occur, when, for example, we photograph a scene through a semireflecting medium (windshield or glass). The proposed method requires two mixtures of two scenes captured under different illumination conditions. We show that the boundary values of the ratio of the two mixtures can lead to an accurate estimation of the separation matrix. The technique is very simple, fast, and reliable, as it does not depend on iterative procedures. The method effectiveness is tested on both artificially mixed images and real images.
Article
This work presents a perceptual-based no-reference objective image sharpness/blurriness metric by integrating the concept of just noticeable blur into a probability summation model. Unlike existing objective no-reference image sharpness/blurriness metrics, the proposed metric is able to predict the relative amount of blurriness in images with different content. Results are provided to illustrate the performance of the proposed perceptual-based sharpness metric. These results show that the proposed sharpness metric correlates well with the perceived sharpness being able to predict with high accuracy the relative amount of blurriness in images with different content.
Conference Paper
In this paper we present an approach for separating two transparent layers in images and video sequences. Given two initial unknown physical mixtures, I 1 and I 2, of real scene layers, L 1 and L 2, we seek a layer separation which minimizes the structural correlations across the two layers, at every image point. Such a separation is achieved by transferring local grayscale structure from one image to the other wherever it is highly correlated with the underlying local grayscale structure in the other image, and vice versa. This bi-directional transfer operation, which we call the “layer information exchange”, is performed on diminishing window sizes, from global image windows (i.e., the entire image), down to local image windows, thus detecting similar grayscale structures at varying scales across pixels. We show the applicability of this approach to various real-world scenarios, including image and video transparency separation. In particular, we show that this approach can be used for separating transparent layers in images obtained under different polarizations, as well as for separating complex non-rigid transparent motions in video sequences. These can be done without prior knowledge of the layer mixing model (simple additive, alpha-mated composition with an unknown alpha-map, or other), and under unknown complex temporal changes (e.g., unknown varying lighting conditions).