Conference PaperPDF Available

Low Compute and Fully Parallel Computer Vision with HashMatch

Authors:
  • perceptiveIO, Inc
  • perceptiveIO, Inc.
Low Compute and Fully Parallel Computer Vision with HashMatch
Sean Ryan Fanello1Julien Valentin1Adarsh Kowdle1Christoph Rhemann1
Vladimir Tankovich1Carlo Ciliberto2Philip Davidson1Shahram Izadi1
perceptiveIO1University College London2
Abstract
Numerous computer vision problems such as stereo
depth estimation, object-class segmentation and fore-
ground/background segmentation can be formulated as per-
pixel image labeling tasks. Given one or many images as
input, the desired output of these methods is usually a spa-
tially smooth assignment of labels. The large amount of
such computer vision problems has lead to significant re-
search efforts, with the state of art moving from CRF-based
approaches to deep CNNs and more recently, hybrids of
the two. Although these approaches have significantly ad-
vanced the state of the art, the vast majority has solely
focused on improving quantitative results and are not de-
signed for low-compute scenarios. In this paper, we present
a new general framework for a variety of computer vision
labeling tasks, called HashMatch. Our approach is de-
signed to be both fully parallel, i.e. each pixel is indepen-
dently processed, and low-compute, with a model complex-
ity an order of magnitude less than existing CNN and CRF-
based approaches. We evaluate HashMatch extensively
on several problems such as disparity estimation, image
retrieval, feature approximation and background subtrac-
tion, for which HashMatch achieves high computational ef-
ficiency while producing high quality results.
1. Introduction
Since the groundbreaking work of Krizhevsky et al. [29],
deep learning is now the method of choice for a variety
of computer vision problems. Although significant efforts
have been undertaken to improve the performance of CNNs
on various labeling tasks, these models are still far from be-
ing computationally efficient. Most of the work on efficient
deep learning focuses on trying to compress deep models
without losing precision and accuracy [38,20,11]. For ex-
ample in [38], the authors train a deep architecture making
use of binary weights for both the input and the filters. In
Authors equally contributed to this work.
[20], the authors try to remove redundant connections and
force multiple neurons to share the same quantized weights.
Others like [22] attempt at designing more compact layers
that use a reduced number of parameters. Similar to [38],
methods proposed in [11,12] try to binarize the full net-
work. However, these solutions still require many compu-
tational layers that involve multiple convolutions to infer
per-pixel labels. Although there is an improved efficiency
from the computational perspective, they suffer from ac-
cessing image patches stored in memory multiple times, and
hence these algorithms are both memory and computation-
ally bound.
Prior to the deep learning era, Conditional Random
Fields (CRFs) were used as one of the major tools for image
labeling problems. In their ‘simplest’ form, CRFs are com-
posed of a pairwise term, encouraging structural coherence
in the solution and a unary term, which is responsible for
modeling the compatibility between each data point/pixel
and a pre-defined set of labels. Machine learning is com-
monly used to predict this compatibility function. It is ac-
cepted that the more sophisticated the unary potential, the
better the results obtained after solving the CRF. As a con-
sequence, practitioners tend to use expensive feature repre-
sentations, e.g. HOG [13], SIFT [30] or even deep-learning
intermediate representations, followed by advanced classi-
fiers such as kernel-SVM [42]. Once the unary potential has
been estimated for each pixel, the actual CRF inference can
be performed. Unfortunately, solving multi-label CRFs is a
NP hard problem [7]. The high demand for fast and accurate
solvers resulted in significant research efforts in this space,
each offering different trade-offs in terms of compute and
closeness to the posterior captured by the CRF. It is worth
noting that it is possible to compute unary potentials and
solve the CRF at interactive rates (e.g. [10,48]) but these
approaches still have some sequential steps and have large
model complexity.
In this paper, we propose to bridge this gap in the liter-
ature by introducing HashMatch, a generic and extremely
low-compute framework that has been designed from the
ground-up for parallel, i.e. pixel independent, processing.
As demonstrated in this paper, the efficiency and parallel
nature of our approach allows us to achieve compelling re-
sults on a variety of computer vision problems, at speeds
never before demonstrated. For example, estimating dis-
parity or segmentation masks on 1.3megapixel images at
1000 fps on high-end GPUs (e.g. NVIDIA Titan X) and at
200fps on VGA images on mobile architectures (e.g. Tegra
TX1) while also providing compelling results on multiple
computer vision tasks. Our main technical contributions are
two-fold. First we propose a binary embedding for classifi-
cation, regression and nearest neighbor tasks that is trained
with sparsity and anti-sparsity constraints. This allows eval-
uating a robust unary potential with only a few operations
per pixel. Second, we present a new inference scheme that
fully operates in parallel with complexity that is not a func-
tion of the size of the solution space. These technical contri-
butions are formulated in a mathematical framework whose
objective function is directly applicable to many different
computer vision problems.
2. Related Work
Binary Representations. The task of finding binary and
compact representations has been exhaustively studied in
the literature. This is usually known as hashing and the
problem is generally formulated as:
b=h(x)(1)
with xRn,ba binary code in {0,1}k, and hthe hash-
ing function. hcan be a linear projection, a spherical func-
tion, a kernel, a neural network, a non-parametric function,
etc. Here we focus on the family of linear hash functions
of the form h(x) = sign(xW), with WRn×k, where
sign(x)returns 1if x0and 0otherwise. The most
popular data independent approach to generate those hash
function is called Locality-Sensitive Hashing (LSH) [23,9].
These usually make use of random gaussian projections to
generate the hyperplanes W. Despite their simplicity, these
hashing schemes perform reasonably well in practice. For
an extensive review of LSH, we refer the reader to [51].
Data dependent hashing schemes have also been pro-
posed [19,37,8,54,21,43,56]. These methods are usu-
ally unsupervised and they try to design an objective func-
tion that preserves the similarities of the input signal in the
new binary space. Iterative Quantization (ITQ) [19] ob-
tains low dimensional codes by applying PCA to the data
and then find the optimal rotation that makes the codes as
close as possible in the binary space. [8] casts the hash-
ing problem to an auto-encoder formulation, however part
of the optimization resorts to enumerating all the possible
solutions, leading to very slow training procedures. In [37],
authors exploit sparsity to have a runtime that is cost inde-
pendent of the original signal dimensionality. However an
`2-normalization of the signal is first performed, therefore
the overall running cost still depends on the input dimen-
sion. More recently, [52] uses an ensemble of fast decision
trees to model hash functions, however the method requires
a non-trivial aggregation step, adding compute.
Our work is very different from the ones in the literature.
Whereas most of the approaches focus on nearest neighbor
tasks, our framework is very flexible and can be used on
tasks ranging from classification to signal reconstruction.
Different from others [8,19], we heavily exploit sparsity at
runtime and remove normalization steps such as in [37].
Inference in Graphical Models. Estimating the Maxi-
mum A Posteriori (MAP) of a multi-label Conditional Ran-
dom Field (CRF) is a well known NP complete problem
[7]. Given the successful use of CRFs for advancing the
state of the art, performing (approximate) inference over
those models has received much attention. The major dif-
ference among all the solvers is whether they are determin-
istic or stochastic. Stochastic methods include the family of
Markov Chain Monte-Carlo methods, which have exponen-
tial convergence rate in the worst case but can provide exact
results. This family of methods is regarded as too compute
intensive for real-time applications that make use of ‘large’
CRFs. Besides stochastic techniques resides a wide array
of deterministic methods. Successful techniques include
Move-Making algorithms [7], Belief Propagation [17], It-
erated Conditional Modes [3], Tree Reweighted Message
Passing [27], Quadratic Pseudo-Boolean Optimization [41]
and Mean-Field [28]. Each of these techniques comes with
various trade-offs in terms of quality of the approximation
and speed.
It is worth noting that when the data term is strong and
high-speed inference is required (e.g. depth estimation at
VGA resolution), global optimization of the posterior is
usually dropped in favor of local optimization [40,4,32].
Thanks to our learned binary representation we can pro-
vide a very strong data term. This low entropy data term
allows us to design and use a new parallel global infer-
ence method which reaches high quality solutions for 1.3
Megapixel images in less than one millisecond on GPUs.
3. The HashMatch Framework
Our framework is based on a pairwise-CRF that can be
expressed using the following probabilistic factorization:
P(Y|D) = 1
Z(D)eE(Y|D)(2)
E(Y|D) = X
i
ψu(li) + X
i
X
j∈Ni
ψp(li, lj),(3)
The data term ψu(li)models how likely a node in the
graph (usually a pixel) belongs to a particular class li(e.g.
‘foreground’). The exact implementation of ψu(li)depends
on the task at hand. For instance, for finding the nearest
neighbor between image patches, labels licorrespond to
vectors (u, v)which define the displacements in the image
directions. Then
ψu(li) = |h(xi)h(xi+li)|(4)
measures the compatibility of two image patches xcentered
at 2D pixel location iand i+li. The function h(x) =
sign(xW)is a binary feature, which allows us to compute
ψu(li)highly efficiently via the Hamming distance in (4).
For any other classification or regression problem we define
ψuas
ψu(li) = log(g(li, h(xi))) (5)
where gis a learned classifier or regressor that evaluates
the likelihood of label ligiven the binary code h(xi)of
an image patch xi. The smoothness cost is defined as
ψp(xi=li, xj=lj) = max(τ, |lilj|)and it encour-
ages that neighboring pixels iand jare assigned to similar
labels, where τis a truncation threshold.
Our key contribution is a novel method to compute h(x)
and g(·)that captures the essential information in the data
and it is detailed in the remainder of the section.
3.1. HashCodes Learning
We now detail our proposed approach to train a function
hthat maps a signal xRnto a binary space b=h(x) =
sign(xW)∈ {0,1}k. This binary representation b, is then
used to learn a function g(l, b)that performs any given task
yRd. It is important to note that ycan correspond to
tasks as diverse as multi-label classification and structured
regression. In particular, one can define y=xfor nearest-
neighbor search. In order to keep the computational cost as
low as possible, we consider a linear model for each entry
yl, i.e. yl=g(l, b) = b>zl.
More formally, we learn a set of hyperplanes WRn×k
and a task function ZRk×dthat minimizes a loss L:
min
W,ZL(sign(XW)Z,Y) + Γ(W) + Ω(Z)(6)
where XRm×nand YRm×dare matrices whose i-
th row corresponds respectively to xiand yi. The terms
Γ(W)and Ω(Z)are suitable regularizers encouraging de-
sired structures on the two predictors.
The model y=Z>sign(W>x), can be interpreted as a
neural network with one hidden layer and with the operator
sign(·)as non-linearity (in contrast to a sigmoid or ReLu
[34]). In particular, when Y=X, the model becomes sim-
ilar to an autoencoder with internal binary representation
(e.g. [8]). However, the optimization of Eq. (6) cannot be
performed using first-order methods (e.g. backpropagation)
because sign(XW)is a piece-wise constant function (and
therefore the subgradient with respect to Wis zero almost
everywhere). We circumvent this issue by decoupling the
task yfrom the binary mapping h(x). To do so we intro-
duce an additional variable B=sign(XW)and then relax
the equality constraint by means of a dissimilarity measure
D(XW,B)that will be minimized. This leads to the prob-
lem
min
W,Z,BL(BZ,Y) + Γ(W) + Ω(Z) + γD(XW,B)
s.t. kBkµ
(7)
where kBk=maxij |Bij |denotes the `(or max) norm
of Band µ > 0a scalar hyperparameter. The constraint
kBkµis introduced to encourage so-called anti-sparse
solutions, such that the minimizer Bof Eq. (7) would
have all entries Bij =±µ. The concept of anti-sparsity
was originally introduced in the signal processing literature
where it was observed that imposing constraints based on
the max-norm would induce ‘binary’ solutions. We refer the
reader to [33,18,25,53,44] for an in-depth discussion on
the anti-sparse properties of max-norm regularization (and
constraints). The idea of introducing a variable Bwith bi-
nary entries is akin to the one proposed in [8]. However in
such work the authors imposed the constraint B∈ {−1,1}
in the optimization, leading to an NP-hard problem. On the
other hand, the max-norm constraint ball is a convex set,
and in the following we discuss an efficient optimization
algorithm to find Bin practice.
Interestingly, when Bis a binary matrix with entries
equal to ±µ, the problem of learning a linear predictor
Wsuch that 1
µBsign(XW)corresponds to a stan-
dard multi-label (or multi-task) binary classification prob-
lem, with each column of Brepresenting a different binary
task. Therefore a natural choice for the dissimilarity mea-
sure D(XW,B)is a loss function used for classification
problems such as the logistic, hinge, least-squares etc.
3.2. Optimization
The optimization problem described by Eq. (7) is not
jointly convex in W,Z,Bbut, for convex loss functions
L,Dand regularizers Γ,the objective functional is
convex separately in each variable. A natural strategy
to address this problem is therefore to perform either
alternated minimization [47,39] or block coordinate
descent [5]. Below we detail how we propose to perform
the minimization of Eq. (7) by independently optimizing
W,Band Z.
Optimizing W. We propose to use the regularizer Γ(W) =
λ|W|1in order to induce sparse solutions. In particular, in
our experiments λis chosen so that the corresponding so-
lution Whas at most s << n non-zero entries in each
column. Indeed, this allows to compute a code h(x) =
sign(W>
x)in O(sk)operations rather than O(nk). When
the dissimilarity measure Dis smooth, one can employ
a standard proximal forward-backward splitting method to
find the best Wfor fixed Band Z. The algorithm consists
of producing a sequence of updates defined by
Wt+1 =Proxσ1λ|·|1(Wtσ1X>γ∇D(XWt,B)) (8)
where Proxσ1λ|·|1denotes the proximal operator of σ1λ| · |1
(see [2]). For the case of the `1norm, the proximal
operator is well-known to correspond to the entry-wise
soft-thresholding [2]: for each scalar w, the soft thresh-
olding is such that Proxσ1λ|·|1(w)=0if |w| ≤ σ1λand
Proxσ1λ|·|1(w) = wsign(w)σ1λotherwise. For a
suitable choice of step size σ1(by either line search or
depending on the Lipschitz constant of the gradient of D),
iterating Eq. (8) is guaranteed to converge to the solution
Wwith the value of the objective functional decreasing at
a rate of O(1/t)[2]. Following [57] we also use an early
stopping criterion to fix the desired number of variables
with the highest absolute values for each column of W.
Optimizing B. If Lis smooth, one can again adopt the
proximal forward-backward splitting approach to minimize
Eq. (7) w.r.t. B(for fixed Wand Z) obtaining the updates
e
Bt+1 =Btσ2∇L(BtZ,Y)Z>+γ∇D(XW,Bt)
Bt+1 =Prox{k·kµ}(e
Bt+1)(9)
To compute the proximal operator, we make use of
the Moreau decomposition, stating that for any func-
tion φ, Proxφ(B) = BProxφ(B), with φdenot-
ing the Fenchel’s conjugate of φ(defined as φ(B) =
supCRm×ktr(B>C)φ(C)). In our case, φcan be in-
terpreted as the indicator function of the max-norm ball of
radius µ(namely the function that is zero when kBkµ
and +otherwise). It is straightforward to show that the
corresponding Fenchel’s conjugate is φ(B) = µ|B|1. As
a consequence we have
Bt+1 =e
Bt+1 Proxµ|·|1(e
Bt+1)(10)
with Proxµ|·|1the entry-wise soft-thresholding operator
introduced for the optimization of W. We again obtain a
convergence rate in the order of O(1/t)[2].
Optimizing Z. We consider the Frobenius norm regularizer
Ω(Z) = ηkZk2to avoid overfitting. This problem can be
solved by standard gradient descent updates
Zt+1 =Ztσ3B>∇L(BZ,Y) + ηZt(11)
which is known to converge for a suitable choice of step
η. Convergence rates of O(1/t)are guaranteed also in
this case [2]. Faster rates can be achieved adding further
hypotheses on the conditioning of the loss L[6]. More-
over, if we consider L(BZ,Y) = kBZ Yk2, we can
compute the the solution to the problem in closed form as
Z= (B>B+ηI)1B>Y, with Ithe k×kidentity matrix.
Convergence Rates of Block Coordinate Descent. Block
coordinate descent methods consists of iterating across the
steps at Eq. (8,9,11) by optimizing over one variable at the
time while keeping the other two variables fixed. In gen-
eral, it is challenging to prove convergence of the iterations
to a stationary point (e.g. local minima or saddles) let alone
prove rates on how fast such convergence can be guaran-
teed. However, for the choice of loss functions and regu-
larizers considered in this work, the proposed approach be-
longs to the family of Proximal Alternating Linearized Min-
imization (PALM) optimization methods [5] whose conver-
gence properties have been recently studied. As a corollary
to Theorem 1and Remark 6in [5] we have the following
Theorem 1 (Convergence of PALM).With the notation of
Eq. (7), let Land Dsatisfy the hypotheses in [5] (Thm.1),
in particular, they are differentiable, have Lipschitz con-
tinuous gradient with associated Lipschitz constants LL
and LDrespectively. Consider the iterative sequence of
(Wt,Bt,Zt)
t=0 obtained by updating each variable iter-
atively according to respectively Eq. (8,9,11) with step size
respectively σ1(γLDkXkop)1,σ2(ηLL+γLD)1,
σ3(µmLL+η)1(with mdefined in Thm 3.1 of [5],
kXkop denoting the operator norm of X, namely the max-
imum singular value of X). Then there exists a stationary
point W,B,Zfor the functional at Eq. (7)such that
k(Wt,Bt,Zt)(W,B,Z)k=O(1/t)(12)
The above theorem states that for specific choices of
descent steps σ1, σ2, σ3, one can expect convergence to a
stationary point at a sublinear rate of the order of O(1/t).
Choice of Land D. The relaxed problem described in Eq.
(7) and the optimization approach described apply to any
choice of convex smooth function Land dissimilarity mea-
sure D. In the experiments reported in this work we adopted
the least squares loss function L(BZ,Y) = kBZ Yk2
and D(WX,B) = kWX Bk2for which it is easy to
recover the Lipschitz constant of the gradient to derive the
descent step sizes σ1, σ2, σ3in Thm.1. This choice was also
motivated by the fact that least-squares is the standard loss
function for regression and reconstruction problems (hence
a natural choice for L) but also often used in classification
settings [55] (hence a viable choice for the dissimilarity D).
In the supplementary material we report the pseudocode for
the optimization strategy described in for this choice of loss
and dissimilarity.
3.3. Parallel Inference
Infering the posterior probability, and a fortiori the MAP
(Maximum a Posteriori) of P(Y|D), is in general very hard
as it requires solving for a very complex series of integrals
over all the variables xX. To approximate this complex
distribution, we resort to using variational approximation.
More precisely, we are aiming at finding a distribution Q
that is a ‘close’ approximation of Pwithin the class of dis-
tributions that can be factorized as a product of independent
marginals, i.e.
Q(Y) = Y
i
Q(Yi)(13)
This approximation is computationally attractive but is
very likely to lose a lot of information about the origi-
nal distribution P(Y|D). Nevertheless, the result of MAP
and MPM (Maximum Posterior Marginal) inference will
be quite similar when the entropy of P(Y|D)is low.
Broadly speaking, in the case of the pairwise CRF we de-
scribed previously, good approximations of P(Y|D)are
obtained when the unary potentials are ‘peaky’ (i.e. low en-
tropy). The quality of the approximation between Q(Y)and
P(Y|D)is usually measured using KL(Q(Y)||P(Y|D)),
where KL is the Kullback-Leibler divergence. Taking
the fixed point solution of the Kullback-Leibler divergence
[26], we obtain the following update for the label liin the
marginal of random variable xi:
Qt
i(li) = 1
Bi
eMi(li)(14)
Mi(li) = ψu(li) + X
j∈Ni
X
lj∈L
Qt1
jψp(li, lj)(15)
Bi=X
li∈L
eMi(li)(16)
The underlying coordinate ascent procedure results in
a better approximation of Pby Qfor each iteration, but
also guarantees convergence. Note that popular techniques
like Belief Propagation are not guaranteed to converge when
performing inference over graphs. At this stage, it is crucial
to note that the complexity of evaluating the updated Qt(Y)
is O(|Y||L|(|N ||L| + 1)). The quadratic complexity on L
is not of a practical concern for high-speed inference when
Lis small. Nevertheless, this makes the (computationally
attractive) mean-field framework too slow when this num-
ber grows large. As mentioned before, the MPM and MAP
solutions are similar in the pairwise CRF (Eq. (3)) when the
entropy of the unary is low. We go one step further in the
approximation and explicitly assume that Qalso has low
entropy and approximate it with a Dirac δfunction. This
corresponds to setting Qi=δ(liarg maxljQj). We can
now rewrite (15) as follows
Mi(li) = ψu(li) + X
j∈Ni
ψp(li,arg max
lj
Qj)(17)
Since Qt
i(14) now follows a Dirac δfunction, comput-
ing the normalization function Bi(16) is not required any-
more. The compute complexity of updating Qt(Y)is now
O(|Y||N |(1+|N |)). Roughly speaking, this corresponds to
a reduction of complexity along the lines of O(|L|2/|N |).
Note that |N | is small for many problems (e.g. stereo
depth estimation), and that in most these problems |L| >
|N | (e.g. |L| is in the hundreds and |N | = 4 when estimat-
ing disparities).
3.4. Computational Analysis
The following is a computational analysis of the pro-
posed method for (discrete) disparity estimation. We as-
sume an input image with |Y|pixels, and Lpossible labels.
Typical values for disparity estimation are |Y|= 1280 ×
1024 and L= 512. The hyperplanes Ware trained to have
maximum 4non-zero elements for each code. Each pixel i
is associated with a patch Piof size |Pi|= 11 ×11. Taking
the sign of the dot product between Piand Wremaps Pi
into k= 32 binary codes. The computation of the binary
codes bis independent of the window size |P|and it in-
volves 4multiplications and sums for each hyperplane. This
corresponds to a per-pixel complexity of O(4 k)to compute
a hash code. We initialize the data term ψiby evaluating
only a very limited set of 32 random label hypotheses. The
distances are computed using the Hamming distance in the
new space and this can be efficiently implemented in O(1)
making use of the popc() function that is implemented in
most of the latest GPU architectures. The initialization step
then has a per-pixel complexity O(4 k). Regarding the in-
ference, we only use the immediate N= 3×3neighbors of
each pixel in the pairwise potential. The marginal of each
pixel can be updated in parallel without waiting for sequen-
tial propagation steps like in [4,16]. In practice, we use
4 steps of the proposed inference, resulting in a per-pixel
complexity of O(4 |N |(1 + |N |)). Note that in contrast to
other approaches, the proposed algorithm is independent of
the number of labels Land the patch size |Pi|. The rela-
tively low computational complexity and fully parallel na-
ture of all the components of the proposed method make
it particularly suitable for high-speed applications on low
compute devices. We tested the HashMatch framework on
an NVIDIA Titan X GPU, with an overall running time of
890µs per frame. We also implemented the algorithm on a
NVIDIA Tegra TX1 with an overall running time of 5ms,
Figure 1. Qualitative comparisons of depth maps generated with
state of the art methods. HashMatch shows the most complete
results in complex scenes.
which opens up the possibility of high speed applications
on mobile platforms.
4. Results
We evaluate the HashMatch framework on a diverse set
of computer vision tasks. For each problem, we explicitly
describe the form of the unary potential that is used. We
first show how our method can handle continuous labeling
problems such as disparity estimation. Further, we evaluate
the proposed hashing scheme on retrieval and feature ap-
proximation tasks. Finally, we assess the quality of the pro-
posed inference for background subtraction. Note that for
the tasks where the proposed inference is used, the number
of iterations is constant and set to 4.
4.1. Depth Estimation
In this section, we focus on depth estimation from stereo
images under active illumination [15,16]. For our exper-
iments, we use a hardware setup similar to [16], i.e. two
IR cameras in a stereo configuration as well as a Kinect V1
DOE. When the IR images ILand IRare calibrated and
rectified, each pixel p= (u, v)in ILhas a corresponding
pixel q= (u+l, v)in IRthat lies on the same scanline
v. We apply HashMatch to retrieve the continuous disparity
lR. The disparity is then remapped in the depth domain
via Z=bf
l, where bis the baseline of the system, fthe
focal length of the camera and lis the inferred disparity.
In this section, the data term ψu(li)is computed accord-
ing to Eq. (4). We train the hyperplanes Wby acquiring
10000 images and extracting 11 ×11 image patches x. We
set the task y=x. We use k= 32 hyperplanes Wwith
maximum 4non-zero elements for each column.
We compare HashMatch with state of the art methods,
including both qualitative and quantitative evaluations. We
set the exposure time of the cameras very low (2ms) in or-
der to perform evaluations on real data captured at 500Hz.
Figure 2. Example of high quality depthmap and point clouds gen-
erated with HashMatch. Notice the level of details and the absence
of quantization and outliers.
This data has a significatively lower signal to noise ratio
(SNR) compared to data captured at 30Hz. In Fig. 1we
show qualitative results generated with HashMatch, Patch-
Match Stereo [4] and UltraStereo (US) [16]. Notice how
the baseline methods suffer from the relatively low SNR
whereas HashMatch is able to predict complete and smooth
depthmaps. We also generate results using our unary term
and the PatchMatch inference [1] (HashMatch+PM in Fig.
1). The proposed parallel inference produces results very
close to those generated with the PatchMatch inference, but
is 2Xfaster compared with the very optimized implemen-
tation described in [16]. In Fig. 2we show a high quality
depthmap and pointcloud generated with HashMatch, no-
tice the level of details that are captured by our method.
To quantitatively assess the proposed framework, we fol-
low the procedure presented in [15] and acquired images of
a flat wall at multiple known distances, varying between 500
and 3500 cm. For each set distance, 100 frames are recorded
from which we estimate the average depth bias (defined as
the average error), and depth jitter (defined as the standard
deviation). We compare with Kinect V1, RealSense R200,
PatchMatch Stereo [4], HyperDepth [15] and UltraStereo
[16] as baseline methods; results are reported in Fig. 3.
HashMatch outperforms most of the competitors and is on
par with more involved methods such as [15].
Finally, we compare the quality of our hashing scheme
with other state of the art methods such as: ITQ [19], SBE
[37], CBE [56], Binary Autoencoders (BA) [8] and Ultra-
Stereo [16]. We used the synthetic dataset from [16] with
perfect groundtruth disparities and perform the (discrete)
nearest neighbor search over the pixel disparities. We de-
fined the accuracy as the percentage of pixels for which the
estimated disparity is below 1 pixel; results are reported in
Figure 3. Quantitative comparisons with state of the art methods.
HashMatch achieves the lowest error using the lowest compute.
Tab. 1. Although our HashMatch descriptor uses only 4
non-zero elements per hyperplane, the results are on par
with dense hashing schemes such as BA. We also outper-
form other sparse hashing schemes such as CBE and SBE,
proving the quality of the proposed representation. It is in-
teresting to note that besides providing smooth results, us-
ing the proposed inference raises the precision reported in
Tab. 1from 77% to 96%.
HashMatch ITQ SBE CBE LSH BA US
77% 76% 68% 70% 58% 77% 73%
Table 1. Hashing Schemes Comparisons. We compare our method
with other popular hashing schemes and binary representations:
ITQ [19], SBE [37], CBE [56], Binary Autoencoders [8] and Ul-
traStereo (US) [16]. HashMatch uses only 4non zero elements
and is on par with dense methods like BA.
4.2. Nearest Neighbor Retrieval
In this section, we evaluate HashMatch on a nearest
neighbor retrieval task. We assume we have a two set of fea-
ture vectors, query and base. For each sample in the query
set, the goal is to find the nearest neighbor in the base set.
This is a typical case for correspondence search problems
between two or more image. The goal of this section is to
evaluate only our hashing scheme and compare it with other
state of the art methods, therefore no parallel inference is re-
quired. The data term ψu(li)is computed according to Eq.
(4). Notice that for this experiment the smoothness term is
not needed, thus we dropped it. We use the GIST1M dataset
[24], which is composed of three disjoint sets: train, query
and base. Each descriptor xhas 960 dimensions, and we
minimize Eq. (6) setting the task Y=X. We trained data
Figure 4. Recall@R curves on the GIST1M dataset.
dependent hashing schemes using the training set and test
on the other sets. We compute the Recall@R, defined as the
recall for R retrievals.
We compare our method with the following state of the
art hashing techniques: ITQ [19], SBE [37], CBE [56],
LSH [23], and SKLSH [36]. For each method we trained
16,32,64,128 codes, perform the binary embedding ac-
cording to Eq. (1) and compute the nearest neighbor search
using the Hamming distance. We do not perform any data
preprocessing such as normalization steps or augmentation.
For HashMatch and SBE we set the sparsity parameter such
that we have 10% of non-zero elements.
We report the Recall@R curves in Fig. 4. For small
number of codes, HashMatch greatly outperforms all the
competitors, including dense ones such as ITQ. When larger
codes are used (i.e. 128), this gap reduces but HashMatch
still provides for the largest area under curve (AUC).
4.3. Feature Approximation
In this experiment we consider the problem of approxi-
mating complex feature descriptors given a set of 11 ×11
image patches x. For this particular application we consider
SIFT descriptors [31]sR128. The goal is to minimize
Eq. (6) where the task Y=Sis the set of SIFT descrip-
tors. In other words, we apply HashMatch to a regression
problem, where the target continuous function is computed
from handcrafted features. In general we could apply the
same framework to learn more sophisticated descriptors.
We consider the EPFL wide-baseline stereo data [46],
where we trained HashMatch on those sequences with no
groundtruth available. Training data is generated by ex-
tracting SIFT descriptors in an unsupervised way. At test
time we detect corners and compute SIFT and HashMatch
Figure 5. Qualitative Experiments for feature approximation (see
Sec. 4.3).
descriptors, respectively. We match the descriptors based
on the closest `2distance and we filter outliers by impos-
ing the epipolar constraint between the two images. In Fig.
5we report qualitative results: in green we depict correct
matches, in red those matches with distances greater than 1
cm in the provided groundtruth. SIFT achieves an average
end-point error of 0.8cm, whereas HashMatch gets very
close with 1.2cm. If we consider as inliers the percentage
of retrieved matches with error <1cm, HashMatch reports
an overall accuracy of 90%, whereas SIFT retrieves 81%
corrected matches. On average, SIFT is able to retrieve 350
good matches per image, HashMatch 150. While the num-
ber of matches are fewer in case of HashMatch we get more
reliable matches. In practice, the order of feature matches
we obtain from HashMatch would cater to most applica-
tions that use SIFT, and with much less compute making
HashMatch very powerful in such scenarios.
4.4. Background Subtraction
In this last experiment, we evaluate the proposed paral-
lel inference on a background segmentation task with static
cameras. Typical applications are surveillances and peo-
ple tracking scenarios. We consider natural (indoor) rooms.
Assuming a clean background shot is available (RGB and
depth), we want to detect any new object entering the scene.
The goal of this application is to evaluate the proposed
inference and compare it with well established approaches
like belief propagation (BP) [45], tree-reweighted message
passing (TRW) [50,27], mean field [49] and patch match
[1]. Assuming that RGB and depth information is available,
we use a unary potential of the form ψu(li) = ψrgb (li) +
ψdepth(li). The two terms are simple differences in the
HSV space and a logistic function in the depth domain as
defined in [35] and [14]. Note that the unary potential is
shared amongst all the baselines methods and that the pair-
wise term is the standard Potts model.
We collected 100 frames from different subjects and ob-
jects, and manually labeled the foreground from the images
to obtain the ground truth. We report the average energy ob-
tained by the proposed inference and the baselines in Table
Figure 6. Qualitative results of background subtraction. The fore-
ground segmentation corresponds to the cyan region, which is
overlaid on the gray-scale version of the color image. Note that
we achieve results that are comparable with more computationally
demanding approaches, but orders of magnitude faster.
4.4. It is interesting to note that although much less compu-
tationally demanding, the proposed inference achieves final
energies that are very similar to that obtained by the base-
lines. Some qualitative results are shown in Fig. 6.
Inference algorithm Average energy
Only Unaries 5.8105
BP [45]2.9105
TRW [50,27]2.85 105
Mean Field [49]2.9105
PatchMatch [1]3.0105
Proposed inference 2.9105
Table 2. Quantitative evaluation of the proposed inference on a
background subtraction task. We compare our method against
established inference techniques. We report the average energy
obtained by each approach. Note how all the propagation ap-
proaches significantly reduce the energy obtained by the initial so-
lution (first row). Also note that although the proposed inference
is less computationally demanding than the baselines, it reaches
very comparable energy levels.
5. Conclusion
In this paper we presented HashMatch, an efficient
framework tailored for parallel compute architectures.
Through extensive experiments on a diverse set of computer
vision tasks, we demonstrated that although HashMatch op-
erates at extreme speeds, it makes little compromise in pre-
cision compared to more compute intensive approaches. All
these characteristics make HashMatch an appealing frame-
work for low compute mobile platforms and for products
required to operate at very high speeds.
Acknowledgements
We thank the entire perceptiveIO team for continuous
feedback and support regarding this work.
References
[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman.
PatchMatch: A randomized correspondence algorithm for
structural image editing. ACM SIGGRAPH and Transaction
On Graphics, 2009. 6,8
[2] H. H. Bauschke and P. L. Combettes. Convex analysis and
monotone operator theory in Hilbert spaces. Springer Sci-
ence & Business Media, 2011. 4
[3] J. Besag. On the statistical analysis of dirty pictures. Journal
of the Royal Statistical Society. Series B (Methodological),
pages 259–302, 1986. 2
[4] M. Bleyer, C. Rhemann, and C. Rother. PatchMatch Stereo -
Stereo Matching with Slanted Support Windows. In BMVC,
2011. 2,5,6
[5] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating
linearized minimization for nonconvex and nonsmooth prob-
lems. Mathematical Programming, 2014. 3,4
[6] S. Boyd and L. Vandenberghe. Convex optimization. Cam-
bridge University Press, 2004. 4
[7] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-
ergy minimization via graph cuts. IEEE Transactions on pat-
tern analysis and machine intelligence, 23(11):1222–1239,
2001. 1,2
[8] M. . Carreira-Perpin and R. Raziperchikolaei. Hashing with
binary autoencoders. In CVPR, 2015. 2,3,6,7
[9] M. S. Charikar. Similarity estimation techniques from round-
ing algorithms. In ACM symposium on Theory of computing,
pages 380–388. ACM, 2002. 2
[10] M.-M. Cheng, V. A. Prisacariu, S. Zheng, P. H. Torr, and
C. Rother. Densecut: Densely connected crfs for realtime
grabcut. In Computer Graphics Forum, volume 34, pages
193–201. Wiley Online Library, 2015. 1
[11] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect:
Training deep neural networks with binary weights during
propagations. In NIPS, pages 3123–3131, 2015. 1
[12] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and
Y. Bengio. Binarized neural networks: Training deep neu-
ral networks with weights and activations constrained to+ 1
or-1. arXiv preprint arXiv:1602.02830, 2016. 1
[13] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In Computer Vision and Pattern Recogni-
tion, 2005. CVPR 2005. IEEE Computer Society Conference
on, volume 1, pages 886–893. IEEE, 2005. 1
[14] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello,
A. Kowdle, S. Orts Escolano, C. Rhemann, D. Kim, J. Tay-
lor, P. Kohli, V. Tankovich, and S. Izadi. Fusion4d: Real-time
performance capture of challenging scenes. 2016. 8
[15] S. R. Fanello, C. Rhemann, V. Tankovich, A. Kowdle, S. O.
Escolano, D. Kim, and S. Izadi. Hyperdepth: Learning depth
from structured light without matching. In CVPR, volume 2,
page 7, 2016. 6
[16] S. R. Fanello, J. Valentin, C. Rhemann, A. Kowdle,
V. Tankovich, and S. Izadi. Ultrastereo: Efficient learning-
based matching for active stereo systems. In CVPR, 2017. 5,
6,7
[17] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief
propagation for early vision. IJCV, 70(1):41–54, 2006. 2
[18] J. J. Fuchs. Spread representations. In ASILOMAR, 2011. 3
[19] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Itera-
tive quantization: A procrustean approach to learning binary
codes for large-scale image retrieval. PAMI, 35(12):2916–
2929, 2013. 2,6,7
[20] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-
pressing deep neural networks with pruning, trained quanti-
zation and huffman coding. In ICLR, 2016. 1
[21] J. He, S.-F. Chang, R. Radhakrishnan, and C. Bauer. Com-
pact hashing with joint optimization of search accuracy and
time. In CVPR, 2011. 2
[22] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and 0.5 mb model size. arXiv
preprint arXiv:1602.07360, 2016. 1
[23] P. Indyk and R. Motwani. Approximate nearest neighbors:
towards removing the curse of dimensionality. In ACM sym-
posium on Theory of computing, 1998. 2,7
[24] H. Jegou, M. Douze, and C. Schmid. Searching with quan-
tization: approximate nearest neighbor search using short
codes and distance estimators. In INRIA Technical report,
2009. 7
[25] H. J´
egou, T. Furon, and J. J. Fuchs. Anti-sparse coding for
approximate nearest neighbor search. In ICASSP, 2012. 3
[26] D. Koller and N. Friedman. Probabilistic graphical models:
principles and techniques. MIT press, 2009. 5
[27] V. Kolmogorov. Convergent tree-reweighted message pass-
ing for energy minimization. IEEE transactions on pattern
analysis and machine intelligence, 28(10):1568–1583, 2006.
2,8
[28] P. Kr¨
ahenb¨
uhl and V. Koltun. Efficient inference in fully
connected crfs with gaussian edge potentials. NIPS, 2011. 2
[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS, pages 1097–1105, 2012. 1
[30] D. G. Lowe. Object recognition from local scale-invariant
features. In Computer vision, 1999. The proceedings of the
seventh IEEE international conference on, volume 2, pages
1150–1157. Ieee, 1999. 1
[31] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 2004. 7
[32] J. Lu, H. Yang, D. Min, and M. Do. Patch match filter: Ef-
ficient edge-aware filtering meets randomized search for fast
correspondence field estimation. In CVPR, 2013. 2
[33] Y. Lyubarskii and R. Vershynin. Uncertainty principles and
vector quantization. IEEE Transactions on Information The-
ory, 2010. 3
[34] V. Nair and G. E. Hinton. Rectified Linear Units Improve
Restricted Boltzmann Machines. 2012. 3
[35] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang,
A. Kowdle, Y. Degtyarev, D. Kim, P. L. Davidson,
S. Khamis, M. Dou, et al. Holoportation: Virtual 3d telepor-
tation in real-time. In Proceedings of the 29th Annual Sym-
posium on User Interface Software and Technology, pages
741–754. ACM, 2016. 8
[36] M. Raginsky and S. Lazebnik. Locality-sensitive binary
codes from shift-invariant kernels. In NIPS, 2009. 7
[37] M. Rastegari, C. Keskin, P. Kohli, and S. Izadi. Computa-
tionally bounded retrieval. In CVPR, 2015. 2,6,7
[38] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
net: Imagenet classification using binary convolutional neu-
ral networks. In ECCV, 2016. 1
[39] M. Razaviyayn, M. Hong, and Z.-Q. Luo. A unified con-
vergence analysis of block successive minimization methods
for nonsmooth optimization. SIAM Journal on Optimization,
23(2):1126–1153, 2013. 3
[40] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and
M. Gelautz. Fast cost-volume filtering for visual correspon-
dence and beyond. In CVPR, 2011. 2
[41] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer.
Optimizing binary mrfs via extended roof duality. In 2007
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2007. 2
[42] B. Scholkopf and A. J. Smola. Learning with Kernels: Sup-
port Vector Machines, Regularization, Optimization, and Be-
yond. MIT Press, 2001. 1
[43] C. Strecha, A. Bronstein, M. Bronstein, and P. Fua. Lda-
hash: Improved matching with smaller descriptors. PAMI,
34(1):66–78, 2012. 2
[44] C. Studer, T. Goldstein, W. Yin, and R. G. Baraniuk. Demo-
cratic representations. CoRR, 2014. 3
[45] M. F. Tappen and W. T. Freeman. Comparison of graph cuts
with belief propagation for stereo, using identical MRF pa-
rameters. In ICCV, 2003. 8
[46] E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient dense
descriptor applied to wide-baseline stereo. PAMI, 2010. 7
[47] P. Tseng. Convergence of a block coordinate descent method
for nondifferentiable minimization. Journal of optimization
theory and applications, 109(3):475–494, 2001. 3
[48] J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, J. Shotton,
P. Kohli, M. Nießner, A. Criminisi, S. Izadi, and P. Torr. Se-
manticpaint: Interactive 3d labeling and learning at your fin-
gertips. ACM Transactions on Graphics (TOG), 2015. 1
[49] V. Vineet, J. Warrell, and P. H. S. Torr. Filter-based mean-
field inference for random fields with higher-order terms and
product label-spaces. 2012. 8
[50] M. Wainwright, T. Jaakkola, and A. Willsky. Map estimation
via agreement on (hyper)trees: Message-passing and linear
programming approaches. IEEE Transactions on Informa-
tion Theory, 2002. 8
[51] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity
search: A survey. arXiv preprint arXiv:1408.2927, 2014. 2
[52] S. Wang, S. R. Fanello, C. Rhemann, S. Izadi, and P. Kohli.
The global patch collider. CVPR, 2016. 2
[53] S. Wang and N. Rahnavard. Binary compressive sensing via
sum of l1-norm and l(infinity)-norm regularization. In MIL-
COM, 2013. 3
[54] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In
NIPS, 2009. 2
[55] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping
in gradient descent learning. Constructive Approximation,
2007. 4
[56] F. X. Yu, S. Kumar, Y. Gong, and S.-F. Chang. Circulant
binary embedding. In ICML, 2014. 2,6,7
[57] H. Zou and T. Hastie. Regularization and variable selection
via the elastic net. In Journal of the Royal Statistical Society,
2005. 4
... II. RELATED WORK Learning-based active stereo has had limited research in recent years. Prior to the deep learning era, frameworks for learning embeddings where matching can be performed more efficiently were explored [16], [17], [52] together with direct mapping from pixel intensities to depth [14], [15]. These methods have failed in general textureless scenes due to shallow architectures and local optimization schemes. ...
Preprint
Full-text available
Active stereo systems are widely used in the robotics industry due to their low cost and high quality depth maps. These depth sensors, however, suffer from stereo artefacts and do not provide dense depth estimates. In this work, we present the first self-supervised depth completion method for active stereo systems that predicts accurate dense depth maps. Our system leverages a feature-based visual inertial SLAM system to produce motion estimates and accurate (but sparse) 3D landmarks. The 3D landmarks are used both as model input and as supervision during training. The motion estimates are used in our novel reconstruction loss that relies on a combination of passive and active stereo frames, resulting in significant improvements in textureless areas that are common in indoor environments. Due to the non-existence of publicly available active stereo datasets, we release a real dataset together with additional information for a publicly available synthetic dataset needed for active depth completion and prediction. Through rigorous evaluations we show that our method outperforms state of the art on both datasets. Additionally we show how our method obtains more complete, and therefore safer, 3D maps when used in a robotic platform
... Instead of random initialization, SOS explores the prior information of cost space to obtain high-quality initial labels. Moreover, HashMatch [20] based propagation and inference are used to efficiently fix the matching errors in initialization. The computational complexity of SOS is proved to be O(1) and according to [19], the theoretical throughput can be as high as 4000 frames per second (FPS) on modern high-end GPU. ...
... In the state-of-the-art work by Dou et al. [188] with depth maps generated up to 500Hz [189,190], a detail layer is computed to capture the high-frequency details and atlas mapping is applied to improve the color fidelity. Our rendering system is compatible with the new fusion pipeline, by integrating the computation of seams, geodesic fields, and view-dependent rendering modules. ...
Thesis
In spite of the dramatic growth of virtual and augmented reality (VR and AR) technology, content creation for immersive and dynamic virtual environments remains a significant challenge. In this dissertation, we present our research in fusing multimedia data, including text, photos, panoramas, and multi-view videos, to create rich and compelling virtual environments. First, we present Social Street View, which renders geo-tagged social media in its natural geo-spatial context provided by 360° panoramas. Our system takes into account visual saliency and uses maximal Poisson-disc placement with spatiotemporal filters to render social multimedia in an immersive setting. We also present a novel GPU-driven pipeline for saliency computation in 360° panoramas using spherical harmonics (SH). Our spherical residual model can be applied to virtual cinematography in 360° videos. We further present Geollery, a mixed-reality platform to render an interactive mirrored world in real time with three-dimensional (3D) buildings, user-generated content, and geo-tagged social media. Our user study has identified several use cases for these systems, including immersive social storytelling, experiencing the culture, and crowd-sourced tourism. We next present Video Fields, a web-based interactive system to create, calibrate, and render dynamic videos overlaid on 3D scenes. Our system renders dynamic entities from multiple videos, using early and deferred texture sampling. Video Fields can be used for immersive surveillance in virtual environments. Furthermore, we present VRSurus and ARCrypt projects to explore the applications of gestures recognition, haptic feedback, and visual cryptography for virtual and augmented reality. Finally, we present our work on Montage4D, a real-time system for seamlessly fusing multi-view video textures with dynamic meshes. We use geodesics on meshes with view-dependent rendering to mitigate spatial occlusion seams while maintaining temporal consistency. Our experiments show significant enhancement in rendering quality, especially for salient regions such as faces. We believe that Social Street View, Geollery, Video Fields, and Montage4D will greatly facilitate several applications such as virtual tourism, immersive telepresence, and remote education.
Article
Full-text available
Real-world three-dimensional reconstruction is a project of long-standing interest in global computer vision. Many tools have emerged these past years to accurately perceive the surrounding world either through active sensors or through passive algorithmic methods. With the advent and popularization of augmented reality on smartphones, new visualization issues have emerged concerning the virtual experience. Especially a 3D model seems to be essential to provide more realistic AR effects including consistency of occlusion, shadow mapping or even collision between virtual objects and real environment. However, due to the huge computation of most of current approaches, most of these algorithms are working on a computer desktop or high-end smartphones. Indeed, the reconstruction scale is rapidly limited by the complexity of both computation and memory. Therefore, our study aims to find a relevant method to process real time reconstruction of close-range outdoor scenes such as cultural heritage or underground infrastructures in real time locally on a smartphone.
Article
Active stereo systems are used in many robotic applications that require 3D information. These depth sensors, however, suffer from stereo artefacts and do not provide dense depth estimates. In this work, we present the first self-supervised depth completion method for active stereo systems that predicts accurate dense depth maps. Our system leverages a feature-based visual inertial SLAM system to produce motion estimates and accurate (but sparse) 3D landmarks. The 3D landmarks are used both as model input and as supervision during training. The motion estimates are used in our novel reconstruction loss that relies on a combination of passive and active stereo frames, resulting in significant improvements in textureless areas that are common in indoor environments. Due to the non-existence of publicly available active stereo datasets, we release a real dataset together with additional information for a publicly available synthetic dataset (TartanAir [30] needed for active depth completion and prediction. Through rigorous evaluations we show that our method outperforms state of the art on both datasets. Additionally we show how our method obtains more complete, and therefore safer, 3D maps when used in a robotic platform.
Article
The increasing demand for 3D content in augmented and virtual reality has motivated the development of volumetric performance capture systemsnsuch as the Light Stage. Recent advances are pushing free viewpoint relightable videos of dynamic human performances closer to photorealistic quality. However, despite significant efforts, these sophisticated systems are limited by reconstruction and rendering algorithms which do not fully model complex 3D structures and higher order light transport effects such as global illumination and sub-surface scattering. In this paper, we propose a system that combines traditional geometric pipelines with a neural rendering scheme to generate photorealistic renderings of dynamic performances under desired viewpoint and lighting. Our system leverages deep neural networks that model the classical rendering process to learn implicit features that represent the view-dependent appearance of the subject independent of the geometry layout, allowing for generalization to unseen subject poses and even novel subject identity. Detailed experiments and comparisons demonstrate the efficacy and versatility of our method to generate high-quality results, significantly outperforming the existing state-of-the-art solutions.
Chapter
Computational stereo has reached a high level of accuracy, but degrades in the presence of occlusions, repeated textures, and correspondence errors along edges. We present a novel approach based on neural networks for depth estimation that combines stereo from dual cameras with stereo from a dual-pixel sensor, which is increasingly common on consumer cameras. Our network uses a novel architecture to fuse these two sources of information and can overcome the above-mentioned limitations of pure binocular stereo matching. Our method provides a dense depth map with sharp edges, which is crucial for computational photography applications like synthetic shallow-depth-of-field or 3D Photos. Additionally, we avoid the inherent ambiguity due to the aperture problem in stereo cameras by designing the stereo baseline to be orthogonal to the dual-pixel baseline. We present experiments and comparisons with state-of-the-art approaches to show that our method offers a substantial improvement over previous works.
Article
Full-text available
The nearest neighbor problem is the following: Given a set of n points P in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to the query point q in X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = R-d under some l-p norm.
Book
A comprehensive introduction to Support Vector Machines and related kernel methods. In the 1990s, a new type of learning algorithm was developed, based on results from statistical learning theory: the Support Vector Machine (SVM). This gave rise to a new class of theoretically elegant learning machines that use a central concept of SVMs—-kernels—for a number of learning tasks. Kernel machines provide a modular framework that can be adapted to different tasks and domains by the choice of the kernel function and the base algorithm. They are replacing neural networks in a variety of fields, including engineering, information retrieval, and bioinformatics. Learning with Kernels provides an introduction to SVMs and related kernel methods. Although the book begins with the basics, it also includes the latest research. It provides all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms and to understand and apply the powerful algorithms that have been developed over the last few years.
Article
A continuous two‐dimensional region is partitioned into a fine rectangular array of sites or “pixels”, each pixel having a particular “colour” belonging to a prescribed finite set. The true colouring of the region is unknown but, associated with each pixel, there is a possibly multivariate record which conveys imperfect information about its colour according to a known statistical model. The aim is to reconstruct the true scene, with the additional knowledge that pixels close together tend to have the same or similar colours. In this paper, it is assumed that the local characteristics of the true scene can be represented by a non‐degenerate Markov random field. Such information can be combined with the records by Bayes' theorem and the true scene can be estimated according to standard criteria. However, the computational burden is enormous and the reconstruction may reflect undesirable large‐scale properties of the random field. Thus, a simple, iterative method of reconstruction is proposed, which does not depend on these large‐scale characteristics. The method is illustrated by computer simulations in which the original scene is not directly related to the assumed random field. Some complications, including parameter estimation, are discussed. Potential applications are mentioned briefly.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art.
Conference Paper
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Book
This reference text, now in its second edition, offers a modern unifying presentation of three basic areas of nonlinear analysis: convex analysis, monotone operator theory, and the fixed point theory of nonexpansive operators. Taking a unique comprehensive approach, the theory is developed from the ground up, with the rich connections and interactions between the areas as the central focus, and it is illustrated by a large number of examples. The Hilbert space setting of the material offers a wide range of applications while avoiding the technical difficulties of general Banach spaces. The authors have also drawn upon recent advances and modern tools to simplify the proofs of key results making the book more accessible to a broader range of scholars and users. Combining a strong emphasis on applications with exceptionally lucid writing and an abundance of exercises, this text is of great value to a large audience including pure and applied mathematicians as well as researchers in engineering, data science, machine learning, physics, decision sciences, economics, and inverse problems. The second edition of Convex Analysis and Monotone Operator Theory in Hilbert Spaces greatly expands on the first edition, containing over 140 pages of new material, over 270 new results, and more than 100 new exercises. It features a new chapter on proximity operators including two sections on proximity operators of matrix functions, in addition to several new sections distributed throughout the original chapters. Many existing results have been improved, and the list of references has been updated. Heinz H. Bauschke is a Full Professor of Mathematics at the Kelowna campus of the University of British Columbia, Canada. Patrick L. Combettes, IEEE Fellow, was on the faculty of the City University of New York and of Université Pierre et Marie Curie – Paris 6 before joining North Carolina State University as a Distinguished Professor of Mathematics in 2016.