Content uploaded by Sean Ryan Fanello
Author content
All content in this area was uploaded by Sean Ryan Fanello on Oct 11, 2017
Content may be subject to copyright.
Low Compute and Fully Parallel Computer Vision with HashMatch
Sean Ryan Fanello1∗Julien Valentin1∗Adarsh Kowdle1Christoph Rhemann1
Vladimir Tankovich1Carlo Ciliberto2Philip Davidson1Shahram Izadi1
perceptiveIO1University College London2
Abstract
Numerous computer vision problems such as stereo
depth estimation, object-class segmentation and fore-
ground/background segmentation can be formulated as per-
pixel image labeling tasks. Given one or many images as
input, the desired output of these methods is usually a spa-
tially smooth assignment of labels. The large amount of
such computer vision problems has lead to significant re-
search efforts, with the state of art moving from CRF-based
approaches to deep CNNs and more recently, hybrids of
the two. Although these approaches have significantly ad-
vanced the state of the art, the vast majority has solely
focused on improving quantitative results and are not de-
signed for low-compute scenarios. In this paper, we present
a new general framework for a variety of computer vision
labeling tasks, called HashMatch. Our approach is de-
signed to be both fully parallel, i.e. each pixel is indepen-
dently processed, and low-compute, with a model complex-
ity an order of magnitude less than existing CNN and CRF-
based approaches. We evaluate HashMatch extensively
on several problems such as disparity estimation, image
retrieval, feature approximation and background subtrac-
tion, for which HashMatch achieves high computational ef-
ficiency while producing high quality results.
1. Introduction
Since the groundbreaking work of Krizhevsky et al. [29],
deep learning is now the method of choice for a variety
of computer vision problems. Although significant efforts
have been undertaken to improve the performance of CNNs
on various labeling tasks, these models are still far from be-
ing computationally efficient. Most of the work on efficient
deep learning focuses on trying to compress deep models
without losing precision and accuracy [38,20,11]. For ex-
ample in [38], the authors train a deep architecture making
use of binary weights for both the input and the filters. In
∗Authors equally contributed to this work.
[20], the authors try to remove redundant connections and
force multiple neurons to share the same quantized weights.
Others like [22] attempt at designing more compact layers
that use a reduced number of parameters. Similar to [38],
methods proposed in [11,12] try to binarize the full net-
work. However, these solutions still require many compu-
tational layers that involve multiple convolutions to infer
per-pixel labels. Although there is an improved efficiency
from the computational perspective, they suffer from ac-
cessing image patches stored in memory multiple times, and
hence these algorithms are both memory and computation-
ally bound.
Prior to the deep learning era, Conditional Random
Fields (CRFs) were used as one of the major tools for image
labeling problems. In their ‘simplest’ form, CRFs are com-
posed of a pairwise term, encouraging structural coherence
in the solution and a unary term, which is responsible for
modeling the compatibility between each data point/pixel
and a pre-defined set of labels. Machine learning is com-
monly used to predict this compatibility function. It is ac-
cepted that the more sophisticated the unary potential, the
better the results obtained after solving the CRF. As a con-
sequence, practitioners tend to use expensive feature repre-
sentations, e.g. HOG [13], SIFT [30] or even deep-learning
intermediate representations, followed by advanced classi-
fiers such as kernel-SVM [42]. Once the unary potential has
been estimated for each pixel, the actual CRF inference can
be performed. Unfortunately, solving multi-label CRFs is a
NP hard problem [7]. The high demand for fast and accurate
solvers resulted in significant research efforts in this space,
each offering different trade-offs in terms of compute and
closeness to the posterior captured by the CRF. It is worth
noting that it is possible to compute unary potentials and
solve the CRF at interactive rates (e.g. [10,48]) but these
approaches still have some sequential steps and have large
model complexity.
In this paper, we propose to bridge this gap in the liter-
ature by introducing HashMatch, a generic and extremely
low-compute framework that has been designed from the
ground-up for parallel, i.e. pixel independent, processing.
As demonstrated in this paper, the efficiency and parallel
nature of our approach allows us to achieve compelling re-
sults on a variety of computer vision problems, at speeds
never before demonstrated. For example, estimating dis-
parity or segmentation masks on 1.3megapixel images at
1000 fps on high-end GPUs (e.g. NVIDIA Titan X) and at
200fps on VGA images on mobile architectures (e.g. Tegra
TX1) while also providing compelling results on multiple
computer vision tasks. Our main technical contributions are
two-fold. First we propose a binary embedding for classifi-
cation, regression and nearest neighbor tasks that is trained
with sparsity and anti-sparsity constraints. This allows eval-
uating a robust unary potential with only a few operations
per pixel. Second, we present a new inference scheme that
fully operates in parallel with complexity that is not a func-
tion of the size of the solution space. These technical contri-
butions are formulated in a mathematical framework whose
objective function is directly applicable to many different
computer vision problems.
2. Related Work
Binary Representations. The task of finding binary and
compact representations has been exhaustively studied in
the literature. This is usually known as hashing and the
problem is generally formulated as:
b=h(x)(1)
with x∈Rn,ba binary code in {0,1}k, and hthe hash-
ing function. hcan be a linear projection, a spherical func-
tion, a kernel, a neural network, a non-parametric function,
etc. Here we focus on the family of linear hash functions
of the form h(x) = sign(xW), with W∈Rn×k, where
sign(x)returns 1if x≥0and 0otherwise. The most
popular data independent approach to generate those hash
function is called Locality-Sensitive Hashing (LSH) [23,9].
These usually make use of random gaussian projections to
generate the hyperplanes W. Despite their simplicity, these
hashing schemes perform reasonably well in practice. For
an extensive review of LSH, we refer the reader to [51].
Data dependent hashing schemes have also been pro-
posed [19,37,8,54,21,43,56]. These methods are usu-
ally unsupervised and they try to design an objective func-
tion that preserves the similarities of the input signal in the
new binary space. Iterative Quantization (ITQ) [19] ob-
tains low dimensional codes by applying PCA to the data
and then find the optimal rotation that makes the codes as
close as possible in the binary space. [8] casts the hash-
ing problem to an auto-encoder formulation, however part
of the optimization resorts to enumerating all the possible
solutions, leading to very slow training procedures. In [37],
authors exploit sparsity to have a runtime that is cost inde-
pendent of the original signal dimensionality. However an
`2-normalization of the signal is first performed, therefore
the overall running cost still depends on the input dimen-
sion. More recently, [52] uses an ensemble of fast decision
trees to model hash functions, however the method requires
a non-trivial aggregation step, adding compute.
Our work is very different from the ones in the literature.
Whereas most of the approaches focus on nearest neighbor
tasks, our framework is very flexible and can be used on
tasks ranging from classification to signal reconstruction.
Different from others [8,19], we heavily exploit sparsity at
runtime and remove normalization steps such as in [37].
Inference in Graphical Models. Estimating the Maxi-
mum A Posteriori (MAP) of a multi-label Conditional Ran-
dom Field (CRF) is a well known NP complete problem
[7]. Given the successful use of CRFs for advancing the
state of the art, performing (approximate) inference over
those models has received much attention. The major dif-
ference among all the solvers is whether they are determin-
istic or stochastic. Stochastic methods include the family of
Markov Chain Monte-Carlo methods, which have exponen-
tial convergence rate in the worst case but can provide exact
results. This family of methods is regarded as too compute
intensive for real-time applications that make use of ‘large’
CRFs. Besides stochastic techniques resides a wide array
of deterministic methods. Successful techniques include
Move-Making algorithms [7], Belief Propagation [17], It-
erated Conditional Modes [3], Tree Reweighted Message
Passing [27], Quadratic Pseudo-Boolean Optimization [41]
and Mean-Field [28]. Each of these techniques comes with
various trade-offs in terms of quality of the approximation
and speed.
It is worth noting that when the data term is strong and
high-speed inference is required (e.g. depth estimation at
VGA resolution), global optimization of the posterior is
usually dropped in favor of local optimization [40,4,32].
Thanks to our learned binary representation we can pro-
vide a very strong data term. This low entropy data term
allows us to design and use a new parallel global infer-
ence method which reaches high quality solutions for 1.3
Megapixel images in less than one millisecond on GPUs.
3. The HashMatch Framework
Our framework is based on a pairwise-CRF that can be
expressed using the following probabilistic factorization:
P(Y|D) = 1
Z(D)e−E(Y|D)(2)
E(Y|D) = X
i
ψu(li) + X
i
X
j∈Ni
ψp(li, lj),(3)
The data term ψu(li)models how likely a node in the
graph (usually a pixel) belongs to a particular class li(e.g.
‘foreground’). The exact implementation of ψu(li)depends
on the task at hand. For instance, for finding the nearest
neighbor between image patches, labels licorrespond to
vectors (u, v)which define the displacements in the image
directions. Then
ψu(li) = |h(xi)−h(xi+li)|(4)
measures the compatibility of two image patches xcentered
at 2D pixel location iand i+li. The function h(x) =
sign(xW)is a binary feature, which allows us to compute
ψu(li)highly efficiently via the Hamming distance in (4).
For any other classification or regression problem we define
ψuas
ψu(li) = −log(g(li, h(xi))) (5)
where gis a learned classifier or regressor that evaluates
the likelihood of label ligiven the binary code h(xi)of
an image patch xi. The smoothness cost is defined as
ψp(xi=li, xj=lj) = max(τ, |li−lj|)and it encour-
ages that neighboring pixels iand jare assigned to similar
labels, where τis a truncation threshold.
Our key contribution is a novel method to compute h(x)
and g(·)that captures the essential information in the data
and it is detailed in the remainder of the section.
3.1. HashCodes Learning
We now detail our proposed approach to train a function
hthat maps a signal x∈Rnto a binary space b=h(x) =
sign(xW)∈ {0,1}k. This binary representation b, is then
used to learn a function g(l, b)that performs any given task
y∈Rd. It is important to note that ycan correspond to
tasks as diverse as multi-label classification and structured
regression. In particular, one can define y=xfor nearest-
neighbor search. In order to keep the computational cost as
low as possible, we consider a linear model for each entry
yl, i.e. yl=g(l, b) = b>zl.
More formally, we learn a set of hyperplanes W∈Rn×k
and a task function Z∈Rk×dthat minimizes a loss L:
min
W,ZL(sign(XW)Z,Y) + Γ(W) + Ω(Z)(6)
where X∈Rm×nand Y∈Rm×dare matrices whose i-
th row corresponds respectively to xiand yi. The terms
Γ(W)and Ω(Z)are suitable regularizers encouraging de-
sired structures on the two predictors.
The model y=Z>sign(W>x), can be interpreted as a
neural network with one hidden layer and with the operator
sign(·)as non-linearity (in contrast to a sigmoid or ReLu
[34]). In particular, when Y=X, the model becomes sim-
ilar to an autoencoder with internal binary representation
(e.g. [8]). However, the optimization of Eq. (6) cannot be
performed using first-order methods (e.g. backpropagation)
because sign(XW)is a piece-wise constant function (and
therefore the subgradient with respect to Wis zero almost
everywhere). We circumvent this issue by decoupling the
task yfrom the binary mapping h(x). To do so we intro-
duce an additional variable B=sign(XW)and then relax
the equality constraint by means of a dissimilarity measure
D(XW,B)that will be minimized. This leads to the prob-
lem
min
W,Z,BL(BZ,Y) + Γ(W) + Ω(Z) + γD(XW,B)
s.t. kBk∞≤µ
(7)
where kBk∞=maxij |Bij |denotes the `∞(or max) norm
of Band µ > 0a scalar hyperparameter. The constraint
kBk∞≤µis introduced to encourage so-called anti-sparse
solutions, such that the minimizer B∗of Eq. (7) would
have all entries B∗ij =±µ. The concept of anti-sparsity
was originally introduced in the signal processing literature
where it was observed that imposing constraints based on
the max-norm would induce ‘binary’ solutions. We refer the
reader to [33,18,25,53,44] for an in-depth discussion on
the anti-sparse properties of max-norm regularization (and
constraints). The idea of introducing a variable Bwith bi-
nary entries is akin to the one proposed in [8]. However in
such work the authors imposed the constraint B∈ {−1,1}
in the optimization, leading to an NP-hard problem. On the
other hand, the max-norm constraint ball is a convex set,
and in the following we discuss an efficient optimization
algorithm to find Bin practice.
Interestingly, when Bis a binary matrix with entries
equal to ±µ, the problem of learning a linear predictor
Wsuch that 1
µB∼sign(XW)corresponds to a stan-
dard multi-label (or multi-task) binary classification prob-
lem, with each column of Brepresenting a different binary
task. Therefore a natural choice for the dissimilarity mea-
sure D(XW,B)is a loss function used for classification
problems such as the logistic, hinge, least-squares etc.
3.2. Optimization
The optimization problem described by Eq. (7) is not
jointly convex in W,Z,Bbut, for convex loss functions
L,Dand regularizers Γ,Ωthe objective functional is
convex separately in each variable. A natural strategy
to address this problem is therefore to perform either
alternated minimization [47,39] or block coordinate
descent [5]. Below we detail how we propose to perform
the minimization of Eq. (7) by independently optimizing
W,Band Z.
Optimizing W. We propose to use the regularizer Γ(W) =
λ|W|1in order to induce sparse solutions. In particular, in
our experiments λis chosen so that the corresponding so-
lution W∗has at most s << n non-zero entries in each
column. Indeed, this allows to compute a code h(x) =
sign(W>
∗x)in O(sk)operations rather than O(nk). When
the dissimilarity measure Dis smooth, one can employ
a standard proximal forward-backward splitting method to
find the best Wfor fixed Band Z. The algorithm consists
of producing a sequence of updates defined by
Wt+1 =Proxσ1λ|·|1(Wt−σ1X>γ∇D(XWt,B)) (8)
where Proxσ1λ|·|1denotes the proximal operator of σ1λ| · |1
(see [2]). For the case of the `1norm, the proximal
operator is well-known to correspond to the entry-wise
soft-thresholding [2]: for each scalar w, the soft thresh-
olding is such that Proxσ1λ|·|1(w)=0if |w| ≤ σ1λand
Proxσ1λ|·|1(w) = w−sign(w)σ1λotherwise. For a
suitable choice of step size σ1(by either line search or
depending on the Lipschitz constant of the gradient of D),
iterating Eq. (8) is guaranteed to converge to the solution
W∗with the value of the objective functional decreasing at
a rate of O(1/t)[2]. Following [57] we also use an early
stopping criterion to fix the desired number of variables
with the highest absolute values for each column of W.
Optimizing B. If Lis smooth, one can again adopt the
proximal forward-backward splitting approach to minimize
Eq. (7) w.r.t. B(for fixed Wand Z) obtaining the updates
e
Bt+1 =Bt−σ2∇L(BtZ,Y)Z>+γ∇D(XW,Bt)
Bt+1 =Prox{k·k∞≤µ}(e
Bt+1)(9)
To compute the proximal operator, we make use of
the Moreau decomposition, stating that for any func-
tion φ, Proxφ(B) = B−Proxφ∗(B), with φ∗denot-
ing the Fenchel’s conjugate of φ(defined as φ∗(B) =
supC∈Rm×ktr(B>C)−φ(C)). In our case, φcan be in-
terpreted as the indicator function of the max-norm ball of
radius µ(namely the function that is zero when kBk∞≤µ
and +∞otherwise). It is straightforward to show that the
corresponding Fenchel’s conjugate is φ∗(B) = µ|B|1. As
a consequence we have
Bt+1 =e
Bt+1 −Proxµ|·|1(e
Bt+1)(10)
with Proxµ|·|1the entry-wise soft-thresholding operator
introduced for the optimization of W. We again obtain a
convergence rate in the order of O(1/t)[2].
Optimizing Z. We consider the Frobenius norm regularizer
Ω(Z) = ηkZk2to avoid overfitting. This problem can be
solved by standard gradient descent updates
Zt+1 =Zt−σ3B>∇L(BZ,Y) + ηZt(11)
which is known to converge for a suitable choice of step
η. Convergence rates of O(1/t)are guaranteed also in
this case [2]. Faster rates can be achieved adding further
hypotheses on the conditioning of the loss L[6]. More-
over, if we consider L(BZ,Y) = kBZ −Yk2, we can
compute the the solution to the problem in closed form as
Z∗= (B>B+ηI)−1B>Y, with Ithe k×kidentity matrix.
Convergence Rates of Block Coordinate Descent. Block
coordinate descent methods consists of iterating across the
steps at Eq. (8,9,11) by optimizing over one variable at the
time while keeping the other two variables fixed. In gen-
eral, it is challenging to prove convergence of the iterations
to a stationary point (e.g. local minima or saddles) let alone
prove rates on how fast such convergence can be guaran-
teed. However, for the choice of loss functions and regu-
larizers considered in this work, the proposed approach be-
longs to the family of Proximal Alternating Linearized Min-
imization (PALM) optimization methods [5] whose conver-
gence properties have been recently studied. As a corollary
to Theorem 1and Remark 6in [5] we have the following
Theorem 1 (Convergence of PALM).With the notation of
Eq. (7), let Land Dsatisfy the hypotheses in [5] (Thm.1),
in particular, they are differentiable, have Lipschitz con-
tinuous gradient with associated Lipschitz constants LL
and LDrespectively. Consider the iterative sequence of
(Wt,Bt,Zt)∞
t=0 obtained by updating each variable iter-
atively according to respectively Eq. (8,9,11) with step size
respectively σ1≤(γLDkXkop)−1,σ2≤(ηLL+γLD)−1,
σ3≤(µmLL+η)−1(with mdefined in Thm 3.1 of [5],
kXkop denoting the operator norm of X, namely the max-
imum singular value of X). Then there exists a stationary
point W∗,B∗,Z∗for the functional at Eq. (7)such that
k(Wt,Bt,Zt)−(W∗,B∗,Z∗)k=O(1/t)(12)
The above theorem states that for specific choices of
descent steps σ1, σ2, σ3, one can expect convergence to a
stationary point at a sublinear rate of the order of O(1/t).
Choice of Land D. The relaxed problem described in Eq.
(7) and the optimization approach described apply to any
choice of convex smooth function Land dissimilarity mea-
sure D. In the experiments reported in this work we adopted
the least squares loss function L(BZ,Y) = kBZ −Yk2
and D(WX,B) = kWX −Bk2for which it is easy to
recover the Lipschitz constant of the gradient to derive the
descent step sizes σ1, σ2, σ3in Thm.1. This choice was also
motivated by the fact that least-squares is the standard loss
function for regression and reconstruction problems (hence
a natural choice for L) but also often used in classification
settings [55] (hence a viable choice for the dissimilarity D).
In the supplementary material we report the pseudocode for
the optimization strategy described in for this choice of loss
and dissimilarity.
3.3. Parallel Inference
Infering the posterior probability, and a fortiori the MAP
(Maximum a Posteriori) of P(Y|D), is in general very hard
as it requires solving for a very complex series of integrals
over all the variables x∈X. To approximate this complex
distribution, we resort to using variational approximation.
More precisely, we are aiming at finding a distribution Q
that is a ‘close’ approximation of Pwithin the class of dis-
tributions that can be factorized as a product of independent
marginals, i.e.
Q(Y) = Y
i
Q(Yi)(13)
This approximation is computationally attractive but is
very likely to lose a lot of information about the origi-
nal distribution P(Y|D). Nevertheless, the result of MAP
and MPM (Maximum Posterior Marginal) inference will
be quite similar when the entropy of P(Y|D)is low.
Broadly speaking, in the case of the pairwise CRF we de-
scribed previously, good approximations of P(Y|D)are
obtained when the unary potentials are ‘peaky’ (i.e. low en-
tropy). The quality of the approximation between Q(Y)and
P(Y|D)is usually measured using KL(Q(Y)||P(Y|D)),
where KL is the Kullback-Leibler divergence. Taking
the fixed point solution of the Kullback-Leibler divergence
[26], we obtain the following update for the label liin the
marginal of random variable xi:
Qt
i(li) = 1
Bi
e−Mi(li)(14)
Mi(li) = ψu(li) + X
j∈Ni
X
lj∈L
Qt−1
jψp(li, lj)(15)
Bi=X
li∈L
e−Mi(li)(16)
The underlying coordinate ascent procedure results in
a better approximation of Pby Qfor each iteration, but
also guarantees convergence. Note that popular techniques
like Belief Propagation are not guaranteed to converge when
performing inference over graphs. At this stage, it is crucial
to note that the complexity of evaluating the updated Qt(Y)
is O(|Y||L|(|N ||L| + 1)). The quadratic complexity on L
is not of a practical concern for high-speed inference when
Lis small. Nevertheless, this makes the (computationally
attractive) mean-field framework too slow when this num-
ber grows large. As mentioned before, the MPM and MAP
solutions are similar in the pairwise CRF (Eq. (3)) when the
entropy of the unary is low. We go one step further in the
approximation and explicitly assume that Qalso has low
entropy and approximate it with a Dirac δfunction. This
corresponds to setting Qi=δ(li−arg maxljQj). We can
now rewrite (15) as follows
Mi(li) = ψu(li) + X
j∈Ni
ψp(li,arg max
lj
Qj)(17)
Since Qt
i(14) now follows a Dirac δfunction, comput-
ing the normalization function Bi(16) is not required any-
more. The compute complexity of updating Qt(Y)is now
O(|Y||N |(1+|N |)). Roughly speaking, this corresponds to
a reduction of complexity along the lines of O(|L|2/|N |).
Note that |N | is small for many problems (e.g. stereo
depth estimation), and that in most these problems |L| >
|N | (e.g. |L| is in the hundreds and |N | = 4 when estimat-
ing disparities).
3.4. Computational Analysis
The following is a computational analysis of the pro-
posed method for (discrete) disparity estimation. We as-
sume an input image with |Y|pixels, and Lpossible labels.
Typical values for disparity estimation are |Y|= 1280 ×
1024 and L= 512. The hyperplanes Ware trained to have
maximum 4non-zero elements for each code. Each pixel i
is associated with a patch Piof size |Pi|= 11 ×11. Taking
the sign of the dot product between Piand Wremaps Pi
into k= 32 binary codes. The computation of the binary
codes bis independent of the window size |P|and it in-
volves 4multiplications and sums for each hyperplane. This
corresponds to a per-pixel complexity of O(4 k)to compute
a hash code. We initialize the data term ψiby evaluating
only a very limited set of 32 random label hypotheses. The
distances are computed using the Hamming distance in the
new space and this can be efficiently implemented in O(1)
making use of the popc() function that is implemented in
most of the latest GPU architectures. The initialization step
then has a per-pixel complexity O(4 k). Regarding the in-
ference, we only use the immediate N= 3×3neighbors of
each pixel in the pairwise potential. The marginal of each
pixel can be updated in parallel without waiting for sequen-
tial propagation steps like in [4,16]. In practice, we use
4 steps of the proposed inference, resulting in a per-pixel
complexity of O(4 |N |(1 + |N |)). Note that in contrast to
other approaches, the proposed algorithm is independent of
the number of labels Land the patch size |Pi|. The rela-
tively low computational complexity and fully parallel na-
ture of all the components of the proposed method make
it particularly suitable for high-speed applications on low
compute devices. We tested the HashMatch framework on
an NVIDIA Titan X GPU, with an overall running time of
890µs per frame. We also implemented the algorithm on a
NVIDIA Tegra TX1 with an overall running time of 5ms,
Figure 1. Qualitative comparisons of depth maps generated with
state of the art methods. HashMatch shows the most complete
results in complex scenes.
which opens up the possibility of high speed applications
on mobile platforms.
4. Results
We evaluate the HashMatch framework on a diverse set
of computer vision tasks. For each problem, we explicitly
describe the form of the unary potential that is used. We
first show how our method can handle continuous labeling
problems such as disparity estimation. Further, we evaluate
the proposed hashing scheme on retrieval and feature ap-
proximation tasks. Finally, we assess the quality of the pro-
posed inference for background subtraction. Note that for
the tasks where the proposed inference is used, the number
of iterations is constant and set to 4.
4.1. Depth Estimation
In this section, we focus on depth estimation from stereo
images under active illumination [15,16]. For our exper-
iments, we use a hardware setup similar to [16], i.e. two
IR cameras in a stereo configuration as well as a Kinect V1
DOE. When the IR images ILand IRare calibrated and
rectified, each pixel p= (u, v)in ILhas a corresponding
pixel q= (u+l, v)in IRthat lies on the same scanline
v. We apply HashMatch to retrieve the continuous disparity
l∈R. The disparity is then remapped in the depth domain
via Z=bf
l, where bis the baseline of the system, fthe
focal length of the camera and lis the inferred disparity.
In this section, the data term ψu(li)is computed accord-
ing to Eq. (4). We train the hyperplanes Wby acquiring
10000 images and extracting 11 ×11 image patches x. We
set the task y=x. We use k= 32 hyperplanes Wwith
maximum 4non-zero elements for each column.
We compare HashMatch with state of the art methods,
including both qualitative and quantitative evaluations. We
set the exposure time of the cameras very low (2ms) in or-
der to perform evaluations on real data captured at 500Hz.
Figure 2. Example of high quality depthmap and point clouds gen-
erated with HashMatch. Notice the level of details and the absence
of quantization and outliers.
This data has a significatively lower signal to noise ratio
(SNR) compared to data captured at 30Hz. In Fig. 1we
show qualitative results generated with HashMatch, Patch-
Match Stereo [4] and UltraStereo (US) [16]. Notice how
the baseline methods suffer from the relatively low SNR
whereas HashMatch is able to predict complete and smooth
depthmaps. We also generate results using our unary term
and the PatchMatch inference [1] (HashMatch+PM in Fig.
1). The proposed parallel inference produces results very
close to those generated with the PatchMatch inference, but
is 2Xfaster compared with the very optimized implemen-
tation described in [16]. In Fig. 2we show a high quality
depthmap and pointcloud generated with HashMatch, no-
tice the level of details that are captured by our method.
To quantitatively assess the proposed framework, we fol-
low the procedure presented in [15] and acquired images of
a flat wall at multiple known distances, varying between 500
and 3500 cm. For each set distance, 100 frames are recorded
from which we estimate the average depth bias (defined as
the average error), and depth jitter (defined as the standard
deviation). We compare with Kinect V1, RealSense R200,
PatchMatch Stereo [4], HyperDepth [15] and UltraStereo
[16] as baseline methods; results are reported in Fig. 3.
HashMatch outperforms most of the competitors and is on
par with more involved methods such as [15].
Finally, we compare the quality of our hashing scheme
with other state of the art methods such as: ITQ [19], SBE
[37], CBE [56], Binary Autoencoders (BA) [8] and Ultra-
Stereo [16]. We used the synthetic dataset from [16] with
perfect groundtruth disparities and perform the (discrete)
nearest neighbor search over the pixel disparities. We de-
fined the accuracy as the percentage of pixels for which the
estimated disparity is below 1 pixel; results are reported in
Figure 3. Quantitative comparisons with state of the art methods.
HashMatch achieves the lowest error using the lowest compute.
Tab. 1. Although our HashMatch descriptor uses only 4
non-zero elements per hyperplane, the results are on par
with dense hashing schemes such as BA. We also outper-
form other sparse hashing schemes such as CBE and SBE,
proving the quality of the proposed representation. It is in-
teresting to note that besides providing smooth results, us-
ing the proposed inference raises the precision reported in
Tab. 1from 77% to 96%.
HashMatch ITQ SBE CBE LSH BA US
77% 76% 68% 70% 58% 77% 73%
Table 1. Hashing Schemes Comparisons. We compare our method
with other popular hashing schemes and binary representations:
ITQ [19], SBE [37], CBE [56], Binary Autoencoders [8] and Ul-
traStereo (US) [16]. HashMatch uses only 4non zero elements
and is on par with dense methods like BA.
4.2. Nearest Neighbor Retrieval
In this section, we evaluate HashMatch on a nearest
neighbor retrieval task. We assume we have a two set of fea-
ture vectors, query and base. For each sample in the query
set, the goal is to find the nearest neighbor in the base set.
This is a typical case for correspondence search problems
between two or more image. The goal of this section is to
evaluate only our hashing scheme and compare it with other
state of the art methods, therefore no parallel inference is re-
quired. The data term ψu(li)is computed according to Eq.
(4). Notice that for this experiment the smoothness term is
not needed, thus we dropped it. We use the GIST1M dataset
[24], which is composed of three disjoint sets: train, query
and base. Each descriptor xhas 960 dimensions, and we
minimize Eq. (6) setting the task Y=X. We trained data
Figure 4. Recall@R curves on the GIST1M dataset.
dependent hashing schemes using the training set and test
on the other sets. We compute the Recall@R, defined as the
recall for R retrievals.
We compare our method with the following state of the
art hashing techniques: ITQ [19], SBE [37], CBE [56],
LSH [23], and SKLSH [36]. For each method we trained
16,32,64,128 codes, perform the binary embedding ac-
cording to Eq. (1) and compute the nearest neighbor search
using the Hamming distance. We do not perform any data
preprocessing such as normalization steps or augmentation.
For HashMatch and SBE we set the sparsity parameter such
that we have 10% of non-zero elements.
We report the Recall@R curves in Fig. 4. For small
number of codes, HashMatch greatly outperforms all the
competitors, including dense ones such as ITQ. When larger
codes are used (i.e. 128), this gap reduces but HashMatch
still provides for the largest area under curve (AUC).
4.3. Feature Approximation
In this experiment we consider the problem of approxi-
mating complex feature descriptors given a set of 11 ×11
image patches x. For this particular application we consider
SIFT descriptors [31]s∈R128. The goal is to minimize
Eq. (6) where the task Y=Sis the set of SIFT descrip-
tors. In other words, we apply HashMatch to a regression
problem, where the target continuous function is computed
from handcrafted features. In general we could apply the
same framework to learn more sophisticated descriptors.
We consider the EPFL wide-baseline stereo data [46],
where we trained HashMatch on those sequences with no
groundtruth available. Training data is generated by ex-
tracting SIFT descriptors in an unsupervised way. At test
time we detect corners and compute SIFT and HashMatch
Figure 5. Qualitative Experiments for feature approximation (see
Sec. 4.3).
descriptors, respectively. We match the descriptors based
on the closest `2distance and we filter outliers by impos-
ing the epipolar constraint between the two images. In Fig.
5we report qualitative results: in green we depict correct
matches, in red those matches with distances greater than 1
cm in the provided groundtruth. SIFT achieves an average
end-point error of 0.8cm, whereas HashMatch gets very
close with 1.2cm. If we consider as inliers the percentage
of retrieved matches with error <1cm, HashMatch reports
an overall accuracy of 90%, whereas SIFT retrieves 81%
corrected matches. On average, SIFT is able to retrieve 350
good matches per image, HashMatch 150. While the num-
ber of matches are fewer in case of HashMatch we get more
reliable matches. In practice, the order of feature matches
we obtain from HashMatch would cater to most applica-
tions that use SIFT, and with much less compute making
HashMatch very powerful in such scenarios.
4.4. Background Subtraction
In this last experiment, we evaluate the proposed paral-
lel inference on a background segmentation task with static
cameras. Typical applications are surveillances and peo-
ple tracking scenarios. We consider natural (indoor) rooms.
Assuming a clean background shot is available (RGB and
depth), we want to detect any new object entering the scene.
The goal of this application is to evaluate the proposed
inference and compare it with well established approaches
like belief propagation (BP) [45], tree-reweighted message
passing (TRW) [50,27], mean field [49] and patch match
[1]. Assuming that RGB and depth information is available,
we use a unary potential of the form ψu(li) = ψrgb (li) +
ψdepth(li). The two terms are simple differences in the
HSV space and a logistic function in the depth domain as
defined in [35] and [14]. Note that the unary potential is
shared amongst all the baselines methods and that the pair-
wise term is the standard Potts model.
We collected 100 frames from different subjects and ob-
jects, and manually labeled the foreground from the images
to obtain the ground truth. We report the average energy ob-
tained by the proposed inference and the baselines in Table
Figure 6. Qualitative results of background subtraction. The fore-
ground segmentation corresponds to the cyan region, which is
overlaid on the gray-scale version of the color image. Note that
we achieve results that are comparable with more computationally
demanding approaches, but orders of magnitude faster.
4.4. It is interesting to note that although much less compu-
tationally demanding, the proposed inference achieves final
energies that are very similar to that obtained by the base-
lines. Some qualitative results are shown in Fig. 6.
Inference algorithm Average energy
Only Unaries 5.8∗105
BP [45]2.9∗105
TRW [50,27]2.85 ∗105
Mean Field [49]2.9∗105
PatchMatch [1]3.0∗105
Proposed inference 2.9∗105
Table 2. Quantitative evaluation of the proposed inference on a
background subtraction task. We compare our method against
established inference techniques. We report the average energy
obtained by each approach. Note how all the propagation ap-
proaches significantly reduce the energy obtained by the initial so-
lution (first row). Also note that although the proposed inference
is less computationally demanding than the baselines, it reaches
very comparable energy levels.
5. Conclusion
In this paper we presented HashMatch, an efficient
framework tailored for parallel compute architectures.
Through extensive experiments on a diverse set of computer
vision tasks, we demonstrated that although HashMatch op-
erates at extreme speeds, it makes little compromise in pre-
cision compared to more compute intensive approaches. All
these characteristics make HashMatch an appealing frame-
work for low compute mobile platforms and for products
required to operate at very high speeds.
Acknowledgements
We thank the entire perceptiveIO team for continuous
feedback and support regarding this work.
References
[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman.
PatchMatch: A randomized correspondence algorithm for
structural image editing. ACM SIGGRAPH and Transaction
On Graphics, 2009. 6,8
[2] H. H. Bauschke and P. L. Combettes. Convex analysis and
monotone operator theory in Hilbert spaces. Springer Sci-
ence & Business Media, 2011. 4
[3] J. Besag. On the statistical analysis of dirty pictures. Journal
of the Royal Statistical Society. Series B (Methodological),
pages 259–302, 1986. 2
[4] M. Bleyer, C. Rhemann, and C. Rother. PatchMatch Stereo -
Stereo Matching with Slanted Support Windows. In BMVC,
2011. 2,5,6
[5] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating
linearized minimization for nonconvex and nonsmooth prob-
lems. Mathematical Programming, 2014. 3,4
[6] S. Boyd and L. Vandenberghe. Convex optimization. Cam-
bridge University Press, 2004. 4
[7] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-
ergy minimization via graph cuts. IEEE Transactions on pat-
tern analysis and machine intelligence, 23(11):1222–1239,
2001. 1,2
[8] M. . Carreira-Perpin and R. Raziperchikolaei. Hashing with
binary autoencoders. In CVPR, 2015. 2,3,6,7
[9] M. S. Charikar. Similarity estimation techniques from round-
ing algorithms. In ACM symposium on Theory of computing,
pages 380–388. ACM, 2002. 2
[10] M.-M. Cheng, V. A. Prisacariu, S. Zheng, P. H. Torr, and
C. Rother. Densecut: Densely connected crfs for realtime
grabcut. In Computer Graphics Forum, volume 34, pages
193–201. Wiley Online Library, 2015. 1
[11] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect:
Training deep neural networks with binary weights during
propagations. In NIPS, pages 3123–3131, 2015. 1
[12] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and
Y. Bengio. Binarized neural networks: Training deep neu-
ral networks with weights and activations constrained to+ 1
or-1. arXiv preprint arXiv:1602.02830, 2016. 1
[13] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In Computer Vision and Pattern Recogni-
tion, 2005. CVPR 2005. IEEE Computer Society Conference
on, volume 1, pages 886–893. IEEE, 2005. 1
[14] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello,
A. Kowdle, S. Orts Escolano, C. Rhemann, D. Kim, J. Tay-
lor, P. Kohli, V. Tankovich, and S. Izadi. Fusion4d: Real-time
performance capture of challenging scenes. 2016. 8
[15] S. R. Fanello, C. Rhemann, V. Tankovich, A. Kowdle, S. O.
Escolano, D. Kim, and S. Izadi. Hyperdepth: Learning depth
from structured light without matching. In CVPR, volume 2,
page 7, 2016. 6
[16] S. R. Fanello, J. Valentin, C. Rhemann, A. Kowdle,
V. Tankovich, and S. Izadi. Ultrastereo: Efficient learning-
based matching for active stereo systems. In CVPR, 2017. 5,
6,7
[17] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief
propagation for early vision. IJCV, 70(1):41–54, 2006. 2
[18] J. J. Fuchs. Spread representations. In ASILOMAR, 2011. 3
[19] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Itera-
tive quantization: A procrustean approach to learning binary
codes for large-scale image retrieval. PAMI, 35(12):2916–
2929, 2013. 2,6,7
[20] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-
pressing deep neural networks with pruning, trained quanti-
zation and huffman coding. In ICLR, 2016. 1
[21] J. He, S.-F. Chang, R. Radhakrishnan, and C. Bauer. Com-
pact hashing with joint optimization of search accuracy and
time. In CVPR, 2011. 2
[22] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and 0.5 mb model size. arXiv
preprint arXiv:1602.07360, 2016. 1
[23] P. Indyk and R. Motwani. Approximate nearest neighbors:
towards removing the curse of dimensionality. In ACM sym-
posium on Theory of computing, 1998. 2,7
[24] H. Jegou, M. Douze, and C. Schmid. Searching with quan-
tization: approximate nearest neighbor search using short
codes and distance estimators. In INRIA Technical report,
2009. 7
[25] H. J´
egou, T. Furon, and J. J. Fuchs. Anti-sparse coding for
approximate nearest neighbor search. In ICASSP, 2012. 3
[26] D. Koller and N. Friedman. Probabilistic graphical models:
principles and techniques. MIT press, 2009. 5
[27] V. Kolmogorov. Convergent tree-reweighted message pass-
ing for energy minimization. IEEE transactions on pattern
analysis and machine intelligence, 28(10):1568–1583, 2006.
2,8
[28] P. Kr¨
ahenb¨
uhl and V. Koltun. Efficient inference in fully
connected crfs with gaussian edge potentials. NIPS, 2011. 2
[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS, pages 1097–1105, 2012. 1
[30] D. G. Lowe. Object recognition from local scale-invariant
features. In Computer vision, 1999. The proceedings of the
seventh IEEE international conference on, volume 2, pages
1150–1157. Ieee, 1999. 1
[31] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 2004. 7
[32] J. Lu, H. Yang, D. Min, and M. Do. Patch match filter: Ef-
ficient edge-aware filtering meets randomized search for fast
correspondence field estimation. In CVPR, 2013. 2
[33] Y. Lyubarskii and R. Vershynin. Uncertainty principles and
vector quantization. IEEE Transactions on Information The-
ory, 2010. 3
[34] V. Nair and G. E. Hinton. Rectified Linear Units Improve
Restricted Boltzmann Machines. 2012. 3
[35] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang,
A. Kowdle, Y. Degtyarev, D. Kim, P. L. Davidson,
S. Khamis, M. Dou, et al. Holoportation: Virtual 3d telepor-
tation in real-time. In Proceedings of the 29th Annual Sym-
posium on User Interface Software and Technology, pages
741–754. ACM, 2016. 8
[36] M. Raginsky and S. Lazebnik. Locality-sensitive binary
codes from shift-invariant kernels. In NIPS, 2009. 7
[37] M. Rastegari, C. Keskin, P. Kohli, and S. Izadi. Computa-
tionally bounded retrieval. In CVPR, 2015. 2,6,7
[38] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
net: Imagenet classification using binary convolutional neu-
ral networks. In ECCV, 2016. 1
[39] M. Razaviyayn, M. Hong, and Z.-Q. Luo. A unified con-
vergence analysis of block successive minimization methods
for nonsmooth optimization. SIAM Journal on Optimization,
23(2):1126–1153, 2013. 3
[40] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and
M. Gelautz. Fast cost-volume filtering for visual correspon-
dence and beyond. In CVPR, 2011. 2
[41] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer.
Optimizing binary mrfs via extended roof duality. In 2007
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2007. 2
[42] B. Scholkopf and A. J. Smola. Learning with Kernels: Sup-
port Vector Machines, Regularization, Optimization, and Be-
yond. MIT Press, 2001. 1
[43] C. Strecha, A. Bronstein, M. Bronstein, and P. Fua. Lda-
hash: Improved matching with smaller descriptors. PAMI,
34(1):66–78, 2012. 2
[44] C. Studer, T. Goldstein, W. Yin, and R. G. Baraniuk. Demo-
cratic representations. CoRR, 2014. 3
[45] M. F. Tappen and W. T. Freeman. Comparison of graph cuts
with belief propagation for stereo, using identical MRF pa-
rameters. In ICCV, 2003. 8
[46] E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient dense
descriptor applied to wide-baseline stereo. PAMI, 2010. 7
[47] P. Tseng. Convergence of a block coordinate descent method
for nondifferentiable minimization. Journal of optimization
theory and applications, 109(3):475–494, 2001. 3
[48] J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, J. Shotton,
P. Kohli, M. Nießner, A. Criminisi, S. Izadi, and P. Torr. Se-
manticpaint: Interactive 3d labeling and learning at your fin-
gertips. ACM Transactions on Graphics (TOG), 2015. 1
[49] V. Vineet, J. Warrell, and P. H. S. Torr. Filter-based mean-
field inference for random fields with higher-order terms and
product label-spaces. 2012. 8
[50] M. Wainwright, T. Jaakkola, and A. Willsky. Map estimation
via agreement on (hyper)trees: Message-passing and linear
programming approaches. IEEE Transactions on Informa-
tion Theory, 2002. 8
[51] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity
search: A survey. arXiv preprint arXiv:1408.2927, 2014. 2
[52] S. Wang, S. R. Fanello, C. Rhemann, S. Izadi, and P. Kohli.
The global patch collider. CVPR, 2016. 2
[53] S. Wang and N. Rahnavard. Binary compressive sensing via
sum of l1-norm and l(infinity)-norm regularization. In MIL-
COM, 2013. 3
[54] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In
NIPS, 2009. 2
[55] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping
in gradient descent learning. Constructive Approximation,
2007. 4
[56] F. X. Yu, S. Kumar, Y. Gong, and S.-F. Chang. Circulant
binary embedding. In ICML, 2014. 2,6,7
[57] H. Zou and T. Hastie. Regularization and variable selection
via the elastic net. In Journal of the Royal Statistical Society,
2005. 4