Conference PaperPDF Available

The Global Patch Collider

  • perceptiveIO, Inc

Abstract and Figures

This paper proposes a novel extremely efficient, fully-parallelizable, task-specific algorithm for the computation of global point-wise correspondences in images and videos. Our algorithm, the Global Patch Collider, is based on detecting unique collisions between image points using a collection of learned tree structures that act as conditional hash functions. In contrast to conventional approaches that rely on pairwise distance computation, our algorithm isolates distinctive pixel pairs that hit the same leaf during traversal through multiple learned tree structures. The split functions stored at the intermediate nodes of the trees are trained to ensure that only visually similar patches or their geometric or photometric transformed versions fall into the same leaf node. The matching process involves passing all pixel positions in the images under analysis through the tree structures. We then compute matches by isolating points that uniquely collide with each other {\em ie.} fell in the same empty leaf in multiple trees. Our algorithm is linear in the number of pixels but can be made constant time on a parallel computation architecture as the tree traversal for individual image points is decoupled. We demonstrate the efficacy of our method by using it to perform optical flow matching and stereo matching on some challenging benchmarks. Experimental results show that not only is our method extremely computationally efficient, but it is also able to match or outperform state of the art methods that are much more complex.
Content may be subject to copyright.
The Global Patch Collider
Shenlong Wang1,2 Sean Ryan Fanello1Christoph Rhemann1Shahram Izadi1Pushmeet Kohli1
Microsoft Research1University of Toronto2
This paper proposes a novel extremely efficient, fully-
parallelizable, task-specific algorithm for the computation
of global point-wise correspondences in images and videos.
Our algorithm, the Global Patch Collider, is based on de-
tecting unique collisions between image points using a col-
lection of learned tree structures that act as conditional
hash functions. In contrast to conventional approaches that
rely on pairwise distance computation, our algorithm iso-
lates distinctive pixel pairs that hit the same leaf during
traversal through multiple learned tree structures. The split
functions stored at the intermediate nodes of the trees are
trained to ensure that only visually similar patches or their
geometric or photometric transformed versions fall into the
same leaf node. The matching process involves passing
all pixel positions in the images under analysis through
the tree structures. We then compute matches by isolating
points that uniquely collide with each other ie. fell in the
same empty leaf in multiple trees. Our algorithm is lin-
ear in the number of pixels but can be made constant time
on a parallel computation architecture as the tree traver-
sal for individual image points is decoupled. We demon-
strate the efficacy of our method by using it to perform op-
tical flow matching and stereo matching on some challeng-
ing benchmarks. Experimental results show that not only
is our method extremely computationally efficient, but it is
also able to match or outperform state of the art methods
that are much more complex.
1. Introduction
Correspondence estimation ie. the task of estimating
how parts of visual signals (images or volumes) correspond
to each other, is an important and challenging problem in
Computer Vision. Point-wise correspondences between im-
ages or 3D volumes can be used for tasks such as camera
pose estimation, multi-view stereo, structure-from-motion,
co-segmentation, retrieval, and compression etc. Due to its
wide applicability, many variants of the general correspon-
dence estimation problem like stereo and optical flow have
been extensively studied in the literature.
There are two key challenges in matching visual con-
tent across images or volumes. First, robust modelling of
the photometric and geometric transformations present in
real-world data, such as occlusions, large displacements,
viewpoints, shading, and illumination change. Secondly,
and perhaps more importantly, the hardness of perform-
ing inference in the above-mentioned model. The latter
stems from the computational complexity of performing
search in the large space of potential correspondences and
is a major impediment in the development of real time
algorithms. A popular approach to handle the problem
involves detecting ‘interest or salient points’ in the im-
age which are then matched based on measuring the eu-
clidean distance between hand specified [24,34,20,7] or
learned [38,35,23,32,30,6,39,29,31] descriptors that
are designed to be invariant to certain classes of transfor-
mations and in some cases can also work across different
modalties. While these methods generate accurate matches,
the computational complexity (quadratic in the number of
interest points) of matching potential interest points restricts
their applicability to small number of key-points.
An effective strategy to generate dense correspondences
is to limit the search space of possible correspondences.
For instance, in the case of optical flow by only search-
ing for matches in the immediate vicinity of the pixel lo-
cation. However, this approach fails to detect large mo-
tions/displacements. Methods like [3,22] overcome this
problem by adaptively sampling the search space and have
been shown to be very effective for optical flow and dis-
parity estimation [1,4]. However, they rely on the implicit
assumption that the correspondence field between images
is smooth and fail when this assumption is violated. Tech-
niques based on algorithms for finding approximate near-
est neighbors such as KD-Tree [18,2] and hashing [22,9]
can be used to search large-displacement correspondences
and have been used for initializing optical flow algorithms
[37,1,36,25]. However, these approaches search for can-
didate matches based on the appearance similarity and they
are not robust in scenarios when geometric and photometric
transformations occurs (see Fig. 2).
In this paper, we address the problem of efficiently gen-
erating correspondences that can (1) have arbitrary dis-
tribution of magnitudes, (2) and that are between im-
age elements affected by task-dependent geometric and
photometric transformations. We propose a novel fully-
parallelizable, learned matching algorithm called Global
Patch Collider (GPC) to enable extremely efficient compu-
tation of global point-wise correspondences. GPC is based
on detecting unique collisions between image points using
a collection of learned tree structures that act as conditional
hash functions. In contrast to conventional approaches that
isolate matches by computing distances between pairs of
image elements, GPC detects matches by finding which
pixel pairs hit the same leaf during traversal through mul-
tiple learned tree structures.
The split functions stored at the intermediate nodes of
the trees are trained to ensure that visually similar patches
fall into the same terminal node. The matching process in-
volves passing all pixel positions in the images under anal-
ysis through the tree structures. We then compute matches
by isolating points that uniquely collide with each other ie.
fell in the same empty leaf in multiple trees. We also in-
corporate a multi-scale top-bottom architecture, which sig-
nificantly reduces the number of outliers. Content-aware
motion patterns are learned for each leaf node, in order to
increase the recall of the retrieved matches.
Unlike existing feature matching algorithms, the pro-
posed global patch collider does not require any pairwise
comparisons or key-point detection, thus it tackles the
matching problem with linear complexity with respect to
the number of pixels. Furthermore, its computational com-
plexity can be made independent of the number of pixels by
using a parallel computation architecture as the tree traver-
sal for individual image points is decoupled.
We demonstrate the efficacy of our method by applying
it on a number of challenging vision tasks, including opti-
cal flow and stereo. Not only is GPC extremely computa-
tionally efficient, but it is also able to match or outperform
more complex state of the art algorithms. To summarize,
our contributions are two-fold: firstly, we propose a novel
learning based matching algorithm that conducts global cor-
respondence with linear complexity; secondly, we develop
a novel hashing scheme by training decision trees designed
for seeking collisions.
2. Related Work
Our work is similar to correspondence estimation al-
gorithms based on approximate nearest neighbor (ANN)
methods, such as KD-Tree [18,2] or hashing [22,9]. How-
ever, there are two notable differences: (1) GPC is trained
to be robust to various geometric and photometrics trans-
formations in the training data, and (2) it isolates potential
matches by looking for unique collisions in leaves of deci-
4 2 3713
6 472 1
6 3
5 21 7
1 54 1
Tree 1 Tree 2 Tre e T
Source Patches Target Patches
Input: local patches (features)
Output: matched pairs common leaf ids (1, 4, 2)common leaf ids (6, 7, 1)
Figure 1. Global Patch Collider (GPC). Local patches traverse
each tree in the decision forest, reaching different leaves. If two
patches from source and target image hit the same leaf across all
trees without collisions with other patches, they are considered as
a distinctive correspondence. For instance, source patch 4and tar-
get patch 1hit the same leaves of all the trees and there is no other
patch hit exactly the same leaves across all trees with them, thus it
is a distinctive correspondence.
sion trees. These leaves act like conditional hash functions.
The growing availability of real and synthetic datasets
for correspondence problems have led to the proposal of a
number of learning based approaches. In one of the earli-
est works along this direction, Roth and Black [27] showed
how optical flow estimates can be improved by incorporat-
ing a statistical prior on the distribution of flow in a Field
of Experts model. As the size of the available datasets
have grown, researchers have started to use high capacity
models such as deep convolutional neural network to either
learn the pair-wise similarity [29,38] or learn the end-to-
end pipeline directly [16].
The computational architecture of GPC is similar to de-
cision forests [10]. Decision trees have been widely used in
various fields of computer vision, such as pose estimation
[28], image denoising [14], image classification [5], object
detection [21], depth estimation [13,15], etc. However,
unlike all these applications, our method does not require
classification or regression labels. Our objective function
has been especially designed to ensure that visually similar
patches (or their perspective transformed versions) will fol-
low the same path in the trees and fall into the same leaf
3. Global Patch Collider
GPC is a matching algorithm based on finding unique
collisions using decision trees as hash function evaluators.
Each tree learns to map patches that are in correspondence
into the same leaf while separating them from other patches
(see Fig. 1). We provide the formal description of the
Global Patch Collider (GPC) below.
Figure 2. Examples of matched local patches. From left to right:
Sintel, Kitti, active stereo, MVS, synthetic. We can see that cor-
respondences are task dependent with different type of variations,
e.g. non-rigid transform, scaling, intensity change, rotation, back-
ground change, etc. It is difficult to propose a generic descriptor
that is robust to all kinds of variations, whereas our approach is
able to learn those variations directly from the training data.
3.1. Formulation
Single Tree Prediction. Given two images Iand I0, our
target is to find distinctive local correspondences between
pixel positions. Given a local patch xwith center coordinate
pfrom an image I, we pass it into a decision tree Tuntil it
reaches one terminal node (leaf). The id of the leaf is just a
hash key for the image patch and is denoted as T(x).
After processing all the patches, for each leaf j, GPC
stores a set of patches from source image denoted as Sj
as well as a set of patches from the target image, de-
noted as S0
j. We will consider two patches to be a cor-
respondence pair if and only if they fall into the same
leaf and this leaf contains only one target patch and one
source patch. More formally, the set of correspondences
could be written as CT(I, I0) = {(x,x0)|T (x) = T(x0)
and |ST(x)|=|S0
T(x)|= 1}.
This decision tree approach can be considered as a hash-
ing function, where correspondent patches are picked by
finding distinctive collisions between source and target im-
age in the hash table. Simple binary hash functions can be
used instead of decision trees but they would not have the
conditional execution structure that decision trees have as
only one split function needs to be evaluated at every layer.
Forest Prediction. It is worth noting that a simple tree
is not discriminative enough to generate a large amount of
distinctive pairs. For example, given a 16-layer binary tree,
the maximum number of states is 32768, but, if we con-
sider megapixel images, there are millions of patches from
one image. Moreover, due to the content similarity, most
patches within one image will fall into a small fraction of
the leaves (between 6000 to 10000 on Sintel dataset). If
we merely increase the depth of the tree we will bring ad-
ditional computational and storage burdens for training the
decision trees. This motivates us to extend the single-tree
approach to a hashing forest scheme.
Specifically, instead of searching distinctive pairs that
Figure 3. Sparse matching with w/o multi-scale learning. From
top-left to bottom right: 7×7,15 ×15,31 ×31, multi-scale.
fall into the same leaf, our method seeks distinctive pairs
that fall into the same leaf across all the trees in the for-
est. In particular, two patches are considered as a distinc-
tive match if they reach the same leaves for all the trees
and there is not any other patch from both source and tar-
get image reach exactly the same leaves. Given two im-
ages Iand I0and a random forest F, the set of corre-
spondence is formulated as CF(I, I0) = {(x,x0)|F(x) =
F(x0)and |SF(x)|=|S0
F(x)|= 1}.F(x)is a sequence
of leaf nodes {j1, ..., jT}where xfalls in this forest, and
SLrepresents a set of patches that fall into the ordered leaf
nodes sequence L. For a forest with Ttrees and Llay-
ers for each tree, the number of states in total is 2L(T1).
In practice, the number of states is between 50k to 200k
for a 16-layer-8-tree forest on 0.4-megapixel image from
Sintel dataset [8]. Note that our method only seeks unique
matched pairs, thus no re-ranking or pairwise comparison is
Split Function. Each split node contains a set of learned
parameters θ= (w, τ ), where wis a hyper-plane and τrep-
resents a threshold value. The split function fis evaluated
at a patch xas:
f(x;θ) = sign(wTφ(x)τ)(1)
where φ(x)is the features for x, we will introduce our
patch-based features for each task individually. This hyper-
plane classifier is commonly adopted in decision forest [10].
Note the sparse hyper-planes can be used to increase effi-
ciency, since only a small fraction of the feature is tested.
Furthermore, the nature of random forest allows us to easily
process patches and trees in parallel and independently for
each pixel.
Figure 4. Qualitative comparisons among top-5 algorithms on Sin-
tel optical flow benchmark. Top to bottom: input image (average
of two), ground-truth, EpicFlow [26], CPM, FlowFields [1], Glob-
alPatchCollider (ours).
3.2. Training.
Training Data. Each tree in the forest is trained inde-
pendently on a random subset Sof the training data. For
our correspondence problem, the set Scontains triplet sam-
ples (x,xpos,xneg ) where xis a patch in a training source
image, xpos is the ground-truth correspondent patch in the
target image and xneg is a negative patch sampled around
the ground-truth location in the target image with a random
Training Objective. Intuitively, for each node we want
to choose the optimal parameters that keep positive and
reference patches into the same child node, and that split
them from the negative patches. Therefore, for each internal
node, we propose to choose the parameters that maximize
weighted harmonic mean between precision and recall:
precision(S, θ)·recall(S, θ)
w1precision(S, θ) + w2recall(S, θ)(2)
where w1+w2= 1. The optimization task is equivalent
to maximize precision if w1= 0, w2= 1 and maximize
recall if w1= 1, w2= 0. In practice we choose a small
w1[0,0.3], since we prefer high-precision matches due
to the nature of correspondence problem.
The optimization is conducted in a greedy manner. We
randomly sample hyper-planes and for each hyper-plane we
choose the best threshold through line-search. Each node
selects the hyper-plane and the threshold that maximize our
objective function. To further improve the efficiency of
the training we share features across nodes within the same
layer and updating threshold for each node only. This tech-
nique is known in literature as Random Ferns [5].
3.3. Extensions.
The described method is very efficient and retrieves very
accurate and sparse matches. However, some applications
require a denser coverage in order to incorporate smooth-
ness constraints within neighbor pixels. To do so we pro-
pose three possible extensions that do not introduce any ad-
ditional cost in the overall running time. First, we design
a multiscale version of the algorithm to increase the cover-
age across the image and improve the recall of the matched
pairs. Secondly, hard pairs are sampled with higher prob-
ability during the training stage. Finally we learn motion
prior over the patches: this gives a low compute way to dis-
ambiguate non-unique matches without performing any ex-
pensive re-ranking steps.
Multi-scale Learning. Many feature matching methods
have difficulties in finding all reliable matches at a fixed
scale. For instance, for small local patches, matches are
ambiguous due to repetitive patterns or smoothing regions
due to the lack of context. This motivates us to utilize in-
formation from multiple scales. However, simply stacking
multiple features will dramatically increasing the dimen-
sion of the hyper-plane which brings difficulty for optimiza-
tion. Therefore, we proposed a multi-layer learning scheme
where the decision trees are organized in a coarse-to-fine
manner. The first several layers are required to focus on
features at a coarse resolution and they will look into finer
resolutions as the tree goes deeper. Tab. 1shows precision-
recall of single-scale approach and multiple scales methods
and Fig. 3depicts the matching results. Compared with a
single-scale approach with the same tree architecture, this
multi-scale approach achieves better recall at the same level
of precision.
Mining Hard Pairs. One of the drawback of the greedy
training approach is that difficult positive pairs are dis-
carded early once they are split into different internal nodes.
In the context of optical flow and stereo, we found these
samples are mostly due to large motion. Therefore, when
Figure 5. Qualitative comparisons for sparse matching on Sintel flow dataset (zoom-in for better quality). Left to right: Sift, LibViso,
CSH, DeepMatching, Ours. Number of matches in Sift and LibViso are not dense enough. CSH and Sift generates too many outliers.
DeepMatching has the best coverage but also generates some outliers (background and the arm on top, the man’s head on bottom). Our
method is almost outlier-free and has most matches.
Dataset Optical Flow Sintel (Final, Pr at k%recall)
Method 1% 5% 10% 25% 50%
Locality Sensitive Hashing - - - 85.2% 76.6%
Global Patch Collider (single-tree) - - - 94.5% 89.5%
Global Patch Collider (multi-tree) 99.8% 99.3% 97.5 93.6% 89.5%
Global Patch Collider (+multi-scale) 99.8% 99.5% 99.3 98.1% 94.7%
Table 1. Precision at %krecall under different configurations. Our baseline is a random balanced tree with hyper-plane split function.
sampling training patches we give higher probability to
large-displacement patches.
Motion Pattern Learning. Our method could be further
extended to learn priors of motion patterns. To be specific,
at each terminal node, we train a six-layer decision tree to
predict whether two patches are true correspondent simply
according to the relative motion. This is based on the mo-
tivation that motions are highly correlated with local con-
tent of images. For instance, boundary patches are more
possible to move along the direction perpendicular to the
edges than along the edge direction. Fig. 6depicts the mo-
tion priors over different leaves. As we can see the patterns
of motion diverse significantly, which justify our approach
to using motion features to further boost performance. In
the testing stage, we could further utilize non-distinctive
patches by predicting whether two patches are likely to be a
good match.
3.4. Complexity Analysis.
The run-time complexity of the algorithm depends lin-
early on the size of the image I. For instance, in optical
flow task, the total complexity of our matching algorithm is
O(dT LN ) + O(N)(3)
where Nis the number of patches, dis the number of
features examined in each split function, Tis the number
Figure 6. Motion histogram for ten randomly picked leaves. His-
togram bins are divided according to motion radius (0, 1, 3, 10)
and angle (πto π).
of trees, and Lis the layer of the each tree. To be spe-
cific, the forest prediction stage requires O(dT LN )opera-
tions and matching stage requires a linear pass over all non-
zero states with a maximum number of N. Therefore, our
method considers all the possible matches globally in lin-
ear time and does not require any pairwise comparison. In
practice, the parameters for our algorithm are T= 8, L =
12, d = 27 for optical flow and T= 7, L = 12, d = 2
for stereo. As comparison, KD-tree based matching will
takes O(dNlogN) + O(dN logN) + O(dmN)with an ad-
ditional tree building step and deep matching approximate
takes O(NN )operations.
4. Experimental Results
This section presents the results for the proposed Global
Patch Collider to the following tasks: (i) optical flow, (ii)
structured light stereo matching and (iii) feature matching
Algorithm 1 Global Patch Collider
Input: Image Iand I0and the trained decision forest F.
- Get all local patch features {x}and {x0}from source
and target images respectively.
- Initialize C(I, I0)with empty set.
- For each patch xencode and store the forest status F(x)
according to Sec. 3.1.
- Enumerate all the forest status with non-zero number of
hits. If there is only one source patch and target patch,
add this distinctive pair (xi,x0
j)into the correspondence
set C(I, I0).
Output: C(I, I0)
Figure 7. Qualitative comparisons for structured light stereo. Top
to bottom: left input, full-iter (random initialization), 1-iter (ran-
dom initialization), 1-iter (ours). Our initialization can help
achieve better results within only one iteration, e.g. regions of the
computer and the table on the right side of two images.
for widebaseline stereo. For the first problem, we perform
evaluations using the popular MPI-Sintel benchmark [8]
and the KITTI 2015 Optical flow dataset [17]. We compare
our method with current state-of-the-art algorithms. The
structured light stereo matching task is conducted over a se-
quence of infrared stereo images, and compared with the
patch-matching stereo algorithm [4]. Finally we adopt the
fountain dataset [34] to validate domain transfer ability for
the proposed method.
4.1. Optical Flow
For optical flow experiments, we evaluate our method
on the challenging MPI-Sintel dataset [8]. We first split the
training dataset into training (sequence 1-12) and validation
(sequence 13-22), where we evaluate the performance of the
sparse matching and pick the best hyper-parameters. Refer-
ence patches are randomly sampled with higher-probability
over large-motion patches (pixel larger than 10). A positive
patch is chosen according to the ground-truth flow of cen-
ter pixel and a negative patch is randomly sampled around
ground-truth location with an offset between 3to 20. This
configuration would generate very difficult negative sam-
ples due to the local appearance similarity of images. In
total we have 4-million patches for training and 1.25 mil-
lion patches for validation. Three scales are selected for the
multi-scale patch collider (7×7,15 ×15,31 ×31). We
EPE All S0-10 S10-40 S40+
FlowFields [1] 5.810 1.157 3.739 33.890
GlobalPatchCollider 6.040 1.102 3.589 36.455
CPM 6.078 1.201 3.814 35.748
DiscreteFlow [25] 6.077 1.074 3.832 36.339
EpicFlow [26] 6.285 1.135 3.727 38.021
TF+OFM [19] 6.727 1.512 3.765 39.761
Deep+R [12] 6.769 1.157 3.837 41.687
DeepFlow2 [36] 6.928 1.182 3.859 42.854
MDP-Flow2 [37] 8.445 1.420 5.449 50.507
LDOF [7] 9.116 1.485 4.839 57.296
Classic+NL [33] 9.153 1.113 4.496 60.291
Table 2. Optical Flow Leader-board on Sintel (final) benchmark.
use Walsh-Hadamard transform (WHT) as feature due to its
efficiency and representation power1. For each rgb channel
we pick the first 9components, thus our feature dimension
in total is 81 for multi-scale collider and 27 for single-scale
Precision-Recall. We first report precision-recall on val-
idation triplets with multiple-configurations in Tab. 1. The
balance of precision-recall could be achieved via adjusting
the number of layers and the number of intersected trees. As
the model becomes complex we could achieve higher preci-
sion and lower recall. We compare our method with a ran-
dom balanced tree baseline with exactly same features and
tree architecture but randomly generated hyper-plane. This
is essentially equivalent to locality sensitive hashing method
[11]. Tab. 1shows that our learning-based approach clearly
out-performs the random baseline in terms both single-tree
and forest setting. Furthermore, under the same level of re-
call, we can see that a multi-scale learning achieves higher
precision than the single-scale approach.
Sparse Matching. We conduct sparse matching experi-
ments on a subset of our validation data (every 5frame).
Tab. 3reports the results in terms of endpoint error, inlier
percentage as well as number of matches per image. We
consider pixel-wise motion estimation with endpoint error
larger than 3pixels as outliers. We also report our algorithm
under multiple configurations, namely single-scale, multi-
scale and mutli-scale plus motion learning. Several match-
ing methods are picked as competing algorithm. Coherency
sensitive hashing [22] is a hashing based PatchMatch algo-
rithm which is designed for dense nearest-neighbor field2.
SiftMatching [24] is a baseline for sparse matching3. Lib-
Viso2 [20] is a fast feature matching algorithm which is
designed for sparse correspondence with applications in
1Since 2n×2npatch size is required for WHT, we extrapolate the
additional row and column with padding.
2We use the author’s implementation. In order to ensure a fair com-
parison for dense approaches, we only compare pixels available at ‘Ours
15 ×15’ approach when calculating endpoint errors and inlier percentage.
3We use the implementation in VLFeat and set PeakThres to be 0 for
DoG based keypoint detection.
Figure 8. Domain transfer ability for wide-baseline stereo match-
ing. Left to right: matching results across 1, 2, 3 frames respec-
tively. Green lines are inlier and blue lines are outlier.
SLAM, optical flow and stereo4. DeepMatching [36] is the
state-of-the-art matching algorithm specially designed for
optical flow task5. From this table we can see that our al-
gorithm achieves the lowest endpoint error and outliers per-
centage with more number of matched points on average.
In terms of coverage in the most difficult case, our method
outperforms other feature matching algorithms. Compared
with single-scale approach, mutli-scale GPC significantly
reduces the endpoint error and outlier percentage, but also
decreases the number of matched points in the worst case.
If we also consider non-unique hits with motion learning,
the proposed method reaches the same accuracy level with
multi-scale method while keeps a reasonable number of
matches. Qualitative comparisons of sparse matching are
shown in Fig. 5. Two failure cases of our global patch col-
lider are shown in Fig. 10. Although motion and multi-scale
learning is introduced to increase coverage, in some cases
(e.g. in presence of motion blur and rotation) our method
may fail in capturing some large transformed regions.
Dense Flow. Once we compute sparse matches, we use
the state-of-the-art interpolation method EpicFlow [26] to
generate dense flow results on the Sintel testing benchmark.
Tab. 2shows the qualitative results of top-8 optical flow
methods on Sintel testing benchmark (final) as well as other
three popular algorithms6. Please note all the top-5 methods
use EpicFlow as post-processing and the original EpicFlow
uses DeepMatching [36] as sparse initialization. The pro-
posed GPC is ranked second among all the optical flow
methods. In particular our proposed approach achieves the
best results over pixels with motion between 10 pixels and
40 pixels. It is worth noting that our matching algorithm
is the only method which does not need pairwise similarity
comparison or re-ranking from multiple proposals. Fig. 4
shows qualitative comparison of all the competing algo-
rithms over the testing dataset.
4.2. KITTI Optical Flow.
We evaluate our algorithm on the KITTI 2015 Optical
flow dataset [17]. To be specific we follow the same con-
figuration used in the Sintel dataset. In the training stage,
we trained a GPC with 8trees, with 12 layers. Each layer
learns a specific scale from (7×7,15×15 and 31×31) with
4We use the author’s implementation.
5We use the author’s implementation.
6This is a snapshot of Sintel benchmark on Nov. 10 2015. For latest
results, please refer to
Method EPE Inlier %Mean # Min # Max #
CSH 5.7427 86.39% dense dense dense
Sift Matching 3.3814 92.60% 1120 61 2393
LibViso 1.5577 92.42% 848 45 1805
Deep Matching 3.0844 87.36% 5945 2008 6818
Ours 15 ×15 1.8796 94.69% 21048 973 93934
Ours M-Scale 1.2809 97.29% 16813 27 88309
Ours M-Scale+Motion 1.3626 96.17% 26131 890 169736
Table 3. Sparse matching performance on Sintel Dataset.
Figure 9. Results on KITTI optical flow. From top to bottom: input
image, flow estimation, flow error.
Error Fl-bg Fl-fg Fl-all
All / All 30.60 % 33.09 % 31.01 %
Noc / All 20.09 % 28.92 % 21.69 %
Table 4. Performance on KITTI flow 2015 benchmark
27 dimensional feature for each scale. In the testing stage,
our sparse matching is conducted with GPC and we used
EpicFlow [26] to obtain the final dense optical flow, with
the standard hyper-parameters for the KITTI dataset. The
average number of matches per image is 14563, whereas
the minimum number of matches is 2542. Tab. 4shows the
results. In general our method is comparable with sparse
matching + EpicFlow, but orders of magnitude faster in the
matching stage. We also show some qualitative results in
Fig. 9.
4.3. Structured Light Stereo Matching
For the stereo matching task, we collected 2200 infrared
stereo images in indoor scene with a Kinect depth sensor
based on structured illumination. The reference pattern is
recovered using the calibration procedure in [15]. The pat-
tern and Kinect images are rectified so that disparity is along
horizontal line [15]. We used 1000 frames as training set
and the rest as test test. The GPC patch size is set to 7×7.
For this scenario, we use the following pixel-wise differ-
ence test as split function: f(x, θ) = sign(x(i)x(j)τ).
Each internal node calculates the pixel-wise intensity differ-
ent at two pixel offsets (i, j)and the binary decision is made
whether the difference is smaller than the threshold τ. This
relative feature is illumination invariant and requires little
computation. In the training stage, we trained a 10-tree for-
est with 16 layers for each tree over 1million triplets ran-
Figure 10. Failure cases of Global Patch Collider. Left: compared with deep matching, which explicitly models rotation, our method failed
in capturing rotation of the basket; Right: both deep matching and our patch collider failed in capturing non-rigid deformation of the bat.
Baseline Sparse EPE Outlier %Mean #
Ours 1.30 1.28%3754
Ours (high-recall) 1.41 1.79%8369
Ours (motion) 1.00 0.88% 9883
Table 5. Sparse matching performance on IR Stereo data w/o mo-
Baseline 1-iter 2-iter
Random 1.5940 36.81% 1.5698 36.59%
Ours 1.5863 35.26% 1.5642 34.90%
Table 6. PatchMatch based on dense stereo results w/o initializa-
tion with global patch collider.
domly collected from the training set. In practice we found
choosing the first 7 layers will already ensure a good bal-
ance between precision and recall for this task, since the
number of possible matches is greatly reduced by epipolar
constraint. For each internal node, we generate 1024 ran-
dom proposals and pick the best one which maximizes our
objective defined in Eq. (2). Tab. 5shows the average end-
point error and outlier percentage under different configu-
rations. Pixels with disparity error larger than 1are con-
sidered as outliers. In this table ‘Ours’ represents our stan-
dard unique-collisions based matching, ‘Ours (high-recall)’
represents reducing the complexity of the tree architecture
(6-layer, 8-tree) in order to generate comparable number of
matches before inducing ‘motion’ prior. Please note that in
this 1-dimensional matching case, ‘motion’ learning is con-
ducted simply as training a 1D-Gaussian binary classifier
with disparity as input for each node. With this prior and
non-unique collisions our method further increases both ac-
curacy and recall. We also conducted dense stereo recon-
struction experiment by using our sparse matching as ini-
tialization for PatchMatch based stereo [4]. In Fig. 7, we
show PatchMatch results after 1-iteration with our method
initialization and random initialization respectively. As we
can see our method could generate more completed results
and even comparable with the quality of full-iteration of
PatchMatch. Quantitative comparisons are shown in Tab. 6.
4.4. Feature Matching for Wide-baseline Stereo
We also conduct an experiment to show the domain
transfer ability of GPC. To be more specific, we trained a
Global Patch Collider over Sintel dataset for optical flow
Method Sift Ours
Frame Diff 1 2 3 1 2 3
EPE 0.12 0.21 1.07 0.04 0.17 0.94
Inlier% 96% 92% 71% 98% 87% 53%
Mean # 669 239 71 6933 1143 200
Table 7. Quantitative analysis of our method’s domain transfer
ability on wide-baseline stereo matching (trained on Sintel).
task and used it for matching correspondence on EPFL
wide-baseline stereo data [34]. Given the camera poses,
we use our patch collider to find matches across two im-
ages then discard those violating the epipolar constraints.
Errors are measured in the 3D space by projecting the two
matched points back into world coordinate using the GT
depth. A match pair is considered as outlier if the `2-error
is larger than 0.15m in 3D space. Fig. 8depicts an example
of matches across 1 frame, 2 frames and 3 frames respec-
tively. Green lines are inliers and blue lines are outliers. We
also consider Sift as a baseline approach and reported the
quantitative results on average over all frames on ‘Foun-
tain’ subset in Tab. 7. From the table and figure we can
see that our method achieves better results for small base-
line cases, but the performance dropped over wide-baseline
case. This is expected since the GPC model is trained on the
Sintel dataset, where large viewpoint changes and signifi-
cant patch deformations barely happen on adjacent frames.
5. Conclusion
This paper proposes a novel algorithm, the Global Patch
Collider, for the computation of global point-wise corre-
spondences in images. The proposed method is based
on detecting unique collisions between image points us-
ing a collection of learned tree structures that act as con-
ditional hash functions. Our algorithm is extremely effi-
cient, fully-parallelizable, task-specific and does not require
any pairwise comparison. Experiments on optical flow and
stereo matching validates the performance of the proposed
method. Future work includes high level applications such
as hand tracking, and nonrigid reconstruction of deformable
[1] C. Bailer, B. Taetz, and D. Stricker. Flow fields: Dense corre-
spondence fields for highly accurate large displacement op-
tical flow estimation. In ICCV.1,4,6
[2] L. Bao, Q. Yang, and H. Jin. Fast edge-preserving patch-
match for large displacement optical flow. In TIP, 2014. 1,
[3] C. Barnes, E. Shechtman, D. B. Goldman, and A. Finkel-
stein. The generalized patchmatch correspondence algo-
rithm. In ECCV. 2010. 1
[4] M. Bleyer, C. Rhemann, and C. Rother. Patchmatch stereo-
stereo matching with slanted support windows. In BMVC,
2011. 1,6,8
[5] A. Bosch, A. Zisserman, and X. Munoz. Image classification
using random forests and ferns. In ICCV, 2007. 2,4
[6] H. Bristow, J. Valmadre, and S. Lucey. Dense semantic cor-
respondence where every pixel is a classifier. In ICCV, 2015.
[7] T. Brox and J. Malik. Large displacement optical flow: de-
scriptor matching in variational motion estimation. In PAMI,
2011. 1,6
[8] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A
naturalistic open source movie for optical flow evaluation.
In ECCV, 2012. 3,6
[9] Z. Chen, H. Jin, Z. Lin, S. Cohen, and Y. Wu. Large displace-
ment optical flow from nearest neighbor fields. In CVPR,
2013. 1,2
[10] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests:
A unified framework for classification, regression, density
estimation, manifold learning and semi-supervised learning.
In Now, 2012. 2,3
[11] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni.
Locality-sensitive hashing scheme based on p-stable distri-
butions. In SOCG, 2004. 6
[12] B. Drayer and T. Brox. Combinatorial regularization of de-
scriptor matching for optical flow estimation. In BMVC,
2015. 6
[13] S. R. Fanello, C. Keskin, S. Izadi, P. Kohli, D. Kim,
D. Sweeney, A. Criminisi, J. Shotton, S. Kang, and T. Paek.
Learning to be a depth camera for close-range human cap-
ture and interaction. In ACM SIGGRAPH and Transaction
On Graphics, 2014. 2
[14] S. R. Fanello, C. Keskin, P. Kohli, S. Izadi, J. Shotton, A. Cri-
minisi, U. Pattacini, and T. Paek. Filter forests for learning
data-dependent convolutional kernels. In CVPR, 2014. 2
[15] S. R. Fanello, C. Rhemann, V. Tankovich, A. Kowdle,
S. Orts Escolano, D. Kim, and S. Izadi. Hyperdepth: Learn-
ing depth from structured light without matching. In CVPR,
2016. 2,7
[16] P. Fischer, A. Dosovitskiy, E. Ilg, P. H¨
ausser, C. Hazırbas¸,
V. Golkov, P. van der Smagt, D. Cremers, and T. Brox.
Flownet: Learning optical flow with convolutional networks.
In ICCV, 2015. 2
[17] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-
tonomous driving? the kitti vision benchmark suite. In
CVPR, 2012. 6,7
[18] K. He and J. Sun. Computing nearest-neighbor fields via
propagation-assisted kd-trees. In CVPR, 2012. 1,2
[19] R. Kennedy and C. J. Taylor. Optical flow with geometric oc-
clusion estimation and fusion of multiple frames. In EMM-
CVPR, 2015. 6
[20] B. Kitt, A. Geiger, and H. Lategahn. Visual odometry based
on stereo image sequences with ransac-based outlier rejec-
tion scheme. In IV, 2010. 1,6
[21] P. Kontschieder, S. R. Bul`
o, A. Criminisi, P. Kohli,
M. Pelillo, and H. Bischof. Context-sensitive decision forests
for object detection. In NIPS, 2012. 2
[22] S. Korman and S. Avidan. Coherency sensitive hashing. In
ICCV, 2011. 1,2,6
[23] L. Ladick `
y, C. H¨
ane, and M. Pollefeys. Learning the match-
ing function. In arXiv preprint arXiv:1502.00652, 2015. 1
[24] D. G. Lowe. Object recognition from local scale-invariant
features. In ICCV, 1999. 1,6
[25] M. Menze, C. Heipke, and A. Geiger. Discrete optimization
for optical flow. In GCPR, 2015. 1,6
[26] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid.
Epicflow: Edge-preserving interpolation of correspondences
for optical flow. In CVPR, 2015. 4,6,7
[27] S. Roth and M. J. Black. On the spatial statistics of optical
flow. In IJCV, 2007. 2
[28] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,
R. Moore, A. Kipman, and A. Blake. Real-time human pose
recognition in parts from single depth images. In CVPR,
2011. 2
[29] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and
F. M. Noguer. Discriminative learning of deep convolutional
feature point descriptors. In ICCV, 2015. 1,2
[30] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning lo-
cal feature descriptors using convex optimisation. In PAMI,
2014. 1
[31] S. Singh, A. Gupta, and A. Efros. Unsupervised discovery
of mid-level discriminative patches. In ECCV, 2012. 1
[32] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua.
Ldahash: Improved matching with smaller descriptors. In
PAMI, 2012. 1
[33] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow
estimation and their principles. In CVPR, 2010. 6
[34] E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient dense
descriptor applied to wide-baseline stereo. In PAMI, 2010.
[35] T. Trzcinski, M. Christoudias, P. Fua, and V. Lepetit. Boost-
ing binary keypoint descriptors. In CVPR, 2013. 1
[36] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid.
Deepflow: Large displacement optical flow with deep match-
ing. In ICCV, 2013. 1,6,7
[37] L. Xu, J. Jia, and Y. Matsushita. Motion detail preserving
optical flow estimation. In PAMI, 2012. 1,6
[38] S. Zagoruyko and N. Komodakis. Learning to compare im-
age patches via convolutional neural networks. In CVPR,
2015. 1,2
[39] J. ˇ
Zbontar and Y. LeCun. Computing the stereo matching
cost with a convolutional neural network. In CVPR, 2015. 1
... The task has been the cornerstone of various computer vision problems for decades, since the pixel-level association allows one to recover the structure and motion of the world effectively [8,35,53,67]. Prevalent approaches focus on hand-crafted [7,13,50,55,65,77] or learned [20,23,60,[81][82][83][84] robust visual features that can distinguish one pixel from the others in diverse scenarios. While impressive performance has been achieved [66,74], these methods fall short when there is little overlap among input images, as there are hardly any co-visible 3D points. ...
Full-text available
Recovering the spatial layout of the cameras and the geometry of the scene from extreme-view images is a longstanding challenge in computer vision. Prevailing 3D reconstruction algorithms often adopt the image matching paradigm and presume that a portion of the scene is co-visible across images, yielding poor performance when there is little overlap among inputs. In contrast, humans can associate visible parts in one image to the corresponding invisible components in another image via prior knowledge of the shapes. Inspired by this fact, we present a novel concept called virtual correspondences (VCs). VCs are a pair of pixels from two images whose camera rays intersect in 3D. Similar to classic correspondences, VCs conform with epipolar geometry; unlike classic correspondences, VCs do not need to be co-visible across views. Therefore VCs can be established and exploited even if images do not overlap. We introduce a method to find virtual correspondences based on humans in the scene. We showcase how VCs can be seamlessly integrated with classic bundle adjustment to recover camera poses across extreme views. Experiments show that our method significantly outperforms state-of-the-art camera pose estimation methods in challenging scenarios and is comparable in the traditional densely captured setup. Our approach also unleashes the potential of multiple downstream tasks such as scene reconstruction from multi-view stereo and novel view synthesis in extreme-view scenarios.
... II. RELATED WORK Learning-based active stereo has had limited research in recent years. Prior to the deep learning era, frameworks for learning embeddings where matching can be performed more efficiently were explored [16], [17], [52] together with direct mapping from pixel intensities to depth [14], [15]. These methods have failed in general textureless scenes due to shallow architectures and local optimization schemes. ...
Full-text available
Active stereo systems are widely used in the robotics industry due to their low cost and high quality depth maps. These depth sensors, however, suffer from stereo artefacts and do not provide dense depth estimates. In this work, we present the first self-supervised depth completion method for active stereo systems that predicts accurate dense depth maps. Our system leverages a feature-based visual inertial SLAM system to produce motion estimates and accurate (but sparse) 3D landmarks. The 3D landmarks are used both as model input and as supervision during training. The motion estimates are used in our novel reconstruction loss that relies on a combination of passive and active stereo frames, resulting in significant improvements in textureless areas that are common in indoor environments. Due to the non-existence of publicly available active stereo datasets, we release a real dataset together with additional information for a publicly available synthetic dataset needed for active depth completion and prediction. Through rigorous evaluations we show that our method outperforms state of the art on both datasets. Additionally we show how our method obtains more complete, and therefore safer, 3D maps when used in a robotic platform
... which is a distance between two sparsely related points found by global patch collider [60] and C v t is the correspondence set containing tuples of matched surfels s i t and pixels ...
Full-text available
Reconstructing dynamic scenes with commodity depth cameras has many applications in computer graphics, computer vision, and robotics. However, due to the presence of noise and erroneous observations from data capturing devices and the inherently ill-posed nature of non-rigid registration with insufficient information, traditional approaches often produce low-quality geometry with holes, bumps, and misalignments. We propose a novel 3D dynamic reconstruction system, named HDR-Net-Fusion, which learns to simultaneously reconstruct and refine the geometry on the fly with a sparse embedded deformation graph of surfels, using a hierarchical deep reinforcement (HDR) network. The latter comprises two parts: a global HDR-Net which rapidly detects local regions with large geometric errors, and a local HDR-Net serving as a local patch refinement operator to promptly complete and enhance such regions. Training the global HDR-Net is formulated as a novel reinforcement learning problem to implicitly learn the region selection strategy with the goal of improving the overall reconstruction quality. The applicability and efficiency of our approach are demonstrated using a large-scale dynamic reconstruction dataset. Our method can reconstruct geometry with higher quality than traditional methods.
... To handle topology changes, KillingFusion [45] directly estimates the motion field given a pair of signed distance fields (SDF). Optical/scene flow [15,51,50,53,32,55,33,22,34] is a closely related technique. They have been used to generate initial guess for non-rigid tracking in [17,18,4,16,54]. ...
Tracking non-rigidly deforming scenes using range sensors has numerous applications including computer vision, AR/VR, and robotics. However, due to occlusions and physical limitations of range sensors, existing methods only handle the visible surface, thus causing discontinuities and incompleteness in the motion field. To this end, we introduce 4DComplete, a novel data-driven approach that estimates the non-rigid motion for the unobserved geometry. 4DComplete takes as input a partial shape and motion observation, extracts 4D time-space embedding, and jointly infers the missing geometry and motion field using a sparse fully-convolutional network. For network training, we constructed a large-scale synthetic dataset called DeformingThings4D, which consists of 1972 animation sequences spanning 31 different animals or humanoid categories with dense 4D annotation. Experiments show that 4DComplete 1) reconstructs high-resolution volumetric shape and motion field from a partial observation, 2) learns an entangled 4D feature representation that benefits both shape and motion estimation, 3) yields more accurate and natural deformation than classic non-rigid priors such as As-Rigid-As-Possible (ARAP) deformation, and 4) generalizes well to unseen objects in real-world sequences.
... The second originality concerns the regression process: similar to Wang et al. (2016), we use a split function at each node of the regression forest, which takes the whole feature vector as input. This is different from the classical binary test function of the models in Shotton et al. (2013), Guzman-Rivera et al. (2014), Valentin et al. (2015), Brachmann et al. (2016). ...
Camera relocalization refers to the problematic of the camera pose estimation including 3D translation and 3D rotation expressed in the world coordinate system with no temporal constraint. Camera relocalization is necessary in localization systems. However, it is still challenging to have both a real-time and accurate method. In this paper, we introduce our data-oriented hybrid method merging both machine learning and geometric approaches for fast and accurate camera relocalization from a single RGB image. We propose an efficient multi-output deep-forest regression based on a sparse feature detection, that uses a whole learned feature vector at each split function to improve the accuracy of 2D–3D point correspondences. Especially, multiple coordinate regression of our deep-forest allows to deal with ambiguous repetitive structure. The learned feature extraction is able to be pre-trained and reused for different scenes. The use of sparse feature detection reduces processing time and increases accuracy of predictions. Finally, we show favorable results in terms of accuracy and computational time compared to the state-of-the-art methods.
Non‐rigid registration computes an alignment between a source surface with a target surface in a non‐rigid manner. In the past decade, with the advances in 3D sensing technologies that can measure time‐varying surfaces, non‐rigid registration has been applied for the acquisition of deformable shapes and has a wide range of applications. This survey presents a comprehensive review of non‐rigid registration methods for 3D shapes, focusing on techniques related to dynamic shape acquisition and reconstruction. In particular, we review different approaches for representing the deformation field, and the methods for computing the desired deformation. Both optimization‐based and learning‐based methods are covered. We also review benchmarks and datasets for evaluating non‐rigid registration methods, and discuss potential future research directions.
Non-rigid registration computes an alignment between a source surface with a target surface in a non-rigid manner. In the past decade, with the advances in 3D sensing technologies that can measure time-varying surfaces, non-rigid registration has been applied for the acquisition of deformable shapes and has a wide range of applications. This survey presents a comprehensive review of non-rigid registration methods for 3D shapes, focusing on techniques related to dynamic shape acquisition and reconstruction. In particular, we review different approaches for representing the deformation field, and the methods for computing the desired deformation. Both optimization-based and learning-based methods are covered. We also review benchmarks and datasets for evaluating non-rigid registration methods, and discuss potential future research directions.
Background Recent developments in capturing devices like kinect, Intel real sense camera etc., has impelled research in 3D reconstruction especially in the dynamic scene and the performance in terms of both reconstruction quality and speed has increased and thus, have supported many application like teleportation, gaming, free view point video, CG films etc. This paper provides systematic literature review of 3D reconstruction techniques applied in dynamic scene. The objective of this systematic literature review is to provide the detail technical progress in 3D reconstruction techniques for dynamic scene and to find the research gap in this field. Purpose This paper presents a systematic literature review of the current state of the art that focuses on 3D reconstruction of non-rigid object, articulated motion and human performance in real-time. We further discuss the limitations of current methods and emphasize promising technologies for future development. Methods Search was conducted on five databases to find 3D reconstruction techniques for dynamic scene. As reconstruction of dynamic scene can be further categorized as rigid object 3D reconstruction and non-rigid object reconstruction based on the object being reconstructed in the dynamic scene. Thus we have searched for both categories for review and we have concentrated on the dynamic scene generated where object is dynamic while camera is static. Results 281 papers were initially searched further than after abstract screening 100 were selected later after detail study 46 were selected for systematic literature review and are presented in the table.
Die Schätzung von optischem Fluss ist ein grundlegender Bestandteil der Bildverarbeitung. Die Anwendungen reichen von Kamerastabilisierung über Bildkompression, Handlungserkennung und Bewegungssegmentierung bis zu 3D-Rekonstruktion. In der Vergangenheit wurden viele Ansätze entwickelt, um das Problem mittels einer Energiefunktion und diskreter oder kontinuierlicher Optimierung zu lösen. Die schwierigen Aspekte von optischem Fluss, wie Verdeckungen, Diskontinuitäten und das Aperturproblem, sind problematisch in solche Ansätze zu integrieren und führen zu Einschränkungen. Diese Dissertation stellt einen neuartigen Ansatz vor, indem sie Convolutional Neural Networks verwendet. Es wird gezeigt, dass die neuronalen Netze eine bessere Heuristik lernen als die bislang händlich entwickelten Methoden. Als erstes werden ein End-to-End Encoder-Decoder Netzwerk namens FlowNetS und ein siamesisches Netzwerk namens FlowNetC mit einer expliziten Korrelation vorgestellt. Der Ansatz wird dann weiterentwickelt zu einer Pipeline von Netzwerken, genannt FlowNet2, und in die einzelnen Stufen werden weiterhin Verdeckungen und Bewegungskanten integriert. Die Resultate zeigen, dass die Schätzung von optischem Fluss mittels neuronaler Netze möglich ist und dass die Ergebnisse vergleichbar mit State-of-the-art-Ansätzen sind, derweil aber mit um Größenordnungen geringerer Laufzeit. Weiterhin sind die Netzwerke deutlich besser darin, Verdeckungen und Bewegungskanten zu erkennen, und setzen in diesen Bereichen einen neuen State of the art. Die Schätzungvonsolchen hochwertigen Flussfeldern in Echtzeit hat außerdem die möglichen Anwendungen revolutioniert und allgemein große Auswirkungen verursacht. Zuletzt haben die neuronalen Netze auch noch den Vorteil, dass man Priors für spezifische Anwendungsfälle und das damit verbundene Aperturproblem aus Trainingsdaten lernen kann. Um die möglichen Anwendungen sogar noch weiterzuführen, werden anschließend ein Multi-Hypothesen-Netzwerk namens FlowNetH und eine Pipeline für die Schätzung von Unsicherheiten präsentiert. Die Auswertung zeigt, dass die Unsicherheiten ebenfalls State of the art sind und dass neuronale Netze sehr gut über die Zuverlässigkeit der eigenen Flussschätzung informieren können. Dem Leser wird abschließend ein Ausblick darüber gegeben, wie man den Ansatz dazu erweitern kann, multimodale Wahrscheinlichkeitsverteilungen zu liefern, um diese zukünftig als Baustein für weitergehende, zuverlässigere Systeme zu verwenden.
Full-text available
Modern large displacement optical flow algorithms usually use an initialization by either sparse descriptor matching techniques or dense approximate nearest neighbor fields. While the latter have the advantage of being dense, they have the major disadvantage of being very outlier-prone as they are not designed to find the optical flow, but the visually most similar correspondence. In this article we present a dense correspondence field approach that is much less outlier-prone and thus much better suited for optical flow estimation than approximate nearest neighbor fields. Our approach does not require explicit regularization, smoothing (like median filtering) or a new data term. Instead we solely rely on patch matching techniques and a novel multi-scale matching strategy. We also present enhancements for outlier filtering. We show that our approach is better suited for large displacement optical flow estimation than modern descriptor matching techniques. We do so by initializing EpicFlow with our approach instead of their originally used state-of-the-art descriptor matching technique. We significantly outperform the original EpicFlow on MPI-Sintel, KITTI 2012, KITTI 2015 and Middlebury. In this extended article of our former conference publication we further improve our approach in matching accuracy as well as runtime and present more experiments and insights.
Conference Paper
Full-text available
Structured light sensors are popular due to their robustness to untextured scenes and multipath. These systems triangulate depth by solving a correspondence problem between each camera and projector pixel. This is often framed as a local stereo matching task, correlating patches of pixels in the observed and reference image. However, this is computationally intensive, leading to reduced depth accuracy and framerate. We contribute an algorithm for solving this correspondence problem efficiently, without compromising depth accuracy. For the first time, this problem is cast as a classification-regression task, which we solve extremely efficiently using an ensemble of cascaded random forests. Our algorithm scales in number of disparities, and each pixel can be processed independently, and in parallel. No matching or even access to the corresponding reference pattern is required at runtime, and regressed labels are directly mapped to depth. Our GPU-based algorithm runs at a 1KHz for 1.3MP input/output images, with disparity error of 0.1 subpixels. We show a prototype high framerate depth camera running at 375Hz, useful for solving tracking-related problems. We demonstrate our algorithmic performance, creating high resolution real-time depth maps that surpass the quality of current state of the art depth technologies, highlighting quantization-free results with reduced holes, edge fattening and other stereo-based depth artifacts.
Full-text available
Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on hand-crafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.
Coherency Sensitive Hashing (CSH) extends Locality Sensitivity Hashing (LSH) and PatchMatch to quickly find matching patches between two images. LSH relies on hashing, which maps similar patches to the same bin, in order to find matching patches. PatchMatch, on the other hand, relies on the observation that images are coherent, to propagate good matches to their neighbors in the image plane, using random patch assignment to seed the initial matching. CSH relies on hashing to seed the initial patch matching and on image coherence to propagate good matches. In addition, hashing lets it propagate information between patches with similar appearance (i.e., map to the same bin). This way, information is propagated much faster because it can use similarity in appearance space or neighborhood in the image plane. As a result, CSH is at least three to four times faster than PatchMatch and more accurate, especially in textured regions, where reconstruction artifacts are most noticeable to the human eye. We verified CSH on a new, large scale, data set of 133 image pairs and experimented on several extensions, including: k nearest neighbor search, the addition of rotation and matching three dimensional patches in videos.
Conference Paper
Optical flow research has made significant progress in recent years and it can now be computed efficiently and accurately for many images. However, complex motions, large displacements, and difficult imaging conditions are still problematic. In this paper, we present a framework for estimating optical flow which leads to improvements on these difficult cases by 1) estimating occlusions and 2) using additional temporal information. First, we divide the image into discrete triangles and show how this allows for occluded regions to be naturally estimated and directly incorporated into the optimization algorithm. We additionally propose a novel method of dealing with temporal information in image sequences by using “inertial estimates” of the flow. These estimates are combined using a classifier-based fusion scheme, which significantly improves results. These contributions are evaluated on three different optical flow datasets, and we achieve state-of-the-art results on MPI-Sintel.
Conference Paper
We propose to look at large-displacement optical flow from a discrete point of view. Motivated by the observation that sub-pixel accuracy is easily obtained given pixel-accurate optical flow, we conjecture that computing the integral part is the hardest piece of the problem. Consequently, we formulate optical flow estimation as a discrete inference problem in a conditional random field, followed by sub-pixel refinement. Naïve discretization of the 2D flow space, however, is intractable due to the resulting size of the label set. In this paper, we therefore investigate three different strategies, each able to reduce computation and memory demands by several orders of magnitude. Their combination allows us to estimate large-displacement optical flow both accurately and efficiently and demonstrates the potential of discrete optimization for optical flow. We obtain state-of-the-art performance on MPI Sintel and KITTI.
In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-of-the-art methods.