Conference PaperPDF Available

Filter Forests for Learning Data-Dependent Convolutional Kernels

Authors:
  • perceptiveIO, Inc

Abstract and Figures

We propose 'filter forests' (FF), an efficient new discriminative approach for predicting continuous variables given a signal and its context. FF can be used for general signal restoration tasks that can be tackled via convolutional filter-ing, where it attempts to learn the optimal filtering kernels to be applied to each data point. The model can learn both the size of the kernel and its values, conditioned on the ob-servation and its spatial or temporal context. We show that FF compares favorably to both Markov random field based and recently proposed regression forest based approaches for labeling problems in terms of efficiency and accuracy. In particular, we demonstrate how FF can be used to learn optimal denoising filters for natural images as well as for other tasks such as depth image refinement, and 1D signal magnitude estimation. Numerous experiments and quanti-tative comparisons show that FFs achieve accuracy at par or superior to recent state of the art techniques, while being several orders of magnitude faster.
Content may be subject to copyright.
Filter Forests for Learning Data-Dependent Convolutional Kernels
Sean Ryan Fanello1,2 Cem Keskin1Pushmeet Kohli1Shahram Izadi1
Jamie Shotton1Antonio Criminisi1Ugo Pattacini2Tim Paek1
Microsoft Research1iCub Facility - Istituto Italiano di Tecnologia2
Abstract
We propose ‘filter forests’ (FF), an efficient new discrimi-
native approach for predicting continuous variables given
a signal and its context. FF can be used for general signal
restoration tasks that can be tackled via convolutional filter-
ing, where it attempts to learn the optimal filtering kernels
to be applied to each data point. The model can learn both
the size of the kernel and its values, conditioned on the ob-
servation and its spatial or temporal context. We show that
FF compares favorably to both Markov random field based
and recently proposed regression forest based approaches
for labeling problems in terms of efficiency and accuracy.
In particular, we demonstrate how FF can be used to learn
optimal denoising filters for natural images as well as for
other tasks such as depth image refinement, and 1D signal
magnitude estimation. Numerous experiments and quanti-
tative comparisons show that FFs achieve accuracy at par
or superior to recent state of the art techniques, while being
several orders of magnitude faster.
1. Introduction
Probabilistic models such as pairwise Markov random
fields (MRF) and conditional random fields (CRFs) have
been used for solving many pixel-wise labeling problems
encountered in computer vision, including semantic seg-
mentation, optical flow, image denoising, and stereo [
3
,
17
].
These models allow the relationships between interacting
variables (such as those corresponding to neighboring pixels)
to be modeled easily, and typically lead to results which are
smooth and respect edges in the image.
Despite impressive results, field-based models have two
drawbacks: estimating the structure and parameters of mod-
els is hard, and inference of the maximum a posteriori so-
lution under the model can be expensive. This has led re-
searchers to investigate faster forest-based alternatives.
Decision forests [
1
,
4
,
8
,
23
] have been used for problems
such as body part classification [
24
] and organ detection [
6
].
While they enable efficient prediction, they also have a ma-
jor drawback. In general, forest-based models make predic-
tions under the assumption that output variables (such as
pixel labels) are independent, and thus fail to enforce spatial
smoothness. A recent exception is the work in [
16
] where
the authors investigate the use of an iterative, stacking tech-
nique for the prediction of categorical (discrete) labels while
respecting spatial context.
In this paper we propose filter forests (FF), a non-iterative
forest-based predictor for the estimation of continuous vari-
ables, which can incorporate a learned model of spatial or
temporal context. We apply FF here to the task of restoring
corrupted n-dimensional signals.
Our main contribution is in the use of a highly-efficient
multi-scale decision forest that is trained to discriminatively
predict which filter should be applied at any given set of
variables (e.g. pixels). FF has the following properties: (i)
the forest recursively partitions the input signal such that a
simple convolution kernel is appropriate at each leaf; (ii) we
train the forest to minimize a new, regularized least squared
error of the kernels at each leaf; (iii) we are able to achieve
state of the art accuracy at a speed that is orders of mag-
nitude faster than state of the art; and (iv) no field-based
post-processing is required.
We evaluate FFs on a number of tasks including natural
image denoising, depth image refinement, and 1D signal
magnitude estimation. Experiments on a popular denoising
image dataset show that our method achieves accuracy which
is at par or superior to state of the art, but four orders of
magnitude faster. Improved results are also demonstrated for
depth image refinement and 1D signal magnitude estimation,
hinting at the generality of our method. We believe FFs can
be extended to many other related tasks such as learned edge
detection, image sharpening, or even morphological filtering
and discrete classification tasks.
1.1. Related Work
Convolution in Image Processing.
Bilateral [
27
] and
Wiener [
30
] are popular filters for noise removal in images.
The former replaces the intensity at a pixel
i
with a weighted
combination of nearby intensities, with weights depending
on both intensity difference and spatial distance from
i
. The
latter defines local filters conditioned on local variance es-
timates. Both filters try to preserve edges; but despite their
popularity they do not always yield good denoising results.
1
Figure 1.
Filter forests for learned, spatially-varying filtering
. Filter forests learn an optimal convolution kernel as a function of the local
image appearance.
(left)
Input noisy image, with superimposed select locations.
(middle)
Learned convolution filters for corresponding
locations (color-coded).
(right)
Denoised image. In textureless regions (red square) the filter forest learns to use large, isotropic smoothing
filters. Instead, at object boundaries (orange or green squares) the forest learns to use directional, edge-preserving filters.
Further adaptive techniques include non-local means [
5
],
K-SVD [
11
], LSSC [
18
] and CSR [
10
]. One of the most
accurate methods, BM3D [
9
], uses local similarity of noisy
patches to obtain a single prediction.
Filtering for Encouraging Spatial Coherence.
A number
of papers have considered the use of filtering for pixel la-
beling tasks. Convolving a unary likelihood with a filter
can be used for inferring solutions to problems such as opti-
mal flow, stereo and figure-ground segmentation [
7
,
20
,
26
].
Kontschieder et al. [
16
] integrated these ideas in a sequential
classification approach where filtered outputs from one layer
of predictors was given as input to the next layer of predic-
tors. These methods are efficient, but their modeling power
is limited by the fact that only a single fixed form of filter
function is used. In contrast, our method uses a tree to select
the filter to be applied from a potentially unboundedly large
dictionary. Our method can be seen as a generalization of
the convolution model, where the kernel is no longer fixed
for the entire image, but varies spatially (or temporally),
conditioned on the local appearance of the input signal.
Decision Trees for Structured-Output Prediction.
The
use of decision trees in combination with random fields
for the purposes of image denoising has been investigated
in [
14
,
19
]. There, Nowozin et al. have shown how arbi-
trary, data-dependent pairwise potentials in CRF models can
be encoded using a decision tree, and learned efficiently.
Decision trees have also been used for reducing the compu-
tational complexity of inference methods. Shapovalov et al.
[
22
] recently showed how the inference machines framework
proposed in [
21
] can be used in conjunction with decision
forests to efficiently assign class labels to 3D points. Our
work differs from these methods in that it does not require
explicit inference, it is non-iterative and involves a simple
(yet spatially-varying) convolution operation.
2. Method
Many problems in image processing and computer vision
can be formulated as a regression problem where we want to
learn a mapping
fw:X → Y
from the space
X
of inputs to
the space
Y
of outputs that is parameterized by some param-
eters
w
. Learning these predictors is generally formulated
using the principle of empirical risk minimization:
w?= argmin
wX
iT
L(yGT
i, fw(xi)) (1)
where
T
is the training data composed of pairs of input
and expected (ground truth) output pairs
(xi, yGT
i)
, and the
function
L
computes the loss of predicting
fw(xi)
when the
true solution is yGT
i.
As an example, consider signal processing tasks that re-
quire recovering the original signal
y
, or some of its features
(like edges), from possibly noisy observations
x
. The type
of the noise can be additive, such that
x=y+
, or multi-
plicative, Poisson, etc. Convolution of
x
with a kernel
w
is
considered to be an efficient method to obtain an approxima-
tion to the original signal
y
. If we assume that the form of
the prediction function
fw(xi)
is a convolution and the loss
is the absolute value, then we can find the optimal kernel as:
w?= argmin
wX
iT
kyGT
ix>
iwk2(2)
where
i
is a data point index, and
xi
is the
k
dimensional
context of data point
i
. However, application of a fixed kernel
homogeneously to the entire signal rarely gives satisfactory
results, because signal and noise characteristics may vary
throughout the signal. Our aim is to train a model that
efficiently matches a data point and its context
xi
with the
optimal kernel wlearned from a training set.
We propose solving this problem with random forests that
store optimal filters
w
at the leaves and learn to efficiently
assign these to the input
xi
. Notably, these filters can be
local, specific and simple, yet the forest can still handle
complicated types of noise (like a piecewise linear function
converging to a continuous function). For instance, FF can
handle multiplicative noise by approximating its response
with additive noise at each leaf. Therefore, we choose to
store linear filters that are learned by assuming additive noise,
for which there is an effective closed form solution via least
squares optimization [12]:
w?= argmin
w
kyXwk2,(3)
where
yRN
is the vector of the
N
ground truth (noise
free) values, and
XRN×k
is the matrix of
N
observations
where each row is an x>
i.
Note that the linear filtering
Xw
could be replaced with a
non-linear function
fw(X)
suitable for other types of prob-
lems. Examples include a median filter for different type
of noise, or a sigmoid function that could cast a discrete
labeling problem into a continuous one, so that it can be
solved via regression.
We will now describe this model in more detail. From
this point on we will focus on a specific application, namely
image denoising, to make the explanations more intuitive.
2.1. Filter Forests
We use regression forests to learn a partitioning of the
space of image patches
x
such that within each partition
a simple linear convolution produces (measurably) good
results. We will show: i) how to train a FF to minimize
a well-defined denoising loss from labeled training data,
ii) how our regularizer encourages edge-preserving filters,
and iii) how to learn an appropriate scale for each kernel.
Our experiments in Sec. 3show that at test time the FF can
produce accurate denoising at a very high speed. Below,
we give a brief review of standard decision forests, before
presenting details on FFs.
2.1.1 Conventional decision forests
A forest is an ensemble of
T
decision trees [
1
,
4
,
8
]. Each
tree consists of non-terminal (split) and terminal (leaf) nodes.
At test time a pixel
i
is passed into the root node. At each
split node
j
, a split function
f(i;θj)
is evaluated. This
computes a binary decision based on some function of the
image that surrounds the pixel
i
, based on learned parameters
θj
.
1
Depending on this binary decision, the pixel passes
either to the left or right child. When a leaf is reached, a
stored histogram over class labels (for classification) or a
density over continuous variables (for regression) is output.
Each tree in the forest is trained independently. A set of
example pixel indices
i
and their corresponding ground truth
labels are provided. Starting at the root, a set of candidate
binary split function parameters
θ
are proposed at random,
and for each candidate, the set of training pixels is parti-
tioned into left and right child sets. An objective function
(typically information gain for classification problems) is
evaluated given each of these candidate partitions, and the
best
θ
is chosen. Training then continues greedily down
the tree, recursively partitioning the original set into succes-
sively smaller subsets. Training stops when a node reaches a
maximum depth Dor contains too few examples.
1
Function
f
can be any arbitrary function of the image region surround-
ing pixel iand is not limited to using just the patch xi.
Figure 2.
PCA-based image features.
The first
10
principal com-
ponents (shown in false color), computed from noisy training
patches at scales 11x11, 7x7, and 3x3. They represent smooth
regions, edges, corners as well as more complex patterns. Note that
these PCA components are only used to compute the features used
by the decision forest to partition the appearance space; the final
filtered output image is computed directly from the noisy input.
2.1.2 Details on filter forests
Training data.
For FFs, we use pairs of noisy image patches
xi
and their corresponding noise-free ground truth values
yi
(for the central pixel only).
2
As is standard with denoising
techniques, for our experiments we generate training data
by adding synthetic noise of particular characteristics (e.g.
Gaussian with
σ= 20
grey levels) to the noise-free ground
truth images.
Features types.
We employ two types of image features in
FFs. The notation
θ
is used to encapsulate which type of
feature is used, the parameters of that feature, and a threshold
applied to a scalar value that results in the required binary
decision. All aspects of
θ
are learned at each split node
through randomized optimization [8].
The first type of feature is a multi-scale filter bank (as
illustrated in Fig. 2). At each of the resolutions 3x3, 5x5,
7x7, 9x9, and 11x11, a PCA is performed on all the noisy
training patches. For both training and testing, the filter
bank is applied at each resolution as a convolution across
the image. We allow two sub-types of this feature: unary
and pairwise. The unary feature simply takes the response at
pixel
i
of one of the top
10
PCA components in one of the
resolutions, whereas the pairwise feature takes the difference
of responses (both at pixel
i
) between two different PCA
components in one of the resolutions. The split function
f
then simply applies the learned threshold to the resulting
scalar value. The choice of unary/pairwise, the PCA compo-
nents, the resolutions, and the best threshold are all learned
through randomized node optimization. We use PCA eigen-
vectors instead of predefined filter banks [
14
] because of
PCA’s generality across different types of signal. Moreover,
PCA ensures that the principal directions represent suitable
features for axis-aligned splits at internal nodes.
The second type of feature is based on a local estimate
of uniformity at each pixel. At each scale of 3x3, 5x5, 7x7,
9x9, and 11x11, we compute the variance of each image
2
While we use
xi
to conveniently denote a particular patch of the noisy
image, for neighboring pixels these patches will share values, and the
individual xis need not be explicitly computed.
patch. We only employ a unary variant of this type of feature.
The function
f
applies the learned threshold to the variance
response at pixel iat one of the scales.
Training objective.
One of the main contributions of our
work is the form of the training objective. Let
Sj
denote the
subset of training data that arrives at any node
j
of the tree.
The split function at node
j
is parameterized with parameters
θ
and splits
Sj
into the left and right children subsets
SL
j(θ)
and
SR
j(θ)
respectively. The training algorithm selects the
parameters of the split function by minimizing the energy
E(θ,Sj)as
θj= argmin
θ
E(θ,Sj),(4)
with
E
is defined as the weighted sum of energies of the two
child nodes
E(θ,Sj) = X
c∈{L,R}
|Sc
j(θ)|Ec(Sc
j(θ)) .(5)
The energy of each child node is computed as
Ec(Sc
j) = kyc
jXc
jw?k2,(6)
where
w?= argmin
wkyc
jXc
jwk2+kΓ(Xc
j,yc
j)wk2,(7)
Here,
Xc
j
and
yc
j
are computed from
SL
j
and
SR
j
, and repre-
sent the set of noisy training data patches and the set of noise-
free ground truth values that have reached the left (
c= L
) or
right (
c= R
) child respectively of node
j
. Further,
Γ(X,y)
represents a data-dependent regularization weighting matrix.
Data-dependent regularized training.
The above men-
tioned optimization task would be a standard regularized
least squares problem if not for the use of
Γ(X,y)
. As will
become clearer later,
Γ
encourages edge-preserving regu-
larization. The closed form solution of the minimization
problem (7) can be computed as
w?= (XTX+ΓTΓ)1XTy.(8)
If
Γ
were set to identity, then the regularization term in
(7) would encourage kernels to have smaller norms. In this
work, we instead investigate the use of a data-dependent
regularization, where smaller entries in
w
are encouraged
only when they differ from the pixel
i
at the center of the
patch.
To achieve this, we use a diagonal
Γ
matrix of size
p2×p2
computed as a function of the data matrix
X
and ground truth
values as follows:
γd(X,y) = 1
N
N
X
i=1
(xi,d yi)2,(9)
where
d∈ {1, . . . , p2}
indexes pixels within the patch,
N
is
the total number of samples (i.e. the number of rows) in
X
,
Figure 3.
The importance of the Γweighting matrix
.
Column 1:
different subsets of image patches that comprise
X
.
Column 2:
the
resulting
Γ(X,y)
weighting matrices.
Column 3:
the learned filters
w?
. Rows 1 and 2 use the
Γ
computed in (9), while row 3 uses
Γ=I
. Row 1 uses a randomly chosen set of patches, while rows
2 and 3 use the same set of patches with a vertical edge just to the
right of the central pixel. Our weighted regularization encourages
edge-aware filters. See text for more details.
and
xi,d
and
yi
are entries in
X
and
y
respectively. Intuitively,
the
Γ
matrix indicates which regions of the noisy image patch
are most likely to describe the ground-truth value
yi
at the
center of the patch. A pixel
d
with a larger value of
γd
will
discourage the filter
w
from using pixel
d
, while smaller
values of γdwill encourage the use of pixel d.
To illustrate the importance of this regularization term,
we consider two sets of patches that are shown in Figure 3:
randomly generated (first row) and with vertical edges (sec-
ond and third rows). From these sets we compute
Γ
(second
column) and hence using (8) the filter
w
(third column). Us-
ing random patches we obtain a filter that is, as we might
expect, somewhat isotropic. In the second row we select
patches containing vertical edges where the central pixel is
to the left of the edge. As expected, the regularizer encour-
ages the filter to consider only the pixels to the left of the
edge that are likely to be good predictors of the central pixel.
This is an example of how the regularization encourages
edge-preserving filters.
The third row in Figure 3shows the solution obtained for
patches containing vertical edges by setting
Γ=I
. In this
final case all dimensions
d
of the filter are equally regularized
(no data dependence), and the result is a poor filter that is
unlikely to generalize well.
Multi-scale filter learning.
The optimization of (7) is
obtained by testing different filter sizes, from
3×3
to
11×11
,
and selecting the one corresponding to the lowest error. This
simple scheme allows us to compute the appropriate kernel
size for each leaf, depending on the surrounding context
of the pixel. Example of such data-dependent kernels are
shown in Figure 1. Smooth areas are likely to be restored
by large kernels (in red), whereas fine details are recovered
using smaller and/or directional filters (orange).
The training phase ends when the tree reaches a certain
depth or the number of examples is too small. Each leaf
stores a different filter kernel
w?
j
, learned using the regu-
larized least squares error minimization computed over the
subset of labeled training examples that reached leaf j(8).
Test time signal filtering.
At test time, for each tree, the
test pixel
i
is sent into the root node. Then each split node
sends it to one of its child nodes until it reaches a leaf
j(i)
.
Thanks to the nodes hierarchical structure, such descent is
very efficient. The kernel
wj(i)
stored at leaf
j(i)
is looked
up, and applied as a dot product to the noisy image patch
at
i
, to produce the denoised output
ˆyi=x>
iwj(i)
. Thus, a
different kernel is applied to each input patch, conditioned
on the patch appearance. In a forest, each tree is applied in
turn and the results averaged. Different trees can be handled
independently and in parallel for further efficiency.
Algorithm Complexity.
The runtime complexity of the
algorithm depends linearly on the size of the signal, i.e.
O(|X|)
. In particular, for image denoising, the number of
operation per pixel is:
10O(p2) + T(O(p2) + O(d)) (10)
where
p2
is the biggest patch size (in our case
11 ×11
). The
first term
10O(p2)
is due to the PCA projector computations,
TO(p2)
is the dot product between the predicted filters
W
and the patch. Finally
TO(d)
is the cost for descending the
forest, with
T
trees of depth
d
. This cost being negligible, we
can approximate the whole running time to
(T+10)O(p2)
per pixel. These operations can be parallelized since the final
result does not depend on the results of the other image pixels.
This further decreases the overall running cost. Notably, the
cost per pixel is much lower than one of the fastest state of
the art methods [9] (see time comparisons in Table 1).
Implementation details.
During training, for each tree
node, we use reservoir sampling [
28
] to select a subset of
example pixels
i
. This speeds up the training process while
ensuring a certain degree of randomness. Only the variance
feature at multiple scale is allowed to be selected in the very
first levels of the tree. This helps separate smooth regions
from textured ones early on.
3. Results and Comparisons
This section presents results for FFs applied to each of
the following tasks: (i) image denoising, (ii) depth image
refinement, and (iii) 1D dynamical system filtering. For the
first problem, we performed exhaustive evaluations using the
popular BSDS500 benchmark [
2
], comparing our method
with current state of the art algorithms. The depth denoising
task has been evaluated using the 7 Scenes dataset from [
25
].
For the third experiment we designed a real, noisy dynamical
system, and compared the predictions of filter forests with
those from the standard Kalman and Least Mean Squares
(LMS) filters. These experiments demonstrate the flexibility
and effectiveness of the proposed framework in different
application domains.
3.1. Image Denoising
For image denoising experiments, we use the same pro-
tocol as in [
14
]; images are re-scaled by a factor of
0.5
, the
validation set is used for parameter tuning, and the final
model is evaluated on the test set. To form training im-
ages, Gaussian noise with zero mean and standard deviation
σ∈ {20,30,40,50}
is added to every pixel in the images
independently. We trained our method on
300
images from
training and validation set, and the test set is composed of
200
images. The results are shown in Table 1. The proposed
FF-based method gets competitive PSNR scores with an av-
erage running cost of
0.025
seconds per image. To improve
the accuracy further, we apply the collaborative Wiener fil-
ter as proposed in [
9
] in a post-processing step (FF
Wiener
in
the table). This drastically enhances the accuracy of the
method for a small cost in speed, making it better than all the
other methods except for RTF
ALL
[
14
], which uses denoised
images computed with four state-of-the-art methods as its
input. For input noise
σ= 20
we match the accuracy of this
algorithm, with a running time of
0.125
seconds compared
to
1275
seconds required by RTF
ALL
. Hence, we conclude
that, for more realistic levels of noise, FF performs as well as
the state-of-the-art, while being
4
orders of magnitude faster.
Applying the same post-processing step to other methods did
not yield significant increases in accuracy in our experiments.
Qualitative comparisons are given in Fig. 4.
3.2. Depth Image Refinement
We conduct experiments on the first sequence of each
scene in the 7 Scenes Dataset [
25
], which has a total of
6500
frames. Synthetic ‘ground truth’ depth was computed using
a KinectFusion [
13
] reconstruction of the scene. Training
images are generated by adding depth-dependent Poisson
noise to each pixel. We train an FF (four trees of depth
12
with multi-scale patches) on
1000
frames selected from the
Chess scene, and test the model on the other scenes. The
quantitative results are reported in Table 2, showing that the
proposed approach outperforms the compared methods. This
Figure 4.
Qualitative results of natural image denoising.
From left to right: noisy input, ground truth, FF output, BM3D first phase, FF
with Wiener filtering, BM3D with Wiener filtering. Best viewed digitally at high zoom.
Method σ= 20 σ= 30 σ= 40 σ= 50 sec.
Wiener 25.73 24.79 23.95 23.19 0.001
FF 28.75 26.55 25.32 24.51 0.025
FFWiener 29.65 27.48 26.15 25.25 0.125
RTFPLAIN [14]28.95 26.97 25.71 24.76 0.7
BM3D [9]29.25 27.32 25.98 25.09 0.9
EPLL [31]29.38 27.44 26.17 25.22 38
CSR [10]29.17 27.24 25.91 24.99 124
LSSC [18]29.40 27.39 26.08 25.09 172
RTFALL [14]29.67 27.72 26.43 25.51 1275
Table 1.
Denoising results (PSNR) and comparisons on the
BSDS500 benchmark.
Comparison of FF with state-of-the-art
denoising algorithms. The last column reports average running
times (in seconds) for a single image (
241 ×161
). All experiments
run on a 4-core Intel Xeon machine (2.4GHz).
also demonstrates that FFs can handle noise types other than
additive noise. In this particular application, collaborative
Wiener filtering (FF
W
) did not significantly improve the
accuracy. Qualitative examples are shown in Fig. 5.
FF FFWWiener Bilateral LMS BM3D [9]
35.61 35.63 32.29 30.95 24.37 35.46
Table 2.
Test set PSNR (db) for the depth refinement experi-
ment.
We compare our method with other popular filters: Wiener,
Bilateral, LMS, and BM3D [9]
3.3. 1D Dynamical Signal Filtering
Here, we demonstrate the application of FFs to a non-
image modality. Consider a discrete dynamical system of
the form:
xk=Akxk1+Bkuk+ωk1
zk=Ckxk+νk
(11)
where:
xkRn
is the internal state at time
k
evolving
from its previous state at time
k1
;
AkRn×n
is the
transition matrix;
BkRn×n
is the input matrix accounting
for the contribution of the control vector
uk
;
zkRm
is
an observation of the current (noisy) state
xk
;
CkRm×n
is a transformation matrix that maps the state into the ob-
servations; and
ωk1Rn
and
νkRm
are the zero
mean process noise with covariance matrices
QkRn×n
and
RkRm×m
respectively. In this setting we consider
a time series of
t
observations
zk=zkt,...,zk
with the
aim of learning a filter
w
such that
xk=z>
kw
. The model
used for image denoising can thus be applied to this scenario
without any modification.
As a practical example of Eq. 11, we refer to the model of
an electrical DC motor, which is described by the following
state-space representation
x1
x2k
=0 1
0ax1
x2k1
+0
bVk+wk1
zk=1 0x1
x2k
+νk(12)
where the parameters
a=1
τ
and
b=K
τ
are known from
the motor’s datasheet (see [
15
] for details). Here
x1
and
x2
are the position [rad] and the velocity [rad/s] of the motor
respectively,
Vk
is the motor armor voltage [Volt]. Given a
series of noisy observations
z1, . . . , zk
from the load encoder,
we want to predict the current state of the system
xk
that is
the denoised position of the motor encoder [rad]. For the first
experiment, the signal is corrupted with a Gaussian noise
(zero mean, variance
σ= 0.01
), and the proposed model is
compared with a Kalman filter. In this setting, the Kalman
filter is the optimal estimator if the noise is white and the
covariances of the noise are known [
29
] (i.e. it minimizes
Figure 5.
Qualitative results of depth refinement.
From left to right: noisy input, BM3D, Wiener filter, bilateral filter, LMS filter, FF. FF
produces visually better results.
Figure 6.
Dynamical system filtering in presence of structured
noise.
We show error as the noise level
σ
is varied. FFs systemati-
cally outperform LMS and the Kalman filter.
the mean square error of the estimated output).
We generate
100
sequences of
2000
samples for the train-
ing phase, and test on
100
new noisy sequences. For a noisy
signal with
0.0097
rad MSE, the Kalman filter achieves an
error of
0.0018
rad, and FFs manage to converge to the opti-
mal filter with
0.00184
rad of error, without any assumption
on the system or noise model. In Fig. 7, left plot, we show
a denoised sequence. Notably, a least mean square filter
w
achieves an error of 0.0033 rad.
In a second experiment, we generate input depen-
dent noise of the form
wk=νxk
, where
ν
is a
random variable with zero mean and variances
σ
{0.01,0.02,0.05,0.1,0.2}
. When the noise is structured,
the Kalman filter is not guaranteed to be optimal. Here, FFs
systematically outperform the Kalman and the LMS filters at
every noise level. Quantitative results are reported in Fig. 6,
and qualititative results in Fig. 7(right).
4. Conclusion
We have proposed FF as an efficient, novel discriminative
approach for signal filtering and restoration. FFs are a non-
linear, data-adaptive extension of convolution-based filtering,
but can be further extended to other filter types or non-linear
functions. A new data-driven, regularized training objective
allows us to adapt decision forests to the task of regressing
continuous variables, while exploiting their spatio-temporal
context.
Numerous experiments and quantitative comparisons
show that FFs achieve accuracy, which is at par or better
than recent state of the art algorithms; and orders of magni-
tude faster. We believe FFs can readily be extended to other
applications such as edge detection, sharpening, inpainting,
superresolution and many more.
References
[1]
Y. Amit and D. Geman. Shape quantization and recognition
with randomized trees. Neural Computation, 9(7), 1997. 1,3
[2]
P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour
detection and hierarchical image segmentation. IEEE Trans.
PAMI, 2011. 5
[3]
A. Blake, P. Kohli, and C. Rother. Markov Random Fields for
Vision and Image Processing. The MIT Press, 2011. 1
Figure 7.
Dynamical system filtering example.
Kalman filter (black line) and an FF(cyan line) are used to estimate the encoder position of
a DC electrical motor (green line) from noisy observations (red line). Left: white Gaussian noise is added. In this setting the Kalman filter
represents the optimal estimator, and FF learns a very close approximation. Right: input dependent multiplicative noise is generated. In this
scenario our more flexible approach outperforms the Kalman filter.
[4]
L. Breiman. Random forests. Machine Learning, 45(1), 2001.
1,3
[5]
A. Buades, B. Coll, and J. M. Morel. A non-local algorithm
for image denoising. In Proc. CVPR, 2005. 2
[6]
A. Criminisi, D. Robertson, E. Konukoglu, J. Shotton,
S. Pathak, S. White, and K. Siddiqui. Regression forests
for efficient anatomy detection and localization in computed
tomography scans. Medical image analysis, 2013. 1
[7]
A. Criminisi, T. Sharp, C. Rother, and P. Perez. Geodesic
image and video editing. Proc. ACM SIGGRAPH, 2011. 2
[8]
A. Criminisi and J. Shotton. Decision Forests for Computer
Vision and Medical Image Analysis. Springer, 2013. 1,3
[9]
K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Im-
age denoising by sparse 3-d transform-domain collaborative
filtering. IEEE Trans. Image Processing, 2007. 2,5,6
[10]
W. Dong, X. Li, L. Zhang, and G. Shi. Sparsity-based image
denoising vis dictionary learning and structural clustering. In
Proc. CVPR, 2011. 2,6
[11]
M. Elad and M. Aharon. Image denoising via sparse and
redundant representations over learned dictionaries. IEEE
Trans. Image Processing, 2006. 2
[12]
S. Haykin and B. Widrow. Least-Mean-Square Adaptive
Filters. Wiley, 2003. 2
[13]
S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe,
P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and
A. Fitzgibbon. KinectFusion: Real-time 3D reconstruction
and interaction using a moving depth camera. In ACM UIST,
2011. 5
[14]
J. Jancsary, S. Nowozin, and C. Rother. Loss-specific training
of non-parametric image restoration models: a new state of
the art. In Proc. ECCV, 2012. 2,3,5,6
[15]
T. Kara and I. Eker. Nonlinear modeling and identification
of a
{
DC
}
motor for bidirectional operation with real time
experiments. Energy Conversion and Management, 2004. 6
[16]
P. Kontschieder, P. Kohli, J. Shotton, and A. Criminisi. GeoF:
Geodesic forests for learning coupled predictors. In Proc.
CVPR, 2013. 1,2
[17]
J. Lafferty, A. McCallum, and F. Pereira. Conditional ran-
dom fields: Probabilistic models for segmenting and labeling
sequence data. In Proc.ICML, 2001. 1
[18]
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman.
Non-local sparse models for image restoration. In Proc. ICCV,
2009. 2,6
[19]
S. Nowozin, C. Rother, S. Bagon, T. Sharp, B. Yao, and
P. Kohli. Decision tree fields. In ICCV, pages 1668–1675,
2011. 2
[20]
C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz.
Fast cost-volume filtering for visual correspondence and be-
yond. In CVPR, pages 3017–3024, 2011. 2
[21]
S. Ross, D. Munoz, M. Hebert, and J. A. Bagnell. Learning
message-passing inference machines for structured prediction.
In CVPR, pages 2737–2744, 2011. 2
[22]
R. Shapovalov, D. Vetrov, and P. Kohli. Spatial inference
machines. In CVPR, pages 2985–2992, 2013. 2
[23]
T. Sharp. Implementing decision trees and forests on a GPU.
In Proc. ECCV. Springer, 2008. 1
[24]
J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook,
M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman,
and A. Blake. Efficient human pose estimation from single
depth images. IEEE Trans. PAMI, 2013. 1
[25]
J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and
A. Fitzgibbon. Scene coordinate regression forests for camera
relocalization in rgb-d images. In Proc. CVPR, 2013. 5
[26]
M. W. Tao, J. Bai, P. Kohli, and S. Paris. Simpleflow: A non-
iterative, sublinear optical flow algorithm. Comput. Graph.
Forum, 31(2):345–353, 2012. 2
[27]
C. Tomasi. Bilateral filtering for gray and color images. In
Proc. ICCV, 1998. 1
[28]
J. Vitter. Random sampling with a reservoir. ACM Trans.
Math. Softw., 1985. 5
[29]
G. Welch and G. Bishop. An introduction to the kalman filter.
Technical report, University of North Carolina at Chapel Hill,
1995. 6
[30]
N. Wiener. Extrapolation, Interpolation, and Smoothing of
Stationary Time Series. The MIT Press, 1964. 1
[31]
D. Zoran and Y. Weiss. From learning models of natural
image patches to whole image restoration. In Proc. ICCV,
2011. 6
... This involves tackling challenges such as noise, occlusions, and complex vegetation structures through advanced methodologies. Adaptive filtering techniques dynamically adjust filtering parameters based on local point density and vegetation characteristics, improving tree segmentation accuracy by mitigating noise and irrelevant points [41,[50][51][52]. For instance, adaptive morphological filters have been used to distinguish ground and non-ground points in dense forests, enhancing segmentation precision [41]. ...
Article
Full-text available
Accurate characterization of tree stems is critical for assessing commercial forest health, estimating merchantable timber volume, and informing sustainable value management strategies. Conventional ground-based manual measurements, although precise, are labor-intensive and impractical at large scales, while remote sensing approaches using satellite or UAV imagery often lack the spatial resolution needed to capture individual tree attributes in complex forest environments. To address these challenges, this study provides a significant contribution by introducing a large-scale dataset encompassing 40 plots in Western Australia (WA) with varying tree densities, derived from Hovermap LiDAR acquisitions and destructive sampling. The dataset includes parameters such as plot and tree identifiers, DBH, tree height, stem length, section lengths, and detailed diameter measurements (e.g., DiaMin, DiaMax, DiaMean) across various heights, enabling precise ground-truth calibration and validation. Based on this dataset, we present the Forest Stem Extraction and Modeling (FoSEM) framework, a LiDAR-driven methodology that efficiently and reliably models individual tree stems from dense 3D point clouds. FoSEM integrates ground segmentation, height normalization, and K-means clustering at a predefined elevation to isolate stem cores. It then applies circle fitting to capture cross-sectional geometry and employs MLESAC-based cylinder fitting for robust stem delineation. Experimental evaluations conducted across various radiata pine plots of varying complexity demonstrate that FoSEM consistently achieves high accuracy, with a DBH RMSE of 1.19 cm (rRMSE = 4.67%) and a height RMSE of 1.00 m (rRMSE = 4.24%). These results surpass those of existing methods and highlight FoSEM’s adaptability to heterogeneous stand conditions. By providing both a robust method and an extensive dataset, this work advances the state of the art in LiDAR-based forest inventory, enabling more efficient and accurate tree-level assessments in support of sustainable forest management.
... These features are mapped to the filter parameters through a perceptron [4]. Fanello et al. [14] described how to learn optimal linear filters applied to clusters of patches, agglomerated by a random forest. The optimal filters are found by minimizing the 2 reconstruction error. ...
Preprint
The non-stationary nature of image characteristics calls for adaptive processing, based on the local image content. We propose a simple and flexible method to learn local tuning of parameters in adaptive image processing: we extract simple local features from an image and learn the relation between these features and the optimal filtering parameters. Learning is performed by optimizing a user defined cost function (any image quality metric) on a training set. We apply our method to three classical problems (denoising, demosaicing and deblurring) and we show the effectiveness of the learned parameter modulation strategies. We also show that these strategies are consistent with theoretical results from the literature.
... While it is quite easy to get a huge database of color image examples, e.g. from the web, there exists no equivalent source for depth data. One workaround [9,23] is to densely reconstruct a 3D scene with KinectFusion [19] and facilitate these reconstructions as ground-truth. However, this also introduces artifacts in the training data, such as smoothed edges and the loss of fine details. ...
Preprint
In this paper we present a novel method to increase the spatial resolution of depth images. We combine a deep fully convolutional network with a non-local variational method in a deep primal-dual network. The joint network computes a noise-free, high-resolution estimate from a noisy, low-resolution input depth map. Additionally, a high-resolution intensity image is used to guide the reconstruction in the network. By unrolling the optimization steps of a first-order primal-dual algorithm and formulating it as a network, we can train our joint method end-to-end. This not only enables us to learn the weights of the fully convolutional network, but also to optimize all parameters of the variational method and its optimization procedure. The training of such a deep network requires a large dataset for supervision. Therefore, we generate high-quality depth maps and corresponding color images with a physically based renderer. In an exhaustive evaluation we show that our method outperforms the state-of-the-art on multiple benchmarks.
... On the other hand, TNRD [11], [12] unfolds a fixed number of gradient descent inference steps. Non-parametric methods, such as regression tree fields (RTF) [44], [45] and filter forests [47], are also used for modeling image priors. ...
Preprint
Most existing non-blind restoration methods are based on the assumption that a precise degradation model is known. As the degradation process can only be partially known or inaccurately modeled, images may not be well restored. Rain streak removal and image deconvolution with inaccurate blur kernels are two representative examples of such tasks. For rain streak removal, although an input image can be decomposed into a scene layer and a rain streak layer, there exists no explicit formulation for modeling rain streaks and the composition with scene layer. For blind deconvolution, as estimation error of blur kernel is usually introduced, the subsequent non-blind deconvolution process does not restore the latent image well. In this paper, we propose a principled algorithm within the maximum a posterior framework to tackle image restoration with a partially known or inaccurate degradation model. Specifically, the residual caused by a partially known or inaccurate degradation model is spatially dependent and complexly distributed. With a training set of degraded and ground-truth image pairs, we parameterize and learn the fidelity term for a degradation model in a task-driven manner. Furthermore, the regularization term can also be learned along with the fidelity term, thereby forming a simultaneous fidelity and regularization learning model. Extensive experimental results demonstrate the effectiveness of the proposed model for image deconvolution with inaccurate blur kernels, deconvolution with multiple degradations and rain streak removal.
... Versatile time-window sliding machine learning techniques… Convolutional filters can effectively process data by using a data-driven convolutional filter design method to remove Gaussian noise from images (Fanello et al. 2014). The shortcomings of discrete convolutional filters in our models can naturally capture the combination information and long-term dependency information (Yang 2018). ...
Article
Full-text available
The stock market which is a critical instrument in the modern financial system consistently attracts a significant number of individuals and financial institutions. The amalgamation of machine learning with stock forecasting has seen increased interest due to the rising prominence of machine learning. Among numerous machine learning models, the long short-term memory (LSTM) is favored by many researchers for its superior performance in long time series. This paper presents an innovative approach that integrates LSTM with versatile sliding window techniques to enhance prediction results and training performance. Moreover, the attached experiments incorporate convolutional filters and combined bivariate performance measures, which are invaluable methodologies for enhancing stock forecasting systems. These significant contributions are expected to be beneficial for researchers operating within this domain.
... Among them, näive refinement practices typically involve left-right consistency check [16], hole filling [17], median filtering [18], [19], [20], [21], guided [22] or bilateral filtering [23] procedures. Others, instead, adopt non local-means filtering [24], dictionary-based strategies [25], filter forests [26] or leverage confidence estimation [9], [27] to determine the reliability of computed disparities and, possibly, refine them [28], [29]. With the advent of deep learning, deep neural networks have been used to detect, replace, and refine noisy disparities [10] or the refinement process has been formulated as a recurrent process [11]. ...
Article
Full-text available
We propose a framework that combines traditional, hand-crafted algorithms and recent advances in deep learning to obtain high-quality, high-resolution disparity maps from stereo images. By casting the refinement process as a continuous feature sampling strategy, our neural disparity refinement network can estimate an enhanced disparity map at any output resolution. Our solution can process any disparity map produced by classical stereo algorithms, as well as those predicted by modern stereo networks or even different depth-from-images approaches, such as the COLMAP structure-from-motion pipeline. Nonetheless, when deployed in the former configuration, our framework performs at its best in terms of zero-shot generalization from synthetic to real images. Moreover, its continuous formulation allows for easily handling the unbalanced stereo setup very diffused in mobile phones.
... These artifacts can be not only localized [40], but also corrected [58,21]. In the past, the refinement task has been faced using non local-means filtering [12], dictionary-based strategies [24] or filter forests [11]. More recent refinement approaches rely on deep networks, for instance to sequentially detect, replace and refine noisy pixels [15]. ...
... These artifacts can be not only localized [40], but also corrected [58,21]. In the past, the refinement task has been faced using non local-means filtering [12], dictionary-based strategies [24] or filter forests [11]. More recent refinement approaches rely on deep networks, for instance to sequentially detect, replace and refine noisy pixels [15]. ...
Preprint
Full-text available
We introduce a novel architecture for neural disparity refinement aimed at facilitating deployment of 3D computer vision on cheap and widespread consumer devices, such as mobile phones. Our approach relies on a continuous formulation that enables to estimate a refined disparity map at any arbitrary output resolution. Thereby, it can handle effectively the unbalanced camera setup typical of nowadays mobile phones, which feature both high and low resolution RGB sensors within the same device. Moreover, our neural network can process seamlessly the output of a variety of stereo methods and, by refining the disparity maps computed by a traditional matching algorithm like SGM, it can achieve unpaired zero-shot generalization performance compared to state-of-the-art end-to-end stereo models.
... Pose rectification can also be seen as a "data cleaning" task. To this, recurrent frameworks [73], conditional Boltzmann machines [96], Kalman filtering [6], dimensionality reduction [3] or data-dependent random forests [29] have been successfully applied to identify and rectify outliers in human poses and other modalities. Specifically in sports, Hwang at al. [56] combined global and local pose estimation to refine joint predictions of athletes. ...
Thesis
The international success of world-class athletes depends strongly on the assessment and active improvement of their technique. Camera technology, force sensors, and sophisticated evaluation software have become common tools for compiling sophisticated biomechanical performance profiles, which allow for precisely evaluating the human motion and concluding necessary steps for improving the athletic performance. In this work, we investigate approaches and algorithms for the fully automatic, time-continuous estimation of kinematic and dynamic performance parameters of athletes in sports videos. Therefore, we focus on two specific application scenarios: estimating the motion kinematics of swimmers in a swimming channel and predicting the jump forces of ski jumpers just from video footage. A fundamental concept of the kinematic analysis is the key-pose, a specific, well-defined athlete posture. The precise and continuous identification of key-poses allows for deriving a lot of relevant kinematic parameters: stroke frequency, kick rate, and inner-cyclic interval timings. We show that the identification of a key-pose can be treated as a classification problem and introduce an ensemble of spatiotemporal pose detectors for continuously estimating the occurrence of distinct postures in cyclic motion. While we can not guarantee that these postures represent actual key-poses, they can be used in a simple probabilistic scheme to indirectly infer key-pose occurrences. We furthermore discuss articulate pose estimation for identifying key-poses directly by their defining features. The proposed pose estimator leverages a deep neural network for learning athlete appearance, which enables the articulate estimation of joint locations even in the visually challenging scenario of a swimming channel. With modern pose estimation systems not being immune to estimation errors, we investigate the error susceptibility of two recent systems and derive a taxonomy of the most pertinent estimation errors. Several optimization strategies are discussed, where each method rectifies a different error class, consequently improving joint localization and key-pose identification. At last, we address the question of how the jump forces of a ski jumper can be predicted from video footage. We introduce a neural network build on dilated convolutional layers for predicting a series of force measurements directly from a sequence of athlete poses. An experimental exploration of different architectures indicates the general feasibility of our approach.
... We use PSNR as the metric. We compare c-Glow with ProximalNet (Wang, Fidler, and Urtasun 2016), Filter-Forest (Ryan Fanello et al. 2014), and BM3D (Dabov et al. 2007). For c-Glow, the parameters are set to be L = 3, K = 8, n c = 8, and n w = 32. ...
Article
Traditional structured prediction models try to learn the conditional likelihood, i.e., p(y|x), to capture the relationship between the structured output y and the input features x. For many models, computing the likelihood is intractable. These models are therefore hard to train, requiring the use of surrogate objectives or variational inference to approximate likelihood. In this paper, we propose conditional Glow (c-Glow), a conditional generative flow for structured output learning. C-Glow benefits from the ability of flow-based models to compute p(y|x exactly and efficiently. Learning with c-Glow does not require a surrogate objective or performing inference during training. Once trained, we can directly and efficiently generate conditional samples. We develop a sample-based prediction method, which can use this advantage to do efficient and effective inference. In our experiments, we test c-Glow on five different tasks. C-Glow outperforms the state-of-the-art baselines in some tasks and predicts comparable outputs in the other tasks. The results show that c-Glow is versatile and is applicable to many different structured prediction problems.
Article
Full-text available
Optical flow is a critical component of video editing applications, e.g. for tasks such as object tracking, segmentation, and selection. In this paper, we propose an optical flow algorithm called SimpleFlow whose running times increase sublinearly in the number of pixels. Central to our approach is a probabilistic representation of the motion flow that is computed using only local evidence and without resorting to global optimization. To estimate the flow in image regions where the motion is smooth, we use a sparse set of samples only, thereby avoiding the expensive computation inherent in traditional dense algorithms. We show that our results can be used as is for a variety of video editing tasks. For applications where accuracy is paramount, we use our result to bootstrap a global optimization. This significantly reduces the running times of such methods without sacrificing accuracy. We also demonstrate that the SimpleFlow algorithm can process HD and 4K footage in reasonable times.
Book
Decision forests (also known as random forests) are an indispensable tool for automatic image analysis. This practical and easy-to-follow text explores the theoretical underpinnings of decision forests, organizing the vast existing literature on the field within a new, general-purpose forest model. A number of exercises encourage the reader to practice their skills with the aid of the provided free software library. An international selection of leading researchers from both academia and industry then contribute their own perspectives on the use of decision forests in real-world applications such as pedestrian tracking, human body pose estimation, pixel-wise semantic segmentation of images and videos, automatic parsing of medical 3D scans, and detection of tumors. The book concludes with a detailed discussion on the efficient implementation of decision forests. Topics and features: • With a foreword by Prof. Yali Amit and Prof. Donald Geman, recounting their participation in the development of decision forests • Introduces a flexible decision forest model, capable of addressing a large and diverse set of image and video analysis tasks • Investigates both the theoretical foundations and the practical implementation of decision forests • Discusses the use of decision forests for such tasks as classification, regression, density estimation, manifold learning, active learning and semi-supervised classification • Includes exercises and experiments throughout the text, with solutions, slides, demo videos and other supplementary material provided at an associated website • Provides a free, user-friendly software library, enabling the reader to experiment with forests in a hands-on manner With its clear, tutorial structure and supporting exercises, this text will be of great value to students wishing to learn the basics of decision forests, researchers wanting to become more familiar with forest-based learning, and practitioners interested in exploring modern and efficient image analysis techniques. Dr. A. Criminisi and Dr. J. Shotton are Senior Researchers in the Computer Vision Group at Microsoft Research Cambridge, UK.
Chapter
Previous chapters have discussed the use of decision forests in supervised problems as well as unsupervised ones. This chapter puts the two things together to achieve semi-supervised learning. We focus here on semi-supervised classification, but the approach can be extended to regression too. In semi-supervised classification we have available a small set of labeled training data points and a large set of unlabeled ones. This is a typical situation in many practical scenarios. For instance, in medical image analysis, getting hold of numerous anonymized patients scans is relatively easy and cheap. However, labeling them with ground truth annotations requires experts time and effort and thus it is very expensive. A key question then is whether we can exploit the existence of unlabeled data to improve classification. After a brief literature survey, we show how to adapt the abstract forest model of Chap. 3 to achieve efficient semi-supervised classification.
Conference Paper
Conventional decision forest based methods for image labelling tasks like object segmentation make predictions for each variable (pixel) independently [3, 5, 8]. This prevents them from enforcing dependencies between variables and translates into locally inconsistent pixel labellings. Random field models, instead, encourage spatial consistency of labels at increased computational expense. This paper presents a new and efficient forest based model that achieves spatially consistent semantic image segmentation by encoding variable dependencies directly in the feature space the forests operate on. Such correlations are captured via new long-range, soft connectivity features, computed via generalized geodesic distance transforms. Our model can be thought of as a generalization of the successful Semantic Texton Forest, Auto-Context, and Entangled Forest models. A second contribution is to show the connection between the typical Conditional Random Field (CRF) energy and the forest training objective. This analysis yields a new objective for training decision forests that encourages more accurate structured prediction. Our GeoF model is validated quantitatively on the task of semantic image segmentation, on four challenging and very diverse image datasets. GeoF outperforms both state of-the-art forest models and the conventional pair wise CRF.
Conference Paper
We address the problem of inferring the pose of an RGB-D camera relative to a known 3D scene, given only a single acquired image. Our approach employs a regression forest that is capable of inferring an estimate of each pixel's correspondence to 3D points in the scene's world coordinate frame. The forest uses only simple depth and RGB pixel comparison features, and does not require the computation of feature descriptors. The forest is trained to be capable of predicting correspondences at any pixel, so no interest point detectors are required. The camera pose is inferred using a robust optimization scheme. This starts with an initial set of hypothesized camera poses, constructed by applying the forest at a small fraction of image pixels. Preemptive RANSAC then iterates sampling more pixels at which to evaluate the forest, counting inliers, and refining the hypothesized poses. We evaluate on several varied scenes captured with an RGB-D camera and observe that the proposed technique achieves highly accurate relocalization and substantially out-performs two state of the art baselines.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.