Conference PaperPDF Available

LiteFlowNet3: Resolving Correspondence Ambiguity for More Accurate Optical Flow Estimation

Authors:

Abstract and Figures

Deep learning approaches have achieved great success in addressing the problem of optical flow estimation. The keys to success lie in the use of cost volume and coarse-to-fine flow inference. However, the matching problem becomes ill-posed when partially occluded or homogeneous regions exist in images. This causes a cost volume to contain outliers and affects the flow decoding from it. Besides, the coarse-to-fine flow inference demands an accurate flow initialization. Ambiguous correspondence yields erroneous flow fields and affects the flow inferences in subsequent levels. In this paper, we introduce LiteFlowNet3, a deep network consisting of two specialized modules, to address the above challenges. (1) We ameliorate the issue of outliers in the cost volume by amending each cost vector through an adaptive modulation prior to the flow decoding. (2) We further improve the flow accuracy by exploring local flow consistency. To this end, each inaccurate optical flow is replaced with an accurate one from a nearby position through a novel warping of the flow field. LiteFlowNet3 not only achieves promising results on public benchmarks but also has a small model size and a fast runtime.
Content may be subject to copyright.
LiteFlowNet3: Resolving Correspondence
Ambiguity for More Accurate Optical Flow
Estimation
Tak-Wai Hui1[0000000214419289] and Chen Change Loy2[0000000153451591]
1The Chinese University of Hong Kong
2Nanyang Technological University
https://github.com/twhui/LiteFlowNet3
twhui@ie.cuhk.edu.hk, ccloy@ntu.edu.sg
Abstract. Deep learning approaches have achieved great success in ad-
dressing the problem of optical flow estimation. The keys to success lie
in the use of cost volume and coarse-to-fine flow inference. However, the
matching problem becomes ill-posed when partially occluded or homo-
geneous regions exist in images. This causes a cost volume to contain
outliers and affects the flow decoding from it. Besides, the coarse-to-
fine flow inference demands an accurate flow initialization. Ambiguous
correspondence yields erroneous flow fields and affects the flow infer-
ences in subsequent levels. In this paper, we introduce LiteFlowNet3, a
deep network consisting of two specialized modules, to address the above
challenges. (1) We ameliorate the issue of outliers in the cost volume by
amending each cost vector through an adaptive modulation prior to the
flow decoding. (2) We further improve the flow accuracy by exploring lo-
cal flow consistency. To this end, each inaccurate optical flow is replaced
with an accurate one from a nearby position through a novel warping
of the flow field. LiteFlowNet3 not only achieves promising results on
public benchmarks but also has a small model size and a fast runtime.
1 Introduction
Optical flow estimation is a classical problem in computer vision. It is widely used
in many applications such as motion tracking, action recognition, video segmen-
tation, 3D reconstruction, and more. With the advancement of deep learning,
many research works have attempted to address the problem by using convolu-
tional neural networks (CNNs) [10,11,12,13,18,27,28,31,32]. The majority of the
CNNs belongs to the 2-frame method that infers a flow field from an image pair.
Particularly, LiteFlowNet [10] and PWC-Net [27] are the first CNNs to propose
using the feature warping and cost volume at multiple pyramid levels in a coarse-
to-fine estimation. This greatly reduces the number of model parameters from
160M in FlowNet2 [13] to 5.37M in LiteFlowNet and 8.75M in PWC-Net while
accurate flow estimation is still maintained.
One of the keys to success for the lightweight optical flow CNNs is the use of
cost volume for establishing correspondence at each pyramid level. However, a
2 T.-W. Hui, C. C. Loy
cost volume is easily corrupted by ambiguous feature matching [17,26,30]. This
causes flow fields that are decoded from the cost volume to become unreliable.
The underlying reasons for the existence of ambiguous matching are twofold.
First, when given a pair of images, it is impossible for a feature point in the first
image to find the corresponding point in the second image, when the latter is oc-
cluded. Second, ambiguous correspondence is inevitable in homogeneous regions
(e.g., shadows, sky, and walls) of images. Another key to success for the opti-
cal flow CNNs is to infer flow fields using a coarse-to-fine framework. However,
this approach highly demands an accurate flow initialization from the preceding
pyramid level. Once ambiguous correspondence exists, erroneous optical flow is
generated and propagates to subsequent levels.
To address the aforementioned challenges, we attempt to make correspon-
dence across images less ambiguous and in turn improves the accuracy of optical
flow CNNs by introducing the following specialized CNN modules:
Cost Volume Modulation. Ambiguous feature matching causes outliers to
exist in a cost volume. Inaccurate cost vectors need to be amended to allow the
correct flow decoding. To deal with occlusions, earlier work improves the match-
ing process by using the offset-centered matching windows [17]. A cost volume is
filtered to remove outliers prior to the correspondence decoding [26,30]. However,
existing optical flow CNNs [11,12,13,18,28,32,31] infer optical flow from a cost
volume using convolutions without explicitly addressing the issue of outliers. We
propose to amend each cost vector in the cost volume by using an adaptive affine
transformation. A confidence map that pinpoints the locations of unreliable flow
is used to facilitate the generation of transformation parameters.
Flow Field Deformation. When the correspondence problem becomes ill-
posed, it is very difficult to find correct matching pairs. Local flow consistency
and co-occurrence between flow boundaries and intensity edges are commonly
used as the clues to regularize flow fields in conventional methods [29,33]. The
two principles are also adopted in recent optical flow CNNs [10,11,12]. We pro-
pose a novel technique to further improve the flow accuracy by using the clue
from local flow consistency. Intuitively, we replace each inaccurate optical flow
with an accurate one from a nearby position having similar feature vectors. The
replacement is achieved by a meta-warping of the flow field in accordance with
a computed displacement field (similar to optical flow but the displacement field
no longer represents correspondence). We compute the displacement field by
using a confidence-guided decoding from an auto-correlation cost volume.
In this work, we make the first attempt to use cost volume modulation and
flow field deformation in optical flow CNNs. We extend our previous work (Lite-
FlowNet2 [11]) by incorporating the proposed modules for addressing the afore-
mentioned challenges. LiteFlowNet3 achieves promising performance in the 2-
frame method. It outperforms VCN-small [31], IRR-PWC [12], PWC-Net+ [28],
and LiteFlowNet2 on Sintel and KITTI. Even though SelFlow [18] (a multi-
frame method) and HD3[32] use extra training data, LiteFlowNet3 outperforms
SelFlow on Sintel clean and KITTI while it performs better than HD3on Sintel,
KITTI 2012, and KITTI 2015 (in foreground region). LiteFlowNet3 does not
Resolving Correspondence Ambiguity for More Accurate Flow Estimation 3
suffer from the artifact problem on real-world images as HD3, while being 7.7
times smaller in model size and 2.2 times faster in runtime.
2 Related Work
Variational Approach. Since the pioneering work of Horn and Schunck [8], the
variational approach has been widely studied for optical flow estimation. Brox
et al. address the problem of illumination change across images by introducing
the gradient constancy assumption [3]. Brox et al. [3] and Papenberg et al. [23]
propose the use of image warping in minimizing an energy functional. Bailer et
al. propose Flow Fields [1], which is a searching-based method. Optical flow is
computed by a numerical optimization with multiple propagations and random
searches. In EpicFlow [25], Revaud et al. use sparse flows as an initialization
and then interpolate them to a dense flow field by fitting a local affine model
at each pixel based on nearby matches. The affine parameters are computed as
the least-square solution of an over-determined system. Unlike EpicFlow, we use
an adaptive affine transformation to amend a cost volume. The transformation
parameters are implicitly generated in the CNN instead.
Cost Volume Approach. Kang et al. address the problem of ambiguous
matching by using the offset-centered windows and select a subset of neighbor-
ing image frames to perform matching dynamically [17]. Rhemann et al. propose
to filter a cost volume using an edge-preserving filter [26]. In DCFlow [30], Xu
et al. exploit regularity in a cost volume and improve the optical flow accu-
racy by adapting the semi-global matching. With the inspiration of improving
cost volume from the above conventional methods, we propose to modulate each
cost vector in the cost volume by using an affine transformation prior to the
flow decoding. The transformation parameters are adaptively constructed to
suit different cost vectors. In particular, DCFlow combines the interpolation in
EpicFlow [25] with a complementary scheme to convert a sparse correspondence
to a dense one. On the contrary, LiteFlowNet3 applies an affine transformation
to all elements in the cost volume but not to sparse correspondence.
Unsupervised and Self-supervised Optical Flow Estimation. To avoid
annotating labels, Meister et al. propose a framework that uses the difference
between synthesized and real images for unsupervised training [21]. Liu et al.
propose SelFlow that distills reliable flow estimations from non-occluded pixels
in a large dataset using self-supervised training [18]. It also uses multiple frames
and fine-tunes the self-supervised model in supervised training for improving the
flow accuracy further. Unlike the above works, we focus on supervised learning.
Even though LiteFlowNet3 is a 2-frame method and trained on a much smaller
dataset, it still outperforms SelFlow on Sintel clean and KITTI.
Supervised Learning of Optical Flow. Dosovitskiy et al. develop FlowNet [6],
the first optical flow CNN. Mayer et al. extend FlowNet to estimate disparity
and scene flow [20]. In FlowNet2 [13], Ilg et al. improve the flow accuracy of
FlowNet by cascading several variants of it. However, the model size is increased
to over 160M parameters and it also demands a high computation time. Ranjan
4 T.-W. Hui, C. C. Loy
et al. develop a compact network SPyNet [24], but the accuracy is not compa-
rable to FlowNet2. Our LiteFlowNet [10], which consists of the cascaded flow
inference and flow regularization, has a small model size (5.37M) and compara-
ble performance as FlowNet2. We then develop LiteFlowNet2 for more accurate
flow accuracy and faster runtime [11]. LiteFlowNet3 is built upon LiteFlowNet2
with the incorporation of cost volume modulation and flow field deformation
for improving the flow accuracy further. A concurrent work to LiteFlowNet is
PWC-Net [27], which proposes using the feature warping and cost volume as
LiteFlowNet. Sun et al. then develop PWC-Net+ by improving the training
protocol [28]. Ilg et al. extend FlowNet2 to FlowNet3 with the joint learning of
occlusion and optical flow [14]. In Devon [19], Lu et al. perform feature match-
ing that is governed by an external flow field. On the contrary, our displacement
field is used to deform optical flow but not to facilitate feature matching. Hur et
al. propose IRR-PWC [12], which improves PWC-Net by adopting the flow reg-
ularization from LiteFlowNet as well as introducing the occlusion decoder and
weight sharing. Yin et al. introduce HD3for learning a probabilistic pixel cor-
respondence [32], but it requires pre-training on ImageNet. While LiteFlowNet3
learns a flow confidence implicitly but not computed from the probabilistic es-
timation. Despite HD3uses extra training data and 7.7 times more parameters,
LiteFlowNet3 outperforms HD3on Sintel, KITTI 2012, and KITTI 2015 (in
foreground region). LiteFlowNet3 outperforms VCN-small [31] even though the
model sizes are similar. Comparing to deformable convolution [5], we perform
deformation on flow fields but not on feature maps. Our deformation aims to
replace each inaccurate optical flow with an accurate one from a nearby position
in the flow field, while deformable convolution aims to augment spatial sampling.
3 LiteFlowNet3
Feature matching becomes ill-posed in homogeneous and partially occluded re-
gions as one-to-multiple correspondence occurs for the first case while one-to-
none correspondence occurs for the second case. Duplicate of image structure
(so-called “ghosting effect”) is inevitable whenever warping is applied to im-
ages [15]. The same also applies to feature maps. In coarse-to-fine estimation,
erroneous optical flow resulting from the preceding level affects the subsequent
flow inferences. To address the above challenges, we develop two specialized CNN
modules: Cost volume Modulation (CM) and Flow field Deformation (FD). We
demonstrate the applicability of the modules on LiteFlowNet2 [11]. The resulting
network is named as LiteFlowNet3. Figure 1illustrates a simplified overview of
the network architecture. FD is used to refine the previous flow estimate before
it is used as a flow initialization in the current pyramid level. In flow inference,
the cost volume is amended by CM prior to the flow decoding.
3.1 Preliminaries
We first provide a concise description on the construction of cost volume in
optical flow CNNs. Suppose a pair of images I1(at time t= 1) and I2(at time
Resolving Correspondence Ambiguity for More Accurate Flow Estimation 5
Fig. 1: (a) A simplified overview of LiteFlowNet3. Flow field deformation (FD)
and cost volume modulation (CM) together with confidence maps are incorpo-
rated into LiteFlowNet3. For the ease of presentation, only a 2-level encoder-
decoder structure is shown. The proposed modules are applicable to other levels
but not limited to level 1. (b) The optical flow inference in LiteFlowNet2 [11].
t= 2) is given. We convert I1and I2respectively into pyramidal feature maps
F1and F2through a feature encoder. We denote xas a point in the rectangular
domain R2. Correspondence between I1and I2is established by computing
the dot product between two high-level feature vectors in the individual feature
maps F1and F2as follows [6]:
c(x;D) = F1(x)· F2(x0)/N, (1)
where Dis the maximum matching radius, c(x;D) (a 3D column vector with
length 2D+ 1) is the collection of matching costs between feature vectors F1(x)
and F2(x0) for all possible x0such that kxx0k=D, and Nis the length of
the feature vector. Cost volume Cis constructed by aggregating all c(x;D) into
a 3D grid. Flow decoding is then performed on Cusing convolutions (or native
winner-takes-all approach [17]). The resulting flow field u:R2provides the
dense correspondence from I1to I2. In the following, we will omit variable Dthat
indicates the maximum matching radius for brevity and use c(x) to represent
the cost vector at x. When we discuss operations in a pyramid level, the same
operations are applicable to other levels.
3.2 Cost Volume Modulation
Given a pair of images, the existence of partial occlusion and homogeneous re-
gions makes the establishment of correspondence very challenging. This situation
also occurs on feature space because simply transforming images into feature
maps does not resolve the correspondence ambiguity. In this way, a cost volume
is corrupted and the subsequent flow decoding is seriously affected. Conventional
methods [26,30] address the above problem by filtering a cost volume prior to
the decoding. But there has not been any existing works to address this prob-
lem for optical flow CNNs. Some studies [2,10,12] have revealed that applying
6 T.-W. Hui, C. C. Loy
Fig. 2: (a) Modulation tensors (α,β) are adaptively constructed for each cost
volume. (b) Cost volume modulation is integrated into the flow inference. Instead
of leaving cost volume Cunaltered (via the dashed arrow), it is amended to
Cmby using the adaptive modulation prior to the flow decoding. Note: “conv”
denotes several convolution layers.
feature-driven convolutions on feature space is an effective approach to influence
the feed-forward behavior of a network since the filter weights are adaptively
constructed. Therefore, we devise to filter outliers in a cost volume by using an
adaptive modulation. We will show that our modulation approach is not only
effective in improving the flow accuracy but also parameter-efficient.
An overview of cost volume modulation is illustrated in Fig. 2b. At a pyramid
level, each cost vector c(x) in cost volume Cis adaptively modulated by an affine
transformation (α(x), β(x)) as follows:
cm(x) = α(x)c(x)β(x),(2)
where cm(x) is the modulated cost vector, “” and “” denote element-wise
multiplication and addition, respectively. The dimension of the modulated cost
volume is same as the original. This property allows cost volume modulation
to be jointly used and trained with an existing network without major changes
made to the original network architecture.
To have an efficient computation, the affine parameters {α(x), β(x)},x,
are generated altogether in the form of modulation tensors (α, β) having the same
dimension as C. As shown in Fig. 2a, we use cost volume C, feature F1from
the encoder, and confidence map Mat the same pyramid level as the inputs
to the modulation parameter generator. The confidence map is introduced to
facilitate the generation of modulation parameters. Specifically, M(x) pinpoints
the probability of having an accurate optical flow at xin the associated flow field.
The confidence map is constructed by introducing an additional output in the
preceding optical flow decoder. A sigmoid function is used to constraint its values
to [0,1]. We train the confidence map using a L2 loss with the ground-truth label
Mgt(x) as follows:
Mgt(x)=e−kugt(x)u(x)k2,(3)
where ugt(x) is the ground truth of u(x). An example of predicted confidence
maps will be provided in Section 4.2.
Discussion. In the literature, there are two major approaches to infer a flow
field from a cost volume as shown in Fig. 3. The first approach (Fig. 3a) is
Resolving Correspondence Ambiguity for More Accurate Flow Estimation 7
(a) NO: LiteFlowNet2 [11] (b) FF: PWC-Net+ [28] (c) Ours: LiteFlowNet3
Fig. 3: Augmenting a cost volume under different configurations. (a) NO, (b) FF,
and (c) Our solution: Cost volume Cis modulated to Cmby using an adaptive
affine transformation prior to the flow decoding. Note: “corr” and “mod” denote
correlation and modulation, respectively. Correlation is performed on F1and
warped F2(i.e.,
f
F2).
Table 1: Average end-point error (AEE) and model size of different models
trained on FlyingChairs under different augmentations of cost volume.
Augmentations NO FF Ours
Features of I17 3 7
Flow field 7 3 7
Modulation 7 7 3
Number of model parameters (M) 6.42 7.16 7.18
Sintel Clean (training set) 2.71 2.70 2.65
Sintel Final (training set) 4.14 4.20 4.02
KITTI 2012 (training set) 4.20 4.28 3.95
KITTI 2015 (training set) 11.12 11.30 10.65
to perform flow decoding directly on the cost volume without any augmenta-
tion [10,11]. This is similar to the conventional winner-takes-all approach [17]
except using convolutions for yielding flow fields rather than argument of the
minimum. The second approach (Fig. 3b) feed-forwards the pyramidal features
F1from the feature encoder [27,28]. It also feed-forwards the upsampled flow
field (2u2
k1) and features (D2
k1) from the previous flow decoder (at level k1).
Flow decoding is then performed on the concatenation. Our approach (Fig. 3c)
is to perform modulation on the cost volume prior to the flow decoding. The ef-
fectiveness of the above approaches has not been studied in the literature. Here,
we use LiteFlowNet2 [11] as the backbone architecture and train all the models
from scratch on FlyingChairs dataset [6]. Table 1summarizes the results of our
evaluation. Even though FF needs 11.5% more model parameters than NO, it
attains lower flow accuracy. On the contrary, our modulation approach that has
just 0.28% more parameters than FF outperforms the compared methods on all
8 T.-W. Hui, C. C. Loy
Fig. 4: Replacing an inaccurate optical flow u(x1) with an accurate optical flow
u(x2) through a meta-warping governed by displacement d(x1).
the benchmarks, especially KITTI 2012 and KITTI 2015. This indicates that a
large CNN model does not always perform better than a smaller one.
3.3 Flow Field Deformation
In coarse-to-fine flow estimation, a flow estimate from the preceding decoder is
used as a flow initialization for the subsequent decoder. This highly demands the
previous estimate to be accurate. Otherwise, erroneous optical flow is propagated
to subsequent levels and affects the flow inference. Using cost volume modulation
alone is not able to address this problem. We explore local flow consistency [29,33]
and propose to use a meta-warping for improving the flow accuracy.
Intuitively, we refine a given flow field by replacing each inaccurate optical
flow with an accurate one from a nearby position using the principle of local flow
consistency. As shown in Fig. 4, suppose an optical flow u(x1) is inaccurate. With
some prior knowledge, 1) a nearby optical flow u(x2) such that x2=x1+d(x1)
is known to be accurate as indicated by a confidence map; 2) the pyramidal
features of I1at x1and x2are similar i.e., F1(x1)∼ F1(x2) as indicated by
an auto-correlation cost volume. Since image points that have similar feature
vectors have similar optical flow in a neighborhood, we replace u(x1) with a
clone of u(x2).
The previous analysis is just for a single flow vector. To cover the whole flow
field, we need to find a displacement vector for every position in the flow field. In
other words, we need to have a displacement field for guiding the meta-warping
of flow field. We use a warping mechanism that is similar to image [13] and
feature warpings [10,27]. The differences are that our meta-warping is limited to
two channels and the physical meaning of the introduced displacement field no
longer represents correspondence across images.
An overview of flow field deformation is illustrated in Fig. 5b. At a pyramid
level, we replace u(x) with an neighboring optical flow by warping of u(x) in
accordance to the computed displacement d(x) as follows:
ud(x) = u(x+d(x)) .(4)
In particular, not every optical flow needs an amendment. Suppose u(x0) is very
accurate, then no flow warping is required i.e., d(x0)0.
To generate the displacement field, the location of image point having similar
feature as the targeted image point needs to be found. This is accomplished by
Resolving Correspondence Ambiguity for More Accurate Flow Estimation 9
Fig. 5: (a) Displacement field dis constructed according to auto-correlation cost
volume Caand confidence map M. (b) Flow field uis warped to udin accordance
to d. Flow deformation is performed before uis used as an initialization for the
flow inference. Note: “conv” denotes several convolution layers.
decoding from an auto-correlation cost volume. The procedure is similar to flow
decoding from a normal cost volume [6]. As shown in Fig. 5a, we first measure
the feature similarity of targeted point at xand its surrounding points at x0
by computing auto-correlation cost vector ca(x;D) between features F1(x) and
F1(x0) as follows:
ca(x;D) = F1(x)· F1(x0)/N, (5)
where Dis the maximum matching radius, xand x0are constrained by kx
x0k=D, and Nis the length of the feature vector. The above equation is
identical to Eq. (1) except using features from I1only. Auto-correlation cost
volume Cais then built by aggregating all cost vectors into a 3D grid.
To avoid trivial solution, confidence map Massociated with flow field uthat
is constructed by the preceding flow decoder (same as the one presented in Sec-
tion 3.2) is used to guide the decoding of displacement from Ca. As shown in
Fig. 5a, we use cost volume Cafor the auto-correlation of F1and confidence
map Mat the same pyramid level as the inputs to the displacement field gener-
ator. Rather than flow decoding from the cost volume as the normal descriptor
matching [6], our displacement decoding is performed on the auto-correlation
cost volume and is guided by the confidence map.
4 Experiments
Network Details. LiteFlowNet3 is built upon LiteFlowNet2 [11]. Flow infer-
ence is performed from levels 6 to 3 (and 2) with the given image resolution as
level 1. Flow field deformation is applied prior to the cascaded flow inference
while cost volume modulation is applied in the descriptor matching unit. We do
not apply the modules to level 6 as no significant improvement on flow accuracy
can be observed (and level 2 due to large computational load). Each module uses
four 3×3 convolution layers followed by a leaky rectified linear unit except to
use a 5×5 filter in the last layer at levels 4 and 3. Confidence of flow prediction
is implicitly generated by introducing an additional convolution layer in a flow
decoder. Weight sharing is used on the flow decoders and proposed modules.
This variant is denoted by the suffix “S”.
10 T.-W. Hui, C. C. Loy
Training Details. For a fair comparison, we use the same training sets as other
optical flow CNNs in the literature [6,10,11,12,13,19,24,27,28,31]. We use the
same training protocol (including data augmentation and batch size) as Lite-
FlowNet2 [11]. We first train LiteFlowNet2 on FlyingChairs dataset [6] using
the stage-wise training procedure [11]. We then integrate brand new modules,
cost volume deformation and flow field modulation, into LiteFlowNet2 to form
LiteFlowNet3. The newly introduced CNN modules are trained with a learning
rate of 1e-4 while the other components are trained with a reduced learning rate
of 2e-5 for 300K iterations. We then fine-tune the whole network on FlyingTh-
ings3D [20] with a learning rate 5e-6 for 500K iterations. Finally, we fine-tune
LiteFlowNet3 respectively on a mixture of Sintel [4] and KITTI [22], and KITTI
training sets with a learning rate 5e-5 for 600K iterations. The two models are
also re-trained with reduced learning rates and iterations same as LiteFlowNet2.
4.1 Results
We evaluate LiteFlowNet3 on the popular optical flow benchmarks including
Sintel clean and final passes [4], KITTI 2012 [7], and KITTI 2015 [22]. We
report average end-point error (AEE) for all the benchmarks unless otherwise
explicitly specified. More results are available in the supplementary material [9].
Preliminary Discussion. The majority of optical flow CNNs including Lite-
FlowNet3 are 2-frame methods and use the same datasets for training. However,
HD3[32] is pre-trained on ImageNet (>10M images). SelFlow [18] uses Sintel
movie (10K images) and multi-view extensions of KITTI (>20K images) for
self-supervised training. SENSE [16] uses SceneFlow dataset [20] (>39K im-
ages) for pre-training. While SelFlow also uses more than two frames to boost
the flow accuracy. Therefore, their evaluations are not directly comparable to
the majority of the optical flow CNNs in the literature.
Quantitative Results. Table 2summarizes the AEE results of LiteFlowNet3
and the state-of-the-art methods on the public benchmarks. With the exception
of HD3[32], SelFlow [18], and SENSE [16], all the compared CNN models are
trained on the same datasets and are the 2-frame method. Thanks to the cost
volume modulation and flow field deformation, LiteFlowNet3 outperforms these
CNN models including the recent state-of-the-art methods IRR-PWC [12] and
VCN-small [31] on both Sintel and KITTI benchmarks. Despite the recent state-
of-the-art methods HD3and SelFlow (a multi-frame method) use extra training
data, LiteFlowNet3 outperforms HD3on Sintel, KITTI 2012, and KITTI 2015
(Fl-fg). Our model also performs better than SelFlow on Sintel clean and KITTI.
It should be noted that LiteFlowNet3 has a smaller model size and a faster run-
time than HD3and VCN [31] (a larger variant of VCN-small). We also perform
evaluation by dividing AEE into matched and unmatched regions (error over
regions that are visible in adjacent frames or only in one of two adjacent frames,
respectively). As revealed in Table 3, LiteFlowNet3 achieves the best results on
both matched and unmatched regions. Particularly, there is a large improve-
ment on unmatched regions comparing to LiteFlowNet2. This indicates that the
proposed modules are effective in addressing correspondence ambiguity.
Resolving Correspondence Ambiguity for More Accurate Flow Estimation 11
Table 2: AEE results on the public benchmarks. (Notes: The values in parenthe-
ses are the results of the networks on the data they were trained on, and hence
are not directly comparable to the others. The best in each category is in bold
and the second best is underlined. For KITTI 2012, “All” (or “Noc”) represents
the average end-point error in total (or non-occluded areas). For KITTI 2015,
“Fl-all” (or “-fg”) represents the percentage of outliers averaged over all (or fore-
ground) pixels. Inliers are defined as end-point error <3 pixels or 5%. Using
additional training sets. A multi-frame method.)
Method Sintel Clean Sintel Final KITTI 2012 KITTI 2015
train test train test train test (All) test (Noc) train train (Fl-all) test (Fl-fg) test (Fl-all)
FlowNetS [6] (3.66) 6.96 (4.44) 7.76 7.52 9.1 - - - - -
FlowNetC [6] (3.78) 6.85 (5.28) 8.51 8.79 - - - - - -
FlowNet2 [13] (1.45) 4.16 (2.19) 5.74 (1.43) 1.8 1.0 (2.36) (8.88%) 8.75% 11.48%
FlowNet3 [14] (1.47) 4.35 (2.12) 5.67 (1.19) - - (1.79) - - 8.60%
SPyNet [24] (3.17) 6.64 (4.32) 8.36 3.36 4.1 2.0 - - 43.62% 35.07%
Devon [19] - 4.34 - 6.35 - 2.6 1.3 - - 19.49% 14.31%
PWC-Net [27] (2.02) 4.39 (2.08) 5.04 (1.45) 1.7 0.9 (2.16) (9.80%) 9.31% 9.60%
PWC-Net+ [28] (1.71) 3.45 (2.34) 4.60 (0.99) 1.4 0.8 (1.47) (7.59%) 7.88% 7.72%
IRR-PWC [12] (1.92) 3.84 (2.51) 4.58 - 1.6 0.9 (1.63) (5.32%) 7.52% 7.65%
SENSE [16](1.54) 3.60 (2.05) 4.86 (1.18) 1.5 - (2.05) (9.69%) 9.33% 8.16%
HD3[32](1.70) 4.79 (1.17) 4.67 (0.81) 1.4 0.7 (1.31) (4.10%) 9.02% 6.55%
SelFlow [18],(1.68) 3.75 (1.77) 4.26 (0.76) 1.5 0.9 (1.18) - 12.48% 8.42%
VCN-small [31] (1.84) 3.26 (2.44) 4.73 - - - (1.41) (5.5%) - 7.74%
LiteFlowNet [10] (1.35) 4.54 (1.78) 5.38 (1.05) 1.6 0.8 (1.62) (5.58%) 7.99% 9.38%
LiteFlowNet2 [11] (1.30) 3.48 (1.62) 4.69 (0.95) 1.4 0.7 (1.33) (4.32%) 7.64% 7.62%
LiteFlowNet3 (1.32) 2.99 (1.76) 4.45 (0.91) 1.3 0.7 (1.26) (3.82%) 7.75% 7.34%
LiteFlowNet3-S (1.43) 3.03 (1.90) 4.53 (0.94) 1.3 0.7 (1.39) (4.35%) 6.96% 7.22%
Qualitative Results. Examples of optical flow predictions on Sintel and KITTI
are shown in Figs. 6and 7, respectively. AEE evaluated on the respective training
sets is also provided. For Sintel, the flow fields resulting from LiteFlowNet3
contain less artifacts when comparing with the other state-of-the-art methods.
As shown in the second row of Fig. 7, a portion of optical flow over the road fence
cannot be recovered by LiteFlowNet2 [11]. On the contrary, it is fully recovered
by HD3[32] and LiteFlowNet3. Flow bleeding is observed over the road signs
for LiteFlowNet2 as illustrated in the third and fourth rows of Fig. 7while HD3
and LiteFlowNet3 do not have such a problem. Despite HD3is pre-trained on
ImageNet and uses 7.7 times more model parameters than LiteFlowNet3, there
are serious artifacts on the generated flow fields as shown in the second column of
Fig. 7. The above observations suggest that LiteFlowNet3 incorporating the cost
volume modulation and flow field deformation is effective in generating optical
flow with high accuracy and less artifacts.
Runtime and Model Size. We measure runtime using a Sintel image pair
(1024 ×436) on a machine equipped with Intel Xeon E5 2.2GHz and NVIDIA
GTX 1080. Timing is averaged over 100 runs. LiteFlowNet3 needs 59ms for com-
putation and has 5.2M parameters. When weight sharing is not used, the model
size is 7.5M. The runtimes of the state-of-the-art 2-frame methods HD3[32] and
IRR-PWC [12] are 128ms and 180ms, respectively. While HD3and IRR-PWC
have 39.9M and 6.4M parameters, respectively.
12 T.-W. Hui, C. C. Loy
Table 3: AEE results on the testing sets of Sintel. (Note: Using additional
training sets.)
Models All Matched Unmatched
Clean Final Clean Final Clean Final
FlowNet2 [13] 4.16 5.74 1.56 2.75 25.40 30.11
Devon [19] 4.34 6.35 1.74 3.23 25.58 31.78
PWC-Net+ [28] 3.45 4.60 1.41 2.25 20.12 23.70
IRR-PWC [12] 3.84 4.58 1.47 2.15 23.22 24.36
SENSE [16]3.60 4.86 1.38 2.30 21.75 25.73
HD3[32]4.79 4.67 1.62 2.17 30.63 24.99
LiteFlowNet2 [11] 3.48 4.69 1.33 2.25 20.64 24.57
LiteFlowNet3 2.99 4.45 1.15 2.09 18.08 23.68
Image overlay PWC-Net+ [28] HD3[32] LiteFlowNet2[11] LiteFlowNet3
Fig. 6: Examples of flow fields on Sintel training set (Clean pass: first row, Final
pass: second row) and testing set (Clean pass: third row, Final pass: forth row).
Image overlay HD3[32] LiteFlowNet2 [11] LiteFlowNet3
Fig. 7: Examples of flow fields on KITTI training set (2012: first row, 2015:
second row) and testing set (2012: third row, 2015: fourth row).
Resolving Correspondence Ambiguity for More Accurate Flow Estimation 13
Table 4: AEE results of variants of LiteFlowNet3 having some of the components
disabled. (Note: The symbol “-” indicates that confidence map is not being used.)
Settings NO CM- CMFD- CM CMFD
Cost Volume Modulation 7 3 3 3 3
Flow Field Deformation 7 7 3 7 3
Confidence map 7 7 7 3 3
Sintel clean (training set) 2.78 2.66 2.63 2.65 2.59
Sintel final (training set) 4.14 4.09 4.06 4.02 3.91
KITTI 2012 (training set) 4.11 4.02 4.06 3.95 3.88
KITTI 2015 (training set) 11.31 11.01 10.97 10.65 10.40
Fig. 8: Examples of flow fields on Sintel Final (top two rows) and KITTI 2015
(bottom two rows) generated by different variants of LiteFlowNet3. Note: NO =
No proposed modules are used, CM = Cost Volume Modulation, CMFD = Cost
Volume Modulation and Flow Field Deformation, and the suffix “-” indicates
that confidence map is not being used.
4.2 Ablation Study
To study the role of each proposed component in LiteFlowNet3, we disable some
of the components and train the resulting variants on FlyingChairs. The evalua-
tion results on the public benchmarks are summarized in Table 4and examples
of flow fields are illustrated in Fig. 8.
Cost Volume Modulation and Flow Deformation. As revealed in Table 4,
when only cost volume modulation (CM) is incorporated to LiteFlowNet3, it
performs better than its counterpart (NO) neither using modulation nor de-
formation on all the benchmarks, especially KITTI 2015. When both of cost
volume modulation and flow field formation (CMFD) are utilized, it outper-
forms the others and achieves in a large improvement on KITTI 2015. Examples
of visual performance are demonstrated in Fig. 8. For Sintel, we can observe
a large discrepancy in flow color of the human arm between NO and ground
14 T.-W. Hui, C. C. Loy
(a) Confidence map (b) Original flow field (c) Displacement (d) Deformed flow
Fig. 9: An example of flow field deformation. The darker a pixel in the confidence
map, the more chance the associated optical flow is not correct.
truth. On the contrary, flow color is close to ground truth when CM and CMFD
are enabled. Particularly, the green artifact is successfully removed in CMFD.
In the example of KITTI, the car’s windshield and triangle road sign in NO
are not completely filled with correct optical flow. In comparison with CM, the
missed flow can be recovered using CMFD only. This indicates that flow field
deformation is more efficient in “hole filling” than cost volume modulation.
Confidence Map. Variants CM and CMFD, as revealed in Table 4, perform
better than their counterparts CM- and CMFD- with confidence map disabled.
For the example of Sintel in Fig. 8, the green artifact is greatly reduced when
comparing CM- with CM. Optical flow of the human arm is partially disappeared
in CMFD-, while it is recovered in CMFD. The corresponding confidence map is
illustrated in Fig. 9a. It indicates that optical flow near the human arm is highly
unreliable. Similar phenomenon can also be observed in the example of KITTI.
Through pinpointing the flow correctness, the use of confidence map facilitates
both cost volume modulation and flow field deformation.
Displacement Field. As shown in Fig. 9c, the active region of the displacement
field (having strong color intensity) is well-coincided with the active region of the
confidence map (having strong darkness, so indicating high probability of being
incorrect flow) in Fig. 9a. The deformed flow field in Fig. 9d has not only less
artifacts but also sharper motion boundaries and a lower AEE when comparing
to the flow field without meta-warping in Fig. 9b.
5 Conclusion
Correspondence ambiguity is a common problem in optical flow estimation. Am-
biguous feature matching causes outliers to exist in a cost volume and in turn
affects the decoding of flow from it. Besides, erroneous optical flow can be prop-
agated to subsequent pyramid levels. We propose to amend the cost volume
prior to the flow decoding. This is accomplished by modulating each cost vector
through an adaptive affine transformation. We further improve the flow accuracy
by replacing each inaccurate optical flow with an accurate one from a nearby po-
sition through a meta-warping governed by a displacement field. We also propose
to use a confidence map to facilitate the generation of modulation parameters
and displacement field. LiteFlowNet3, which incorporates the cost volume modu-
lation and flow field deformation, not only demonstrates promising performance
on public benchmarks but also has a small model size and a fast runtime.
Resolving Correspondence Ambiguity for More Accurate Flow Estimation 15
References
1. Bailer, C., Taetz, B., Stricker, D.: Flow Fields: Dense correspondence fields for
highly accurate large displacement optical flow estimation. ICCV pp. 4015–4023
(2015)
2. Brabandere, B.D., Jia, X., Tuytelaars, T., Gool, L.V.: Dynamic filter networks.
NIPS (2016)
3. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow esti-
mation based on a theory for warping. ECCV pp. 25–36 (2004)
4. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie
for optical flow evaluation. ECCV pp. 611–625 (2012)
5. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-
lutional networks. ICCV pp. 764–773 (2017)
6. Dosovitskiy, A., Fischer, P., Ilg, E., H¨ausser, P., Hazırba¸s, C., Golkov, V., van der
Smagt, P., Cremers, D., Brox, T.: FlowNet: Learning optical flow with convolu-
tional networks. ICCV pp. 2758–2766 (2015)
7. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? CVPR
pp. 3354–3361 (2012)
8. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Arifical Intelligence 17,
185–203 (1981)
9. Hui, T.W., Loy, C.C.: Supplementary material for LiteFlowNet3: Resolving Cor-
respondence Ambiguity for More Accurate Optical Flow Estimation (2020)
10. Hui, T.W., Tang, X., Loy, C.C.: LiteFlowNet: A lightweight convolutional neural
network for optical flow estimation. CVPR pp. 8981–8989 (2018)
11. Hui, T.W., Tang, X., Loy, C.C.: A lightweight optical flow CNN
– Revisiting data fidelity and regularization. TPAMI (2020).
https://doi.org/10.1109/TPAMI.2020.2976928
12. Hur, J., Roth, S.: Iterative residual refinement for joint optical flow and occlusion
estimation. CVPR pp. 5754–5763 (2019)
13. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet2.0:
Evolution of optical flow estimation with deep networks. CVPR pp. 2462–2470
(2017)
14. Ilg, E., Saikia, T., Keuper, M., Brox, T.: Occlusions, motion and depth boundaries
with a generic network for disparity, optical flow or scene flow estimation. ECCV
pp. 626–643 (2018)
15. Janai, J., G¨uney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of
multi-frame optical flow with occlusions. ECCV pp. 713–731 (2018)
16. Jiang, H., Sun, D., Jampani, V., Lv, Z., Learned-Miller, E., Kautz, J.: SENSE: a
shared encoder network for scene-flow estimation. ICCV pp. 3195–3204 (2019)
17. Kang, S.B., Szeliski, R., Chai, J.: Handling occlusions in dense multi-view stereo.
CVPR pp. 103–110 (2001)
18. Liu, P., Lyu, M., King, I., Xu, J.: SelFlow: Self-supervised learning of optical flow.
CVPR pp. 4566–4575 (2019)
19. Lu, Y., Valmadre, J., Wang, H., Kannala, J., Harandi, M., Torr, P.H.S.: Devon:
Deformable volume network for learning optical flow. WAVC pp. 2705–2713 (2020)
20. Mayer, N., Ilg, E., H¨ausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox,
T.: A large dataset to train convolutional networks for disparity, optical flow, and
scene flow estimation. CVPR pp. 4040–4048 (2016)
21. Meister, S., Hur, J., Roth, S.: UnFlow: Unsupervised learning of opticalflow with
a bidirectional census loss. AAAI pp. 7251–7259 (2018)
16 T.-W. Hui, C. C. Loy
22. Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. CVPR pp.
3061–3070 (2015)
23. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic
flow computation with theoretically justified warping. IJCV 67(2), 141–158 (2006)
24. Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network.
CVPR pp. 4161–4170 (2017)
25. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: Edge-preserving
interpolation of correspondences for optical flow. CVPR pp. 1164–1172 (2015)
26. Rhemann, C., Hosni, A., Bleyer, M., Rother, C., Gelautz, M.: Fast cost-volume
filtering for visual correspondence and beyond. CVPR pp. 3017–3024 (2011)
27. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using
pyramid, warping, and cost volume. CVPR pp. 8934–8943 (2018)
28. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Models matter, so does train-
ing: An empirical study of CNNs for optical flow estimation. TPAMI (2019).
https://doi.org/10.1109/TPAMI.2019.2894353
29. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.:
Anisotropic Huber-L1optical flow. BMVC (2009)
30. Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume pro-
cessings. CVPR pp. 1289–1297 (2017)
31. Yang, G., Ramanan, D.: Volumetric correspondence networks for optical flow.
NeurIPS (2019)
32. Yin, Z., Darrell, T., Yu, F.: Hierarchical discrete distribution decomposition for
match density estimation. CVPR pp. 6044–6053 (2019)
33. Zimmer, H., Bruhn, A., Weickert, J.: Optic flow in harmony. IJCV 93(3), 368–388
(2011)
... SpyNet [23], PWC-Net [26] and all three variants of LiteFlowNet [10][11][12] take advantage of classical coarse-to-fine approach in their designs. SpyNet builds image pyramid and warps the images using estimated flow in the coarser levels. ...
... FlowNet2 stacks several versions of FlowNets but all its networks work on full-resolution data. It takes advantage of the warping layer to help the higher-level networks to deal with smaller displacements [10,11]. Our method differs from FlowNet-Simple: we estimate flow after each sub-net comprised of about 15 layers and each sub-net takes full image features as input. ...
... LiteFlowNet2 increases the speed of the original model along with improvements in the quality of the flow estimates [12]. LiteFlowNet3 incorporates two additional modifications on LiteFlowNet2 to deal with propagation of wrong flow estimates from coarser levels when there is occlusion or homogeneous regions in a scene [10]. ...
Article
Full-text available
Dense optical flow estimation is challenging when there are large displacements in a scene with heterogeneous motion dynamics, occlusion, and scene homogeneity. Traditional approaches to handle these challenges include hierarchical and multiresolution processing methods. Learning-based optical flow methods typically use a multiresolution approach with image warping when a broad range of flow velocities and heterogeneous motion is present. Accuracy of such coarse-to-fine methods is affected by the ghosting artifacts when images are warped across multiple resolutions and by the vanishing problem in smaller scene extents with higher motion contrast. Previously, we devised strategies for building compact dense prediction networks guided by the effective receptive field (ERF) characteristics of the network (DDCNet). The DDCNet design was intentionally simple and compact allowing it to be used as a building block for designing more complex yet compact networks. In this work, we extend the DDCNet strategies to handle heterogeneous motion dynamics by cascading DDCNet based sub-nets with decreasing extents of their ERF. Our DDCNet with multiresolution capability (DDCNet-Multires) is compact without any specialized network layers. We evaluate the performance of the DDCNet-Multires network using standard optical flow benchmark datasets. Our experiments demonstrate that DDCNet-Multires improves over the DDCNet-B0 and -B1 and provides optical flow estimates with accuracy comparable to similar lightweight learning-based methods.
... Our formulation implicitly assumes the corresponding pixels are visible on both images and thus they can be matched by comparing feature similarities. To handle unmatched (occluded and out-of-boundary) regions, we introduce a simple task-agnostic self-attention layer to propagate the high-quality predictions to unmatched regions by measuring feature self-similarity [30], [31]. ...
... However, this assumption will be invalid for occluded and outof-boundary pixels, producing unreliable results in these regions (Fig. 4). To remedy this, by observing that the flow/disparity/depth field and the image itself share high structure similarity [30], [31], we propose to propagate the high-quality flow/disparity/depth predictions to unmatched regions by measuring feature self-similarity. This operation can be implemented efficiently with a single selfattention layer (illustrated in Fig. 2): ...
Preprint
Full-text available
We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
... The final estimated motion is then obtained by a bilinear interpolation operation. This network has recently been improved by modulating the cost volume and introducing a confidence map to estimate the movement on more reliable neighboring features [40]. ...
... 40 ...
Thesis
Ultrasound is the most widely used imaging modality in clinical practice because it is fast, non-invasive and less expensive than other modalities. In echocardiography, several metrics characterizing the cardiac function can be extracted from these acquisitions, among which the global longitudinal strain (GLS) plays an important role in establishing a diagnosis. However, the estimation of this index suffers from a lack of reproducibility due to the specific ultrasound’s characteristics. Indeed, traditional methods such as optical flow or block matching do not handle typical artifacts such as ultrasound texture decorrelation. Recently, deep learning approaches have beaten state-of-the-art methods in motion estimation, driven by applications to robotics or autonomous cars. In the first part of this thesis, we present a pilot study to evaluate the ability of deep learning methods to estimate motion in ultrasound imaging despite the many underlying artifacts. To do so, we created a database composed of simulated and in-vitro ultrasound images including a rotating disk with varying speeds. In the second part of this thesis, we detail the pyramidal neural network that we have developed to estimate the deformation of the myocardial muscle and that significantly improves the performances of the state-of-the-art methods. To train and evaluate our learning method, we also implemented a simulation pipeline to generate realistic echocardiographic image sequences with a dense reference field and with high anatomical and functional variability.
... VCN [21] expanded the receptive field of cost volume and added support for multidimensional similarities. LiteFlowNet3 [22] addressed the problem of outliers using an adaptive affine transformation and a confidence map to amend the cost volume. Jiang et al. [23] replaced the dense cost volume with a sparse one by computing the k closest matches in the matching image for each pixel in the reference image; however, this method reduced the accuracy of optical flow and struggled to handle ambiguous regions. ...
Preprint
Full-text available
Cost volume is widely used to establish correspondences in optical flow estimation. However, when dealing with low-texture and occluded areas, it is difficult to estimate the cost volume correctly. Therefore, we propose a replacement: feature correlation transformer (FCTR), a transformer with self-and cross-attention alternations for obtaining global receptive fields and positional embedding for establishing correspondences. With global context and positional information, FCTR can produce more accurate correspondences for ambiguous areas. Using positional embedding allows the removal of the context network; the positional information can be aggregated within ambiguous motion boundaries, and the number of model parameters can be reduced. To speed up network convergence and strengthen robustness, we introduce a smooth L1 loss with exponential weights in the pre-training step. At the time of submission, our method achieves competitive performance with all published optical flow methods on both the KITTI-2015 and MPI-Sintel benchmarks. Moreover, it outperforms all optical flow and scene flow methods in KITTI-2015 foreground-region prediction.
... However, the field recently shifted toward deep learning approaches [14]- [25], which deliver not only a better performance than the conventional methods but also a faster runtime. Although the focus has largely been on improving performance, efforts have been made to find models of reduced size and faster inference [15], [16], [18], [19], [22], [23], [26]. However, these methods remain computationally expensive, with runtime ranging from several to tens of frames per second (FPS) on desktop GPUs and requiring millions of parameters (and hence large amounts of memory), rendering these models incompatible with edge hardware. ...
Preprint
Full-text available
Nano quadcopters are small, agile, and cheap platforms that are well suited for deployment in narrow, cluttered environments. Due to their limited payload, these vehicles are highly constrained in processing power, rendering conventional vision-based methods for safe and autonomous navigation incompatible. Recent machine learning developments promise high-performance perception at low latency, while dedicated edge computing hardware has the potential to augment the processing capabilities of these limited devices. In this work, we present NanoFlowNet, a lightweight convolutional neural network for real-time dense optical flow estimation on edge computing hardware. We draw inspiration from recent advances in semantic segmentation for the design of this network. Additionally, we guide the learning of optical flow using motion boundary ground truth data, which improves performance with no impact on latency. Validation results on the MPI-Sintel dataset show the high performance of the proposed network given its constrained architecture. Additionally, we successfully demonstrate the capabilities of NanoFlowNet by deploying it on the ultra-low power GAP8 microprocessor and by applying it to vision-based obstacle avoidance on board a Bitcraze Crazyflie, a 34 g nano quadcopter.
Article
Full-text available
This paper presents a novel architecture for simultaneous estimation of highly accurate optical flows and rigid scene transformations for difficult scenarios where the brightness assumption is violated by strong shading changes. In the case of rotating objects or moving light sources, such as those encountered for driving cars in the dark, the scene appearance often changes significantly from one view to the next. Unfortunately, standard methods for calculating optical flows or poses are based on the expectation that the appearance of features in the scene remains constant between views. These methods may fail frequently in the investigated cases. The presented method fuses texture and geometry information by combining image, vertex and normal data to compute an illumination-invariant optical flow. By using a coarse-to-fine strategy, globally anchored optical flows are learned, reducing the impact of erroneous shading-based pseudo-correspondences. Based on the learned optical flows, a second architecture is proposed that predicts robust rigid transformations from the warped vertex and normal maps. Particular attention is paid to situations with strong rotations, which often cause such shading changes. Therefore, a 3-step procedure is proposed that profitably exploits correlations between the normals and vertices. The method has been evaluated on a newly created dataset containing both synthetic and real data with strong rotations and shading effects. These data represent the typical use case in 3D reconstruction, where the object often rotates in large steps between the partial reconstructions. Additionally, we apply the method to the well-known Kitti Odometry dataset. Even if, due to fulfillment of the brightness assumption, this is not the typical use case of the method, the applicability to standard situations and the relation to other methods is therefore established.
Thesis
Current digital processing algorithms require more computing power to achieve more accurate results and process larger data. In the meantime, hardware architectures are becoming more specialized, with highly efficient accelerators designed for specific tasks. In this context, the path of deployment from the algorithm to the implementation becomes increasingly complex. It is, therefore, crucial to determine how algorithms can be modified to take advantage of new hardware capabilities. Our study focused on graphics processing units (GPUs), a massively parallel processor. Our algorithmic work was done in the context of radio-astronomy or optical flow estimation and consisted of finding the best adaptation of the software to the hardware. At the level of a mathematical operator, we modified the traditional image convolution algorithm to use the matrix units and showed that its performance doubles for large convolution kernels. At a broader method level, we evaluated linear solvers for the combined local-global optical flow to find the most suitable one on GPU. With additional optimizations, such as iteration fusion or memory buffer re-utilization, the method is twice as fast as the initial implementation, running at 60 frames per second on an embedded platform (30 W). Finally, we also pointed out the interest of this hardware-aware algorithm design method in the context of deep neural networks. For that, we showed the hybridization of a convolutional neural network for optical flow estimation with a pre-trained image classification network, MobileNet, that was initially designed for efficient image classification on low-power platforms.
Article
Deep learning-based visual odometry (VO) has recently attracted much attention. In performing computer vision tasks, deep models are often more robust than manually extracted features. Deep models, however, have limited reliability and generalization ability which constrain their application in the VO systems. The existing approaches to address these issues are often based on creating more realistic datasets or employing strategies such as unsupervised learning and online fine-tuning. In contrast to the previous research, here we tackle these issues from the model structure standpoint. We present a simple yet robust VO (namely SF-VO) based on an especially designed sparse optical flow network. We then show that this network becomes suitable for VO applications by decomposing the global optimization problem into a single-point optimization. Combining the network with a consistency verification module and a Perspective-n-Point (PnP) solver, we then form a frame-to-frame VO system using the traditional pose estimation pipeline. Extensive experiments show that the proposed VO system effectively generalizes to real scenes while only synthetic datasets are used in the training process. It is also shown that the proposed model also outperforms other deep learning-based methods with a model size of only 1.69 M. Comparisons with the state-of-art optical flow models and performing expansion experiments further confirm that the designed network demonstrates a higher level of generalization ability and is capable of being trained based on limited datasets.
Article
Full-text available
Over four decades, the majority addresses the problem of optical flow estimation using variational methods. With the advance of machine learning, some recent works have attempted to address the problem using convolutional neural network (CNN) and have showed promising results. FlowNet2 [1], the state-of-the-art CNN, requires over 160M parameters to achieve accurate flow estimation. Our LiteFlowNet2 outperforms FlowNet2 on Sintel and KITTI benchmarks, while being 25.3 times smaller in the model size and 3.1 times faster in the running speed. LiteFlowNet2 is built on the foundation laid by conventional methods and resembles the corresponding roles as data fidelity and regularization in variational methods. We compute optical flow in a spatial-pyramid formulation as SPyNet [2] but through a novel lightweight cascaded flow inference. It provides high flow estimation accuracy through early correction with seamless incorporation of descriptor matching. Flow regularization is used to ameliorate the issue of outliers and vague flow boundaries through feature-driven local convolutions. Our network also owns an effective structure for pyramidal feature extraction and embraces feature warping rather than image warping as practiced in FlowNet2 and SPyNet. Comparing to LiteFlowNet [3], LiteFlowNet2 improves the optical flow accuracy on Sintel Clean by 23.3%, Sintel Final by 12.8%, KITTI 2012 by 19.6%, and KITTI 2015 by 18.8%, while being 2.2 times faster. Our network protocol and trained models are made publicly available on https://github.com/twhui/LiteFlowNet2.
Conference Paper
Full-text available
FlowNet2, the state-of-the-art convolutional neural network (CNN) for optical flow estimation, requires over 160M parameters to achieve accurate flow estimation. In this paper we present an alternative network that outperforms FlowNet2 on the challenging Sintel final pass and KITTI benchmarks, while being 30 times smaller in the model size and 1.36 times faster in the running speed. This is made possible by drilling down to architectural details that might have been missed in the current frameworks: (1) We present a more effective flow inference approach at each pyramid level through a lightweight cascaded network. It not only improves flow estimation accuracy through early correction, but also permits seamless incorporation of descriptor matching in our network. (2) We present a novel flow regularization layer to ameliorate the issue of outliers and vague flow boundaries by using a feature-driven local convolution. (3) Our network owns an effective structure for pyramidal feature extraction and embraces feature warping rather than image warping as practiced in FlowNet2. Our code and trained models are available at https://github.com/twhui/LiteFlowNet.
Article
We investigate two crucial and closely related aspects of CNNs for optical flow estimation: models and training. First, we design a compact but effective CNN model, called PWC-Net, according to simple and well-established principles: pyramidal processing, warping, and cost volume processing. PWC-Net is 17 times smaller in size, 2 times faster in inference, and 11% more accurate on Sintel final than the recent FlowNet2 model. It is the winning entry in the optical flow competition of the robust vision challenge. Next, we experimentally analyze the sources of our performance gains. In particular, we use the same training procedure of PWC-Net to retrain FlowNetC, a sub-network of FlowNet2. The retrained FlowNetC is 56% more accurate on Sintel final than the previously trained one and even 5% more accurate than the FlowNet2 model. We further improve the training procedure and increase the accuracy of PWC-Net on Sintel by 10% and on KITTI 2012 and 2015 by 20%. Our newly trained model parameters and training protocols are available on https://github.com/NVlabs/PWC-Net.
Chapter
Learning optical flow with neural networks is hampered by the need for obtaining training data with associated ground truth. Unsupervised learning is a promising direction, yet the performance of current unsupervised methods is still limited. In particular, the lack of proper occlusion handling in commonly used data terms constitutes a major source of error. While most optical flow methods process pairs of consecutive frames, more advanced occlusion reasoning can be realized when considering multiple frames. In this paper, we propose a framework for unsupervised learning of optical flow and occlusions over multiple frames. More specifically, we exploit the minimal configuration of three frames to strengthen the photometric loss and explicitly reason about occlusions. We demonstrate that our multi-frame, occlusion-sensitive formulation outperforms existing unsupervised two-frame methods and even produces results on par with some fully supervised methods.
Chapter
Occlusions play an important role in disparity and optical flow estimation, since matching costs are not available in occluded areas and occlusions indicate depth or motion boundaries. Moreover, occlusions are relevant for motion segmentation and scene flow estimation. In this paper, we present an efficient learning-based approach to estimate occlusion areas jointly with disparities or optical flow. The estimated occlusions and motion boundaries clearly improve over the state-of-the-art. Moreover, we present networks with state-of-the-art performance on the popular KITTI benchmark and good generic performance. Making use of the estimated occlusions, we also show improved results on motion segmentation and scene flow estimation.