Conference PaperPDF Available

Augmented Coarse-to-fine Video Frame Synthesis with Semantic Loss

Authors:

Abstract and Figures

Existing video frame synthesis works suffer from improving percep- tual quality and preserving semantic representation ability. In this paper, we pro- pose a Progressive Motion-texture Synthesis Network (PMSN) to address this problem. Instead of learning synthesis from scratch, we introduce augmented in- puts to compensate texture details and motion information. Specifically, a coarse- to-fine guidance scheme with a well-designed semantic loss is presented to im- prove the capability of video frame synthesis. As shown in the experiments, our proposed PMSN promises excellent quantitative results, visual effects, and gen- eralization ability compared with traditional solutions.
Content may be subject to copyright.
Augmented Coarse-to-fine Video Frame Synthesis with
Semantic Loss
Xin Jin1, Zhibo Chen1,[0000000285255066], Sen Liu1, and Wei Zhou1
CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application
System, University of Science and Technology of China, Hefei 230027, China
chenzhibo@ustc.edu.cn
Abstract. Existing video frame synthesis works suffer from improving percep-
tual quality and preserving semantic representation ability. In this paper, we pro-
pose a Progressive Motion-texture Synthesis Network (PMSN) to address this
problem. Instead of learning synthesis from scratch, we introduce augmented in-
puts to compensate texture details and motion information. Specifically, a coarse-
to-fine guidance scheme with a well-designed semantic loss is presented to im-
prove the capability of video frame synthesis. As shown in the experiments, our
proposed PMSN promises excellent quantitative results, visual effects, and gen-
eralization ability compared with traditional solutions.
Keywords: Video frame synthesis, augmented input, coarse-to-fine guidance scheme,
semantic loss
1 Introduction
Video frame synthesis plays an important role in numerous applications of different
fields, including video compression [2], video frame rate up-sampling [12], and pilot-
less automobile [9]. Given a video sequence, video frame synthesis aims to interpolate
frames between the existing video frames or extrapolate future video frames as shown
in Fig. 1. However, constructing a generalized model to synthesize video frames is still
challenging, especially for those videos with large motion and complex texture.
Fig. 1. Interpolation and extrapolation tasks in the video frame synthesis problem.
A lot of efforts have been dedicated towards video frame synthesis. Traditional ap-
proaches focused on synthesizing video frames from estimated motion information,
such as optical flow [22, 10, 16]. Recent approaches have proposed deep generative
models to directly hallucinate the pixel values of video frames [21, 17, 27, 13, 19, 15,
4, 26]. However, these models always generate significant artifacts since the accuracy
of motion estimation cannot be guaranteed. Meanwhile, due to the straightforward non-
linear convolution operations, the results of deep generative models are suffered from
blur artifacts.
In order to tackle the above problems, we propose a deep model called Progres-
sive Motion-texture Synthesis Network (PMSN), which is a global encoder-decoder
architecture with coarse-to-fine guidance under a brain-inspired semantic objective.
Overview of the whole process of PMSN is illustrated in Fig. 2. Specifically, we first
introduce an augmented frames generation process to produce Motion-texture Aug-
mented Frames (MAFs) containing coarse-grained motion prediction and high texture
details. Second, in order to reduce the loss of detailed information in the feed-forward
process and assist the network to learn motion tendency, MAFs are fed into the decoder
stage with different scales in a coarse-to-fine manner, rather than the scheme of directly
fusion into a single layer as described in [12]. Finally, we also adopt a brain-inspired
semantic loss to further enhance the subjective quality and preserve the semantic rep-
resentation ability of synthesized frames in the learning stage. The contributions of this
paper are summarized as follows:
1. Instead of learning synthesis from scratch, we introduce a novel Progressive Motion-
texture Synthesis Network (PMSN) to learn frame synthesis with triple-frame input
under the assistant of augmented frames. These augmented frames provide effec-
tive prior information including motion tendency and texture details to compensate
the video synthesis
2. A coarse-to-fine guidance scheme is adopted in the decoder stage of the network to
increase its sensitivity to informative features. Through this scheme, we can max-
imally exploit the informative features and suppress less useful ones at the same
time, which acts as a bridge which combines conventional motion estimation meth-
ods and deep learning-based methods.
3. We develop a brain-inspired semantic loss for sharpening the synthetic results and
strengthening object texture as well as motion information. The final results demon-
strate better perceptual quality and semantic representation preserving ability.
2 Related Work
Traditional Methods Early attempts at video frame synthesis focused on motion esti-
mation based approaches. For example, Revaud et al. [22] proposed the EpicFlow to
estimate optical flow by edge-aware distance. Li et al. [10] adopted a Laplacian Cotan-
gent Mesh constraint to enhance the local smoothness for results generated by optical
flow. Meyer et al. [16] leveraged the phase shift information for image interpolation.
The results of these methods are highly relied on the precise estimation of motion infor-
mation. Significant artifacts can be generated when unsatisfactory estimation happens
for videos with large or complex motion.
Fig. 2. Overview of our Progressive Motion-texture Synthesis Network (PMSN).
Learning-based Methods The renaissance of deep neural network (DNN) remark-
ably accelerates the progress of video frame synthesis. Numbers of methods were pro-
posed to interpolate or extrapolate video frames [19, 14, 17, 27, 13,19, 15, 26]. [17] fo-
cused on representing series transformation to predict small patches based on recurrent
neural network (RNN). Xue et al. [27] proposed a model which generates videos with
an assumption that the background is uniform. Lotter et al. [13] proposed a network
called PredNet, which contains a series of stacked modules that forward the deviations
in video sequences. Mathieu et al. [15] proposed a multi-scale architecture with adver-
sarial training, which is referred as BeyondMSE. Niklaus et al. [19] tried to estimate a
convolution kernel from the input frames. Then, the kernel was used to convolve patch-
es from the input frames for synthesizing the interpolated ones. However, it is still hard
to hallucinate realistic details for videos with complex spatiotemporal information only
by the non-linear convolution operation.
Recently, Liu et al. [12] utilized the pixel-wise 3D voxel flow to synthesize video
frames. Lu et al. [14] presented a Flexible Spatio-Temporal Network (FSTN) to capture
complex spatio-temporal dependencies and motion patterns with diverse training strate-
gies. [19] just focused on video frame interpolation task via adaptive convolution. Liang
et al. [11] developed a dual motion Generative Adversarial Network (GAN) for video
prediction. Villegas et al. [25] proposed a deep generative model named MCNet to ex-
tract the features of the last frame as content information and then encode the temporal
differences between previous consecutive frames as motion information. Unfortunately,
these methods usually only have the ability to deal with videos with tiny object motion
and simple background which often cause blur artifacts in video scenes with large and
complex motion. On the contrary, our proposed PMSN is able to achieve much better
results, especially in complex scenes. In the experiment section Sec.4, we will show
adequate evaluations between our method and above methods.
Fig. 3. The whole architecture of our Progressive Motion-texture Synthesis Network (PMSN).
3 Progressive Motion-texture Synthesis Network
The whole architecture of our Progressive Motion-texture Synthesis Network (PMSN)
is shown in Fig. 3, which takes advantage of the spatial invariance and temporal cor-
relation for image representations. Instead of learning from scratch, the model receives
original video frames combined with the produced augmented frames as the whole in-
puts. These triple-frame inputs provide more reference information for motion trajecto-
ry and texture residue, which leads to more reasonable high-level image representations.
In the following sub-sections, we will first describe the augmented frames generation
process. Then, the coarse-to-fine guidance scheme and semantic loss are presented.
Encoder Stage: Each convolutional block is shown in Fig. 4 (a). The size of the
receptive field for all convolution filters is (4,4) along with stride (2,2). A group of
ResidualBlocks [5] (number of blocks in the group is shown in Fig. 3) is used to
strengthen the non-linear representation and preserve more spatial-temporal details. To
overcome the overfitting and internal covariant shift problems, we add a batch normal-
ization layer before each Rectified Liner Unit (ReLU) layer [18].
Decoder Stage: The deconvolutional block is used to upsample the feature maps,
as demonstrated in Fig. 4 (b), which has a receptive field of (5,5) with stride (2,2). The
block also contains BatchNorm, ReLU layer and Residual Block, their parameters are
shown in Fig. 3. To maintain the image details from low-level to high-level, we build
skip connections, which are illustrated as the thin blue arrows in Fig. 3.
(a) (b)
Fig. 4. Two sub-components in PMSN. (a) Convolutional Block in PMSN. (b) Deconvolutional
Block in PMSN.
3.1 Augmented Frames Generation
Intrinsically, our PMSN utilizes augmented frames rather than learning from scratch.
Then it is important for the augmented frame to preserve coarse motion trajectory and
less-blurred texture, for PMSN to further improve the quality under the assistance of
coarse-to-fine guidance and semantic loss. Therefore, any frame augmentation scheme
satisfying above-mentioned two factors can be adopted in the PMSN framework, we in-
troduce a simple augmented frames generation process in this paper to produce Motion-
texture Augmented Frames (MAFs) containing coarse-grained motion prediction and
high texture details. Similar to motion-estimation based frame synthesis methods, the
original input frames are first decomposed into block-level matrixes. Then, we directly
copy the matching blocks to MAFs according to the estimated motion vectors of these
blocks. As shown in Fig. 5 (a), to calculate the motion vector for generating MAF f0
i,
we first partition the frame ˆ
fi1into regular 4x4 blocks, then search backward in the
frame ˆ
fi2. When building each 4x4 block of MAF, the motion vectors of correspond-
ing 4x4 block in frame ˆ
fi1are utilized to locate and copy the data from frame ˆ
fi1.
This bolck-sized thresholding 4x4 is sufficient for our purpose of generating the MAFs.
(a) (b)
Fig. 5. (a) The generation precess of MAFs where the direction of motion vectors is backward.
(b) Attention Object Bounding Box extracted by Faster R-CNN.
Note that this frame augmentation scheme can be replaced by any other frame syn-
thesis solution, we verified this in the experiment section Sec.4.3 by replacing MAFs
with augmented frames generated from [19] and then demonstrate the effectiveness of
our proposed PMSN.
3.2 Coarse-to-fine Guidance
In order to make use of the information aggregated in the MAFs as well as triple-
frame input groups for selectively emphasizing informative features and suppressing
less useful ones, we propose a coarse-to-fine guidance scheme to guide our network in
an end-to-end manner, which is illustrated as orange arrows in Fig. 2, 3 and Fig. 4 (b).
Specifically, given the double-frame input Xand single augmented frame ˜
Y, our goal is
to obtain the synthesized interpolated/extrapolated frames Y0, which can be formulated
as:
Y0=f(G(X+˜
Y),˜
Y),(1)
where Gdenotes a generator which learns motion trajectory and texture residue from
triple-frame input group X+˜
Y. Function frepresents the fusion process to fully cap-
ture channel-wise dependencies through a concatenation operation.
In order to progressively improve the quality of synthesized frames in a coarse-to-
fine way, we make a series of synthesis from MAFs with gradual increase resolutions,
which is depicted as below:
Y0
1=f(G1(X+˜
Y1), e1(˜
Y1)),
Y0
2=f(G2(Y0
1+˜
Y2), e2(˜
Y2)),
...
Y0
k=f(Gk(Y0
k1+˜
Yk), ek(˜
Yk)),
(2)
where krepresents each level of the coarse-to-fine synthesis process. In our PMSN, we
set the size of each level to 40x30 (k= 1), 80x60 (k= 2), 160x120 (k= 3) , 320x240
(k= 4).Gkis the middle layer of G, and G1, G2, ..., Gkcompose an integrated net-
work. And ekis the feature extractor of ˜
Yk, we employ two dilated convolutional layers
[28] instead of simple downsample operations to preserve the texture details of origi-
nal images. Since the output Y0
kis produced by a summation through all channels (X
and Y), the channel dependencies are implicitly embedded in them. In order to ensure
that the network is able to increase its sensitivity to informative features and suppress
less useful ones, the final output of each level is obtained by assigning each channel a
corresponding weighting factor W. Then we design a Guidance Loss `guid containing
four sub-loss functions for each level ˆ
Y0
k. Let Ydenotes the Ground Truth and δrefers
to the activation function ReLu [18]:
ˆ
Y0
k=F(Y0
k, W ) = δ(WY0
k), `guid =
4
X
k=1
kˆ
Y0
kYk2.(3)
3.3 Semantic Loss
In the visual cortex, neurons are mapped to the visible or salient parts of an image and
activated first, then followed by a later spread to neurons that are mapped to the miss-
ing parts [3, 6]. Inspired by this visual cortex representation process, we design a hy-
brid semantic loss `sem to further sharpen the texture details of synthesized results and
strengthen informative motion information, which consists of four sub-parts: Guidance
Loss `guid mentioned above, Lateral Dependency Loss `ld, Attention Emphasis Loss
`emph, and Gradient Loss `gr ad. First, to imitate the cortical neuron filling-in process
and capture lateral dependency between neighbors in the visual cortex, `ld is proposed:
`ld =1
N
N
X
i,j=1
|k ˆ
Yi,j ˆ
Yi1,j k2− kYi,j Yi1,j k2|+
|k ˆ
Yi,j ˆ
Yi,j1k2− kYi,j Yi,j 1k2|.
(4)
Second, `emph is employed to strengthen the texture and motion information of
attention objects in the scene, namely, to emphasize the gradients for attention objects
through feedback values during the back-propagation. As shown in Fig. 5 (b), we take
advantage of the excellent Faster R-CNN [20] to extract the foreground attention objects
through a priori bounding box where (Wbox, Hbox )is the pair of width and height. Then
we define the Attention Emphasis Loss `emph as follows:
`emph =1
Wbox ×Hbox
(i,j)box
X
i,j
|k ˆ
Yi,j ˆ
Yi1,j k2− kYi,j Yi1,j k2|+
|k ˆ
Yi,j ˆ
Yi,j1k2− kYi,j Yi,j 1k2|.
(5)
Finally, `grad is also used to sharpen the texture details by incorporating with image
gradients as shown in Eq.6, and similar operation is also described in [15]. In summary,
the semantic loss `sem is a weighted sum of all the losses in our experiment where
α= 1, β = 0.3, γ = 0.7, λ = 1 are the weights for Guidance Loss, Lateral Dependency
Loss, Attention Emphasis Loss and Gradient Loss, respectively:
`grad =
ˆ
Y− ∇Y
2.(6)
`sem =α`guid +β`ld +γ`emph +λ`grad .(7)
4 Experiments
In this section, we present comprehensive experiments to analyze and understand the
behavior of our model. We first evaluate our model in terms of qualitative and quan-
titative performance for video interpolation and extrapolation. Then we show more
capacities of our PMSN on various datasets. In the end, we analyze the effective-
ness of different components in the PMSN separately. Datasets: We train our network
on 153,971 triplet video sequences sampled from UCF-101 [24] dataset, and test the
performance on UCF-101 (validation), HMDB51 [8], and YouTube-8m [1] datasets.
Training Details: We adopt an Adam [7] solver to learn the model parameters by op-
timizing the semantic loss. The batch size is set as 32, and our initial learning rate
is 0.005 that decays every 50K steps. We train the model for 100K iterations. The
source code will be released in the future. Baselines: Here, we divide existing video
synthesis methods into three categories for comparison: (1)Interpolation-Only, Phase-
based frame interpolation [16] is a traditional and well-performed method just for video
interpolation. Ada-Conv [19] also only focuses on video frame interpolation task via
adaptive convolution operations. (2)Extrapolation-Only, PredNet [13] is a predictive
coding inspired CNN architecture. MCNet [25] predicts frames by decomposing mo-
tion and content. FSTN [14] and Dual-Motion GAN [11] both only focus on video
extrapolation task, and the authors do not release their pre-trained weights or train-
ing details. Hence, we only compare the PSNR and SSIM presented in their paper.
(3)Interpolation-Plus-Extrapolation, EpicFlow [22] is a state-of-the-art approach for
optical flow estimation, the synthesized frames are constructed by pixel compensation.
For CNN-based methods, BeyondMSE [15] is a multi-scale architecture. The official
model is trained by using 4 and 8 input frames. Since our method uses 2 input frames,
BeyondMSE with 2 input frames is implemented for comparison. U-Net [23], which
has a well-received structure for pixel-level generation, is also implemented for com-
parison. Deep Voxel Flow (DVF) [12] trains a deep network that learns to synthesize
video frames by flowing pixel values from existing frames.
Table 1. Performance of frame synthesis on UCF-101 validation dataset.
Methods Interpolation Extrapolation
PSNR SSIM PSNR SSIM
Pred Net [13] 22.6 0.74
Phase-based [16] 28.4 0.84
Beyond-MSE [15] 28.8 0.90 28.2 0.89
Epic-Flow [22] 30.2 0.93 29.1 0.91
U-Net [23] 30.2 0.92 29.2 0.92
FSTN [14] 27.6 0.91
MCNet [25] 28.8 0.92
Deep Voxel Flow [12] 30.9 0.94 29.6 0.92
Dual-Motion GAN [11] 30.5 0.94
Ada-Conv [19] 32.6 0.95
Ours 33.1 0.96 31.3 0.94
4.1 Quantitative and Qualitative Comparison
For quantitative comparison, we use both Peak Signal-to-Noise Ratio (PSNR) and Struc-
tural SIMilarity (SSIM) index to evaluate the image quality of interpolated/extrapolated
frames, higher values of PSNR and SSIM indicate better results. In terms of qualitative
quality, our approach is compared with several latest state-of-the-art methods in Fig. 6
and 7.
Single-frame Synthesis As shown in Table 1, it is obvious that our solution out-
performs all existing solutions. Compared with the existing best interpolation-only so-
lution Ada-Conv and best extrapolation-only solution Dual-Motion GAN, over 0.5dB
and 0.8dB PSNR improvement can be achieved respectively. Compared with the ex-
isting best interpolation-plus-extrapolation scheme Deep Voxel Flow, over 2.2dB and
1.7dB PSNR improvement can be achieved for interpolation and extrapolation oper-
ation respectively. We also show some subjective results for perceptual comparison.
As illustrated in Fig. 6 and Fig. 7, our PMSN demonstrates better perceptual quality
with clearer integrated objects, non-blurred background scene and more accurate mo-
tion prediction, compared with existing solutions. For example, Ada-Conv generates
strong distortion and losses partial object in the bottom-right “leg” area due to failed
motion prediction. On the contrary, our PMSN demonstrates much better perceptual
quality without obvious artifacts.
Multi-frame Synthesis We further explore the multi-frame synthesis ability of our
PMSN on various datasets, which can be used for up-sampling video frame rate and
generating videos with slow-motion effect. We can see that the qualitative results in
Fig. 6. Qualitative comparisons of video interpolation.
Fig. 7. Qualitative comparisons of video extrapolation.
Fig. 8(a) have reasonable motion and realistic texture. And as demonstrated in Fig. 8(b),
the PMSN can provide outstanding performance compared with other state-of-the-art
methods.
(a) (b)
Fig. 8. (a) Three-frame interpolation. (b) Performance comparisons on three-frame interpolation.
4.2 Generalization Ability
Furthermore, we show the generalization ability of our PMSN by evaluating the model
on YouTube-8m and HMDB-51 validation datasets without re-training. Table 2 demon-
strates that our model outperforms all previous state-of-the-art models by a even larger
gain (over 1.2dB PSNR improvement on both datasets for interpolation and extrapo-
lation) compared with results in Table 1, which means our PMNS has a much better
generalization ability.
Table 2. Performance of frame synthesis on YouTube-8M and HMDB-51 validation datasets.
Methods Interpolation Extrapolation
PSNR SSIM PSNR SSIM
Pred Net 19.7 / 18.4 0.65 / 0.59
Phase-based 21.0 /21.7 0.66 / 0.68
U-Net 24.2 / 23.8 0.73 / 0.72 22.7 / 22.4 0.70 / 0.70
BeyondMSE 26.6 / 26.8 0.78 / 0.80 25.7 / 26.1 0.74 / 0.76
MCNet 26.9 / 27.9 0.79 / 0.81
Epic-Flow 29.5 / 29.5 0.92 / 0.92 29.2 / 29.3 0.90 / 0.92
Ada-Conv 29.5 / 29.6 0.93 / 0.92
Ours 31.1 / 31.4 0.94 / 0.92 30.4 / 30.7 0.94 / 0.93
4.3 Ablation Study
Effectiveness of Coarse-to-fine Guidance Scheme: We first visualize the output of
each deconvolutional block in the decoder stage, which indicates these gradually im-
proved results using MAFs with different resolutions through a coarse-to-fine guidance
scheme. As shown in the gray-images of Fig. 9 (a), the texture details of the image are
enhanced progressively, and the texture of the object becomes increasingly realistic.
In addition, as we mentioned in Sec.3.1 that frame augmentation scheme can be
replaced by any other frame synthesis solution, we adopt more complex adaptive con-
volution [19] to replace our basic generation of augmented frames (MAFs) in video
frame interpolation experiment, then we find that our PMSN obtains extra 0.5dB gain
in PSNR. In general, above ablation studies demonstrate that the proposed coarse-to-
fine guidance scheme is really effective in further improving synthesis quality.
Effectiveness of MAFs: As shown in Fig. 9 (b), the pure results of MAFs are un-
satisfactory with a certain degree of blocking artifacts and uneven motions, the results
without MAFs also have significant blur artifacts, which demonstrates that MAFs can
provide informative motion tendency and texture details for synthesis.
Fig. 9. (a) Output of each layer in the decoder stage. (b) Interpolation example.
Effectiveness of Semantic Loss: The Semantic Loss `sem is comprised of Guid-
ance Loss `guid, Lateral Dependency Loss `ld , Attention Emphasis Loss `emph, and
Gradient Loss `grad . To evaluate the contribution of each loss, we implement four re-
lated baselines for comparison. As shown in Table 3, we find that `ld+`guid,`ld+`emph
Table 3. Performance of hybrid losses.
Methods Interpolation Extrapolation
PSNR SSIM PSNR SSIM
`ld 31.9 0.92 30.5 0.90
`ld +`guid 32.4 0.93 30.6 0.90
`ld +`grad 32.8 0.95 30.9 0.91
`ld +`emph 32.9 0.95 31.0 0.93
`sem 33.1 0.96 31.3 0.94
and `ld+`g rad are all higher than basic `ld, which means that Guidance, Attention Em-
phasis and Gradient Loss lead to better performance. The combination of them further
improves the overall performance.
5 Conclusions
In order to solve the problems existing in the traditional synthesis framework based
on pixel motion estimation or learning based solutions, we try to effectively combine
the advantages of the two solutions by establishing the proposed Progressive Motion-
texture Synthesis Network (PMSN) framework. Based on the augmented input, the net-
work can obtain informative motion tendency and enhance the texture details of syn-
thesized video frames through the well-designed coarse-to-fine guidance scheme. In
the learning stage, a brain-inspired semantic loss is introduced for further refining the
motion and texture of objects. We perform comprehensive experiment to verify the ef-
fectiveness of PMSN. In the future, we expect to extend PMSN to other types of tasks
such as video tracking, video question answering, etc.
Acknowledgement
This work was supported in part by the National Key Research and Development Pro-
gram of China under Grant No. 2016YFC0801001, the National Program on Key Ba-
sic Research Projects (973 Program) under Grant 2015CB351803, NSFC under Grant
61571413, 61632001,61390514.
References
1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijaya-
narasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint
arXiv:1609.08675 (2016)
2. Choudhary, S., Varshney, P.: A study of digital video compression techniques. PARIPEX-
Indian Journal of Research 5(4) (2016)
3. De Weerd, P., Gattass, R., Desimone, R., Ungerleider, L.G.: Responses of cells in monkey
visual cortex during perceptual filling-in of an artificial scotoma. Nature p. 731 (1995)
4. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through
video prediction. In: NIPS. pp. 64–72 (2016)
5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR.
pp. 770–778 (2016)
6. Huang, X., Paradiso, M.A.: V1 response timing and surface filling-in. Journal of neurophys-
iology 100(1), 539–547 (2008)
7. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arX-
iv:1412.6980 (2014)
8. Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: A large video database for
human motion recognition. In: High Performance Computing in Science and Engineering
12, pp. 571–582. Springer (2013)
9. Li, S., Yeung, D.Y.: Visual object tracking for unmanned aerial vehicles: A benchmark and
new motion models. In: AAAI. pp. 4140–4146 (2017)
10. Li, W., Cosker, D.: Video interpolation using optical flow and laplacian smoothness. Neuro-
computing 220, 236–243 (2017)
11. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for future-flow embedded video
prediction. ICCV (2017)
12. Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel
flow. In: ICCV. vol. 2 (2017)
13. Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and
unsupervised learning. ICLR (2017)
14. Lu, C., Hirsch, M., Sch¨
olkopf, B.: Flexible spatio-temporal networks for video prediction.
In: CVPR. pp. 6523–6531 (2017)
15. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square
error. ICLR (2016)
16. Meyer, S., Wang, O., Zimmer, H., Grosse, M., Sorkine-Hornung, A.: Phase-based frame
interpolation for video. In: CVPR. pp. 1410–1418 (2015)
17. Michalski, V., Memisevic, R., Konda, K.: Modeling deep temporal dependencies with recur-
rent grammar cells””. In: NIPS. pp. 1925–1933 (2014)
18. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In:
Proceedings of the 27th international conference on machine learning (ICML-10) (2010)
19. Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. In: CVPR.
vol. 2, p. 6 (2017)
20. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with
region proposal networks. In: NIPS. pp. 91–99 (2015)
21. Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical
flow estimation. In: AAAI. pp. 1495–1501 (2017)
22. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpo-
lation of correspondences for optical flow. In: CVPR. pp. 1164–1172 (2015)
23. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image
segmentation. In: International Conference on Medical Image Computing and Computer-
Assisted Intervention. pp. 234–241. Springer (2015)
24. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from
videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
25. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for
natural video sequence prediction. ICLR 1(2), 7 (2017)
26. Wang, Y., Long, M., Wang, J., Gao, Z., Philip, S.Y.: Predrnn: Recurrent neural networks for
predictive learning using spatiotemporal lstms. In: NIPS. pp. 879–888 (2017)
27. Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame syn-
thesis via cross convolutional networks. In: NIPS. pp. 91–99 (2016)
28. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint
arXiv:1511.07122 (2015)
... Moreover, coarseto-fine networks are widely adopted and studied as solutions to enhance the degraded image quality resulting from various video processing outcomes. Jin et al. [26] proposed the Progressive Motion-Texture Synthesis Network (PMSN) to address challenges in existing video frame synthesis tasks related to improving perceptual quality and maintaining semantic representation capability. Additionally, Luo et al. [27] introduced a new Coarseto-Fine Spatio-Temporal Information Fusion (CF-STIF) for enhancing the quality of losscompressed videos. ...
Article
Full-text available
After the development of the Versatile Video Coding (VVC) standard, research on neural network-based video coding technologies continues as a potential approach for future video coding standards. Particularly, neural network-based intra prediction is receiving attention as a solution to mitigate the limitations of traditional intra prediction performance in intricate images with limited spatial redundancy. This study presents an intra prediction method based on coarse-to-fine networks that employ both convolutional neural networks and fully connected layers to enhance VVC intra prediction performance. The coarse networks are designed to adjust the influence on prediction performance depending on the positions and conditions of reference samples. Moreover, the fine networks generate refined prediction samples by considering continuity with adjacent reference samples and facilitate prediction through upscaling at a block size unsupported by the coarse networks. The proposed networks are integrated into the VVC test model (VTM) as an additional intra prediction mode to evaluate the coding performance. The experimental results show that our coarse-to-fine network architecture provides an average gain of 1.31% Bjøntegaard delta-rate (BD-rate) saving for the luma component compared with VTM 11.0 and an average of 0.47% BD-rate saving compared with the previous related work.
... Unlike images, videos contain highly temporal redundancy between consecutive frames. This redundancy is exploited using frame interpolation [26] and frame extrapolation [27], [28]. Moreover, some works estimate the optical flow between frames [29]- [31]. ...
Conference Paper
Full-text available
Driven by the growing demand for video applications , deep learning techniques have become alternatives for implementing end-to-end encoders to achieve applicable compression rates. Conventional video codecs exploit both spatial and temporal correlation. However, due to some restrictions (e.g. computational complexity), they are commonly limited to linear transformations and translational motion estimation. Autoencoder models open up the way for exploiting predictive end-to-end video codecs without such limitations. This paper presents an entire learning-based video codec that exploits spatial and temporal correlations. The presented codec extends the idea of P-frame prediction presented in our previous work. The architecture adopted for I-frame coding is defined by a variational autoencoder with non-parametric entropy modeling. Besides an entropy model parameterized by a hyperprior, the inter-frame encoder architecture has two other independent networks , responsible for motion estimation and residue prediction. Experimental results indicate that some improvements still have to be incorporated into our codec to overcome the all-intra coding set up regarding the traditional algorithms High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC).
... Compared to image, video contains highly temporal correlation between frames. Several works focus on frame interpolation [24] or frame extrapolation [25]- [27] to leverage this correlation and increase frame rate. On the other hand, some efforts have been made to estimate optical flow between frames with [28] or without [29] supervision as footstone for early-stage video analysis. ...
Article
One key challenge to learning-based video compression is that motion predictive coding, a very effective tool for video compression, can hardly be trained into a neural network. In this paper we propose the concept of PixelMotionCNN (PMCNN) which includes motion extension and hybrid prediction networks. PMCNN can model spatiotemporal coherence to effectively perform predictive coding inside the learning network. On the basis of PMCNN, we further explore a learning-based framework for video compression with additional components of iterative analysis/synthesis, binarization, etc. Experimental results demonstrate the effectiveness of the proposed scheme. Although entropy coding and complex configurations are not employed in this paper, we still demonstrate superior performance compared with MPEG-2 and achieve comparable results with H.264 codec. The proposed learning-based scheme provides a possible new direction to further improve compression efficiency and functionalities of future video coding.
Article
Modern codecs remove temporal redundancy of a video via inter prediction, i.e. searching previously coded frames for similar blocks and storing motion vectors to save bit-rates. However, existing codecs adopt block-level motion estimation, where a block is regressed by reference blocks linearly and is doomed to fail to deal with non-linear motions. In this paper, we generate virtual reference frames with previously reconstructed frames via deep networks to offer an additional candidate, which is not constrained to linear motion structure and further significantly improves coding efficiency. More specifically, we propose a novel deep Auto-Regressive Moving-Average (ARMA) model, Error-Corrected Auto-Regressive Network (ECAR-Net), equipped with the powers of the conventional statistic ARMA models and deep networks jointly for reference frame prediction. Similar to conventional ARMA models, the ECAR-Net consists of two stages: Auto-Regression (AR) stage and Error-Correction (EC) stage, where the first part predicts the signal at the current time-step based on previously reconstructed frames while the second one compensates for the output of the AR stage to obtain finer details. Different from the statistic AR models only focusing on short-term temporal dependency, the AR model of our ECAR-Net is further injected with the long-term dynamics mechanism, where long temporal information is utilized to help predict motions more accurately. Furthermore, ECAR-Net works in a configuration-adaptive way, i.e. using different dynamics and error definitions for the Low Delay B and Random Access configurations, which helps improve the adaptivity and generality in diverse coding scenarios. With the well-designed network, our method surpasses HEVC on average 5.0%5.0\% and 6.6%6.6\% BD-rate saving for the luma component under the Low Delay B and Random Access configurations and also obtains on average 1.54%1.54\% BD-rate saving over VVC. Furthermore, ECAR-Net works in a configuration-adaptive way, i.e. using different dynamics and error definitions for the Low Delay B and Random Access configurations, which helps improve the adaptivity and generality in diverse coding scenarios.
Article
Frame interpolation and synthesis are growing topics in the field of computer vision. Hence, these topics gained more attention recently where several deep-learning architectures were proposed to enhance the quality of the synthesized frames. In this paper, an efficient handcrafted deep approach is proposed for better frame synthesis. The proposed approach takes advantage of singular value decomposition (SVD) framework and Generative Adversarial Networks (GAN). The proposed approach does not require any computationally expensive feature extraction steps such optical flow techniques and block-based motion compensation techniques. Nonetheless, the SVD components still carry the relevant motion information needed to deal with the challenges such as large motion and occlusion. Thus, the frames are temporally upscaled via SVD based construction procedure where new middle frames are interpolated for further enhancement using a GAN based approach that eliminates most of the visual artifacts. The proposed frame synthesis approach is comprehensively evaluated in different scenarios where its performance is assessed and compared with the state-of-the-art. Our framework outperforms the majority of the deep learning approaches in terms quantitative results in addition to qualitative results where our framework can generate smoother frames with less visual artifacts.
Conference Paper
Full-text available
Recent work has shown that optical flow estimation can be formulated as a supervised learning problem. Moreover, convolutional networks have been successfully applied to this task. However, supervised flow learning is obfuscated by the shortage of labeled training data. As a consequence, existing methods have to turn to large synthetic datasets for easily computer generated ground truth. In this work, we explore if a deep network for flow estimation can be trained without supervision. Using image warping by the estimated flow, we devise a simple yet effective unsupervised method for learning optical flow, by directly minimizing photometric consistency. We demonstrate that a flow network can be trained from end-to-end using our unsupervised scheme. In some cases, our results come tantalizingly close to the performance of methods trained with full supervision.
Article
Full-text available
Video frame interpolation typically involves two steps: motion estimation and pixel synthesis. Such a two-step approach heavily depends on the quality of motion estimation. This paper presents a robust video frame interpolation method that combines these two steps into a single process. Specifically, our method considers pixel synthesis for the interpolated frame as local convolution over two input frames. The convolution kernel captures both the local motion between the input frames and the coefficients for pixel synthesis. Our method employs a deep fully convolutional neural network to estimate a spatially-adaptive convolution kernel for each pixel. This deep neural network can be directly trained end to end using widely available video data without any difficult-to-obtain ground-truth data like optical flow. Our experiments show that the formulation of video interpolation as a single convolution process allows our method to gracefully handle challenges like occlusion, blur, and abrupt brightness change and enables high-quality video frame interpolation.
Article
Despite recent advances in the visual tracking community, most studies so far have focused on the observation model. As another important component in the tracking system, the motion model is much less well-explored especially for some extreme scenarios. In this paper, we consider one such scenario in which the camera is mounted on an unmanned aerial vehicle (UAV) or drone. We build a benchmark dataset of high diversity, consisting of 70 videos captured by drone cameras. To address the challenging issue of severe camera motion, we devise simple baselines to model the camera motion by geometric transformation based on background feature points. An extensive comparison of recent state-of-the-art trackers and their motion model variants on our drone tracking dataset validates both the necessity of the dataset and the effectiveness of the proposed methods. Our aim for this work is to lay the foundation for further research in the UAV tracking area.
Article
The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical context, where the visual dynamics are believed to have modular structures that can be learned with compositional subsystems. This paper models these structures by presenting PredRNN, a new recurrent network, in which a pair of memory cells are explicitly decoupled, operate in nearly independent transition manners, and finally form unified representations of the complex environment. Concretely, besides the original memory cell of LSTM, this network is featured by a zigzag memory flow that propagates in both bottom-up and top-down directions across all layers, enabling the learned visual dynamics at different levels of RNNs to communicate. It also leverages a memory decoupling loss to keep the memory cells from learning redundant features. We further propose a new curriculum learning strategy to force PredRNN to learn long-term dynamics from context frames, which can be generalized to most sequence-to-sequence models. We provide detailed ablation studies to verify the effectiveness of each component. Our approach is shown to obtain highly competitive results on five datasets for both action-free and action-conditioned predictive learning scenarios.
Conference Paper
Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction is viewed as a promising avenue for unsupervised feature learning. In this work, we train a convolutional network to generate future frames given an input sequence. To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, we propose three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. We compare our predictions to different published results based on recurrent neural networks on the UCF101 dataset.
Conference Paper
While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and generalizing across video datasets. These results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Future frame prediction in videos is a promising avenue for unsupervised video representation learning. Video frames are naturally generated by the inherent pixel flows from preceding frames based on the appearance and motion dynamics in the video. However, existing methods focus on directly hallucinating pixel values, resulting in blurry predictions. In this paper, we develop a dual motion Generative Adversarial Net (GAN) architecture, which learns to explicitly enforce future-frame predictions to be consistent with the pixel-wise flows in the video through a dual-learning mechanism. The primal future-frame prediction and dual future-flow prediction form a closed loop, generating informative feedback signals to each other for better video prediction. To make both synthesized future frames and flows indistinguishable from reality, a dual adversarial training method is proposed to ensure that the future-flow prediction is able to help infer realistic future-frames, while the future-frame prediction in turn leads to realistic optical flows. Our dual motion GAN also handles natural motion uncertainty in different pixel locations with a new probabilistic motion encoder, which is based on variational autoencoders. Extensive experiments demonstrate that the proposed dual motion GAN significantly outperforms state-of-the-art approaches on synthesizing new video frames and predicting future flows. Our model generalizes well across diverse visual scenes and shows superiority in unsupervised video representation learning.
Article
We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture on human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.