Content uploaded by Zhibo Chen
Author content
All content in this area was uploaded by Zhibo Chen on Aug 27, 2018
Content may be subject to copyright.
Augmented Coarse-to-fine Video Frame Synthesis with
Semantic Loss
Xin Jin1, Zhibo Chen1,[0000−0002−8525−5066], Sen Liu1, and Wei Zhou1
CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application
System, University of Science and Technology of China, Hefei 230027, China
chenzhibo@ustc.edu.cn
Abstract. Existing video frame synthesis works suffer from improving percep-
tual quality and preserving semantic representation ability. In this paper, we pro-
pose a Progressive Motion-texture Synthesis Network (PMSN) to address this
problem. Instead of learning synthesis from scratch, we introduce augmented in-
puts to compensate texture details and motion information. Specifically, a coarse-
to-fine guidance scheme with a well-designed semantic loss is presented to im-
prove the capability of video frame synthesis. As shown in the experiments, our
proposed PMSN promises excellent quantitative results, visual effects, and gen-
eralization ability compared with traditional solutions.
Keywords: Video frame synthesis, augmented input, coarse-to-fine guidance scheme,
semantic loss
1 Introduction
Video frame synthesis plays an important role in numerous applications of different
fields, including video compression [2], video frame rate up-sampling [12], and pilot-
less automobile [9]. Given a video sequence, video frame synthesis aims to interpolate
frames between the existing video frames or extrapolate future video frames as shown
in Fig. 1. However, constructing a generalized model to synthesize video frames is still
challenging, especially for those videos with large motion and complex texture.
Fig. 1. Interpolation and extrapolation tasks in the video frame synthesis problem.
A lot of efforts have been dedicated towards video frame synthesis. Traditional ap-
proaches focused on synthesizing video frames from estimated motion information,
such as optical flow [22, 10, 16]. Recent approaches have proposed deep generative
models to directly hallucinate the pixel values of video frames [21, 17, 27, 13, 19, 15,
4, 26]. However, these models always generate significant artifacts since the accuracy
of motion estimation cannot be guaranteed. Meanwhile, due to the straightforward non-
linear convolution operations, the results of deep generative models are suffered from
blur artifacts.
In order to tackle the above problems, we propose a deep model called Progres-
sive Motion-texture Synthesis Network (PMSN), which is a global encoder-decoder
architecture with coarse-to-fine guidance under a brain-inspired semantic objective.
Overview of the whole process of PMSN is illustrated in Fig. 2. Specifically, we first
introduce an augmented frames generation process to produce Motion-texture Aug-
mented Frames (MAFs) containing coarse-grained motion prediction and high texture
details. Second, in order to reduce the loss of detailed information in the feed-forward
process and assist the network to learn motion tendency, MAFs are fed into the decoder
stage with different scales in a coarse-to-fine manner, rather than the scheme of directly
fusion into a single layer as described in [12]. Finally, we also adopt a brain-inspired
semantic loss to further enhance the subjective quality and preserve the semantic rep-
resentation ability of synthesized frames in the learning stage. The contributions of this
paper are summarized as follows:
1. Instead of learning synthesis from scratch, we introduce a novel Progressive Motion-
texture Synthesis Network (PMSN) to learn frame synthesis with triple-frame input
under the assistant of augmented frames. These augmented frames provide effec-
tive prior information including motion tendency and texture details to compensate
the video synthesis
2. A coarse-to-fine guidance scheme is adopted in the decoder stage of the network to
increase its sensitivity to informative features. Through this scheme, we can max-
imally exploit the informative features and suppress less useful ones at the same
time, which acts as a bridge which combines conventional motion estimation meth-
ods and deep learning-based methods.
3. We develop a brain-inspired semantic loss for sharpening the synthetic results and
strengthening object texture as well as motion information. The final results demon-
strate better perceptual quality and semantic representation preserving ability.
2 Related Work
Traditional Methods Early attempts at video frame synthesis focused on motion esti-
mation based approaches. For example, Revaud et al. [22] proposed the EpicFlow to
estimate optical flow by edge-aware distance. Li et al. [10] adopted a Laplacian Cotan-
gent Mesh constraint to enhance the local smoothness for results generated by optical
flow. Meyer et al. [16] leveraged the phase shift information for image interpolation.
The results of these methods are highly relied on the precise estimation of motion infor-
mation. Significant artifacts can be generated when unsatisfactory estimation happens
for videos with large or complex motion.
Fig. 2. Overview of our Progressive Motion-texture Synthesis Network (PMSN).
Learning-based Methods The renaissance of deep neural network (DNN) remark-
ably accelerates the progress of video frame synthesis. Numbers of methods were pro-
posed to interpolate or extrapolate video frames [19, 14, 17, 27, 13,19, 15, 26]. [17] fo-
cused on representing series transformation to predict small patches based on recurrent
neural network (RNN). Xue et al. [27] proposed a model which generates videos with
an assumption that the background is uniform. Lotter et al. [13] proposed a network
called PredNet, which contains a series of stacked modules that forward the deviations
in video sequences. Mathieu et al. [15] proposed a multi-scale architecture with adver-
sarial training, which is referred as BeyondMSE. Niklaus et al. [19] tried to estimate a
convolution kernel from the input frames. Then, the kernel was used to convolve patch-
es from the input frames for synthesizing the interpolated ones. However, it is still hard
to hallucinate realistic details for videos with complex spatiotemporal information only
by the non-linear convolution operation.
Recently, Liu et al. [12] utilized the pixel-wise 3D voxel flow to synthesize video
frames. Lu et al. [14] presented a Flexible Spatio-Temporal Network (FSTN) to capture
complex spatio-temporal dependencies and motion patterns with diverse training strate-
gies. [19] just focused on video frame interpolation task via adaptive convolution. Liang
et al. [11] developed a dual motion Generative Adversarial Network (GAN) for video
prediction. Villegas et al. [25] proposed a deep generative model named MCNet to ex-
tract the features of the last frame as content information and then encode the temporal
differences between previous consecutive frames as motion information. Unfortunately,
these methods usually only have the ability to deal with videos with tiny object motion
and simple background which often cause blur artifacts in video scenes with large and
complex motion. On the contrary, our proposed PMSN is able to achieve much better
results, especially in complex scenes. In the experiment section Sec.4, we will show
adequate evaluations between our method and above methods.
Fig. 3. The whole architecture of our Progressive Motion-texture Synthesis Network (PMSN).
3 Progressive Motion-texture Synthesis Network
The whole architecture of our Progressive Motion-texture Synthesis Network (PMSN)
is shown in Fig. 3, which takes advantage of the spatial invariance and temporal cor-
relation for image representations. Instead of learning from scratch, the model receives
original video frames combined with the produced augmented frames as the whole in-
puts. These triple-frame inputs provide more reference information for motion trajecto-
ry and texture residue, which leads to more reasonable high-level image representations.
In the following sub-sections, we will first describe the augmented frames generation
process. Then, the coarse-to-fine guidance scheme and semantic loss are presented.
Encoder Stage: Each convolutional block is shown in Fig. 4 (a). The size of the
receptive field for all convolution filters is (4,4) along with stride (2,2). A group of
ResidualBlocks [5] (number of blocks in the group is shown in Fig. 3) is used to
strengthen the non-linear representation and preserve more spatial-temporal details. To
overcome the overfitting and internal covariant shift problems, we add a batch normal-
ization layer before each Rectified Liner Unit (ReLU) layer [18].
Decoder Stage: The deconvolutional block is used to upsample the feature maps,
as demonstrated in Fig. 4 (b), which has a receptive field of (5,5) with stride (2,2). The
block also contains BatchNorm, ReLU layer and Residual Block, their parameters are
shown in Fig. 3. To maintain the image details from low-level to high-level, we build
skip connections, which are illustrated as the thin blue arrows in Fig. 3.
(a) (b)
Fig. 4. Two sub-components in PMSN. (a) Convolutional Block in PMSN. (b) Deconvolutional
Block in PMSN.
3.1 Augmented Frames Generation
Intrinsically, our PMSN utilizes augmented frames rather than learning from scratch.
Then it is important for the augmented frame to preserve coarse motion trajectory and
less-blurred texture, for PMSN to further improve the quality under the assistance of
coarse-to-fine guidance and semantic loss. Therefore, any frame augmentation scheme
satisfying above-mentioned two factors can be adopted in the PMSN framework, we in-
troduce a simple augmented frames generation process in this paper to produce Motion-
texture Augmented Frames (MAFs) containing coarse-grained motion prediction and
high texture details. Similar to motion-estimation based frame synthesis methods, the
original input frames are first decomposed into block-level matrixes. Then, we directly
copy the matching blocks to MAFs according to the estimated motion vectors of these
blocks. As shown in Fig. 5 (a), to calculate the motion vector for generating MAF f0
i,
we first partition the frame ˆ
fi−1into regular 4x4 blocks, then search backward in the
frame ˆ
fi−2. When building each 4x4 block of MAF, the motion vectors of correspond-
ing 4x4 block in frame ˆ
fi−1are utilized to locate and copy the data from frame ˆ
fi−1.
This bolck-sized thresholding 4x4 is sufficient for our purpose of generating the MAFs.
(a) (b)
Fig. 5. (a) The generation precess of MAFs where the direction of motion vectors is backward.
(b) Attention Object Bounding Box extracted by Faster R-CNN.
Note that this frame augmentation scheme can be replaced by any other frame syn-
thesis solution, we verified this in the experiment section Sec.4.3 by replacing MAFs
with augmented frames generated from [19] and then demonstrate the effectiveness of
our proposed PMSN.
3.2 Coarse-to-fine Guidance
In order to make use of the information aggregated in the MAFs as well as triple-
frame input groups for selectively emphasizing informative features and suppressing
less useful ones, we propose a coarse-to-fine guidance scheme to guide our network in
an end-to-end manner, which is illustrated as orange arrows in Fig. 2, 3 and Fig. 4 (b).
Specifically, given the double-frame input Xand single augmented frame ˜
Y, our goal is
to obtain the synthesized interpolated/extrapolated frames Y0, which can be formulated
as:
Y0=f(G(X+˜
Y),˜
Y),(1)
where Gdenotes a generator which learns motion trajectory and texture residue from
triple-frame input group X+˜
Y. Function frepresents the fusion process to fully cap-
ture channel-wise dependencies through a concatenation operation.
In order to progressively improve the quality of synthesized frames in a coarse-to-
fine way, we make a series of synthesis from MAFs with gradual increase resolutions,
which is depicted as below:
Y0
1=f(G1(X+˜
Y1), e1(˜
Y1)),
Y0
2=f(G2(Y0
1+˜
Y2), e2(˜
Y2)),
...
Y0
k=f(Gk(Y0
k−1+˜
Yk), ek(˜
Yk)),
(2)
where krepresents each level of the coarse-to-fine synthesis process. In our PMSN, we
set the size of each level to 40x30 (k= 1), 80x60 (k= 2), 160x120 (k= 3) , 320x240
(k= 4).Gkis the middle layer of G, and G1, G2, ..., Gkcompose an integrated net-
work. And ekis the feature extractor of ˜
Yk, we employ two dilated convolutional layers
[28] instead of simple downsample operations to preserve the texture details of origi-
nal images. Since the output Y0
kis produced by a summation through all channels (X
and Y), the channel dependencies are implicitly embedded in them. In order to ensure
that the network is able to increase its sensitivity to informative features and suppress
less useful ones, the final output of each level is obtained by assigning each channel a
corresponding weighting factor W. Then we design a Guidance Loss `guid containing
four sub-loss functions for each level ˆ
Y0
k. Let Ydenotes the Ground Truth and δrefers
to the activation function ReLu [18]:
ˆ
Y0
k=F(Y0
k, W ) = δ(W∗Y0
k), `guid =
4
X
k=1
kˆ
Y0
k−Yk2.(3)
3.3 Semantic Loss
In the visual cortex, neurons are mapped to the visible or salient parts of an image and
activated first, then followed by a later spread to neurons that are mapped to the miss-
ing parts [3, 6]. Inspired by this visual cortex representation process, we design a hy-
brid semantic loss `sem to further sharpen the texture details of synthesized results and
strengthen informative motion information, which consists of four sub-parts: Guidance
Loss `guid mentioned above, Lateral Dependency Loss `ld, Attention Emphasis Loss
`emph, and Gradient Loss `gr ad. First, to imitate the cortical neuron filling-in process
and capture lateral dependency between neighbors in the visual cortex, `ld is proposed:
`ld =1
N
N
X
i,j=1
|k ˆ
Yi,j −ˆ
Yi−1,j k2− kYi,j −Yi−1,j k2|+
|k ˆ
Yi,j −ˆ
Yi,j−1k2− kYi,j −Yi,j −1k2|.
(4)
Second, `emph is employed to strengthen the texture and motion information of
attention objects in the scene, namely, to emphasize the gradients for attention objects
through feedback values during the back-propagation. As shown in Fig. 5 (b), we take
advantage of the excellent Faster R-CNN [20] to extract the foreground attention objects
through a priori bounding box where (Wbox, Hbox )is the pair of width and height. Then
we define the Attention Emphasis Loss `emph as follows:
`emph =1
Wbox ×Hbox
(i,j)∈box
X
i,j
|k ˆ
Yi,j −ˆ
Yi−1,j k2− kYi,j −Yi−1,j k2|+
|k ˆ
Yi,j −ˆ
Yi,j−1k2− kYi,j −Yi,j −1k2|.
(5)
Finally, `grad is also used to sharpen the texture details by incorporating with image
gradients as shown in Eq.6, and similar operation is also described in [15]. In summary,
the semantic loss `sem is a weighted sum of all the losses in our experiment where
α= 1, β = 0.3, γ = 0.7, λ = 1 are the weights for Guidance Loss, Lateral Dependency
Loss, Attention Emphasis Loss and Gradient Loss, respectively:
`grad =
∇ˆ
Y− ∇Y
2.(6)
`sem =α`guid +β`ld +γ`emph +λ`grad .(7)
4 Experiments
In this section, we present comprehensive experiments to analyze and understand the
behavior of our model. We first evaluate our model in terms of qualitative and quan-
titative performance for video interpolation and extrapolation. Then we show more
capacities of our PMSN on various datasets. In the end, we analyze the effective-
ness of different components in the PMSN separately. Datasets: We train our network
on 153,971 triplet video sequences sampled from UCF-101 [24] dataset, and test the
performance on UCF-101 (validation), HMDB51 [8], and YouTube-8m [1] datasets.
Training Details: We adopt an Adam [7] solver to learn the model parameters by op-
timizing the semantic loss. The batch size is set as 32, and our initial learning rate
is 0.005 that decays every 50K steps. We train the model for 100K iterations. The
source code will be released in the future. Baselines: Here, we divide existing video
synthesis methods into three categories for comparison: (1)Interpolation-Only, Phase-
based frame interpolation [16] is a traditional and well-performed method just for video
interpolation. Ada-Conv [19] also only focuses on video frame interpolation task via
adaptive convolution operations. (2)Extrapolation-Only, PredNet [13] is a predictive
coding inspired CNN architecture. MCNet [25] predicts frames by decomposing mo-
tion and content. FSTN [14] and Dual-Motion GAN [11] both only focus on video
extrapolation task, and the authors do not release their pre-trained weights or train-
ing details. Hence, we only compare the PSNR and SSIM presented in their paper.
(3)Interpolation-Plus-Extrapolation, EpicFlow [22] is a state-of-the-art approach for
optical flow estimation, the synthesized frames are constructed by pixel compensation.
For CNN-based methods, BeyondMSE [15] is a multi-scale architecture. The official
model is trained by using 4 and 8 input frames. Since our method uses 2 input frames,
BeyondMSE with 2 input frames is implemented for comparison. U-Net [23], which
has a well-received structure for pixel-level generation, is also implemented for com-
parison. Deep Voxel Flow (DVF) [12] trains a deep network that learns to synthesize
video frames by flowing pixel values from existing frames.
Table 1. Performance of frame synthesis on UCF-101 validation dataset.
Methods Interpolation Extrapolation
PSNR SSIM PSNR SSIM
Pred Net [13] — — 22.6 0.74
Phase-based [16] 28.4 0.84 — —
Beyond-MSE [15] 28.8 0.90 28.2 0.89
Epic-Flow [22] 30.2 0.93 29.1 0.91
U-Net [23] 30.2 0.92 29.2 0.92
FSTN [14] — — 27.6 0.91
MCNet [25] — — 28.8 0.92
Deep Voxel Flow [12] 30.9 0.94 29.6 0.92
Dual-Motion GAN [11] — — 30.5 0.94
Ada-Conv [19] 32.6 0.95 — —
Ours 33.1 0.96 31.3 0.94
4.1 Quantitative and Qualitative Comparison
For quantitative comparison, we use both Peak Signal-to-Noise Ratio (PSNR) and Struc-
tural SIMilarity (SSIM) index to evaluate the image quality of interpolated/extrapolated
frames, higher values of PSNR and SSIM indicate better results. In terms of qualitative
quality, our approach is compared with several latest state-of-the-art methods in Fig. 6
and 7.
Single-frame Synthesis As shown in Table 1, it is obvious that our solution out-
performs all existing solutions. Compared with the existing best interpolation-only so-
lution Ada-Conv and best extrapolation-only solution Dual-Motion GAN, over 0.5dB
and 0.8dB PSNR improvement can be achieved respectively. Compared with the ex-
isting best interpolation-plus-extrapolation scheme Deep Voxel Flow, over 2.2dB and
1.7dB PSNR improvement can be achieved for interpolation and extrapolation oper-
ation respectively. We also show some subjective results for perceptual comparison.
As illustrated in Fig. 6 and Fig. 7, our PMSN demonstrates better perceptual quality
with clearer integrated objects, non-blurred background scene and more accurate mo-
tion prediction, compared with existing solutions. For example, Ada-Conv generates
strong distortion and losses partial object in the bottom-right “leg” area due to failed
motion prediction. On the contrary, our PMSN demonstrates much better perceptual
quality without obvious artifacts.
Multi-frame Synthesis We further explore the multi-frame synthesis ability of our
PMSN on various datasets, which can be used for up-sampling video frame rate and
generating videos with slow-motion effect. We can see that the qualitative results in
Fig. 6. Qualitative comparisons of video interpolation.
Fig. 7. Qualitative comparisons of video extrapolation.
Fig. 8(a) have reasonable motion and realistic texture. And as demonstrated in Fig. 8(b),
the PMSN can provide outstanding performance compared with other state-of-the-art
methods.
(a) (b)
Fig. 8. (a) Three-frame interpolation. (b) Performance comparisons on three-frame interpolation.
4.2 Generalization Ability
Furthermore, we show the generalization ability of our PMSN by evaluating the model
on YouTube-8m and HMDB-51 validation datasets without re-training. Table 2 demon-
strates that our model outperforms all previous state-of-the-art models by a even larger
gain (over 1.2dB PSNR improvement on both datasets for interpolation and extrapo-
lation) compared with results in Table 1, which means our PMNS has a much better
generalization ability.
Table 2. Performance of frame synthesis on YouTube-8M and HMDB-51 validation datasets.
Methods Interpolation Extrapolation
PSNR SSIM PSNR SSIM
Pred Net — — 19.7 / 18.4 0.65 / 0.59
Phase-based 21.0 /21.7 0.66 / 0.68 — —
U-Net 24.2 / 23.8 0.73 / 0.72 22.7 / 22.4 0.70 / 0.70
BeyondMSE 26.6 / 26.8 0.78 / 0.80 25.7 / 26.1 0.74 / 0.76
MCNet — — 26.9 / 27.9 0.79 / 0.81
Epic-Flow 29.5 / 29.5 0.92 / 0.92 29.2 / 29.3 0.90 / 0.92
Ada-Conv 29.5 / 29.6 0.93 / 0.92 — —
Ours 31.1 / 31.4 0.94 / 0.92 30.4 / 30.7 0.94 / 0.93
4.3 Ablation Study
Effectiveness of Coarse-to-fine Guidance Scheme: We first visualize the output of
each deconvolutional block in the decoder stage, which indicates these gradually im-
proved results using MAFs with different resolutions through a coarse-to-fine guidance
scheme. As shown in the gray-images of Fig. 9 (a), the texture details of the image are
enhanced progressively, and the texture of the object becomes increasingly realistic.
In addition, as we mentioned in Sec.3.1 that frame augmentation scheme can be
replaced by any other frame synthesis solution, we adopt more complex adaptive con-
volution [19] to replace our basic generation of augmented frames (MAFs) in video
frame interpolation experiment, then we find that our PMSN obtains extra 0.5dB gain
in PSNR. In general, above ablation studies demonstrate that the proposed coarse-to-
fine guidance scheme is really effective in further improving synthesis quality.
Effectiveness of MAFs: As shown in Fig. 9 (b), the pure results of MAFs are un-
satisfactory with a certain degree of blocking artifacts and uneven motions, the results
without MAFs also have significant blur artifacts, which demonstrates that MAFs can
provide informative motion tendency and texture details for synthesis.
Fig. 9. (a) Output of each layer in the decoder stage. (b) Interpolation example.
Effectiveness of Semantic Loss: The Semantic Loss `sem is comprised of Guid-
ance Loss `guid, Lateral Dependency Loss `ld , Attention Emphasis Loss `emph, and
Gradient Loss `grad . To evaluate the contribution of each loss, we implement four re-
lated baselines for comparison. As shown in Table 3, we find that `ld+`guid,`ld+`emph
Table 3. Performance of hybrid losses.
Methods Interpolation Extrapolation
PSNR SSIM PSNR SSIM
`ld 31.9 0.92 30.5 0.90
`ld +`guid 32.4 0.93 30.6 0.90
`ld +`grad 32.8 0.95 30.9 0.91
`ld +`emph 32.9 0.95 31.0 0.93
`sem 33.1 0.96 31.3 0.94
and `ld+`g rad are all higher than basic `ld, which means that Guidance, Attention Em-
phasis and Gradient Loss lead to better performance. The combination of them further
improves the overall performance.
5 Conclusions
In order to solve the problems existing in the traditional synthesis framework based
on pixel motion estimation or learning based solutions, we try to effectively combine
the advantages of the two solutions by establishing the proposed Progressive Motion-
texture Synthesis Network (PMSN) framework. Based on the augmented input, the net-
work can obtain informative motion tendency and enhance the texture details of syn-
thesized video frames through the well-designed coarse-to-fine guidance scheme. In
the learning stage, a brain-inspired semantic loss is introduced for further refining the
motion and texture of objects. We perform comprehensive experiment to verify the ef-
fectiveness of PMSN. In the future, we expect to extend PMSN to other types of tasks
such as video tracking, video question answering, etc.
Acknowledgement
This work was supported in part by the National Key Research and Development Pro-
gram of China under Grant No. 2016YFC0801001, the National Program on Key Ba-
sic Research Projects (973 Program) under Grant 2015CB351803, NSFC under Grant
61571413, 61632001,61390514.
References
1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijaya-
narasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint
arXiv:1609.08675 (2016)
2. Choudhary, S., Varshney, P.: A study of digital video compression techniques. PARIPEX-
Indian Journal of Research 5(4) (2016)
3. De Weerd, P., Gattass, R., Desimone, R., Ungerleider, L.G.: Responses of cells in monkey
visual cortex during perceptual filling-in of an artificial scotoma. Nature p. 731 (1995)
4. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through
video prediction. In: NIPS. pp. 64–72 (2016)
5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR.
pp. 770–778 (2016)
6. Huang, X., Paradiso, M.A.: V1 response timing and surface filling-in. Journal of neurophys-
iology 100(1), 539–547 (2008)
7. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arX-
iv:1412.6980 (2014)
8. Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: A large video database for
human motion recognition. In: High Performance Computing in Science and Engineering
12, pp. 571–582. Springer (2013)
9. Li, S., Yeung, D.Y.: Visual object tracking for unmanned aerial vehicles: A benchmark and
new motion models. In: AAAI. pp. 4140–4146 (2017)
10. Li, W., Cosker, D.: Video interpolation using optical flow and laplacian smoothness. Neuro-
computing 220, 236–243 (2017)
11. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for future-flow embedded video
prediction. ICCV (2017)
12. Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel
flow. In: ICCV. vol. 2 (2017)
13. Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and
unsupervised learning. ICLR (2017)
14. Lu, C., Hirsch, M., Sch¨
olkopf, B.: Flexible spatio-temporal networks for video prediction.
In: CVPR. pp. 6523–6531 (2017)
15. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square
error. ICLR (2016)
16. Meyer, S., Wang, O., Zimmer, H., Grosse, M., Sorkine-Hornung, A.: Phase-based frame
interpolation for video. In: CVPR. pp. 1410–1418 (2015)
17. Michalski, V., Memisevic, R., Konda, K.: Modeling deep temporal dependencies with recur-
rent grammar cells””. In: NIPS. pp. 1925–1933 (2014)
18. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In:
Proceedings of the 27th international conference on machine learning (ICML-10) (2010)
19. Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. In: CVPR.
vol. 2, p. 6 (2017)
20. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with
region proposal networks. In: NIPS. pp. 91–99 (2015)
21. Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical
flow estimation. In: AAAI. pp. 1495–1501 (2017)
22. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpo-
lation of correspondences for optical flow. In: CVPR. pp. 1164–1172 (2015)
23. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image
segmentation. In: International Conference on Medical Image Computing and Computer-
Assisted Intervention. pp. 234–241. Springer (2015)
24. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from
videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
25. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for
natural video sequence prediction. ICLR 1(2), 7 (2017)
26. Wang, Y., Long, M., Wang, J., Gao, Z., Philip, S.Y.: Predrnn: Recurrent neural networks for
predictive learning using spatiotemporal lstms. In: NIPS. pp. 879–888 (2017)
27. Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame syn-
thesis via cross convolutional networks. In: NIPS. pp. 91–99 (2016)
28. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint
arXiv:1511.07122 (2015)