PreprintPDF Available

Temporal Kernel Consistency for Blind Video Super-Resolution

Authors:
  • Faculty of dentistry, Beni-Suef University
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Deep learning-based blind super-resolution (SR) methods have recently achieved unprecedented performance in upscaling frames with unknown degradation. These models are able to accurately estimate the unknown downscaling kernel from a given low-resolution (LR) image in order to leverage the kernel during restoration. Although these approaches have largely been successful, they are predominantly image-based and therefore do not exploit the temporal properties of the kernels across multiple video frames. In this paper, we investigated the temporal properties of the kernels and highlighted its importance in the task of blind video super-resolution. Specifically, we measured the kernel temporal consistency of real-world videos and illustrated how the estimated kernels might change per frame in videos of varying dynamicity of the scene and its objects. With this new insight, we revisited previous popular video SR approaches, and showed that previous assumptions of using a fixed kernel throughout the restoration process can lead to visual artifacts when upscaling real-world videos. In order to counteract this, we tailored existing single-image and video SR techniques to leverage kernel consistency during both kernel estimation and video upscaling processes. Extensive experiments on synthetic and real-world videos show substantial restoration gains quantitatively and qualitatively, achieving the new state-of-the-art in blind video SR and underlining the potential of exploiting kernel temporal consistency.
Content may be subject to copyright.
Temporal Kernel Consistency for Blind Video Super-Resolution
Lichuan Xiang1, Royson Lee2*
, Mohamed S. Abdelfattah3,
Nicholas D. Lane2,3, Hongkai Wen1,3
1University of Warwick 2University of Cambridge 3Samsung AI Center, Cambridge
l.xiang.2@warwick.ac.uk
Abstract
Deep learning-based blind super-resolution (SR) meth-
ods have recently achieved unprecedented performance in
upscaling frames with unknown degradation. These mod-
els are able to accurately estimate the unknown downscal-
ing kernel from a given low-resolution (LR) image in order
to leverage the kernel during restoration. Although these
approaches have largely been successful, they are predomi-
nantly image-based and therefore do not exploit the tempo-
ral properties of the kernels across multiple video frames.
In this paper, we investigated the temporal properties of the
kernels and highlighted its importance in the task of blind
video super-resolution. Specifically, we measured the kernel
temporal consistency of real-world videos and illustrated
how the estimated kernels might change per frame in videos
of varying dynamicity of the scene and its objects. With
this new insight, we revisited previous popular video SR ap-
proaches, and showed that previous assumptions of using
a fixed kernel throughout the restoration process can lead
to visual artifacts when upscaling real-world videos. In
order to counteract this, we tailored existing single-image
and video SR techniques to leverage kernel consistency dur-
ing both kernel estimation and video upscaling processes.
Extensive experiments on synthetic and real-world videos
show substantial restoration gains quantitatively and qual-
itatively, achieving the new state-of-the-art in blind video
SR and underlining the potential of exploiting kernel tem-
poral consistency.
1. Introduction
Super-resolution (SR) is an ill-posed problem that as-
sumes the low-resolution image (LR) is derived from a
high-resolution (HR) image and is recently dominated by
deep learning due to its unprecedented performance [6]. In
order to better restore the high-frequency details, state-of-
the-art video SR methods [36,37,39] exploit the temporal
*Equal contributions.
frame information by employing a multi-frame SR (MFSR)
approach. Specifically, each supporting frame is aligned
with its reference frame through motion compensation be-
fore the information in these frames are merged for upscal-
ing.
Most of these methods, however, assume that the degra-
dation process, applying the blur kernel and the downscal-
ing operation, is pre-defined. Therefore, the performance
of these methods significantly deteriorates for real-world
videos as the downscaling kernel, which is used for upscal-
ing, differs from the ground truth kernel, a phenomenon
known as the kernel mismatch problem [8]. Although
there has been significant progress to enable the usage of
SR models in real-world applications, these solutions are
predominantly image-based [3,12,23,44]. The primary
paradigm of these blind image-based solutions consists of
either a two-step or an end-to-end process, starting with a
kernel estimation module and followed by a SR model that
aims to maximize image quality given the estimated kernel
and/or noise. Hence, when upscaling videos, these works
do not utilize the temporal similarity between kernels and
have to estimate kernels individually per frame. This is not
only computationally expensive but also less effective, since
estimating kernels independently per-frame may result in
inaccurate kernels, as shown in Sec. 4and Sec. 5, and thus
kernel mismatch.
Recent blind MFSR approaches, on the other hand, uti-
lized a fixed kernel to upscale every frame [21,30] in the
same video – we hypothesize that this fixed kernel assump-
tion can also lead to kernel mismatch. Therefore, in this
work, we attempt to investigate and answer the following
questions: how does the kernel change temporally in real-
world videos, and how can we leverage this change in the
video restoration process?
Towards our goal, we first investigated the temporal dif-
ferences in kernels in Sec. 4. In particular, we used a recent
image-based kernel estimation approach, KernelGAN [3],
on frames of real-world videos and observed that videos
of varying dynamicity, such as scene changes and object
motion blurs, can result in corresponding variations in the
arXiv:2108.08305v1 [eess.IV] 18 Aug 2021
downsampling kernels. We then show how videos of dif-
ferent dynamicity can affect the temporal consistency of
their downscaling kernels. From this perspective, we re-
evaluated previous MFSR approaches on real-world videos
in Sec. 5. Through our experiments, we show that the com-
mon assumption of using a fixed downsampling kernel for
multi-frame approaches can lead to the kernel mismatch
problem, resulting in inaccurate motion compensation and
hence inferior restoration results. To counteract these draw-
backs, we tailored these existing techniques to exploit our
new insight on kernel temporal consistency in Sec. 6, lead-
ing to substantial gains as compared to state-of-the-art. In
summary, the main contributions of this work are:
To the best of our knowledge, we are the first to inves-
tigate the temporal consistency of kernels in real-world
videos for deep blind video SR.
We present the limitations and drawbacks of using a
fixed kernel, a scenario that is commonly assumed, for
multi-frame SR approaches.
Through tailored alterations to existing SR approaches,
we underline the potential of exploiting kernel tem-
poral consistency for accurate kernel estimation and
motion compensation, resulting in considerable perfor-
mance gains in video restoration.
2. Related Work
Single-Image Blind Super-Resolution. Previous deep
learning-based image-based SR approaches [6,7,31,19,
33,1,31] assumed a fixed and ideal downsampling degra-
dation process, often bicubic interpolation, leading to poor
performance when applied to real-world images. As a re-
sult, most blind SR approaches focused on estimating the
downsampling kernel and/or utilizing it for upsampling.
Efrat et al. [8] first highlighted the kernel mismatch prob-
lem: using the incorrect kernel during restoration had a sig-
nificant impact on the performance regardless of the choice
of image prior. Towards accurate downsampling kernel es-
timation, Michaeli et al. [26] exploited the inherent recur-
rence property of image patches and proposed an iterative
algorithm to derive the kernel that maximizes the similarity
of recurring patches across scales of the LR image. Bell-
Kligler et al. [3] adopted a GAN approach [10], in which
the generator learnt an estimated kernel to downscale the
input image with and the discriminator learnt to differenti-
ate between the patch distribution of the input image and its
downscaled variant. The downsampling kernel can also be
learned using CNNs by enforcing that the super-resolved
image maps back to the LR image [13] or using a paired
of real-world image dataset [4]. Exploiting the kernel mis-
match phenomenon, Gu et al. [12] and Luo et al. [23] alter-
natively estimated the kernel from the approximated super-
resolved image and restored the image by using the esti-
mated kernel, reaching the current state-of-the-art.
Multi-Frame Super-Resolution. MFSR approaches fo-
cus on utilizing temporal information from the LR frames
by aligning and fusing them in order to further boost
restoration performance through CNNs or RNNs. Earlier
works [20,17] performed motion compensation by esti-
mating optical flow using traditional off-the-shelf motion
estimation algorithms [2]. As the accuracy of motion es-
timation directly affects the reconstruction quality of the
super-resolved images, these traditional motion estimation
works are superseded by more accurate CNN-based net-
works such as spatial transformer networks [15] or task-
specific motion estimation networks [14,28,34], leading
to approaches [22,25,35,40,29] that focused on inte-
grating motion estimation and SR networks for end-to-end
learning. Recent works [16,36,37,39] decoupled this de-
pendency on motion estimation networks and performed
motion compensation by adaptively aligning the reference
and supporting frames through dynamically-generated fil-
ters or deformable convolutions [5,45]. Although major-
ity of these works helped to elucidate the relationship be-
tween motion estimation and video restoration, they ne-
glected the degradation process by assuming a fixed known
kernel. Therefore, unlike previous MFSR works that fo-
cused on incorporating temporal information in the frames,
we also utilized the temporal information in the downscal-
ing degradation operation in order to further boost restora-
tion performance.
Towards blind MFSR, Pan et al. [30] used a kernel es-
timation network, consisting of two fully-connected layers,
to learn a fixed blur kernel for inference. However, they,
similar to Liu et al. [21], assumed that the kernel is fixed
at every timestamp, resulting in poor SR performance as
shown in Sec. 5.
3. Problem Formulation
Multi-frame Super Resolution (MFSR) uses a set of 2N
supporting LR frames {ytN,· · · , yt1, yt+1,· · · , yt+N}
to upscale the reference LR frame ytat time t, utilizing tem-
poral information across frames. The degradation process is
usually expressed as follows:
yt+i=(Ftt+ixt)kt+is+nt+i(1)
where yand xare the LR and HR image respectively, kis
the blur kernel, sis the downscaling operation (e.g. sub-
sampling) using scaling factor s,nis the additive noise,
i=N, · · · , N , and Fis the warping matrix w.r.t the op-
tical flow applied on xt. The image warping process can
either be done explicitly via an optical flow or implicitly via
dynamically-generated filters [16] or deformable convolu-
tions [5]. The process of applying ktogether with the sis
also referred to as applying a downscaling kernel or SR ker-
nel [3,12]. Traditionally, a prior term is individually hand-
crafted for xt,kt+iand Ftt+i, but most deep learning-
based approaches capture the prior [6] through CNNs by
training it using a large amount of examples.
In order to solve for kand x, state-of-the-art blind image-
based algorithms split the problem into two sub-problems,
estimating kand restoring x, and address each problem se-
quentially [3,12] or alternately [41,23]. MFSR solutions,
on the other hand, include an additional sub-problem of es-
timating the motion between each supporting frame and its
reference frame in order to perform motion compensation
and hence leverage the temporal frame information dur-
ing restoration. Although previous traditional video SR ap-
proaches [9,24] assume that the kernel varies across frames,
recent works [21,30] assume a fixed kernel. In our work,
we study and highlight the implications of both assumptions
and advocate for the per-frame kernel approach, resulting in
the following optimization problem:
ˆxt= arg min
xt
N
X
i=N
yt+i(Ftt+ixt)kt+is
ˆ
kt= arg min
kt
kyt(xtkt)sk
ˆ
Ftt+i= arg min
Ftt+i
yt+i(Ftt+ixt)kt+is
(2)
where ˆx,ˆ
k, and ˆ
Fare the estimated HR image x, kernel k,
and warping matrix Frespectively.
4. Kernels In Real-World Videos
In order to investigate the temporal kernel changes in
real-world videos, we extracted a pool of kernel sequences
from the Something-Something dataset [11], a real-world
video prediction dataset. As ground truth kernels do not
exist in real-world videos, we applied the state-of-the-art
image-based kernel extraction method, KernelGAN [3], to
extract the sequences of kernels. Through these kernel se-
quences, we observed that the extracted SR kernels can of-
ten be different for each frame, while on the other hand may
also exhibit certain levels of temporal consistency, depend-
ing on the video’s dynamicity.
Fig. 1illustrates this phenomena, in which we show the
distributions of the magnitude of kernel changes in dif-
ferent video sequences. Specifically, we reshaped the ex-
tracted kernels for each frame and reduced them through
principal component analysis (PCA). We then computed
the sum of absolute differences between the kernel PCA
components of adjacent frames and plotted this difference
using videos of varying dynamicity (left and middle plot
groups in Fig. 1). As a baseline, in comparison with an
h1 h2 h3 1 L1 l2 l3 0Rand
0
0.5
1
1.5
2
h1
h2
h3
trace 3
L1
l2
l3
trace 7
Rand
Seq 13 Random FramesSeq 16 Seq 22 Seq 2 Seq 23 Seq 25
Videos with low kernel
temporal consistency
Videos with high kernel
temporal consistency
Random frames sampled
from different videos
Difference of kernel PCAs between adjacent frames
0
0.5
1.0
1.5
2.0
Figure 1: We quantify temporal kernel consistency by mea-
suring kernel PCA change for adjacent frames in real-world
videos with high/low kernel temporal consistency. Random
frames are sampled from different videos at each times-
tamp as a baseline to highlight temporal kernel consistency
within same video. Kernel changes are represented by solid
dots while boxplots show distributions.
unrealistic real-world video without any temporal consis-
tency, we sampled random frames from different random
videos at each timestamp, of which its kernel PCA changes
are represented by the right plot in Fig. 1. We observed
that some videos’ kernel differences, namely the left group
of plots showing video sequences 13, 16 and 22 from the
Something-Something dataset in Fig. 1, are of high tempo-
ral kernel consistency, of which the kernels remain largely
unchanged throughout. In contrast, the middle group of
plots represent the kernel differences of videos with low
temporal kernel consistency, namely video sequence 2, 23
and 25, of which kernel changes can be much significant.
Visually, Fig. 2shows example frames from corresponding
videos of high and low temporal kernel consistency. In par-
ticular, videos with high temporal kernel consistency depict
slow and steady movements with no motion blurs or scene
changes - e.g. a video of a hand slowly reaching towards
a cup or videos with almost identical frames at each time
step. On the other hand, videos with low temporal kernel
consistency have motion blur caused by rapid movements
of the camera or object, e.g. large object motions caused
by a man weaving a straw hat or placing a container up-
right and shaky camera motions, as illustrated in the right of
Fig 2. Therefore, our experiments highlight that SR kernels
in real-world videos are often non-uniform and can exhibit
different levels of temporal consistency.
5. Kernel Mismatch in Previous MFSR
In order to highlight the importance of incorporating
temporal kernel consistency in blind video restoration, we
looked into the limitations and drawbacks of both previ-
Seq
13
Seq
16
Seq
22
Seq
2
Seq
23
Seq
25
Videos of High Temporal Kernel Consistency Videos of Low Temporal Kernel Consistency
Figure 2: Example frames from videos of high/low kernel temporal consistency as shown in Fig. 1. Videos of low kernel
temporal consistency (right) contain a higher proportion of video dynamicity as compared to videos of high kernel temporal
consistency (left).
Fixed KernelPer-frame Kernel
Figure 3: Examples of consecutive frames in real-world
videos upscaled using a fixed kernel (top row), and differ-
ent per-frame kernels (bottom row). More examples can be
found in the supplementary material (s.m. Fig. 1).
ous and recent multi-frame super resolution (MFSR) ap-
proaches [20,17,22,25,35,40,30]. Specifically, these
works assumed that either a fixed degradation operation is
used for all videos or a fixed SR kernel is used to degrade all
frames in each video – assumptions that do not hold for real-
world videos as shown in Sec. 4. Consequently, these works
suffer from the kernel mismatch phenomenon [8] when they
are used to upscale real-world videos.
Impact on Frame Upscaling. Previous MFSR works ex-
ploit temporal frame information using a fixed SR kernel.
We first show with a naive approach, that the use of a single
kernel, even without utilizing temporal frame information,
to restore every frame is detrimental towards the perfor-
mance of frame upscaling in the video restoration process.
Towards this goal, we independently computed a kernel
per frame of videos taken from the Something-Something
dataset using KernelGAN and restored these frames using
ZSSR [32]. We compared this per-kernel approach with a
single-kernel approach where we only estimated the SR ker-
nel on the first frame in the video and used that same ker-
nel to restore all subsequent frames in each video. Fig. 3
shows the qualitative difference between the two experi-
ments and we observed that using a fixed kernel indeed re-
sulted in more severe visual artifacts and unnatural textures.
All experiment details and more examples can be found in
the supplementary material (s.m. Sec. 1 & Fig. 3).
Impact on Motion Compensation. We then show that
a fixed kernel assumption further aggravates MFSR ap-
proaches. The premise of these approaches is to utilize
temporal frame information in order to boost the restoration
performance. To this end, previous MFSR works used mo-
tion compensation to warp each supporting frame to its ref-
erence frame before fusing these frames together for upscal-
ing. As mentioned in Sec. 3, the optical flow used for warp-
ing is either estimated explicitly using traditional or deep
motion-estimation techniques or implicitly using adaptive
filters or deformable convolutions.
In order to visualize the impact of kernel mismatch on
motion compensation for real-world videos, we consider
two sets of videos, one from LR sequences of the original
REDS dataset [27], which are degraded using a fixed kernel,
while the other from our REDS10 testing sequence (details
discussed in Sec. 6.1), which are generated using different
per-frame kernels and thus better resemble the degradation
characteristics of real-world videos than the former.
We then used an explicit deep motion estimation model,
which is commonly used in previous MFSR approaches [22,
25,35,40] to compute the optical flow. Specifically, we
adopt PWCNet [34] to estimate optical flow in our exper-
iment. The optical flow is then used to warp each sup-
porting frame and the results are shown in Fig. 4, for both
fixed and per-frame degradation video sets. We observe that
motion compensation performs better on the fixed degrada-
tion video set, benefiting the previous approaches that were
specifically designed under the fixed kernel assumption. On
the other hand, due to the kernel dynamicity in real-world
videos, the warped supporting frames of those approaches
often suffer from kernel mismatch when dealing with videos
of varying kernels, as shown in Fig. 4(bottom row). We
further show that this phenomenon is also observed with
the use of implicit motion compensation, and the errors in-
Fixed KernelPer-frame Kernel
tt + 1t - 1
Fixed KernelPer-frame Kernel
tt + 1t - 1
Figure 4: Example aligned frames with their reference frame at time t. Motion compensation model in current MFSR
approaches performs better when considering a fixed SR kernel at every timestamp (top row), which however does not hold
for real-world videos. For videos with varying kernels per frame, the aligned frames are oversmoothed and blurred (e.g. see
the frames at t- 1 of both examples) due to kernel mismatch (bottom row). Zoom in for best results. More examples are
provided in the supplementary material (s.m. Fig. 4.)
curred from inaccurate motion compensation can be propa-
gated throughout the restoration process in Sec. 6.2.
6. Exploiting Temporal Kernel Consistency
We hypothesize that by using temporal kernel consis-
tency, we can mitigate the limitations highlighted in Sec. 5.
Towards understanding the impact of doing so, we first
adopted the state-of-the-art blind image-based SR algo-
rithm, DAN [23] and incorporated MFSR modules from
EDVR [37] for temporal alignment through implicit mo-
tion compensation, fusion, and video restoration. We then
tailored these approaches to exploit temporal kernel consis-
tency and analyzed the benefits and performance impact of
doing so through an ablation study.
6.1. Experiment Setup
Models. DAN [23] is an end-to-end learning approach that
estimates the kernel k, and restores the image x, alternately.
The key idea, as shown in black in Fig. 5, is to have two con-
volutional modules: 1) a restorer that reconstructs xgiven
the LR image y, and the PCA of k; and 2) an estimator that
learns the PCA of k, based on yand the resulting super-
resolved image ˆx. The basic block for both components is
the conditional residual block (CRB), which concatenates
the basic and conditional inputs channel-wise and then ex-
ploit the inter-dependencies among feature maps through a
channel attention layer [43]. The alternating algorithm ex-
ecutes both components iteratively, starting with an initial
kernel, Dirac, and resulting in the following expression:
x(j+1) = arg min
x
y(xk(j))s
1
k(j+1) = arg min
k
y(x(j+1) k)s
1
(3)
where jpresents the iteration round, j[1, J ]. Both com-
ponents are trained using the sum of the absolute difference,
Figure 5: Our experiment setup of utilizing multiple frames
for temporal kernel estimation (shown in black) and us-
ing temporal kernels for multi-frame restoration (shown in
blue). See text for details and the supplementary material
(s.m. Fig. 1) for a more detailed architecture diagram.
L1loss, between kand ˆ
k, and between xand ˆxestimated
by the last iteration.
For multi-frame experiments, as shown in blue in Fig. 5,
we used the LR feature maps at the last restorer iteration
before upsampling and adopted EDVR’s PCD Module,TSA
Module, and Restoration Module for temporal alignment,
fusion, and video restoration respectively. In other words,
we merged kernel estimation and blind image restoration
techniques with MFSR motion compensation methods and
made alterations in order for these modules to utilize tempo-
ral kernel consistency. Further details of these modules and
the architecture can be found in the supplementary material
(s.m. Sec. 2 & s.m. Fig. 1).
Training Data. We combined both REDS [27] training and
validation set and randomly sampled 250 for train and 10
for test. Following [3], we generated anisotropic Gaussian
kernels with a size of 13×13. The lengths of both axes were
uniformly sampled in (0.6, 5), and then rotated by a ran-
dom angle uniformly distributed in [-π,π]. For real-world
videos, we further added uniform multiplicative noise, up
to 25% of each pixel value of the kernel, to the generated
noise-free kernel, and normalized it to sum to one. Each
frame of each HR video was degraded with a randomly gen-
erated kernel and then downsampled using bicubic interpo-
lation to form the synthetic LR videos. Following previous
works [42,12,23], we reshaped the kernels and reduced
them through principal component analysis (PCA) before
feeding into the network. We adopted this frame-wise syn-
thesis approach for two reasons: 1) to the best of our knowl-
edge, there is no video dataset with real-world kernels avail-
able, and extracting large amount of kernel sequences from
video benchmarks for training is costly. 2) the synthetic
training kernels generated as mentioned above can create
various degradation in the individual frames, and thus are
able to model real-world videos with varying levels of ker-
nel temporal consistency.
Testing Data. We created our testing set with 10 sequences
from the REDS testing set (000 and 010-018), denoted as
REDS10, aiming to mimic the actual degradation of real-
world videos that are of varying video dynamicity. Con-
cretely, following our experiments in Sec. 4, we first sam-
pled videos from the Something-Something dataset [11]1.
The sequences from Something-Something dataset were
randomly sampled such that their estimated kernels had dif-
fering temporal kernel consistency. These kernels were then
used to degrade our test set to mimic the degradation charac-
teristics of real-world videos. We then randomly sampled a
sequence from these estimated real-world kernel sequences
and used it to downsample each selected video in REDS102.
As a result, our testing set has the similar degradation char-
acteristics as that of real-world videos, while allow us to
perform quantitative evaluations. The kernel temporal con-
sistency of this test set can be found in the supplementary
material (s.m. Fig. 2). For real-world video evaluations,
we used videos from the Something-Something dataset. All
implementation details can be found in the supplementary
material (s.m. Sec. 3).
6.2. Effectiveness of Temporal Kernel Consistency
Temporal Kernel Estimation. We first studied the effec-
tiveness of taking multiple frames into account for kernel
estimation. In other words, instead of estimating kernels in-
dividually for each frame, we leveraged our key insight that
the downsampling kernels of frames within a video are tem-
porally consistent to achieve a faster and more accurate ker-
nel estimation for videos. To this end, we modified the esti-
mator to take in multiple LR frames, {yt+i}i=N
i=N, and gen-
1In particular, sequences 13, 16, 21, 35, 37, 49, 52, 55, 63, and 71.
2For cases in which the length of the video is longer than the selected
kernel sequence, we loop over the same kernel sequence for the remaining
frames.
000 010 011012 013 014 015 016 017 018
0
0.2
0.4
0.6
0.8
1
1.2
methods
DAN
VDAN_EST_3F
VDAN_EST_5F
Sequence ID
L1 loss
L1 distance of estimated kernel PCA w.r.t. GT
Est-3 + Res-1
Est-1 + Res-1 (DAN)
Est-5 + Res-1
0
0.2
0.4
0.6
0.8
1.0
1.2
000 010 011 012 013 014 015 016 017 018
Videos in our REDS10 test set
Figure 6: Distribution of kernel estimation errors of dif-
ferent estimators for each video sequence in our test set.
Single-frame estimator (Est-1+Res-1) tends to perform
worse than multi-frame estiamtors (Est-3/5+Res-1) by hav-
ing larger error variance with many outliers.
erated their corresponding estimated kernels, {ˆ
k(j)
t+i}i=N
i=N.
We then utilized the existing channel attention block in
DAN by adopting an early fusion approach, which merges
information at the beginning of the block, to exploit the
inter-channel relationships not only between basic and con-
ditional inputs, but also among temporal inputs. Specifi-
cally, the features of the HR frames are concatenated with
the LR features in every CRB in order to leverage the exist-
ing structure of DAN’s estimator without adding additional
channels or layers as shown in the supplementary material
(s.m. Fig. 1).
We experimented with different number of input frames
on the estimator, labelled as Est-αwhere αis the num-
ber of frames used for kernel estimation. Likewise, we la-
belled βas the number of frames used for restoration, Res-
β. For a fair comparison, here we used DAN’s restorer,
β= 1, which is single-frame and therefore not including
our adopted EDVR components. Fig. 6shows the distribu-
tion of kernel estimation errors of the aforementioned mod-
els in terms of the absolute sum of PCA difference between
the estimated kernels and their respective ground truth ker-
nels for all frames in each sequence found in REDS10. We
observed that independent kernel estimation per-frame can
lead to a larger variance and numerous outliers as compared
to temporal kernel estimation. Notably, temporal kernel es-
timation results in, on average, more accurate kernels for
videos with high dynamicity, i.e. low kernel temporal con-
sistency, while performs similarly for videos with high ker-
nel temporal consistency. The performance increase in ker-
nel estimation, however, did not improve performance sig-
nificantly in video restoration as shown in Table 1. This
phenomenon is also observed in recent blind iterative im-
age SR works [12,23] and these works reported that this
Models PSNR/SSIM
Est-1 + Res-1 (DAN) 26.28/0.7118
Est-3 + Res-1 26.30/0.7124
Est-5 + Res-1 26.31/0.7213
Est-1 + Res-3 26.37/0.7170
Est-1 + Res-5 26.54/0.7287
Est-3 + Res-3 26.62/0.7364
Est-5 + Res-5 26.76/0.7400
Table 1: Ablation study on the impact of utilizing temporal
kernel estimation on video restoration for both kernel esti-
mation and motion compensation. Red for best performing
model and blue for second best. Although estimating more
accurate kernels did not significantly improve the perfor-
mance of a single-image restorer, it is critical for motion
compensation and hence benefiting multi-frame restorers.
was due to the restorer’s robustness to the kernel estima-
tion errors of the estimator since they were jointly trained.
Although having a more accurate kernel estimation did not
drastically impact a single-frame video restoration perfor-
mance, we show that it is essential at improving the perfor-
mance of a multi-frame restoration approach.
Incorporating Temporal Kernels for MFSR. The perfor-
mance gain of utilizing the temporal information of multiple
frames is dependent on the accuracy of its motion estima-
tion; an inaccurate flow can result in misaligned frames af-
ter motion compensation and thus artifacts in the restored
video [35,40,36,16]. As shown in Sec. 5, performing
motion compensation under the assumption of a fixed SR
kernel directly on real-world videos can result in regular ar-
tifacts in the warped frames. To mitigate this, instead of
following the convention of employing motion compensa-
tion on the LR frames or features directly, we performed
motion compensation on the LR frames after considering
their corresponding kernels. Specifically, we utilized the
feature maps at the last restorer iteration as shown in Fig. 5
which embed both LR frame and the corresponding kernel
features from the estimator, and then adopted EDVR for
temporal alignment, fusion, and restoration as mentioned
in Sec. 6.1. This approach mitigates the problem of inac-
curate motion compensation caused by kernel variation in
real-world videos, but the restoration performance may still
depend on the accuracy of estimated kernels; errors in ker-
nel estimation would propagate and result in inaccurate mo-
tion compensation.
To verify this, we first ran our multi-frame restorer,
β={3,5}, with a single-frame estimator,α= 1 and com-
pared it with running the multi-frame restorer together with
the multi-frame estimator. The results are shown in Table 1.
As expected, having a multi-frame restorer resulted in an
improvement in video restoration similar to that of previous
works [20,17,22,25,35]. However, these per-frame esti-
mator MFSR models did not perform as well as their tempo-
Proposed for Models PSNR/SSIM
MFSR TDAN [36] 25.93/0.6867
EDVR [37] 26.21/0.7060
Blind SISR IKC [12] 26.22/0.7021
DAN [23] 26.28/0.7118
Blind MFSR DBVSR [30] 26.11/0.6986
Est-3 + Res-3 (Ours) 26.62/0.7364
Est-5 + Res-5 (Ours) 26.76/0.7400
Table 2: We compare our model with state-of-the-art mod-
els from MFSR, which assume a fixed bicubic degradation,
and blind single-image SR methods, which restore each
frame independently.
ral estimator counterparts. In particular, although our per-
frame estimator MFSR model utilized information from 5
frames (Est-1 + Res-5) to restore each frame, it did not out-
perform our temporal estimator MFSR model that only ex-
ploited information from 3frames (Est-3 + Res-3). Hence,
we can conclude that the kernel mismatch errors incurred
during kernel estimation propagated through the implicit
motion compensation module of EDVR, affecting tempo-
ral alignment, fusion, and thus restoration. In other words,
more accurate estimated kernels through the temporal ker-
nel estimator enable the multi-frame restorer to leverage
temporal frame information better. Therefore, the interplay
between accurate kernel estimation and motion compensa-
tion is the key to utilize temporal kernel consistency for
video restoration.
Comparisons with Previous Works. We compared our ap-
proach, with existing works on both our test set REDS10
and real-world videos taken from the Something-Something
dataset. Specifically, we considered state-of-the-art MFSR
methods, TDAN [36] and EDVR [37], blind image-based
SR methods, IKC [12] and DAN [23], and a recently pro-
posed blind MSFR approach, DBVSR [30].
From the quantitative and qualitative comparisons based
on REDS10 (Table. 2& Fig. 7) and real-world qualita-
tive examples (Fig. 8), we observe that existing MFSR ap-
proaches are lacking due to kernel mismatch, affecting both
motion compensation and video restoration as shown in
Sec. 5. Both TDAN and EDVR, in particular, were trained
using the fixed bicubic degradation assumption and DB-
VSR assumed a fixed temporally uniform kernel. Blind
SISR approaches, on the other hand, restore each frame in-
dependently and hence perform slightly better than exist-
ing MFSR approaches. Our approach, which exploits ker-
nel temporal consistency for accurate kernel estimation and
mitigates the effects of kernel mismatch on motion com-
pensation, leads to a dominant solution for real-world video
restoration. We provided additional examples in the supple-
mentary material (s.m. Fig. 5 & Fig. 6).
Figure 7: Qualitative comparison among existing models, along with bicubic upscaling, on our benchmark test sequences.
Zoom in for best results.
Figure 8: Real-world qualitative comparisons among existing models, along with bicubic upscaling. Zoom in for best results.
Note that there is no ground-truth available.
7. Conclusion
In this paper, we presented the temporal kernel changes
in videos and showed that they varied in their consistency
depending on the video’s dynamicity. Through our exper-
iments, we highlighted the importance of estimating ker-
nels per-frame to tackle the effects of temporal kernel mis-
match in previous works. We then showed how temporal
kernel consistency can be generally incorporated into exist-
ing works through the interaction between both kernel esti-
mation and motion compensation in order to leverage both
temporal kernel and frame information for blind video SR.
We hope to influence future blind video SR model design
by emphasizing the potential of leveraging kernel temporal
consistency in restoring videos.
References
[1] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast,
Accurate, and Lightweight Super-Resolution with Cascading
Residual Network. In ECCV, 2018. 2
[2] S. Baker, D. Scharstein, J. Lewis, S. Roth, Michael J. Black,
and R. Szeliski. A database and evaluation methodology
for optical flow. International Journal of Computer Vision,
2007. 2
[3] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind
super-resolution kernel estimation using an internal-gan. In
Advances in Neural Information Processing Systems. 2019.
1,2,3,5,11,13
[4] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei
Zhang. Toward real-world single image super-resolution: A
new benchmark and a new model. IEEE/CVF International
Conference on Computer Vision (ICCV), 2019. 2
[5] Jifeng Dai, Haozhi Qi, Y. Xiong, Y. Li, Guodong Zhang, H.
Hu, and Y. Wei. Deformable convolutional networks. IEEE
International Conference on Computer Vision (ICCV), 2017.
2
[6] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
Tang. Image super-resolution using deep convolutional net-
works. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2016. 1,2,3
[7] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerat-
ing the Super-Resolution Convolutional Neural Network. In
ECCV, 2016. 2
[8] N. Efrat, Daniel Glasner, Alexander Apartsin, B. Nadler, and
A. Levin. Accurate blur models vs. image priors in single
image super-resolution. IEEE International Conference on
Computer Vision (ICCV), 2013. 1,2,4
[9] Sina Farsiu, M. D. Robinson, Michael Elad, and P. Milanfar.
Fast and robust multiframe super resolution. IEEE Transac-
tions on Image Processing, 2004. 3
[10] Ian J. Goodfellow, Jean Pouget-Abadie, M. Mirza, Bing Xu,
David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, 2014. 2
[11] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
Mueller-Freitag, et al. The Something Something Video
Database for Learning and Evaluating Visual Common
Sense. In ICCV, 2017. 3,6,13
[12] Jinjin Gu, Hannan Lu, W. Zuo, and C. Dong. Blind
super-resolution with iterative kernel correction. IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), 2019. 1,2,3,6,7
[13] Yong Guo, Jian Chen, J. Wang, Q. Chen, Jiezhang Cao,
Zeshuai Deng, Yanwu Xu, and Mingkui Tan. Closed-loop
matters: Dual regression networks for single image super-
resolution. IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2020. 2
[14] Eddy Ilg, N. Mayer, Tonmoy Saikia, Margret Keuper, A.
Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of opti-
cal flow estimation with deep networks. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2017.
2
[15] Max Jaderberg, K. Simonyan, Andrew Zisserman, and K.
Kavukcuoglu. Spatial transformer networks. In Advances in
Neural Information Processing Systems, 2015. 2
[16] Younghyun Jo, S. Oh, Jaeyeon Kang, and S. Kim. Deep
video super-resolution network using dynamic upsampling
filters without explicit motion compensation. IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
2018. 2,7
[17] Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and A. Kat-
saggelos. Video super-resolution with convolutional neural
networks. IEEE Transactions on Computational Imaging,
2016. 2,4,7
[18] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980,
2014. 11
[19] Royson Lee, L. Dudziak, M. Abdelfattah, Stylianos I. Ve-
nieris, H. Kim, Hongkai Wen, and N. Lane. Journey towards
tiny perceptual super-resolution. In European Conference on
Computer Vision (ECCV), 2020. 2
[20] Renjie Liao, X. Tao, R. Li, Z. Ma, and J. Jia. Video super-
resolution via deep draft-ensemble learning. IEEE Interna-
tional Conference on Computer Vision (ICCV), 2015. 2,4,
7
[21] Ce Liu and Deqing Sun. On bayesian adaptive video super
resolution. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 2014. 1,2,3
[22] Ding Liu, Zhaowen Wang, Yuchen Fan, X. Liu, Zhangyang
Wang, S. Chang, and T. Huang. Robust video super-
resolution with learned temporal dynamics. IEEE Interna-
tional Conference on Computer Vision (ICCV), 2017. 2,4,
7
[23] Zhengxiong Luo, Y. Huang, Shang Li, Liang Wang, and Tie-
niu Tan. Unfolding the alternating optimization for blind
super resolution. In Advances in Neural Information Pro-
cessing Systems. 2020. 1,2,3,5,6,7,11
[24] Z. Ma, Renjie Liao, X. Tao, L. Xu, J. Jia, and Enhua Wu.
Handling motion blur in multi-frame super-resolution. IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2015. 3
[25] Osama Makansi, Eddy Ilg, and T. Brox. End-to-end learn-
ing of video super-resolution with motion compensation. In
German Conference on Pattern Recognition (GCPR), 2017.
2,4,7
[26] T. Michaeli and M. Irani. Nonparametric blind super-
resolution. IEEE International Conference on Computer Vi-
sion (ICCV), 2013. 2
[27] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik
Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu
Lee. Ntire 2019 challenge on video deblurring and super-
resolution: Dataset and study. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) Work-
shops, 2019. 4,5
[28] A. Ranjan and Michael J. Black. Optical flow estimation
using a spatial pyramid network. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2017. 2
[29] Mehdi S. M. Sajjadi, Raviteja Vemulapalli, and M. Brown.
Frame-recurrent video super-resolution. 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
2018. 2
[30] Jin shan Pan, Songsheng Cheng, Jiawei Zhang, and J. Tang.
Deep blind video super-resolution. ArXiv, 2020. 1,2,3,4,7
[31] W. Shi, J. Caballero, Ferenc Husz´
ar, J. Totz, A. Aitken, R.
Bishop, D. Rueckert, and Zehan Wang. Real-time single im-
age and video super-resolution using an efficient sub-pixel
convolutional neural network. IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2016. 2
[32] Assaf Shocher, N. Cohen, and M. Irani. ”zero-shot” super-
resolution using deep internal learning. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), 2018. 4,11,13
[33] Dehua Song, Chang Xu, Xu Jia, Yiyi Chen, Chunjing Xu,
and Yunhe Wang. Efficient residual dense block search for
image super-resolution. In Association for the Advancement
of Artificial Intelligence (AAAI), 2020. 2
[34] Deqing Sun, X. Yang, Ming-Yu Liu, and J. Kautz. Pwc-net:
Cnns for optical flow using pyramid, warping, and cost vol-
ume. IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), 2018. 2,4
[35] X. Tao, H. Gao, Renjie Liao, J. Wang, and J. Jia. Detail-
revealing deep video super-resolution. IEEE International
Conference on Computer Vision (ICCV), 2017. 2,4,7
[36] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu.
Tdan: Temporally-deformable alignment network for video
super-resolution. IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2020. 1,2,7
[37] Xintao Wang, Kelvin C. K. Chan, K. Yu, C. Dong, and
Chen Change Loy. Edvr: Video restoration with enhanced
deformable convolutional networks. IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops
(CVPRW), 2019. 1,2,5,7,11
[38] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.
Simoncelli. Image quality assessment: from error visibility
to structural similarity. IEEE Transactions on Image Pro-
cessing, 2004. 11
[39] X. Xiang, Yapeng Tian, Yulun Zhang, Y. Fu, J. Allebach, and
Chenliang Xu. Zooming slow-mo: Fast and accurate one-
stage space-time video super-resolution. IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
2020. 1,2
[40] Tianfan Xue, B. Chen, Jiajun Wu, D. Wei, and W. Freeman.
Video enhancement with task-oriented flow. International
Journal of Computer Vision, 2018. 2,4,7
[41] K. Zhang, L. Gool, and R. Timofte. Deep unfolding network
for image super-resolution. IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2020. 3
[42] Kai Zhang, W. Zuo, and Lei Zhang. Learning a single convo-
lutional super-resolution network for multiple degradations.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2018. 6
[43] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
Zhong, and Yun Fu. Image Super-Resolution Using Very
Deep Residual Channel Attention Networks. In European
Conference on Computer Vision (ECCV), 2018. 5,11
[44] Ruofan Zhou and S. S ¨
usstrunk. Kernel modeling super-
resolution on real low-resolution images. IEEE/CVF Inter-
national Conference on Computer Vision (ICCV), 2019. 1
[45] X. Zhu, H. Hu, Stephen Lin, and Jifeng Dai. Deformable
convnets v2: More deformable, better results. IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), 2019. 2
A. Implementation Details of KernelGAN &
ZSSR.
We used the default settings and hyperparameters pro-
vided by KernelGAN [3] and ZSSR [32]. For KernelGAN,
the estimated downscaling kernel size is set to 13 ×13 and
the input image is cropped to 64 ×64 before kernel extrac-
tion. The kernel is extracted after 3000 iterations using the
Adam [18] optimizer with learning rate set to 0.0002,β1
set to 0.5and β2set to 0.999. For ZSSR, the input LR im-
age and the provided estimated kernel is used to generate a
downsampled variant of the LR image. The resulting image
pair is then used to train the model using the Adam opti-
mizer starting with learning rate set to 0.001 with β1= 0.9
and β2= 0.999. For more details, please refer to the pro-
vided repository3.
B. Architecture of DAN & EDVR
A more detail architecture of our model experimented
in Sec. 5 of the main paper is shown in Fig. A1. No-
tably, the features of the HR frames are concatenated with
the LR features in each CRB block of DAN [23] and we
utilized the existing channel attention layer (CALayer) for
temporal kernel estimation. During the last iteration, the LR
features, which were conditioned on the input frames and
their estimated kernel, were fed into the temporal blocks of
EDVR [37] for temporal alignment, fusion, and restoration.
In particular, the PCD module follows a pyramid cascading
structure, which concatenates features of differing spatial
sizes and uses deformable convolution at each respective
pyramid level to the aligned features. The TSA module then
fused these aligned features together through both temporal
and spatial attention. Specifically, temporal attention maps
are computed based on the aligned features and applied to
these features through the dot product before concatenating
and fusing them using a convolution layer. After which, the
fused features are then used to compute the spatial attention
maps which are then applied to these features. For more
details, please refer to EDVR [37].
C. Implementation Details of DAN & EDVR.
For training, we used scaling factor ×4, input patch size
of 100×100, and set N= 1, i.e. considering sequences of
3. We set N= 2 to highlight the kernel mismatch on mo-
tion compensation as shown in Fig. A4. The batch size was
set to 4, and all models were trained for 300 epochs, using
the Adam optimizer [18] (β1= 0.9and β2= 0.999). The
initial learning rate was set to 1×104, and decayed with a
factor of 0.5 at every 200 epochs. Following DAN [23], we
ran for 4 iterations (J= 4) and used L1 loss for both ker-
nel estimation and video restoration across all our models
3https://github.com/sefibk/KernelGAN
in every iteration. When multiple frames were utilized for
temporal alignment, we applied a scaling factor of 1/2Nto
weight the loss from supporting frames. Following previous
works [37,43], PSNR and SSIM [38] were computed after
converting each frame from RGB to Y channel and trim-
ming the edges by the scale factor. All experiments were
run on NVIDIA 1080Ti and 2080Ti GPUs. Temporal ker-
nel consistency of our test benchmark, REDS10, is shown in
Fig. A2. Similar to Fig. 1 in the main paper, we quantified
kernel temporal change by measuring the sum of absolute
difference between consecutive kernel PCAs. In particular,
video sequences such as 016 and 018 have high temporal
kernel consistency and sequences such as 000 and 014 have
low temporal kernel consistency.
D. Additional Results
We provided additional results here due to space limita-
tions in the main paper. Fig. A3 provides additional exam-
ples for Fig. 3 of the main paper, highlighting that using a
fixed kernel to upscale all the frames in a video can result
in inferior restoration outcomes as compared to using a per-
frame kernel even without incorporating temporal frame in-
formation. Likewise, Fig. A4 shows the additional exam-
ples for Fig. 4 of the main paper, underlining that the use
of explicit motion compensation in previous works for tem-
poral alignment results in more errors when applied to real-
world videos. Fig. A5 and Fig. A6 provides additional qual-
itative examples comparing our multi-frame SR model with
previous multi-frame SR and blind image-based SR models
on REDS10 and real-world videos respectively. Lastly, we
provided a sample video along with this document.
LR
SR (final)
(B, n, C, H, W) (B*n, C, H, W)
k = 3
s = 1
k = 3
s = 1
CRB
CRB
CRB
Pixelshuffle
Restoration
TSA
PCD
Last
iteration
only
k = 3
s = 1
k = 3
s = 1
CRB
CRB
CRB
GAP
k = 3
s = 1
SR
LR
k = 32
s = 1
k = 9
s = scale
(B*n, z)
Stretch
(B, n, z)
Kernel PCAs
(B*n, C, H, W)
Kernel Maps
(B, n*C, H, W)
k
s
Convolution with kernel
(k) and stride (s)
CALayer
GAP
CRB
TSA
PCD
CRB
concat
k = 3, s = 1
k = 3, s = 1
CALayer
Channel
Attention
Global avg.
pooling
Convolutional
residual block
Temporal &
spatial Attention
Pyramid cascading
deformable conv.
fcond fcond fcond
fcond fcond fcond
fout
fcond fbasic
Figure A1: Detailed architecture of our model experimented in Sec. 5 of the main paper.
000 010 011012 013 014 015 016 017 018
0
0.5
1
1.5
2
Kernel Temporal Change
000 010 011 012 013 014 015 016 017 018
Videos in our REDS10 test set
Difference of kernel PCAs between
adjacent frames
0.5
0
1.0
1.5
2.0
Figure A2: Temporal kernel consistency of videos in our
REDS10 benchmark, measured by kernel PCA changes for
adjacent frames in the videos. Kernel changes are repre-
sented by solid dots while boxplots show distributions.
Figure A3: Additional Examples of consecutive frames in real-world videos taken from Something-Something [11] dataset
upscaled using a fixed kernel (top in each example), and a different per-frame kernel (bottom in each example). Kernels are
estimated using KernelGAN [3] and the frames are restored using ZSSR [32].
Fixed KernelPer-frame Kernel
tt + 1t - 1 t + 2
t - 2
Fixed KernelPer-frame Kernel
Figure A4: Additional example aligned frames at time step t2, t 1, t + 1, t + 2 with their reference frame at time step t.
The aligned frames are oversmoothed and blurred due to kernel mismatch for per-frame kernels found in real-world videos.
In comparison, using a fixed downsampling kernel at every time step, which does not hold for real-world videos, leads to
better motion compensation. Zoom in for best results.
Figure A5: Qualitative comparison among existing models, along with bicubic upscaling, on our benchmark test sequences.
Zoom in for best results.
Figure A6: Real-world qualitative comparisons among existing models, along with bicubic upscaling. Zoom in for best
results. Note that there is no ground-truth available.
ResearchGate has not been able to resolve any citations for this publication.
Article
Although remarkable progress has been made on single image super-resolution due to the revival of deep convolutional neural networks, deep learning methods are confronted with the challenges of computation and memory consumption in practice, especially for mobile devices. Focusing on this issue, we propose an efficient residual dense block search algorithm with multiple objectives to hunt for fast, lightweight and accurate networks for image super-resolution. Firstly, to accelerate super-resolution network, we exploit the variation of feature scale adequately with the proposed efficient residual dense blocks. In the proposed evolutionary algorithm, the locations of pooling and upsampling operator are searched automatically. Secondly, network architecture is evolved with the guidance of block credits to acquire accurate super-resolution network. The block credit reflects the effect of current block and is earned during model evaluation process. It guides the evolution by weighing the sampling probability of mutation to favor admirable blocks. Extensive experimental results demonstrate the effectiveness of the proposed searching method and the found efficient super-resolution models achieve better performance than the state-of-the-art methods with limited number of parameters and FLOPs.