PreprintPDF Available

# NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

## Abstract and Figures

This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution.
Content may be subject to copyright.
NTIRE 2022 Challenge on Efﬁcient Super-Resolution: Methods and Results
Yawei LiKai ZhangRadu TimofteLuc Van GoolFangyuan Kong
Mingxi Li Songwei Liu Zongcai Du Ding Liu Chenhui Zhou Jingyi Chen
Qingrui Han Zheyuan Li Yingqi Liu Xiangyu Chen Haoming Cai Yu Qiao
Chao Dong Long Sun Jinshan Pan Yi Zhu Zhikai Zong Xiaoxiao Liu
Zheng Hui Tao Yang Peiran Ren Xuansong Xie Xian-Sheng Hua Yanbo Wang
Xiaozhong Ji Chuming Lin Donghao Luo Ying Tai Chengjie Wang
Zhizhong Zhang Yuan Xie Shen Cheng Ziwei Luo Lei Yu Zhihong Wen
Qi Wu1 Youwei Li Haoqiang Fan Jian Sun Shuaicheng Liu Yuanfei Huang
Meiguang Jin Hua Huang Jing Liu Xinjian Zhang Yan Wang Lingshun Long
Gen Li Yuanfan Zhang Zuowei Cao Lei Sun Panaetov Alexander
Yucong Wang Minjie Cai Li Wang Lu Tian Zheyuan Wang Hongbing Ma
Jie Liu Chao Chen Yidong Cai Jie Tang Gangshan Wu Weiran Wang
Shirui Huang Honglei Lu Huan Liu Keyan Wang Jun Chen Shi Chen
Yuchun Miao Zimo Huang Lefei Zhang Mustafa Ayazo˘
glu Wei Xiong
Chengyi Xiong Fei Wang Hao Li Ruimian Wen Zhijing Yang Wenbin Zou
Weixin Zheng Tian Ye Yuncheng Zhang Xiangzhen Kong Aditya Arora
Syed Waqas Zamir Salman Khan Munawar Hayat Fahad Shahbaz Khan
Dandan Gao Dengwen Zhou Qian Ning Jingzhu Tang Han Huang Yufei Wang
Zhangheng Peng Haobo Li Wenxue Guan Shenghua Gong Xin Li Jun Liu
Wanjun Wang Dengwen Zhou Kun Zeng Hanjiang Lin Xinyu Chen
Jinsheng Fang
Abstract
This paper reviews the NTIRE 2022 challenge on efﬁ-
cient single image super-resolution with focus on the pro-
posed solutions and results. The task of the challenge was
to super-resolve an input image with a magniﬁcation fac-
tor of ×4 based on pairs of low and corresponding high
resolution images. The aim was to design a network for
single image super-resolution that achieved improvement of
efﬁciency measured according to several metrics including
runtime, parameters, FLOPs, activations, and memory con-
sumption while at least maintaining the PSNR of 29.00dB
on DIV2K validation set. IMDN is set as the baseline for ef-
ﬁciency measurement. The challenge had 3 tracks including
the main track (runtime), sub-track one (model complexity),
Y. Li (yawei.li@vision.ee.ethz.ch, Computer Vision Lab, ETH
Zurich), K. Zhang, R. Timofte, and L. Van Gool were the challenge
organizers, while the other authors participated in the challenge. Ap-
pendix Acontains the authors’ teams and afﬁliations. NTIRE 2022 web-
page: https://data.vision.ee.ethz.ch/cvl/ntire22/.
Code: https://github.com/ofsoundof/NTIRE2022_ESR.
and sub-track two (overall performance). In the main track,
the practical runtime performance of the submissions was
evaluated. The rank of the teams were determined directly
by the absolute value of the average runtime on the valida-
tion set and test set. In sub-track one, the number of param-
eters and FLOPs were considered. And the individual rank-
ings of the two metrics were summed up to determine a ﬁnal
ranking in this track. In sub-track two, all of the ﬁve met-
rics mentioned in the description of the challenge including
runtime, parameter count, FLOPs, activations, and memory
consumption were considered. Similar to sub-track one, the
rankings of ﬁve metrics were summed up to determine a ﬁ-
nal ranking. The challenge had 303 registered participants,
and 43 teams made valid submissions. They gauge the state-
of-the-art in efﬁcient single image super-resolution.
1. Introduction
Single image super-resolution (SR) aims at recovering
a high-resolution (LR) image from a single low-resolution
1
(LR) image that undergoes certain degradation process. Be-
fore the deep learning era, the problem of image SR is tack-
led by reconstruction-based [8,19,61] and exampled-based
methods [20,62,68,69]. With the thriving of deep learning,
SR is frequently tackled by solutions based on deep neural
networks [15,35,37,47,48,84].
For image SR, it is assumed that the LR image is de-
rived after two major degradation processes including blur-
ring and down-sampling, namely,
y= (xk)s.(1)
where denotes the convolution operation between the LR
image and the blur kernel and sis the down-sampling op-
eration with a down-scaling factor of ×s. Depending on
the blur kernel and the down-sampling operation, image
SR could be classiﬁed into several standard problems. And
among them, bicubic down-sampling with different down-
scaling factors (×2,×3,×4,×8, or even ×16) is the most
frequently used degradation model. This classical standard
degradation model allows direct comparison between dif-
ferent image SR methods, which also serves as a test bed to
validate the advantage a newly proposed SR method.
With the fast development of hardware technologies, it
becomes possible to train much larger and deeper neural
networks for image SR, which contributes signiﬁcantly to
the performance boost of the proposed solutions. Almost
each breakthrough in the ﬁeld of image SR comes with a
more complex deep neural network [15,35,37,47,48,88].
Apart from the development of large models with high per-
formance, a parallel direction is design efﬁcient deep neu-
ral networks for single image SR [16,23,31,32,40,66,85].
In [16,66], the proposed networks extracted features di-
rectly from the LR images instead of the bicubic interpo-
lation of the LR image, which saved the computation by
almost a factor of s2. This laid the foundation for later de-
sign of neural networks for image SR. Later works focused
on the design of basic building blocks from different per-
spectives [31,32,40]. In [23], a new activation function,
namely multi-bin trainable linear unit was proposed to in-
crease the nonlinear modeling capacity of shallow network.
In [85], an edge-oriented convolution block is proposed for
real-time SR on mobile devices.
Besides the manual design of deep neural networks,
there are a plethora of works that try to improve the
efﬁciency of deep neural networks via network prun-
ing [27,39,43,44,55], low-rank ﬁlter decomposition [33,
41,42,87], network quantization [11,25], neural architec-
ture search [50,67,76], and knowledge distillation [28,70].
Among those network compression works, a couple of them
have been successfully applied to image SR [4244,67,76].
The efﬁciency of deep neural network could be measured
in different metrics including runtime, number of parame-
ters, computational complexity (FLOPs), activations, and
memory consumption, which affect the deployment of deep
neural network in different aspects. Among them, runtime
is the most direct indicator of the efﬁciency of a network and
thus is used as the main efﬁciency evaluation metric. The
number of activations and parameters is related to memory
consumption. And a higher memory consumption means
that additional memory devices are needed to store the ac-
tivations and parameters during the inference. Increased
computational complexity is related to higher energy con-
sumption, which could shorten the battery life of mobile
devices. At last, the number of parameters is also related to
AI chip design. More parameters mean larger chip area and
increased cost of the designed AI devices.
Jointly with the 2022 New Trends in Image Restoration
and Enhancement (NTIRE 2022) workshop, we organize
the challenge on efﬁcient super-resolution. The task of the
challenge is to super-resolve an LR image with a magniﬁ-
cation factor of ×4 by a network that reduces one or sev-
eral aspects such as runtime, parameters, FLOPs, activa-
tions and memory consumption, while at least maintain-
ing PSNR of 29.00dB on the DIV2K validation set. The
challenge aims to seek advanced and novel solutions for ef-
ﬁcient SR, to benchmark their efﬁciency, and identify the
general trends for the design of efﬁcient SR networks.
2. NTIRE 2022 Efﬁcient Super-Resolution
Challenge
This challenge is one of the NTIRE 2022 associ-
ated challenges on: spectral recovery [4], spectral de-
mosaicing [3], perceptual image quality assessment [22],
inpainting [63], night photography rendering [18], efﬁ-
cient super-resolution [46], learning the super-resolution
space [56], super-resolution and quality enhancement of
compressed video [78], high dynamic range [60], stereo
super-resolution [71], burst super-resolution [7].
The objectives of this challenge are: (i) to advance re-
search on efﬁcient SR; (ii) to compare the efﬁciency of dif-
ferent methods and (iii) to offer an opportunity for academic
and industrial attendees to interact and explore collabora-
tions. This section details the challenge itself.
2.1. DIV2K Dataset [2]
Following [2,82,83], the DIV2K dataset is adopted for
the challenge. The dataset contains 1,000 DIVerse 2K res-
olution RGB images, which are divided into a training set
with 800 images, a validation set with 100 images, and a
testing with 100 images. The corresponding LR DIV2K
in this challenge is the bicubicly downsampled counterpart
with a down-scaling factor ×4. The validation set is already
released to the participants. The testing HR images are hid-
den from the participants during the whole challenge.
2
2.2. IMDN Baseline Model
The IMDN [31] serves as the baseline model in this chal-
lenge. The aim is to improve its efﬁciency in terms of run-
time, number of parameters, FLOPs, number of activations,
and GPU memory consumption while maintaining a PSNR
performance of 29.00dB on the validation set. The IMDN
model uses a 3×3convolution to extract features from
the LR RGB images, which is followed by 8 information
multi-distillation blocks. The information multi-distillation
block contains 4 stages that progressive reﬁne the feature
representation in the block. In each stage, the input feature
from the previous stage is split along the channel dimen-
sion, leading two separated features. Among the two fea-
tures, one is bypassed to the end of the block and the other
one is fed to the next stage for the calculation of high-level
feature. The bypassed features from the 4 stages are con-
catenated along the channel dimension and combined by a
1×1convolution. The ﬁnal upsampler only consists of one
trainable convolutional layer to expand the feature dimen-
sion. Pixel-shufﬂe is used to recover the high-resolution
grid of the image. This design is considered to save as many
parameters as possible.
The baseline IMDN is provided by the winner
of the AIM 2019 Challenge on Constrained Super-
Resolution [83]. The quantitative performance and efﬁ-
ciency metrics of IMDN are given in Tab. 1and summarized
as follows. (1) The number of parameters is 0.894M. (2)
The average PSNRs on validation and testing sets of DIV2K
are 29.13dB and 28.78dB, respectively. (3) The runtime av-
eraged on the validation and test set with PyTorch 1.11.0,
CUDA Toolkit 10.2, cuDNN 7.6.2 and a single Titan Xp
GPU is 50.86 ms. (4) The number of FLOPs for an input
of size 256 ×256 is 58.53G. (5) The number of activations
(i.e. the number of elements in all convolutional layer out-
puts) for an input of size 256 ×256 is 154.14M. (5) The
maximum GPU memory consumption during the inference
on the DIV2K validation set is 471.76M. (6) The number of
convolutional layers is 43.
2.3. Tracks and Competition
The aim of this challenge is to devise a network that re-
duces one or several aspects such as runtime, parameters,
FLOPs, activations and memory consumption while at least
maintaining the PSNR of 29.00dB on the validation set. The
challenge is divided into three tracks according to the 5 eval-
uation metrics.
Main Track: Runtime Track. In this track, the practical
runtime performance of the submissions is evaluated. The
rankings of the teams are determined directly by the abso-
lute value of the average runtime on the validation set and
test set.
Sub-Track 1: Model Complexity Track. In this track, the
number of parameters and FLOPs are considered. And the
rankings of the two metrics are summed up to determine a
ﬁnal ranking in this track.
Sub-Track 2: Overall Performance Track. In this track,
all of the ﬁve metrics mentioned in the description of the
challenge including runtime, parameters, FLOPs, activa-
tions, and GPU memory are considered. Similar to Sub-
Track 1, the rankings of ﬁve metrics are summed up to de-
termine a ﬁnal ranking in this track.
Ranking statistic When determine the ranking in the case
of multiple metrics, the individual rankings of different met-
rics are summed up, which constitutes a ranking statistic of
the metrics. This idea is similar to that behind Spearman’s
correlation. That is, instead of using the absolute values, a
ranking statistic could remove the inﬂuence of unit and at
the same time be good enough to distinguish different en-
tries.
Challenge phases (1) Development and validation phase:
The participants had access to the 800 pairs of LR/HR train-
ing image pairs and 100 pairs of LR/HR validation images
of the DIV2K dataset. The IMDN model, pretrained pa-
rameters, and validation demo script are given on GitHub
(https://github.com/ofsoundof/IMDN), allow-
ing the participants to benchmark the runtime of their mod-
els on their system. The participants could upload the HR
validation results on the evaluation server to calculate the
PSNR of the super-resolved image produced by their mod-
els to get immediate feedback. The number of parameters
and runtime was computed by the participant. (2) Testing
phase: In the ﬁnal test phase, the participants were granted
access to the 100 LR testing images. The HR ground-truth
images are hidden for the participants. The participants then
submitted their super-resolved results to the Codalab evalu-
ation server and e-mailed the code and factsheet to the or-
ganizers. The organizers veriﬁed and ran the provided code
to obtain the ﬁnal results. Finally, the participants received
the ﬁnal results at the end of the challenge.
Evaluation protocol The quantitative evaluation metrics
include validation and testing PSNRs, runtime, number of
parameters, number of FLOPs, number of activations, and
maximum GPU memory consumed during inference. The
PSNR was measured by ﬁrst discarding the 4-pixel bound-
ary around the images. The the average runtime during the
inference on the 100 LR validation images and the 100 LR
testing images is computed. The best runtime among three
consecutive trails is selected as the ﬁnal result. The average
runtime on the validation set and testing set is used as the
ﬁnal runtime indicator. The maximum GPU memory con-
sumption is recorded during the inference. The FLOPs, ac-
tivations are evaluated on an input image of size 256 ×256.
Among the above metrics, the runtime is regarded as the
most important one. During the challenge, the participants
3
Table 1. Results of NTIRE 2022 Efﬁcient SR Challenge. The underscript numbers in parentheses following each metric value denotes
the ranking of the solution in terms of that metric. “Ave. Time” is averaged on DIV2K validation and test datasets. “#Params” denotes
the total number of parameters. “FLOPs” is the abbreviation for ﬂoating point operations. “#Acts” measures the number of elements of
all outputs of convolutional layers. “GPU Mem.” represents maximum GPU memory consumption according to the PyTorch function
torch.cuda.max memory allocated() during the inference on DIV2K validation set. “#Conv” represents the number of convo-
lutional layers. “FLOPs” and “#Acts” are tested on an LR image of size 256×256. This is not a challenge for PSNR improvement. The
“validation/testing PSNR” and “#Conv” are not ranked.
Team Main
Track
Sub-
Track 1
Sub-
Track 2
PSNR
[Val.]
PSNR
[Test]
Ave.
Time [ms]
#Params
[M]
FLOPs
[G]
#Acts
[M]
GPU Mem.
[M] #Conv
ByteESR 1 22(11) 33(2) 29.00 28.72 27.11(1) 0.317(11) 19.70(11) 80.05(6) 377.91(4) 39
NJU Jet 2 37(18) 44(6) 29.00 28.69 28.07(2) 0.341(18) 22.28(19) 72.09(4) 204.60(1) 34
NEESR 3 10(4) 27(1) 29.01 28.71 29.97(3) 0.272(4) 16.86(6) 79.59(5) 575.99(9) 59
Super 4 26(12) 55(10) 29.00 28.71 32.09(4) 0.326(14) 20.06(12) 93.82(10) 663.07(15) 59
MegSR 5 18(9) 43(5) 29.00 28.68 32.59(5) 0.290(9) 17.70(9) 91.72(8) 640.63(12) 64
rainbow 6 16(8) 34(3) 29.01 28.74 34.10(6) 0.276(6) 17.98(10) 92.80(9) 309.23(3) 59
VMCL Taobao 7 29(14) 57(11) 29.01 28.68 34.24(7) 0.323(13) 20.97(16) 98.67(11) 633.00(10) 40
Bilibili AI 8 15(7) 41(4) 29.00 28.70 34.67(8) 0.283(8) 17.61(7) 90.50(7) 633.74(11) 64
NKU-ESR 9 12(5) 48(7) 29.00 28.66 34.81(9) 0.276(7) 16.73(5) 111.12(13) 662.51(14) 65
NJUST RESTORARION 10 54(27) 89(15) 28.99 28.68 35.76(10) 0.421(28) 27.67(26) 108.66(12) 643.95(13) 52
TOVBU 11 43(21) 96(19) 29.00 28.71 38.32(11) 0.376(23) 22.38(20) 113.55(15) 867.17(27) 64
Alpan Team 12 18(10) 51(9) 29.01 28.75 39.63(12) 0.326(15) 12.31(3) 115.52(16) 439.37(5) 132
Dragon 13 38(19) 70(13) 29.01 28.69 41.80(13) 0.358(20) 21.11(18) 120.15(17) 260.00(2) 131
TieGuoDun Team 14 54(27) 104(21) 28.95 28.65 42.35(14) 0.433(29) 27.10(25) 112.03(14) 788.13(22) 64
HiImageTeam 15 7(3) 70(13) 29.00 28.72 47.75(15) 0.242(3) 14.51(4) 151.36(23) 861.84(25) 100
xilinxSR 16 66(34) 107(22) 29.05 28.75 48.20(16) 0.790(34) 51.76(32) 136.31(18) 471.37(7) 38
cipher 17 50(24) 111(23) 29.00 28.72 51.42(17) 0.407(26) 25.25(24) 155.35(24) 770.82(20) 67
NJU MCG 18 13(6) 66(12) 28.99 28.71 52.02(18) 0.275(5) 17.65(8) 212.35(27) 511.08(8) 84
IMGWLH 19 34(17) 91(17) 29.01 28.72 56.34(19) 0.362(21) 20.10(13) 136.35(19) 753.02(19) 113
imglhl 20 45(22) 92(18) 29.03 28.75 56.88(20) 0.381(24) 23.26(21) 144.05(21) 451.21(6) 127
whu sigma 21 63(32) 132(30) 29.02 28.73 61.04(21) 0.705(33) 43.88(30) 142.91(20) 1011.54(28) 64
Aselsan Research 22 27(13) 98(20) 29.02 28.73 63.18(22) 0.317(12) 20.71(15) 206.05(26) 799.52(23) 134
Drinktea 23 59(31) 121(27) 29.00 28.70 75.52(23) 0.589(31) 36.92(28) 148.05(22) 734.54(17) 67
GDUT SR 24 50(24) 136(31) 29.05 28.75 75.70(24) 0.414(27) 24.80(23) 260.05(28) 1457.98(34) 195
Giantpandacv 25 63(32) 150(34) 29.07 28.76 87.87(25) 0.683(32) 45.07(31) 361.23(31) 1272.95(31) 122
neptune 26 39(20) 123(29) 28.99 28.69 101.69(26) 0.316(10) 38.03(29) 269.48(29) 1179.05(29) 45
XPixel 27 3(1) 49(8) 29.01 28.69 140.47(27) 0.156(1) 9.50(2) 65.76(3) 729.94(16) 43
NJUST ESR 28 3(1) 89(15) 28.96 28.68 164.80(28) 0.176(2) 8.73(1) 160.43(25) 1346.74(33) 25
TeamInception 29 57(30) 146(33) 29.12 28.82 171.56(29) 0.505(30) 32.42(27) 502.27(34) 866.16(26) 74
cceNBgdd 30 33(16) 114(24) 28.97 28.67 180.60(30) 0.339(16) 21.11(17) 404.16(33) 739.65(18) 197
ZLZ 31 55(29) 118(26) 29.00 28.72 183.43(31) 0.372(22) 64.45(33) 57.51(2) 1244.23(30) 16
Express 32 31(15) 117(25) 29.04 28.77 203.16(32) 0.339(17) 20.41(14) 325.53(30) 853.27(24) 148
Just Try 33 70(35) 170(35) 29.12 28.81 247.90(33) 0.832(35) 135.30(35) 392.43(32) 2387.93(35) 207
ncepu explorers 34 47(23) 137(32) 29.09 28.79 317.66(34) 0.390(25) 23.73(22) 994.25(35) 771.54(21) 374
mju mnu 35 53(26) 121(27) 29.06 28.79 332.28(35) 0.345(19) 78.81(34) 46.76(1) 1310.72(32) 40
The following methods are not ranked since their validation/testing PSNR are not on par with the baseline.
Virtual Reality Team 27.35 27.26 2231.32 0.423 423.16 2731.08 3336.88* 82
NTU607QCO-ESR 27.79 27.61 38.85 0.433 27.06 108.89 776.38 60
Strong Tiger 29.00 28.61 34.92 0.560 36.64 78.91 641.13 23
VAP 29.01 28.47 23.96 0.175 10.83 70.93 507.64 63
Multicog 28.38 28.16 207.98 0.312 37.67 430.23 1461.57 130
Set5baby Team 28.92 28.62 83.44 0.223 13.98 229.07 797.25 88
NWPU SweetDreamLab 28.47 28.23 31.19 0.193 11.73 90.50 633.10 76
SSL 28.72 28.44 64.71 0.290 18.95 150.60 675.41 48
RFDN AIM2020 Winner 29.04 28.75 41.97 0.433 27.10 112.03 788.13 64
IMDN baseline 29.13 28.78 50.86 0.894 58.53 154.14 471.76 43
*This solution uses too much GPU memory. Images are cropped to 256 ×256 with 32 overlapping pixels during inference.
are required to maintain the PSNR of 29.00dB on the vali-
dation set. For the ﬁnal ranking, a tiny accuracy drop is tol-
erated. To be speciﬁc, submissions with PSNR higher than
28.95dB on the validation set and 28.65dB on the testing
set could enter the ﬁnal ranking. The constraint on the test-
ing set avoids overﬁtting on the validation set. A code ex-
4
ample for calculating these metrics is available at https:
//github.com/ofsoundof/NTIRE2022_ESR. The
code of the submitted solutions and the pretrained weights
are also available in this repository.
3. Challenge Results
Tab. 1reports the ﬁnal test results and rankings of the
teams. The solutions with validation PSNR lower than
28.95dB and test PSNR lower than 28.65dB are not ranked.
The results of the baseline method IMDN [31] and the over-
all ﬁrst place winner team in AIM 2020 Efﬁcient SR chal-
lenge [51] are also reported for comparison. The meth-
ods evaluated in Tab. 1are brieﬂy described in Sec. 4and
the team members are listed in Appendix A. According to
Tab. 1, we can have the following observations.
Main Track: Runtime Track. First, the ByteESR team
is the overall ﬁrst place winner in the main track of this ef-
ﬁcient SR challenge. The NJU Jet team and NEESR team
win the overall second place and overall third place, respec-
tively. The average runtime of the ﬁrst three solutions is
below 30 ms and is very close to each other. The ﬁrst 12
teams proposed a solution with average runtime lower than
40 ms. In addition, the solution proposed by the 13-th team
is also faster than the AIM 2020 winnder RFDN [51].
Sub-Track 1: Model Complexity Track For this track,
there are two ﬁrst place winners including XPixel and
NJUST ESR. The solution proposed by XPixel team has
slightly fewer parameters while the solution proposed by
NJUST ESR team has fewer computation. The HiIm-
ageTeam team achieves the third place in this track. The
number of parameters of 9 solutions are lower than 0.3 M,
which is a signiﬁcant improvement compared with the AIM
2019 constrained SR challenge winner IMDN and the AIM
2020 efﬁcient SR challenge winner RFDN. As for the com-
putational complexity, the FLOPs of 11 solutions is lower
than 20G and 27 solutions have fewer FLOPs than RFDN.
Sub-Track 2: Overall Performance Track The
NEESR team is the ﬁrst place winner in this track. ByteESR
and rainbow are the second and third place winner in this
track, respectively. The solutions proposed by mju mnu,
ZLZ, and NJUST ESR team have the least number of ac-
tivations. Meanwhile, the solutions proposed by NJU Jet,
Dragon, and rainbow team are among the most memory-
efﬁcient solutions.
The xilinxSR team achieved the best PSNR ﬁdelity met-
ric (29.05dB on the validation set and 29.75dB on test
set) among the solutions that outperforms the baseline net-
work IMDN in terms of runtime and number of parame-
ters. When comparing this solution and the baseline solu-
tion IMDN, it is observed that IMDN has a larger PSNR im-
provement on the validation set than on the test set. On the
other hand, it is also observed that, compared with the solu-
tions proposed by TeamInception and Just Try, IMDN has
similar PSNR on the validation set but achieves lower PSNR
on the test set. Such phenomenons indicate that IMDN is
more in favor of the PSNR on validation set. This is also
the reason why the baseline PSNR on the validation set is
set to 29.00dB rather than 29.13dB.
3.1. Architectures and main ideas
During this challenge, a couple of techniques are pro-
posed to improve the efﬁciency of deep neural network for
image SR while maintaining the performance as much as
possible. Depending on the metrics that a team wants to op-
timize, different solutions are proposed. In the following,
some typical ideas are listed.
1. Modifying the architecture of information multi-
distillation block (IMDB) and residual feature dis-
tillation block (RFDB) is the mainstream technique.
The IMDB and RFDB modules come from the ﬁrst
place winners of the AIM 2019 constrained image SR
challenge and the AIM 2020 efﬁcient image SR chal-
lenge. Thus, some teams of this challenge start from
modifying those two basic architectures. The ﬁrst
place winner ByteESR in the runtime main track pro-
posed a residual local feature block (RLFB) to replace
the RFDB and IMDB modules. The main difference is
the removal of the concatenation operation and the as-
sociated 1×1convolutional feature distillation layers.
This is optimized especially out of runtime consider-
ations. Besides, the team also reduced the number of
convolutions in the ESA module.
2. Multi-stage information distillation might inﬂuence
the inference speed of the super efﬁcient models. It
is observed that the ﬁrst two place solutions in the run-
time main track do not contain multi-stage information
distillation blocks in the network. It is also reported
in other work [85] that using too many skip connec-
tions and associated 1×1information distillation lay-
ers could harm the runtime performance.
3. Reparameterization could bring slight perfor-
mance improvement. A couple of teams tried to repa-
rameterize a normal convolutional layer with multi-
ple basic operations (3×3convolution, 1×1op-
eration, ﬁrst and second order derivative operators,
skip connections) during network training. During in-
ference, the multiple operations that reparameterize a
convolution could be merged back to a single convolu-
tion. It is demonstrated that, this technique could bring
slight PSNR gain. For example, NJU Jet replaced a
normal residual block with a reparameterized residual
block. The NEESR team used edge-oriented convolu-
tion block to reparameterize a normal convolution.
5
4. Filter decomposition methods could effectively re-
duce the model complexity. Filter decomposition
method generally refers to the replacement of a normal
convolution by a couple of lightweight convolutions
(depthwise, 1×1,1×3and 3×1convolutions). For ex-
ample, the XPixel used the combination of depthwise
convolution and 1×1convolution. The NJUST ESR
team also used the inverted residual block. And the
solutions proposed by the two team won the ﬁrst two
places in the model complexity track.
5. Network pruning began to play a role. It is observed
that a couple of teams used network pruning tech-
niques to slightly compress a baseline network. For
example, the ByteESR team slightly pruned the num-
ber of channels in their network from 48 to 46 at the
ﬁnal training stage. The MegSR team normalized the
weight parameters and applied learnable parameters to
the normalized channels. They pruned the network ac-
cording to the magnitude of these parameters. The xil-
inxSR team also tried to prune the IMDB modules.
6. Activation function is an important factor. It is ob-
served that some team used advanced activation func-
tion in their network. For example, the rainbow team
used SiLU activation function for each convolution ex-
cept the last 1×1convolution. A lot of teams also used
GeLU activation function.
7. Design of loss functions is also among the consid-
eration. Loss function is also an important element
for the success of an efﬁcient SR network. While
most of the teams used L1 or L2 loss, some teams
also demonstrated using a more advanced loss func-
tion could bring marginal PSNR gain. For example,
the ByteESR team used contrastive loss to improve
the PSNR by 0.01dB - 0.02dB on different validation
sets. The NKU-ESR team proposed edge-enhanced
gradient-variance loss.
8. Advanced training strategy guarantees the perfor-
mance of the network. The advanced training strat-
egy contains many aspects of the training setting. For
example, most of the teams prolonged the training.
Since the size of models is mostly small, training with
both large patch size and batch size becomes possi-
ble. Periodic learning rate scheduler and cosine learn-
ing rate scheduler are used by some team, which could
help the training to step outside of the local minima.
The training of winner solutions typically contains sev-
eral stages. Advanced tuning of the network architec-
ture such as pruning and merging of reparameterized
operations is used at the ﬁnal ﬁne-tuning stages.
9. Various other techniques are also attempted. Some
teams also proposed solutions based on neural ar-
chitecture search, vision transformers, and even fast
Fourier transform.
3.2. Participants
This year we see a continuous growth of the efﬁcient SR
challenge with more participants and valid submission. As
shown in Fig. 1, the number of registered participants grows
from 64 in AIM 2019, 150 in AIM 2020, and ﬁnally 303 in
this year. Meanwhile, the number of valid submission also
grows from 12 in AIM 2019, 25 in AIM 2020, and 43 this
year.
Figure 1. Number of participants and valid submission during the
past three challenges.
3.3. Fairness
To maintain the fairness of the efﬁcient SR challenge, a
couple of rules about fair and unfair tricks are set. Most
of the rules are about the dataset used to train the net-
work. First, training with additional external dataset such
as Flickr2K is allowed. Secondly, training with the addi-
tional DIV2K validation set including either of the HR or
LR images is not allowed. This is because the validation set
is used to examine the overall performance and generaliz-
ability of the proposed network. Thirdly, training with the
DIV2K test LR images is not allowed. Fourthly, training
with advanced data augmentation strategy during training
is regarded as a fair trick.
3.4. Conclusions
Based on the aforementioned analysis of the efﬁcient SR
challenge results, the following conclusions can be drawn.
1. The efﬁcient image SR community is growing. This
year the challenge had 303 registered participants, and
received 43 valid submissions, which is a signiﬁcant
boost compared with the previous years.
6
(a) (b) (c)
Figure 2. ByteESR Team: (a) Residual feature distillation block (RFDB). (b) Residual local feature block (RLFB). (c) Enhanced Spatial
Attention (ESA).
2. The family of the proposed solutions during this chal-
lenge keep to push the frontier of the research and im-
plementation of efﬁcient images SR.
3. In conjunction with the previous series of the efﬁ-
cient SR challenge including AIM 2019 Constrained
SR Challenge [83] and AIM 2020 Efﬁcient SR Chal-
lenge [82], the proposed solutions make new records of
network efﬁciency in term of metrics such as runtime
and model complexity while maintain the accuracy of
the network.
4. There is a divergence between the actual runtime and
theoretical model complexity of the proposed net-
works. This shows that the theoretical model complex-
ity including FLOPs and the number of parameters do
not correlate well with the actual runtime at least on
GPU infrastructures.
5. In the meanwhile, new developments in the efﬁcient
SR ﬁeld are also observed, which include but not lim-
ited to the following aspects.
The effectiveness of multi-stage information dis-
tillation mechanism is challenged by the ﬁrst two
place solutions in the runtime main track.
Other techniques such as contrastive loss, net-
work pruning, and convolution reparameteriza-
tion began to play a role for efﬁcient SR.
4. Challenge Methods and Teams
4.1. ByteESR
Network Architecture The ByteESR Team proposed
Residual Local Feature Network (RLFN) for Efﬁcient
Figure 3. ByteESR Team: The architecture of residual local feature
network (RLFN).
Super-Resolution. As shown in Fig. 3, the proposed RLFN
uses one of the basic SR architecture, which is similar to
IMDN [31] and other methods [48,74]. The difference is
RLFN uses four residual local feature block (RLFB) as the
building blocks.
The proposed RLFN is modiﬁed from residual feature
distillation block (RFDB) [51]. As shown in Fig. 2a, RFDB
uses three Conv-1 for feature distillation, and all the dis-
tilled features are concatenated together. Although aggre-
gating multiple layers of distilled features can result in more
powerful feature, concatenation accounts for most of the in-
ference time. Based on the consideration of reducing in-
ference time and memory, RLFB (see Fig. 2b) removes the
concatenation layer and the related feature distillation layers
and replaces them with an addition for local feature learn-
ing. Besides, in RLFB, the Conv Groups in ESA [52] (see
Fig. 2c) is simpliﬁed to one Conv-3 to decrease the model
depth and complexity.
Contrastive Loss Some recent works [53,72] ﬁnd that
a randomly initialized feature extractor, without any train-
ing, can be used to improve the performance of models on
several dense prediction tasks. Inspired by these works,
7
Figure 4. NJU Jet Team: The overall architecture of the fast and memory efﬁcient network (FMEN).
RLFN builds a two-layer network as the feature extractor.
The weights of convolution kernels are randomly initial-
ized. The contrastive loss is deﬁned as:
CL =ϕ(ysr )ϕ(yhr )
ϕ(ysr)ϕ(ylr)(2)
where ϕdeﬁnes the feature map generated by the feature
extractor, ϕ(ysr)ϕ(yhr )is the L1 distance loss between
feature maps of ysr and yhr and ϕ(ysr)ϕ(ylr )is the
L1 distance loss between feature maps of ysr and ylr.
Implementation details. The proposed RLFN has four
RLFBs, in which the number of feature channels is set to
48 while the channel number of ESA is set to 16. During
training, DIV2K [2] and Flickr2K datasets are used for the
whole process. The details of training steps are as follows:
I. At the ﬁrst stage, the model is trained from scratch. HR
patches of size 256 ×256 are randomly cropped from
HR images, and the mini-batch size is set to 64. The
RLFN model is trained by minimizing L1 loss function
with Adam optimizer. The initial learning rate is set
to 5×104and halved at every 200 epochs.The total
number of epochs is 1000.
II. At the second stage, the model is initialized with the
pretrained weights, and trained with the same settings
as in the previous step. This process repeats twice.
III. At the third stage, the model is initialized with the pre-
trained weights from Stage 2. The same training set-
tings as Stage 1 are kept to train the model, except that
the loss function is replaced by the combination of L1
loss and contrastive loss with a regularization factor
×255.
IV. At the fourth stage, the number of Conv-1 of RLFBs
in the pretrained model from 48 to 46 using Soft Filter
Pruning [26]. Training settings are the same as Stage
1, except that the size of HR patches changes to 512 ×
512. After 1000 epochs, L2 loss is used for ﬁne-tuning
with 640×640 HR patches and a learning rate of 105.
4.2. NJU Jet
Runtime and memory consumption are two important
aspects for efﬁcient image super-resolution (EISR) mod-
els to be deployed on resource-constrained devices. Recent
advances in EISR [31,51] exploit distillation and aggre-
gation strategies with plenty of channel split and concate-
nation operations to make full use of limited hierarchical
features. By contrast, sequential network operations avoid
frequently accessing preceding states and extra nodes, and
thus are beneﬁcial to reducing the memory consumption
and runtime overhead. Following this idea, the team de-
signed a lightweight network backbone by mainly stacking
multiple highly optimized convolution and activation lay-
ers and decreasing the usage of feature fusion. The overall
network architecture is shown in Fig. 4. The feature ex-
traction part and reconstruction part are the same as recent
works [31,51], and the high-frequency learning part is com-
posed of the proposed enhanced residual block (ERB) and
high-frequency attention block (HFAB) pairs.
Figure 5. NJU Jet Team: Comparison between a normal residual
block (RB) [48] and an enhanced residual block (ERB).
Enhanced residual block. The team ﬁrst proposed en-
hanced residual block (ERB) to replace normal residual
block (RB) in EDSR [48], for reducing the memory access
cost (MAC) introduced by skip connection. As shown in
8
Fig. 5, ERB is composed of two re-parameterization blocks
(RepBlock) and one ReLU. During training, each RepBlock
utilizes 1×1convolution to expand or reduce the number of
feature maps and adopts 3×3convolution to extract features
in higher dimensional space. Besides, two skip connections
are used to ease training difﬁculty. During inference, all the
linear transformations can be merged [14]. Thus each Rep-
Block can be converted into a single 3×3convolution. In
general, ERB takes advantage of residual learning without
additional MAC.
High-frequency attention block. Recently, attention
mechanism has been extensively studied in the SR commu-
nity. Based on the grain-size composition, it can be divided
into channel attention [31,34], spatial attention [75], pixel
attention [90], and layer attention [59]. Previous attention-
based methods lack consideration of two important as-
pects. First, some attention schemes, such as CCA [31]
and ESA [51], have multi-branch topology, which intro-
duces too much extra memory consumption. Second, some
nodes in the attention branch are not computationally opti-
mal, such as 7×7convolution used in ESA [52], which is
much less efﬁcient than 3×3convolution.
Considering both aspects, a sequential attention branch
is designed to rescale each position based on its nearby
pixels. The attention branch is inspired by edge detec-
tion, where the linear combination of nearby pixels can
be used to detect edges. Furthermore, the team found out
that the attention branch focused on high-frequency areas
and named the proposed block as high-frequency attention
block (HFAB) shown in Fig. 4. HFAB rescales every posi-
tion according to the non-linear combination of its window.
In HFAB, 3×3convolution rather than 1×1convolution
is used to reduce and expand feature dimension for larger
receptive ﬁeld and efﬁciency. Batch normalization (BN) is
adopted in the attention branch to introduce global statistics
and to keep features within the unsaturated area of sigmoid
function. During inference, BN can be merged into convo-
lution without additional computational cost.
Implementation details. DIV2K and Flickr2K are used
as the training dataset. Five ERB-HFAB pairs are stacked
sequentially. The number of feature maps of ERB and
HFAB is set to 50 and 16, respectively. In each training
batch, 64 HR RGB patches are cropped with size 256 ×256
and augmented by random ﬂipping and rotation. The learn-
ing rate is initialized as 5×104and decreases by half per
2×105iterations. The network is trained for 106iterations
in total by minimizing L1 loss function with Adam opti-
mizer. The team loaded the trained weights and repeated
the above training setting for 4 times. After that, L2 loss
is used for ﬁne-tuning. The initial learning rate is set to
1×105for 2×105iterations. Finally, the reconstruction
part is ﬁne-tuned with batch size 256 and HR patch size 640
for 105iterations, with L2 loss.
Figure 6. NEESR Team: Detailed architecture of RFDECB.
Figure 7. NEESR Team: Details of the edge-oriented convolution
block (ECB). In the inference stage, the ECB module will be con-
verted into a single standard 3×3convolution layer.
4.3. NEESR
The NEESR team proposed edge-oriented feature distil-
lation network (EFDN) for lightweight image super reso-
lution. The proposed EFDN is modiﬁed from RFDN [51]
with some efﬁciency improvement considerations such as
less channels and the replacement of the shallow residual
block (SRB) with edge-oriented convolution block (ECB).
Different from RFDN, EFDN only uses 42 channels to
further accelerate the network. Inspired by ECBSR [85],
EFDN employs the re-parameterization technique to boost
the SR performance while maintaining high efﬁciency.
EFDN has four RFDECBs shown in Fig. 6. In the train-
ing stage, RFDECB utilizes the ECB module which consists
of four types of carefully designed operators including nor-
mal 3×3convolution, channel expanding-and-squeezing
convolution, ﬁrst and second order spatial derivatives from
9
Figure 8. XPixel Team: The architecture of blueprint separable residual network (BSRN).
intermediate features. Such a design can extract edge and
texture information more effectively. In the inference stage,
the ECB module is converted into a single standard 3×3
convolution layer. Fig. 7illustrates the fusion process. The
training process contains two stages with three steps.
I. At the ﬁrst stage, the ECB module is equipped with
multiple branches.
Pre-training on DIV2K+Flickr2K (DF2K). HR
patches of size 256 ×256 are randomly cropped from
HR images, and the mini-batch size is set to 64. The
original EFDN model is trained by minimizing L1 loss
function with Adam optimizer. The initial learning rate
is set to 6×104and halved at every 200 epochs. The
total number of epochs is 4000.
Fine-tuning on DF2K. HR patch size and the mini-
batch size are set to 1024×1024 and 256, respectively.
The EFDN model is ﬁne-tuned by minimizing L2 loss
function. The initial learning rate is set to 2.5×104
and halved at every 200 epochs. The total number of
epochs is 4000.
II. At the second stage, the ﬁnal plain EFDN model is
obtained by converting ECB module into a single 3×3
convolution layer.
Fine-tuning on DF2K. HR patch size is set to 1024 ×
1024 and the mini-batch size is set to 256. The ﬁ-
nal EFDN model is ﬁne-tuned by minimizing L2 loss
function. The initial learning rate is set to 5×106
and halved at every 200 epochs. The total number of
epochs is 4000.
4.4. XPixel
General method description. The XPixel team pro-
posed Blueprint Separable Residual Network (BSRN) as
shown in Fig. 8. Following the overall architecture of
RFDN, BSRN consists of four stages including the shal-
low feature extraction, the deep feature extraction, multi-
layer feature fusion, and reconstruction. In the ﬁrst stage,
the shallow feature extraction contains input replication fol-
lowed by a linear mapping and a depthwise convolution to
Figure 9. XPixel Team: The architecture of the proposed efﬁcient
separable residual blocks (ESDB).
map from the input image space to a higher dimensional
feature space. Then stacked efﬁcient separable residual
blocks (ESDB) build up the deep feature extraction to grad-
ually reﬁne the extracted features. Features generated by
each ESDB are fused along the channel dimension at the
end of the trunk. Finally, the SR image is produced by the
reconstruction module, which only consists of a 3×3con-
volution and a non-parametric sub-pixel operation.
Building block description. For further optimization,
the blueprint separable convolution is adopted, which is an
extremely efﬁcient decomposition of the convolution, to re-
place the regular convolution in the proposed blueprint sep-
arable residual block (BSRB), as shown in Fig. 9. RFDN
replaces the contrast-aware channel attention (CCA) layer
with the enhanced spatial attention (ESA) block for better
performance. Yet, it has been found that the channel-wise
feature rescaling is effective for shallow SR models to boost
reconstruction accuracy. Therefore, a channel weighting
layer is involved in each ESDB for modelling channel-wise
relationships to utilize inter-dependencies among channels
with slightly additional cost.
10
Figure 10. NJUST ESR Team: The architecture of the proposed MobileSR model.
Model Details. The proposed BSRN model contains
5 ESDBs. the overall framework follows the pipeline of
RFDN. A global feature aggregation is employed at the end
of the network body to aggregate the ﬁnal features, which is
set to 48. Correspondingly, the channel weighting matrix is
set to 1×1×48 to match the dimension, which is initialized
by normal distribution with σ= 0.9,µ= 1.
Training strategy. Data augmentation including ran-
dom rotation by 90, 180, 270and horizontal ﬂipping
is performed on the DIV2K and Flickr2K training images.
In each training batch, 72 LR color patches with the size
of 64 ×64 are extracted as inputs per GPU. The model is
trained by ADAM optimizor with β1= 0.9, β2= 0.999.
The initial leaning rate is set to 5×104equipped with co-
sine learning rate decay. Different from the recent SR mod-
els, L2loss is used for training from scratch for 1×106
iterations. The model is implemented by Pytorch 1.9.1 and
trained with 4 GeForce RTX 2080ti GPUs.
4.5. NJUST ESR
The NJUST ESR introduced a vision transformer (ViT)
based method for efﬁcient SR, which combines the merits
of convolution and multi-head self-attention. Speciﬁcally, a
hybrid module containing a ViT block [38] and a inverted
residual block [65] is employed to simultaneously extract
local and global information. This module is stacked multi-
ple times to learn discriminative feature representation.
The network consists of three stages as detailed in
Fig. 10. First, a convolution layer maps the input image to
feature space. Then ﬁve hybrid blocks are stacked to learn
discriminative feature representation. At last, there are sev-
eral convolution layers and pixel shufﬂe layers to generate
the HR image.
The proposed MobileSR model is trained on DIV2K
dataset. The input patches of size 64×64 are randomly
cropped from LR images, and the mini-batch size is set to
64. MobileSR is trained by minimizing L1 loss and the fre-
quency loss [10] with Adam optimizer. The initial learning
rate is set to 5×104and halved at every 100 epochs for
total 450 epochs.
4.6. HiImageTeam
The HiImageTeam team proposed Asymmetric Resid-
ual Feature Distillation Network (ARFDN) inspired by the
IMDN [31] and RFDN [51] for efﬁcient SR. IMDN [31] is
an efﬁcient network architecture for image SR. Yet, there
are still many redundant calculation and inefﬁcient opera-
tors as shown in Fig. 11a. Compared with IMDB, RFDB
has a signiﬁcant efﬁciency improvement in the calculation
as shown in Fig. 11b. A shallow residual block (SRB) in
RFDN is equivalent to a normal convolution with sufﬁcient
training. The equivalent architecture of RFDB is shown in
Fig. 11c. That is to say, there is also redundant calculation
in RFDN. An efﬁcient attention module, namely ESA mod-
ule is used in the enhancement stage of distillation informa-
tion of RFANet [52]. Therefore, this module is also used
in the proposed network. As shown in Fig. 11d, an asym-
metric residual feature distillation block is designed, which
consists of both the asymmetric information distillation and
the information recombination and utilization operation.
There are 4 ARFDBs in the proposed ARFDN, the over-
all framework follows the pipeline of RFDN [51], where
global feature aggregation is used to augment the ﬁnal fea-
ture and the number of feature channels is set to 50. HR
patches are set to 256 ×256 and randomly cropped from
HR images during the training of ARFDN. The mini-batch
size is set to 36. The overall training process is divided into
two stages. In the ﬁrst stage, the ARFDN model is trained
for 1000 epochs by minimizing L1 loss function with Adam
optimizer and the initial learning rate is set to 2×104and
halved at every 30 epochs. L2 loss function is used to ﬁne-
tune the network with learning rate of 1×104in the sec-
ond stage. Div2K, OST and Flickr2K datasets are used to
train the ARFDN model.
11
Figure 11. HiImageTeam Team: (a) The original information multi-distillation block (IMDB). (b) Residual feature distillation block
(RFDB). (c) The equivalent architecture of the RFDB (RFDB-E). (d) Asymmetric residual feature distillation block (A-RFDB).
Upsampler
IMDB+
Conv-3
Conv-3
IMDB+
IMDB+
Figure 12. rainbow Team: The architecture of improved informa-
tion multi-distillation network (IMDN+). The number of IMDB+
is 8.
Conv-3
Conv-3 Conv-1 (72)
Conv-3 (36)
Conv-1 (144)
Conv-3 (36)
Training phase Inference phase
Figure 13. rainbow Team: A schematic diagram of structural re-
parameterization strategy.
4.7. rainbow
The rainbow team proposed Improved Information Dis-
tillation Network for efﬁcient SR shown in Fig. 12. This so-
lution mainly concentrates on improving the effectiveness
of the information multi-distillation block (IMDB) [31].
Different from the original IMDB, as illustrated in Fig. 14,
the improved IMDB (IMDB+) uses 5channel split op-
erations. The number of input channels is set to 36.
In order to improve the performance of IMDB+, struc-
tural re-parameterization methods are used [13,85] to re-
place “Conv-3” during the training phase as shown in
Fig. 13. Although re-parameterization can improve perfor-
Conv-3
Channel Split
Conv-3
Channel Split
36
30
30
Conv-3
24
Channel Split
24
Conv-3
Channel Split
18
18
Conv-3
Channel Split
12
12
Conv-3
6
Conv-1
36
Figure 14. rainbow Team: The architecture of the proposed im-
proved information multi-distillation block (IMDB+). Here, 36,
30,24,18,12, and 6all represent the output channels of the con-
volution layer. “Conv-3” denotes the 3×3convolutional layer.
Each convolution followed by a SiLU activation function except
for the last 1×1convolution.
mance (about 0.03dB on DIV2K validation set), it will in-
crease the training time. Different from the ECB proposed
12
Conv3
FDB-S FDB-S FDB-S FDB
Conv1
Conv3
PA
Conv3
Sub-pixel
Figure 15. Super Team: The overall network framework.
Conv1
ReConv3
Conv1
Conv1 ReB
Conv1
ESA
Concat
ReB
ReB
(a) FDB
Conv1
Conv1
Conv1
ESA
Concat
ReConv3
ReB
ReB
(b) FDB-S
LReLU
ReConv3
(c) ReB
Figure 16. Super Team: (a) Feature Distillation Block (FDB). (b)
Feature Distillation Block-Small (FDB-S). (c) Re-parameterized
Block (ReB).
in ECBSR [85], Conv1×1-Sobel and Conv1×1-Laplacian
are removed for efﬁcient training. This is because Sobel
and Laplacian ﬁlters are implemented by depth-wise con-
volution.
4.8. Super
The Super team proposed a solution mainly based
on RFDN [51], where the channel splitting operation in
IMDB [31] is replaced by 1×1convolution for feature dis-
tillation. The method differs from RFDN in three aspects:
1) Pixel Attention [90] is introduced in the network to ef-
fectively improve the feature representation capacity. 2)
Model re-parameterization technique is adopted to expand
the capacity of the network during training, while keeping
the computations during inference. 3) Further compression
of the model is accomplished by reducing the size of the
ﬁrst three blocks.
Framework. The team used a similar framework as
RFDN [51], as shown in Fig. 15. The Pixel Attention
Feature Distillation Network (RePAFDN) consists of four
parts: the feature extraction convolution, the stacked feature
distillation blocks with different sizes (FDB-S and FDB),
the feature fusion part and the reconstruction block.
Given the input x, coarse features are ﬁrst extracted as:
F0=h(x),(3)
where hdenotes the feature extraction function, imple-
mented by a 3×3convolution, and F0is the extracted
features. Next, three FDB-S and one FDB are stacked to
gradually reﬁne the extracted features, formulated as:
Fk=Hk(Fk1), k = 1,··· ,4,(4)
where Hkdenotes the k-th feature distillation block, Fk1
and Fkrepresent the input feature and output feature of the
k-th feature distillation block, respectively. All the interme-
diate features are fused by a 1×1convolution and a 3×3
convolution. The fused features are then fed to the pixel
attention layer as:
Ffused =HP A(Hf(Concat(F1,··· , F4)),(5)
where Concat is the concatenation operation along the
channel dimension, Hfdenots the 1×1convolution fol-
lowed by a 3×3convolution, HP A is the pixel attention
layer, and Ffused is the fused features. Finally, the output
is generated as:
y=Up(Ff used +F0),(6)
where Up is the reconstruction function (i.e. a3×3con-
volution and a sub-pixel operation) and yis the output SR
image.
Feature Distillation Block. Two variants of Feature
Distillation Block (FDB) are designed. The primitive FDB
(Fig. 16a) is similar to RFDB except that the residual con-
nection is removed and Re-parameterized Block (ReB) is
used to replace the Shallow Residual Block (SRB). As
shown in Fig. 16c, ReB contains a re-parameterized 3×3
convolution (ReConv3) and LReLU function. The details of
the ReConv3 will be explained in the next paragraph. The
whole structure of FDB can be described as:
Fdistilled1, Fremain1=DL1(Fin ), RL1(Fin),
Fdistilled2, Fremain2=DL2(Fremain1), RL2(Fremain1),
Fdistilled3, Fremain3=DL3(Fremain2), RL3(Fremain2),
Fdistilled4=ReConv3(Fremain3),
(7)
where DLiis the i-th 1×1convolution and RLiis the
i-th ReB. The distilled features Fdistilled1,··· , Fdistilled4
are concatenated and fed to a 1×1convlution and the
13
Conv3
Conv3
Conv3
Conv1
+
Training Inference
Figure 17. Super Team: Re-parameterized 3×3Convolution (Re-
Conv3).
Conv1 Sigmoid
Figure 18. Super Team: Pixel Attention (PA).
ESA block [52] for further enhancement. As for the more
lightweight FDB (FDB-S), a feature distillation layer (i.e.
a1×1convolution and a ReB) is removed as shown in
Fig. 16b.
Model Re-parameterization. Inspired by recent model
re-parameterization works [6,85], similar techniques are
adopted to improve model performance. Speciﬁcally, the
SRB block inside RFDB is redesigned by introducing multi-
branch convolution during training. As shown in Fig. 17,
there are two extra branches along with the original convo-
lution of size 3×3, which consists of an identity shortcut
and two cascaded convolution layers with size of 1×1and
3×3respectively. The outputs of the three branches are
added before being fed into the activation layer, which can
be formulated as:
Ftraining
ReConv3=Fin +Conv1
3×3(Fin)+
Conv2
3×3(Conv1×1(Fin )),
Finference
ReConv3=Convrep
3×3(Fin),
(8)
where Fin represents the input feature, and Convrep
3×3rep-
resents re-parameterized 3×3convolution. Since the oper-
ations of the three branches are completely linear, the re-
parameterized architecture can be equally converted to a
single convolution of size 3×3for inference. In the ex-
periments, the re-parameterization technique helps improve
the PSNR of small models by 0.02dB.
Pixel Attention. Inspired by PAN [90], pixel attention
is used to more effectively generates features for the ﬁnal
reconstruction block. Speciﬁcally, a 1×1convolution fol-
lowed by a Sigmoid function is responsible for generating
a 3D attention coefﬁcients map for all pixels of the feature
map (shown in Fig. 18). The PA layer can be formulated as:
FP A =P A(Fin)·Fin,(9)
Unlike PAN, PA is not introduced into FDBs since PA is not
runtime friendly. For similar consideration, the PA is con-
ducted in low-resolution space to save computations while
PA in UPA of PAN is conducted in higher resolution.
The channel number used in the model is 48. For
DL1,··· , DL3in FDB, the number of output channels is
12. For DL1and DL2in FDB-S, the number of output
channels is 24. DIV2K and Flickr2K are used as the train-
ing set. For the ﬁrst training stage, patches of size 128×128
are cropped from the LR images as inputs. For the second
training stage, patches of size 160 ×160 are cropped from
the LR images as inputs.
4.9. MegSR
The MegSR team proposed PFDNet, a light-weight net-
work for efﬁcient super-resolution. Previous works such
as IMDN [31] and RFDN [51] introduce novel network
blocks, which are variants of feature distillation block, and
achieve favorable performance. Unlike these works, the
team proposed to tackle this problem based on pruning
strategies. Albeit the techniques of network pruning are
widely used in high-level tasks, such as image classiﬁca-
tion and segmentation, its applications on low-level tasks
are rare. A recent work ASSL [89] propose a pruning
scheme for residual network in the SR task, showing the
network pruning technique is effective. Inspired by RFDN
and ASSL, the team explored how to combine pruning and
feature distillation network.
Speciﬁcally, the method contains two stages: training
stage and ﬁne-tuning stage.Training stage: In this stage,
the original architecture of RFDN is ﬁrst trained to obtain
a pretrained model. Then, the model is reparameterized to
reduce the residual addition operators as many as possible.
When pruning the features of a network, the indices of fea-
tures retained in different layers may be different. Thus,
it is not reasonable to add up the features with different
indices. To solve this problem, except for the ESA [52]
layers, weight normalization (WN) is applied to all con-
volutions. The learnable parameters γof WN indicate the
importance of features. Finally, the new model is trained
with 1loss to maintain the performance, while a regular-
ized term is used to force the unimportant weights to con-
verge to zero. Fine-tuning stage: After the training stage,
the weights in the model is pruned according to the values
of parameters γ. Note that, the remaining convolutional lay-
ers where are WN is not applied are pruned according to γ
of the previous layers. The parameter γcan be fused into
network weight during inference. Thus, using WN does not
increase the computational cost. After pruning, the pruned
14
Conv-1 Conv-3
Conv-1 Conv-3
Conv-1 Conv-3
Conv-3
Concat
Conv-1
ESA Layer
Conv-1 Conv-3
Conv-1 Conv-3
Conv-1 Conv-3
Conv-3
Concat
Conv-1
ESA Layer
Weight Normalization
!!
Fliter1 Fliter2 Fliter3
!"!#
Conv General Conv
Conv Weight Normed Conv
Prune Fliter3
!#<threshold
Pruning
Prune when
(a) RFDB (b) Re-parameterize
and add weight norm
(c) Pruning scheme
Figure 19. MegSR Team: (a) The basic block of RFDN. (b) Reparameterization for identity connection and applying weight normalization
on convolutional layers. (c) Pruning the unimportant weight.
model is ﬁne-tuned with 1loss in the ﬁrst 300,000 itera-
tions and with 2in the last 100,000 iterations.
Reparameterization. Denote a feature as X, and a
weight of convolution as W, this following equation holds:
W X +X= (W+I)X, (10)
where Iis the identity matrix. As depicted in Fig. 19(b),
all the skip-connections are removed from RFDB without
degradation.
Weight Normalization. Weight Normalization includes
learnable parameters which can tell the importance of
weights, as shown in Fig. 19 (c):
ˆ
Wi=Wi
Wi2
,Wi=γiˆ
Wi,for i∈ {1,2,··· , N }
(11)
where WRN×C×H×WWrepresents the 4 dimensional
convolutional kernel, and γRNstands for the 1 dimen-
sional trainable scale parameters in WN.
Loss Function. During training, the model is trained
with 1loss and the following penalty term:
LSI =α
L
X
l=1 X
iS(l)
, γ2
i(12)
where αis the scalar loss weight, γidenotes the i-th element
of γ, and S(l)represents the unimportant ﬁlter index set in
the l-th layer.
k21d1g1
k1d1g1 k1d1g1
56 56
k1d1g1
k3d1g2
112
56
k3d1g1
56
k3d2g1
56
k1d1g1
C
28
28
28
56
LKA
+1
k1d1g1
56
7
maxpool
k7d3g7
k5d1g7
k1d1g1
7
upsample
7
k1d1g1
56
Figure 20. VMCL Taobao Team: The architecture of the proposed
multi-scale information distillation block (MSDB).
4.10. VMCL Taobao
To improve the representation capacity of information
ﬂows in IMDN, the VMCL Taobao team proposed a Multi-
Scale information Distillation Network (MSDN) for efﬁ-
cient super-resolution, which stacks a group of multi-scale
information distillation blocks (MSDB). Particularly, in-
spired by RFDN, a 1×1convolution is used for information
15
Figure 21. Bilibili AI Team: The architecture of Re-parameterized Residual Feature Distillation Network (Rep-RFDN).
Conv3x3
Conv3x3
Conv1x1 Conv1x1 Conv1x1 Conv1x1
SobelDx SobelDy Laplacian
Conv1x3Conv3x1
Conv3x3
Figure 22. Bilibili AI Team: The architecture of the proposed Rep-
Block (RB).
distillation and a 3×3convolution is used for feature re-
ﬁnement to alleviate the limitation of channel splitting op-
eration in IMDB. As shown in Fig. 20, in l-th MSDB, a
multi-scale feature reﬁnement module (marked with green
dotted boxes) is used to replace the 3×3convolution of
RFDB. In Fig. 20, an upsampling reﬁnement module with
the scale factor of 2 is designed. A 1×1convolution is
used for channel expansion and a 3×3convolution with two
groups is used for feature reﬁnement, which has sh×sh
receptive ﬁeld to capture a larger region of neighbors and
acts equivalently on an upsampled feature. Then, a single
3×3convolution is used for identical reﬁnement as done in
RFDB. Last, a dilated 3×3convolution with dilation rate
of 2 is employed for downsampling reﬁnement, which has
(h/s)×(h/s)receptive ﬁeld and acts equivalently on a
downsampled feature. By applying the multi-scale feature
reﬁnement, multi-scale information of the input features can
be captured with fewer computations. Moreover, the Large
Kernel Attention (LKA) [24] is introduced to enhance the
features by capturing a larger receptive ﬁeld.
4.11. Bilibili AI
The Bilibili AI team used Re-parameterized Residual
Feature Distillation Network (Rep-RFDN) as shown in
Fig. 21. Different from the original RFDN [51], all 3×3
convolutional layers except those in the ESA block [51])
are replaced by RepBlocks (RB) in the training stage. Dur-
ing inference stage, the RepBlocks are converted into sin-
gle 3×3convolutional layers. Inspried by ECB [85] and
ACB [12], 3×1Conv and 1×3Conv sub-branches are
added into the original ECB (Fig. 22). The number of fea-
DBB
training
ECB
training
RepVGG
training
Conv3 Conv3
Conv1
Conv1 AvgPool
Conv1 Conv1
Sobel-dy
Conv1
Laplacian
Conv1
Sobel-dx
DBB
training
ECB
training
RepVGG
training
Conv3 Conv3
Conv1
Conv1 AvgPool
Conv1 Conv1
Sobel-dy
Conv1
Laplacian
Conv1
Sobel-dx
(a) Revisiting re-parameterizable typology.
Conv1
Sobel-dy
Conv1
Laplacian
EDBB
training
Conv3 Conv3
Conv1
Conv1
Conv1
Sobel-dxConv3
inference
Conv1
Sobel-dy
Conv1
Laplacian
EDBB
training
Conv3 Conv3
Conv1
Conv1
Conv1
Sobel-dxConv3
inference
(b) Proposed edge-enhanced diverse branch block (EDBB).
Figure 23. NKU-ESR Team: Illustration of re-parameterization
method.
ture channels is set to 40, while in the original RFDN50
version it is set to 50.
4.12. NKU-ESR
Generally, the team proposed an edge-enhanced feature
distillation network, named EFDN, to preserve the high-
frequency information under the synergy of network and
loss devising. In detail, an edge-enhanced convolution
block is built by revisiting the existing reparameterization
methods. The backbone of the EFDN is searched by neu-
ral architecture search (NAS) to improve the basis perfor-
mance. Meanwhile, an edge-enhanced gradient loss is pro-
posed to calibrate the reparameterized block training.
Edge-enhanced diverse branch block. As shown in
Fig. 23a, the detail of RepVGG Block, DBB, and ECB is
presented. A total of eight different structures have been
designed to improve the feature extraction ability of the
vanilla convolution in different scenarios. Although the
performance may be higher with more re-parameterizable
branches, the expensive training cost is unaffordable for
straightly integrating these paths. Meanwhile, another prob-
lem is that edge and structure information may be attenuated
during the merging of parallel branches.
16
LR
SR
EDBB
48
24
Conv3
Conv1
EDBB
Conv1
Conv1
Conv1
Conv1
Conv1
ESA
Concat
EFDB
EFDB
EFDB
Conv3
Pixel Shuffle
Concat-Conv1 Element-wise sum
Figure 24. NKU-ESR Team: Network architecture of the proposed EFDN.
To address the above concerns, a more delicate and
effective reparameterization block is built, namely Edge-
enhanced Diverse Branch Block (EDBB), which can extract
and preserve high-level structural information for the low-
level task. As illustrated in Fig. 23b, the EDBB consists of
seven branches of single convolutions and sequential con-
volutions.
Network architecture. Following IMDN [31] and
RFDN [51], an EFDN is devised to reconstruct high-quality
SR images with sharp edges and clear structure under re-
stricted resources. As illustrated in Fig. 24, the EFDN
consists of an shallow feature extraction module, multiple
edge-enhanced feature distillation blocks (EFDBs), and up-
scaling module. Speciﬁcally, a single vanilla convolution is
leveraged to generate the initial feature maps.
This coarse feature is then sent to stacked EFDBs for
further information reﬁning. In detail, the shallow residual
block in [51] is replaced by the proposed EDBB to con-
struct the EFDB. Different from IMDN and RFDN utilizing
global distillation connections to process input features pro-
gressively, neural architecture search (NAS) [30] is adopted
to decide the feature connection paths. The searched struc-
ture is shown in the orange dashed box. Finally, the SR
images are generated by upscaling module.
Edge-enhanced gradient-variance loss. In previous
work [48], L1and L2loss have been in common usage to
obtain higher evaluation indicators. The network trained
with these loss functions often leads to the loss of struc-
tural information. Although the edge-oriented components
are added into the EDBB, it is hard to ensure their effective-
ness during the complex training procedure of seven parallel
branches. Inspired by the gradient variance (GV) loss [1],
an edge-enhanced gradient-variance (EG) loss is proposed,
which utilizes the ﬁlters of the EDBB to monitor the op-
timization of the model. In detail, the HR image IH R and
SR image ISR are transfered to gray-scale images GH R and
GSR . The Sobel and Laplacian ﬁlters are leveraged to com-
pute the gradient maps and then unfold gradient maps into
HW
n2×n2patches Gx,Gy,Gl. The i-th variance maps can
be formulated as:
vi=Pn2
j=1(Gi,j ¯
Gi)
n21(13)
where ¯
Giis the mean value of the i-th patch. Thus, the vari-
ance metrics vx,vy,vlof HR and SR images can be calcu-
lated, respectively. Referring to GV-loss, the gradient vari-
ance loss of different ﬁlter can be obtained by:
Lx=EISR vHR
xvSR
x2
Ly=EISR vHR
yvSR
y2
Ll=EISR vHR
lvSR
l2
(14)
Besides, L1is added to accelerate convergence and im-
prove the restoration performance. In order to better op-
timize the edge-oriented branches of EDBBs and preserve
sharp edges for visual effects, coefﬁcients λx,λy, and λl
are traded off, which are related to the scaled parameters of
corresponding branches. The sum of the loss function can
be expressed by:
L=L1+λxLx+λyLy+λlLl(15)
4.13. NJUST RESTORARION
The NJUST RESTORARION team proposed Adaptive
Feature Distillation Network(AFDN) for lightweight im-
age SR. The proposed AFDN shown in Fig. 25 is modiﬁed
from RFDN [51] with minor improvements. AFDN uses 4
17
Figure 25. NJUST RESTORARION Team: The overall architec-
ture of the AFDN.
(a) AFDB (b) Adaptive Fusion Block
Figure 26. NJUST RESTORARION Team: Adaptive Fusion Dis-
tillation Block
AFDBs as the building blocks, and the overall framework
follows the pipeline of RFDN.
As illustrated in Fig. 25, AFDN uses Adaptive Fusion
Block (AFB) which is more efﬁcient to fuse features.
AFB splits the feature in half. Each branch uses “Conv 3-
LeakyRelu-Conv 3” to learn the adaptive attention matrix.
Then AFB multiplies the feature with the attention matrix.
Finally, it concatenates the features of two branches.
4.14. TOVBU
Method details. On the basis of Residual Feature Distil-
lation Network, the team proposed a novel efﬁcient Faster
Residual Feature Distillation Network (FasterRFDN) for
single image super resolution. The overall framework of
the proposed method is shown in Fig. 27 and Fig. 28. The
overall framework contains 4 faster residual feature distil-
lation blocks (FRFDB). First, to further reduce the param-
eters and computational complexity of the FRFDB module,
the number of channels of layered distillation is effectively
compressed. The number of channels in each layer from top
to bottom is 64, 32, 16, 16, respectively. These distillation
Figure 27. TOVBU Team: Overall framework of of faster feature
distillation network (FasterRFDN).
Figure 28. TOVBU Team: Faster feature distillation block
(FRFDB).
features are extracted by three 1×1and one 3×3con-
volutional ﬁlters. Then, these features are fed to enhanced
spatial attention (ESA) by concatenation along the channel
dimension. Furthermore, in order to enhance the model’s
representation power, the number of channels of the model
is increased to 64.
Training strategy. The training procedure can be di-
vided into three stages.
1. Pretraining on DIV2K and Flickr2K (DF2K). HR
patches of size 256 ×256 are randomly cropped from
HR images, and the mini-batch size is set to 64. The
model is trained by minimizing L1 loss function with
Adam optimizer. The initial learning rate is set to
5×104and halved at every 200k iterations. The total
number of iterations is 1,600k.
2. Finetuning on DF2K. HR patch size is 512 ×512, and
the mini-batch size are set to 64, respectively. The
model is ﬁne-tuned by minimizing PSNR loss func-
tion. The initial learning rate is set to 5×105and
halved at every 80k iterations. The total number of it-
erations is 480k.
3. Fine-tuning on DF2K again. HR patch size and the
mini-batch size are set to 640 ×640 and 16, respec-
tively. The model is ﬁne-tuned by minimizing L2 loss
function. The initial learning rate is set to 1×105
and cosine learning rate is used.
18
Figure 29. Alpan Team: The building block for SR model consists
of three 3×3convolutions and three ESA blocks with 16 channels
(one ESA block after one convolution) followed by concatenation
of input and 3outputs of each ESA block. Then 1×1convolution
and ESA block - exactly the same as in RFDB.
4.15. Alpan
The Alpan team proposed the method based on
RFDN [51] according to the following steps: 1) Rethink-
ing of RFDB [51]. 2)Efﬁciency and PSNR trade-off for
ESA [51] block and convolution. 3) Fine-tuning width and
depth.
The team’s ﬁrst observation is that ESA [51] block was
efﬁcient and signiﬁcantly improves the results. Thus, the
team placed ESA block after each 3×3convolution in
RFDB [51]. All distillation convolutions from RFDB [51]
(three 1×1convolutions and one 3×3convolution) are
removed and the number of RFDB [51] blocks is reduced
from 4to 3to keep the same inference time. All these
changes have the following effect: 1) PSNR goes up from
29.04 to 29.05 on DIV2K validation set. 2) The number of
parameters is reduced from 0.433M to 0.366M. 3) FLOPS
is reduced from 1.69G to 1.256G.
The team’s next observation was that in the modiﬁed
RFDB 75% parameters and more than 90% FLOPS be-
longed to convolutions outside of ESA [51] blocks. So the
team decided to re-balance the number of channels in ESA
block. Speciﬁcally, the overall number of channels in the
model is reduced from 50 to 44 but the number of channels
in ESA blocks is increased from 12 to 16. All these changes
have the following effect: 1) PSNR is almost the same on
DIV2K validation set. 2) The number of parameters is re-
duced from 0.366M to 0.356M. 3) FLOPS is reduced from
1.256G to 1.034G.
In most of the team’s experiments deeper models are bet-
Figure 30. xilinxSR Team: An overview of the basic IMDN archi-
tecture.
Figure 31. xilinxSR Team: Structural re-parameterization of a col-
lapsible block.
ter than wider models with the same efﬁciency. So the team
decided to reduce the overall number of channels from 44 to
32 while keeping 16 channels in ESA blocks and to increase
the number of modiﬁed RFDB blocks from 3to 4. This
leads to signiﬁcant reduction in FLOPS and small reduction
in parameters in the ﬁnal model: 1) Number of parameters:
0.326M. 2) FLOPS: 0.767G
The ﬁnal model SR model consists of 4modiﬁed RFDB
blocks with 32 channels. All the other unmentioned parts of
the model are the same as in RFDN. The modiﬁed RFDB
block is shown in Fig. 29.
4.16. xilinxSR
The overview of IMDN is illustrated in Fig. 30. It is
a lightweight information multi-distillation network com-
posed of the cascaded information multi-distillation blocks
(IMDB). Speciﬁcally, it adopts a series of IMDB blocks (de-
19
Figure 32. cipher Team: The architecture of the residual distillation network (ResDN).
Figure 33. cipher Team: The architecture of residual distillation block (ResDB). 1×1 convolution is used to expand the channels for
information distillation. The color branch transmits the distilled features to later residual blocks. The number of distillation channels is 16.
fault 8) and a traditional upsampling layer (pixelshufﬂe) for
high-resolution image restoration.
Network Pruning. Based on IMDN, network pruning
is ﬁrstly performed. Tab. 2provides models of different
sizes and the corresponding accuracy. To achieve a trade-
off between accuracy and runtime, the number of IMDB
blocks is reduced from 8 to 7 as the baseline.
Method #IMDB Params PSNR
target 29.00dB
IMDN 8 0.8939M 29.13dB
IMDN 7 0.7905M 28.97dB
IMDN 6 0.6871M 28.93dB
IMDN 5 0.5836M 28.91dB
IMDN 4 0.4802M 28.85dB
Table 2. Comparison of IMDN with different IMDB numbers and
corresponding accuracies on DIV2K valition.
Collapsible Block. Inspired by SESR [6], the team ap-
plied a collapsible block to improve the pruned IMDN ac-
curacy. Speciﬁcally, as shown in Fig. 31, a dense block is
adopted to enhance the representation during training. Each
3×3convolution in IMDB is replaced by a 3×3convo-
lution and a 1×1convolution. These two convolutions are
conducted in parallel and the outputs are summed. During
inference, the two parallel convolutions are converted to one
3×3convolution.
Training strategy. The network was trained on DIV2K
with Flick2K as the extra dataset. The training patch size is
progressively increased from 64 ×64 to 128 ×128 to im-
prove the performance. The batch size is 32 and the num-
ber of epochs is 500. The network is trained by minimizing
L1 loss with Adam optimizer and a dynamic learning rate
ranging from 2×104to 1×105. Data augmentation,
like rotation and horizontal ﬂip, is applied.
4.17. cipher
The cipher team proposed an end-to-end residual distilla-
tion network (ResDN) for lightweight image SR. As shown
in Fig. 32, the proposed ResDN consists of three parts: the
head, trunk and tail parts.
The trunk part consists of four ResDBs and one BFM.
After the coarse features F0is obtained, the four ResDBs
will extract intermediate features in turn, namely
Fi=Hi
ResDB (Fi1), i = 1,2,3,4,(16)
where Hi
ResDB (·)denotes the function of the i-th ResDB,
and Firepresents the intermediate features extracted by the
i-th ResDB.
20
As shown in Fig. 32, the Fi(i= 1,2,3,4) will be aggre-
gated into BFM and the feature dimensions is ﬁrst halved by
1×1convolution followed by the ReLU activation function
(omitted in Fig. 32), and then sequential concatenations are
utilized. This can be formulated as
Ti=(Fi, i = 4,
Concat(C onv1(Fi), Conv1(Ti+1)), i = 3,2,1,
(17)
where Concat (·)and C onv1 (·)denote the concatenation
operation along the channel dimension and 1×1convolu-
tion, respectively. By the sequential fusion, different hier-
archical features can be used more fully. Finally, the coarse
features F0will be transmitted by a residual connection to
generate the deep features Fd.
As shown in Fig. 33, the body of ResDB is stacked by
several residual blocks (RBs). Here, there is a PReLU ac-
tivation function in front of each convolution layer, and a
learnable parameter is set for each channel in PReLU. In
RB, 1×1convolution is ﬁrst used to expand the channel di-
mension for the convenience of distillation. Suppose there
are KRBs in total and the input of the k-th RB is Fk
res
with cchannels, and the number of distilled feature channel
is d. In kRB, the intermediate feature obtained by 1×1
convolution can be expressed as
Fk
inter =Hc+(Kk)d
Conv1(δ(Fk1
res )) (18)
where δdenotes the PReLU activation function.
H(Kk)d+c
Conv1denotes the 1×1convolution with
c+ (Kk)dconvolutional kernels. Then, the in-
termediate features are split along the channel axis, and
each distilled features with d= 16 channels ﬂow to the
latter RBs. And the retained features with c= 48 channels
ﬂow to the 3×3convolution for further reﬁnement.
Moreover, at the beginning of each residual branch, the
distilled features are concatenated on the previous residual
branches of RBs and the input feature of current RB.
Finally, ESA and a skip connection are used to generate the
output features of ResDB.
4.18. NJU MCG
Inspired by WDSR [80], the residual feature expansion
block (RFEB) in Fig. 34a is used as the basic unit of the
feature distillation and expansion network (FDEN). FDEN
adopts the architecture of RFDN [51] except that the ba-
sic unit is replaced by RFEB and the attention mechanism
is replaced by the LapSA module in Fig. 34c. The LapSA
module is responsible for applying the scale transformation
for each spatial position of the input features. To achieve
this goal, it needs to have a global assessment to assign
different scaling factors for different positions according to
the spatial importance. Speciﬁcally, for the task of image
Conv1
ReLU
Conv1
Conv3
(a) NJU MCG Team: Residual Feature Exansion Block (RFEB)
Pool/2 Conv3 Upsample
Subtract
Pool/2 Conv3 Upsample
Subtract
Pool/2 Conv3 Upsample
Subtract
L3
L2
L1
G0
G1
G2
G3
(b) NJU MCG Team: Laplacian Pyramid (LapPyra)
Conv1
LapPyra
fUP(L3)
fUP(L2)
L1
Concat
Conv1
Sigmoid
Concat
Conv1
FLapSA
F*
LapSA
nfnf nf/4 nf/4
nf nf/4 nf
(c) NJU MCG Team: Laplacian Attention (LapSA)
Figure 34. NJU MCG Team: The proposed solution.
super-resolution, the network should concentrate more on
the high-frequency regions that are usually difﬁcult to re-
cover because of the complex details. As a consequence,
the LapSA module is implemented to contain a Laplacian
pyramid (Fig. 34b) which has a large receptive ﬁeld and can
extract the high-frequency details as well. This process can
be formulated as
G1=fG(G0;θG1),L1=G0fUP (G1),
G2=fG(G1;θG2),L2=G1fUP (G2),
G3=fG(G2;θG3),L3=G2fUP (G3).
(19)
As depicted in Fig. 34b,fGdenotes the downsampling
function that consists of a pooling layer and a 3×3con-
volutional layer. fUP is the interpolation function that up-
samples the input feature. L1,L2and L3are the output
features of the three pyramid levels, respectively. The ex-
tracted feature Lj[1,2,3] contains high-frequency informa-
tion and is used in the LapSA module (Fig. 34c) to generate
the scaling factors.
As shown in Fig. 34c, the ﬁrst 1×1convolution is used to
reduce the channel dimension and the last 1×1convolution
is used to recover the channel dimension. The middle 1×1
convolution is used to aggregate the pyramid features and
then the sigmoid function is applied to generate the ﬁnal
scaling factors. L1is concatenated with the scaled features
to augment the output features F
LapSA as
F
LapSA =fC onv1([FLapSA ,L1]).(20)
The number of ﬁlters (nf) in FDEN is set to 29. The
proposed FDEN is trained with the same training setting as
RFDN. DIV2K and Flickr2K datasets are adopted as the
training dataset.
21
Figure 35. IMGWLH Team: The architecture of RLCSR network.
Figure 36. IMGWLH Team: The architecture of LAM module.
Figure 37. IMGWLH Team: The architecture of CCM module.
4.19. IMGWLH
The IMGWLH team proposed RLCSR network. The
overall network structure is shown in Fig. 35. A 3×3convo-
lutional layer is ﬁrstly used to extract shallow features from
low-resolution images. Then six Local residual feature fu-
sion block (RFDB+) modules are then stacked to perform
deep feature extraction on the shallow features. RFDB+
is an improved version of RFDB [51]. Without increasing
the number of parameters of the ESA module, more skip
connections are included to ensure better retention of use-
ful information and use dilated convolution to expand the
receptive ﬁeld for preserving more texture details.
In order to produce compact features, a CCM model is
introduced to fuse the intermediate features from several
RFDB+. The module is developed based on MBFF mod-
ule [58] and the backward fusion model [57]. The detailed
structure of CCM is illustrated in Fig. 37. It can be ob-
served that different levels of features from several RFDB+
are gradually fused to a single feature map using our pro-
posed CCM. The channel shufﬂe operation and 1×1convo-
lution kernel can integrate the features of all basic residual
blocks, which helps to extract more contextual information
in a compact manner.
To further enhance the intermediate features produced
by RFDB+, LAM [59] module is used to introduce an at-
tention mechanism for adaptively selecting representative
features from multiple intermediate features. One can refer
to Fig. 36 for the structure of LAM.
However, the collected features could still be full of re-
dundancy. To address the issue, a weight is assigned to the
feature map through the Hadamard multiplication of chan-
nel attention and pixel attention. To produce features for
high-resolution reconstruction, a long-term skip connection
is included, with which the deep features are added to the
shallow features that are extracted at the beginning of the
network.
22
Figure 38. imgwhl team: The network structure of the proposed method.
Figure 39. imgwhl team: (a) Structure of AAWRU. (b) Structure of EFSA. (c) Structure of BFF.
4.20. imgwhl
The imgwhl team proposed a lightweight SR network
named RFESR to achieve a compact network design and
fast inference speed. To be speciﬁc, the work is based on the
structure of IMDB [31] and inspired by several advanced
techniques [52,58].
The proposed network is shown in Fig. 38. A 3×3con-
volution is ﬁrst used to extract shallow features from inputs.
Then, four Local residual feature fusion block (LRFFB)
modules are stacked to perform deep feature extraction on
the shallow features. After gradual feature reﬁnement by
the LRFFBs, another 3×3convolution is used to extract
ﬁnal deep features from the output of the last LRFFN mod-
ule. The ﬁnal deep and shallow features are element-wise
added through a long-range skip connection. Finally, the
high-resolution images can be reconstructed through a pixel
Shufﬂe block that consists of a 3×3convolution and a non-
parametric sub-pixel operation.
The two building blocks are presented as follows.
LRFFB. Each LRFFB module contains four basic resid-
ual units, i.e. , Attention-guided Adaptive Weighted Resid-
ual Unit (AAWRU). Inspired by the MBFF module [58],
the backward feature fusion (BFF) module is introduced
to fuse multi-level features acquired from AAWRUs. The
feature extracted by jth AAWRU of ith LRFFB is denoted
by Fij . For example, the feature extracted by the second
AAWRU in the ﬁrst LRFFB is F12. Speciﬁcally, in the
ith LRFFB, the last two features (Fi3and Fi4) are aggre-
gated by a BFF module. The structure of the BFF module is
shown in Fig. 39(c). It ﬁrst concatenates the two input fea-
tures and then processes the aggregated features by a chan-
nel shufﬂe operation [86] and 1×1convolution kernel. The
BFF module is repeated three times until all level features
are fused in an LRFFB. The input features are then added
to the output fused feature in an element-wise manner. As
residue may contain redundant information, the results are
multiplied with trainable parameters to select useful infor-
mation.
AAWRU. The detailed structure of the AAWRU mod-
ule is shown in Fig. 39(a). Inspired by the residual block
proposed in the RFANet [52], MAFFSRN [58] introduced
an enhanced fast spatial attention module (EFSA). It aims
to realize spatial attention weighting to make the features
more concentrated in some desired regions, so that more
representative features can be obtained. The two branches
23
LR
Conv2d 3x3
GIDB
Split
GIDB
Split
GIDB
Split
GIDB
Split
GIDB
Split
GIDB
Block NLA
Block NLA
Conv2d 1x1
Concat
Conv2d 3x3
Upsampler
SR
Global Progressive Refinement
Module (GPRM)
64 48/16 48/16 48/16 48/16 48/16 64 64
Figure 40. Aselsan Research Team: The proposed architecture of IMDeception.
Chunk
4
Conv2d
3x3
Conv2d
3x3
Conv2d
3x3
Conv2d
3x3
Group Convolution
groups=4,
output channels = 64
64
16
16
16
16
Concat
Conv2d 1x1
64
Figure 41. Aselsan Research Team: Gblock. Red and orange
stripes stands for ReLU and LeakyReLU activation.
of the residual structure in AAWRU are assigned by adap-
tive weights, which help more shallow-level features be ac-
tivated without increasing parameters. The design of the
EFSA module is shown in Fig. 39(b). Using the blocks
above, the proposed model can better extract and inte-
grate compact contextual information with fewer parame-
ters, which helps produce more delicate SR images.
4.21. whu sigma
The team designed their method based on RFDN [51].
They simply used dilated convolution to replace the convo-
lution part in the RFDB module and adjusted the number of
channel of RFDN to 64.
4.22. Aselsan Research
The team created a network structure where progres-
sive reﬁnement module (PRM) is repeated locally in the
blocks and globally among the blocks to reduce the num-
ber of parameters. This is done in a way that intermedi-
ate information collection (IIC) modules in the global set-
Gblock
Split
Gblock
Split
Gblock
Split
Gblock
Concat
64 48/16 64 48/16 48/1664 16
Conv2d 1x1
Progressive Refinement
Module (PRM)
Figure 42. Aselsan Research Team: GIDB Block
ting is replaced with proposed Global PRM. Furthermore,
block based non-local attention block [73] is employed in
the main path of the network while is avoided in the individ-
ual IMDB blocks. To further reduce the number of param-
eters and number of operations of the network, every sin-
gle convolution operation inside the IMDB is replaced with
Gblocks (Fig. 41) as in XLSR [5] which is based on group
convolution. The group convolution based structures is re-
ferred to as Grouped Information Distilling Blocks (GIDB).
Yet, grouped convolutions are unfortunately not well opti-
mized in PyTorch framework [21]. If implemented properly
within an inference oriented framework, group convolutions
can lead to speedups [5,21] especially in mobile devices
where efﬁcient network structures are usually employed.
4.23. Drinktea
Inspired by IMDN and LatticeNet [57], the Drinktea
team proposed a method to obtain channel attention which
can effectively utilize the mean and the standard deviation
of feature maps. First, as shown in Fig. 43, the mean value
is calculated by global average pooling and the standard de-
viation of each feature map is also computed. Second, the
statistic vector in each branch is passed to a 1×1convo-
lution layer which performs channel-downscaling with re-
duction ratio rand then activated by ReLU. Third, the two
vectors are added up to fuse the information extracted by
each vector. Then the fusion vector is restored to the orig-
inal number of channels. Finally, the sigmoid activation is
24
Figure 43. Drinktea Team: Details of the fusion channel attention module.
Figure 44. Drinktea Team: Details of the fusion channel attention block.
Figure 45. Drinktea Team: The architecture of attention aug-
mented lightweight network (AALN).
utilized to weight the vector to generate channel attention.
FCAB utilizes the features from different hierarchical and
augments them with channel attention [77].
Spatial attention can effectively improve the perfor-
mance of the model, but it is difﬁcult to achieve a balance
between complexity and performance in lightweight net-
work. Inspired by ULSAM [64], the team designed a spa-
tial attention module for super-resolution task. As shown in
Fig. 44, the features on each channel is ﬁrst extracted with
depthwise convolution and activated with PReLU function.
Then another depthwise convolution and sigmoid function
are applied to redistribute the weights. As shown in Fig. 44,
the basic the Attention Augmented ConvBlock (AACB) is
composed of two FCABs, a 1×1convolution and a spa-
tial attention module. The architecture of attention aug-
mented lightweight network (AALN) for efﬁcient image SR
is shown in Fig. 45.
4.24. GDUT SR
The GDUT SR proposed Progressive Representation
Re-Calibration Network (PRRN) for lightweight SR. The
proposed PRRN shown in Fig. 46 is modiﬁed from
PAN [90] but achieves better performance and runtime efﬁ-
ciency than PAN with limited increase of parameters. The
main contribution of PRRN is to adjust the receptive ﬁeld of
CNN by using the pixel and channel information in a two-
stage manner. A shallow channel attention (SCA) mech-
anism is proposed to build the correspondences between
channels in a simpler yet more efﬁcient way. The architec-
ture of PRRN can be divided into three components: shal-
low feature extractor, deep feature extractor, and reconstruc-
tion. The shallow feature is extracted by using 3×3con-
volution layer, while the deep feature extractor is stacked
by the proposed Progressive Representation Re-calibration
Blocks (PRRBs). Finally, the pixel shufﬂe layer is used to
reconstruct the HR image.
The deep feature extractor consists of 16 PRRBs and
multiple long skip connections are applied to propagate the
initial features to the intermediate layers. PRRB precisely
explores the discriminative information in a two-stage man-
ner. In the ﬁrst stage, the First Stage Attention (FSA) uses
pixel attention (PA) [90] to capture important pixel informa-
tion and the proposed SCA mechanism is applied to learn
useful channel information. Therefore, the ﬁrst stage of
PRRB can explore the spatial and channel information si-
multaneously. In the second stage, an SCA modiﬁed from
squeeze-and-excitation (SE) [29] is used to further rescale
25
Figure 46. GDUT SR Team: The overall architecture of the progressive representation re-calibration network (PRRN).
the importance of the output feature channels. SCA uses
average pooling to collect channel information and then
uses 1×1convolution and sigmoid activation function to
process the information. Moreover, inspired by the recent
work [49], SiLU activation is used at the end of both the
top and bottom branches of FSA.
Giantpandacv
The Giantpandacv team proposed a lightweight Self-
Calibrated Efﬁcient Transformer (SCET) network, which
is inspired PAN [90] and Restormer [81]. The architec-
ture of SCET mainly consists of the Self-Calibrated module
and Efﬁcient Transformer block, where the Self-Calibrated
module adopts the pixel attention mechanism to extract im-
age features effectively. To further exploit the contextual
information from features, an efﬁcient Transformer module
is employed to help the network obtain similar features over
long distances and thus recover sufﬁcient texture details.
The main architecture of the network is shown in Fig. 47,
which consists of the Self-Calibrated module and the Efﬁ-
cient Transformer block. The details of these modules of
SCET are described as follows.
Self-Calibrated module. In this module, 16 cascaded
Self-Calibrated convolutions with Pixel Attention (SCPA)
blocks are utilized for a larger receptive ﬁeld. The SCPA
block [90] consists of two branches. One of the branches
is equipped with a pixel attention module to perform the at-
tention mechanism in the spatial dimension while the other
branch is used to retain the original high-frequency feature
information. Furthermore, skip connection is utilize to fa-
cilitate network training.
Efﬁcient Transformer. The efﬁcient transformer block
is utilized to further exploit the contextual information from
features to obtain useful contextual information. In the ef-
ﬁcient transformer block, Multi-Dconv Head Transposed
Attention (MDTA) is used to avoid the vast computational
complexity of the traditional self-attention mechanism. And
a feed-forward network is further employed with a gating
mechanism to recover precise texture details.
4.25. TeamInception
The proposed solution is based on the Transformer-based
architecture Restormer that is recently introduced in [81].
Speciﬁcally, an isotropic version of Restormer is built,
which operates at the original resolution and does not con-
tain any downsampling operation.
Overall pipeline. The overall pipeline of the Restormer
architecture is presented in Fig. 48. Given a low-resolution
image IRH×W×3, Restormer ﬁrst applies a convolution
to obtain low-level feature embeddings F0RH×W×C;
where H×Wdenotes the spatial dimension and Cis the
number of channels. Next, these shallow features F0pass
through multiple transformer blocks (six in this work) and
transformed into deep features FdRH×W×C, to which
shallow features F0are added via skip connection. Finally,
a convolution layer followed by pixel shufﬂe layer is applied
to the deep features Fdto generate residual high-resolution
image RRsH×sW ×3, where sdenotes the scaling fac-
tor. To obtain the ﬁnal super-resolved image, the residual
image is added the bilinearly upsampled input image as:
ˆ
I=bilinear-up(I) + R.
In the proposed Transformer block, the core components
are: (a) multi-Dconv head transposed attention (MDTA)
and (b) gated-Dconv feed-forward network (GDFN).
Multi-Dconv Head Transposed Attention. The major
computational overhead in Transformers comes from the
self-attention layer, which has quadratic time and memory
complexity. Therefore, it is infeasible to apply SA on most
image restoration tasks that often involve high-resolution
images. To alleviate this issue, MDTA is proposed. The
26
Figure 47. Giantpandacv Team: Self-Calibrated Efﬁcient Transformer (SCET) Network.
Figure 48. TeamInception: Overall framework of Restormer [81].
key ingredient is to apply SA across channels rather than the
spatial dimension, i.e. , to compute cross-covariance across
channels to generate an attention map encoding the non-
local context implicitly. As another essential component in
MDTA, depth-wise convolutions is introduced to emphasize
on the local context before computing feature covariance to
produce the global attention map [45].
Gated-Dconv Feed-Forward Network. A feed-forward
network (FN) is the other building block of the Transformer
model [17], which consists of two fully connected layers
with a non-linearity in between. As shown in Fig. 48(b),
the ﬁrst linear transformation layer of the regular FN [17]
is reformulated with a gating mechanism to improve the in-
formation ﬂow through the network. This gating layer is
designed as the element-wise product of two linear projec-
tion layers, one of which is activated with the GELU non-
linearity. Our GDFN is also based on local content mix-
ing similar to the MDTA module to equally emphasize on
the spatial context, which is useful for learning local im-
age structure for effective restoration. The gating mech-
anism in GDFN controls which complementary features
should ﬂow forward and allows subsequent layers in the net-
work to speciﬁcally focus on more reﬁned image attributes,
thus leading to high-quality outputs. Progressive learning
is performed where the network is trained on smaller image
patches in the early epochs and on gradually larger patches
in the later training epochs. The model trained on mixed-
27
Figure 49. cceNBgdd Team: (a) The architecture of our proposed very lightweight and efﬁcient image super-resolution network (VLESR);
(b) Residual attention block (RAB). (c) Lightweight residual concatenation block (LRCB); and (d) Sign description.
Figure 50. cceNBgdd Team: (a) The structure of our proposed lightweight convolution block (LConv); and (b) Progressive interactive
group convolution (PIGC).
size patches via progressive learning shows enhanced per-
formance at test time where images can be of different res-
olutions (a common case in image restoration).
4.26. cceNBgdd
The VLESR network architecture shown in Fig. 49 (a),
mainly consists of a 3×3 convolutional layer, a deep fea-
ture extraction block (DFEB), a frequency grouping fusion
block (FGFB), and an Upsampler. DFEB contains four
residual attention blocks (RABs), and Upsampler uses sub-
pixel convolution.
Each RAB contains three lightweight residual con-
catenation blocks (LRCBs), a multi-way attention block
(MWAB), and a skip connection, as shown in Fig. 49 (b).
The LRCB consists of two parts, as shown in Fig. 49 (c).
The ﬁrst part contains two lightweight convolutional blocks
(LConv) (see Section 3.3), two ReLU nonlinear activation
layers immediately following each LConv and a skip con-
nection to learn the local residual feature information. The
learned residual features are concatenated with the origi-
nal feature to enhance the utilization and propagation of
the feature information. In the second part, a 1×1 con-
volutional layer is used to compress the concatenated fea-
ture. The multi-way attention block (MWAB) is shown in
Fig. 51. The MWAB contains three branches, where the
ﬁrst and the second branches focus on the global informa-
tion, and the third branch focuses on the local information.
The three branches explore the clues of different feature in-
formation respectively and sum the calculated importance
(i.e. , weights) of each channel.
28
Figure 51. cceNBgdd Team: Multi-way attention block (MWAB).
Figure 52. cceNBgdd Team: Schematic diagram of frequency
grouping fusion block (FGFB).
Based on the ShufﬂeNet [86], a very lightweight building
block is designed for the SISR task, called the lightweight
convolutional block (LConv). The important improvements
are twofold: (1) Remove the batch normalization layers
from the ShufﬂeNet unit, which have been shown to de-
teriorate the accuracy of SISR; (2) In ShufﬂeNet unit, the
ﬁrst 1×1 group convolution is replaced with the progres-
sive interactive group convolution (PIGC), and the second
1×1 group convolution is replaced with the 1×1 point-wise
convolution to enhance the interaction between the group
features. The structure of the LConv is shown in Fig. 50
(a), which consists of a PIGC, a channel shufﬂe layer, a
3×3 depth-wise convolution, and a 1×1 point-wise convo-
lution. The structure of the PIGC in the LConv is shown in
Fig. 50 (b).
The frequency grouping fusion block (FGFB) is shown
in Fig. 52. The features with the highest difference between
low-frequencies and high-frequencies are divided into the
ﬁrst group, the features with the next highest difference are
divided into the second group, and so on. Then, starting
from the feature group with the smallest frequency differ-
ence, the features of each group are gradually fused until
the feature group with the largest frequency difference. If
the number of the RABs is odd, only the output feature of
the middle RAB is used as the last feature group. The out-
put feature by grouping fusion is then fed into the MWAB
for the further fusion. When the number of the RABs is 4,
there are only two feature groups.
4.27. ZLZ
The team proposed to use Information Multi-distillation
Transformer Block (IMDTB) in Fig. 53 as the basic block,
where the convolution in IMDB [31,32] was converted to
grouped convolution and the number of groups is 4. The
channel shufﬂing operation is used [86] to increase the
information interaction between channels. The attention
mechanism is replaced with a Swin-Transformer [47,54]
to better deal with images spatial relations with attention
mechanism.
Figure 53. ZLZ Team: IMDTB architecture diagram.
4.28. Express
Existing lightweight SR methods such as IMDN [31]
and RFDN [51] have achieved promising performance with
a great equilibrium between performance and parameters
or inference speed. However, there is still room for im-
29
(a) Shallow residual block (SRB) (b) Mixed operations block (Mixed OP)
Figure 54. Express Team: The shallow residual block in RFDB
and mixed residual block that replace a 3×3convolution layer
with a mixed layer.
Operation Kernel Size Params (K) Muti-Adds (G)
convolution
1×12.5 0.576
3×322.5 5.184
5×562.5 14.400
7×7122.5 28.224
Separable
convolution
3×35.9 1.359
5×57.5 1.728
7×79.9 2.281
Dilated
convolution
3×32.95 0.680
5×53.75 0.864
Table 3. Express Team: Operations and their complexities in
mixed layer. Dilated convolution [79] is applied jointly with group
convolution. Muti-Adds are calculated in ×2SR task with 50
channels on a 1280 ×720 image.
provement in their network architectures. For instance, the
3×3convolution kernels have been widely adopted by
IMDN [31] and RFDN [51], while its optimality is still
questionable. Blocks based on the 3×3convolution kernels
may be suboptimal for lightweight SISR tasks. Neural net-
work architecture search (NAS) may be served as an ideal
approach. Inspired by DARTS [50] and DLSR [30], the
proposed solution is based on DARTS [50] and DLSR [30],
which is a fully differentiable NAS method for lightweight
SR model. The aim is to ﬁnd the lightweight network for
efﬁcient SR by searching the best replacement of 3×3con-
volution kernels of shallow residual block in RFDB. Next,
the search space, search strategy and the searched network
are introduced in sequence.
Search space. Based on residual feature distillation
block of (RFDB) [51], the smallest building block, i.e. shal-
low residual blocks (SRB) consist of a 3×3convolution
layer and a residual connection as shown in Fig. 54(a). In
order to search for a more lightweight structure with com-
petitive performance, the 3×3convolution layer is replaced
with a mixed layer as shown in Fig. 54(b). The mixed
layer is composed of multiple operations including sepa-
rable convolution, dilated convolution, and normal convo-
lution as shown in Tab. 3. The input is denoted as xk, and
the operation set is denoted as Owhere each element rep-
resents a candidate operation o(·)that is weighted by the
architecture parameters αk
o. Then, like DARTS [50], soft-
max function is used to perform the continuous relaxation
of the operation space. Thus, the output of mixed layer k
denoted by fk(xk)is given as:
fk(xk) = X
oO
exp αk
o
PoOexp αk
ooxk.(21)
After searching, only the operation with the largest αk
ois
reserved as the best choice of this layer. All three SRBs of
each RFDB will be replaced by the searched results. The
search space contains 9×9×9different structures. The
network structure and its corresponding cell structure dur-
ing searching is shown in Fig. 55 (a) and Fig. 55 (b), respec-
tively.
Search strategy. The differentiable NAS method is ap-
plied to the lightweight SISR task. The objective function
of the model is deﬁned as
min
θ,α [Ltr (θ(α) + λLval (θ(α); α)]) (22)
where θdenotes the weights parameters of the network, and
λis a non-negative regularization parameter that balances
the importance of the training loss and validation loss. Since
the architecture parameter αis continuous, Adam [36] is
directly applied to solve problem (2). The parameters θ,α
are updated are updated with subsequent iterations:
θ=θηθθLtr(θ , α); (23)
α=αηααLtr(θ , α) + λαLval(θ, α).(24)
The searching and training procedure is summarized in Al-
gorithm 1.
Searched results. As shown in Fig. 55 (c), the searched
cell is composed of a 7×7separable convolution layer,
5×5separable convolution layer, 3×3separable con-
volution layer, ESA block, and residual connections with
information distillation mechanism. Since the number of
parameters and FLOPs of the searched results are all fewer
than the original 3×3convolution layer, a much smaller
(nearly half the original size) model is obtained compared
with RFDN [51].
Loss function. To achieve lightweight and accurate SR
models, the loss function is the weighted sum of these three
losses:
L1=1
N
N
X
i=1 FθILRIH R;(25)
30
(a) The network structure (b) The cell structure during searching (c) The searched cell structure
Figure 55. Express Team: The searched cell structure and architecture of network. For brevity, the connection from each cell to the last
convolution layer has been omitted.
Algorithm 1: Express Team: Searching and train-
ing algorithm
Input: Training set D
1Initialize the super-network Twith architecture
parameters α.
2Split training set Dinto Dtrain and Dvalid .
3Train the super-network Ton Dtrain for several
steps to warm up.
4for t= 1,2,...,T do
5Sample train batch Bt={(xi, yi)}batch
i=1 from
Dtrain
6Optimize θon the Btby Eq. (23)
7Sample valid batch Bv={(xi, yi)}batch
i=1 from
Dvalid
8Optimize αon the Bvby Eq. (24)
9Save the genotypes of the searched networks
10 Train searched networks
Output: A lightweight SR network S
LHF E N =1
N
N
X
i=1 FθILR− ∇IH R;(26)
LP=X
oO
po
PcOpc
softmax (αo) ; (27)
L(θ) = L1+µ×LHF E N +γ×LP.(28)
Speciﬁcally, LHF EN [9] is a gradient-domain L1loss and
can improve image details; podenotes the number of param-
eters of operation oand LPutilizes them to weigh the archi-
tecture parameter α, so as to push the algorithm to search for
lightweight operations. The µand γare weighting parame-
ters that balance the reconstruction performance and model
complexity, respectively. When retraining the searched net-
work, set γ= 0 and remove the last item in the total loss
function Eq. (28).
4.29. Just Try
The team designed a multi-branch network structure LW-
FANet, The LWFANet extracts the shallow feature with
one convolution layer. The extracted feature is sent to the
deep feature extraction module which consists of 10 LWFA
blocks and a 3×3convolution layer. Each LWFA block has
four branches, every branch consists of a 1×1convolution
layer and several 3×3convolution layers. The 1×1con-
volutional layer selects the input features and reduces the
number of channels to one-fourth of the input channels, the
different number of 3×3convolutional layers are used to
extract the features at different levels. Then multi-level fea-
tures of every branch are concatenated and used the channel
attention mechanism for adaptive aggregation of features
out ca. Then spatial attention is used to get out sa. The
input feature is also enhanced by spatial attention, leading
to x sa. The ﬁnal output of the LWFA block is obtained
sum up out ca,out sa and x sa. Long skip connections
are used to get the ﬁnal output feature. Then 1×1convolu-
tion layer is used to reduce the dimension. The upsampling
module consists of nearest interpolation and 3×3convo-
31
3×3 conv
LReLU
AFFB AFFBAFFB
MDAB
MDAB
3×3 conv
bicubic
1×1 conv
1×1 conv
1×1 conv
LConvS
MIR MIR
MIR .CCC
LReLU
LReLU LReLU
LReLU
LReLU LReLU
LConvS
LConvD
1×1 conv
C
C
C
LConvS
1×1 conv
.
.
MDAB
C
LR
SR
123
1
MIR
1
=1 =1 =3
dilation rate
softmax
spatial dimension
1
2
3
group conv depthwise conv 1×1 conv
LConvS
CConcatenation .Element - wise Multiplication
Element - wise Addition
AFFB
(a)
(b)
(c)
(d)
n12 3
channel dimension
SFEB NFMB
HFFB
1×1 conv
PixelShuffle
Upsampler
0LConvS
LConvS
LConvS
LConvS
LConvD
LConvD
group conv dilated depthwise conv 1×1 conv
LConvD
Figure 56. ncepu explorers Team: (a) The network architecture of the proposed MDAN. (b) Area feature fusion block (AFFB). (c) Multiple
interactive residual block (MIR). (d) Lightweight convolutional units (LConvS / LConvD).
lution layers. The reconstructed image is derived after two
convolution operations.
4.30. ncepu explorers
The ncepu explorers team proposed the MDAN network
architecture shown in Fig. 56, consists of four main parts:
shallow feature extraction block (SFEB), nonlinear feature
mapping block (NFMB), hierarchical feature fusion block
(HFFB), and upsampling block (Upsampler). The SFEB
consists of only one 3 ×3 convolutional layer and one
leaky rectiﬁed linear unit (LReLU), and the Upsampler uses
the sub-pixel convolution. The NFMB cascades N(N=
3) area feature fusion blocks (AFFBs). The HFFB mainly
consists of multiple pairs of the lightweight convolutional
units (LConvSs) / 1 ×1 convolutions and multiple multi-
dimensional attention blocks (MDABs).
In the MDAN architecture, three AFFBs were cascaded
in the NFMB, and the number of input and output chan-
nels for each AFFB was 48. Six MIRs were cascaded in
each AFFB, and the dilation rates of the dilation convolu-
tions in each MIR were set to 1, 1, 2, 2, 3 and 3. In each
MIR, the number of the input channels of both LConvS and
LConvD was 48 and the number of the output channels was
24. The group convolutions in both LConvS and LConvD
used three groups. The initial values of the learnable pa-
rameters µ1,µ2, and µ3in the HFFB were set to 0.3, 0.3,
and 0.4, respectively.
4.31. mju mnu
The team proposed a lightweight SR model namely hy-
brid network of CNN and Transformer (HNCT) in Fig. 57,
which integrated CNN and transformers to model local and
non-local priors simultaneously. Speciﬁcally, HNCT con-
sists of four parts: shallow feature extraction (SFE) mod-
ule, Hybrid Blocks of CNN and Transformer (HBCTs),
dense feature fusion (DFF) module and up-sampling mod-
ule. Firstly, shallow features containing low-frequency in-
formation are extracted by only one convolution layer in the
shallow feature extraction module. Then, four HBCTs are
used to extract hierarchical features. Each HBCT contains a
Swin Transformer block (STB) with two Swin Transformer
layers inside, a convolutional layer and two enhanced spa-
tial attention (ESA) modules. Afterwards, these hierarchi-
cal features produced by HBCTs are concatenated and fused
to obtain residual features in SFE. Finally, SR results are
generated in the up-sampling module. Integrating CNN and
transformer, the HNCT is able to extract more effective fea-
tures for SR.
Acknowledgments
We thank the NTIRE 2022 sponsors: Huawei, Reality
Labs, Bending Spoons, MediaTek, OPPO, Oddity, Voy-
age81, ETH Zurich (Computer Vision Lab) and University
of Wurzburg (CAIDAS).
32
Figure 57. mju mnu Team: The architecture of the proposed HNCT for lightweight image super-resolution. (a) The module of Hybrid
Blocks of CNN and Transformer (HBCTs). (b) Swin Transformer Layer (STL). (c) Enhanced spatial attention module (ESA) proposed in
RFANet [52].
A. Teams and afﬁliations
NTIRE 2022 team
Title: NTIRE 2022 Efﬁcient Super-Resolution Challenge
Members:
Yawei Li1(yawei.li@vision.ee.ethz.ch),
Kai Zhang1(kai.zhang@vision.ee.ethz.ch),
Luc Van Gool1(vangool@vision.ee.ethz.ch),
Radu Timofte1,2(radu.timofte@vision.ee.ethz.ch)
Afﬁliations:
1Computer Vision Lab, ETH Zurich, Switzerland
2University of W¨
urzburg, Germany
ByteESR
Title: Residual Local Feature Network For Efﬁcient
Super-Resolution
Members:
Fangyuan Kong (kongfangyuan@bytedance.com), Mingxi
Li, Songwei Liu
Afﬁliations:
ByteDance, Shenzhen, China
NJU Jet
Title: Fast and Memory-Efﬁcient Network with Window
Attention
Members:
Zongcai Du1(151220022@smail.nju.edu.cn), Ding Liu2
Afﬁliations:
1State Key Laboratory for Novel Software Technology,
Nanjing University, China
2ByteDance Inc.
NEESR
Title: Edge-Oriented Feature Distillation Network for
Lightweight Image Super Resolution
Members: Chenhui Zhou1(daomujun@foxmail.com),
Jingyi Chen1, Qingrui Han1
Afﬁliation:
1NetEase, Inc.
XPixel
Title: Blueprint Separable Residual Network for Single
Image Super-Resolution
Members:
Zheyuan Li1(zy.li3@siat.ac.cn), Yingqi Liu1, Xiangyu
Chen1,2, Haoming Cai1, Yu Qiao3,1, Chao Dong3,1
Afﬁliations:
1Shenzhen Institutes of Advanced Technology, CAS
2University of Macau
3Shanghai AI Lab, Shanghai, China
33