Content uploaded by Kaixuan Wei

Author content

All content in this area was uploaded by Kaixuan Wei on Apr 03, 2020

Content may be subject to copyright.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 1

3D Quasi-Recurrent Neural Network for

Hyperspectral Image Denoising

Kaixuan Wei, Ying Fu, Member, IEEE, and Hua Huang, Senior Member, IEEE

Abstract—In this paper, we propose an alternating directional

3D quasi-recurrent neural network for hyperspectral image (HSI)

denoising, which can effectively embed the domain knowledge

— structural spatio-spectral correlation and global correlation

along spectrum. Speciﬁcally, 3D convolution is utilized to extract

structural spatio-spectral correlation in an HSI, while a quasi-

recurrent pooling function is employed to capture the global

correlation along spectrum. Moreover, alternating directional

structure is introduced to eliminate the causal dependency

with no additional computation cost. The proposed model is

capable of modeling spatio-spectral dependency while preserving

the ﬂexibility towards HSIs with arbitrary number of bands.

Extensive experiments on HSI denoising demonstrate signiﬁcant

improvement over state-of-the-arts under various noise settings,

in terms of both restoration accuracy and computation time. Our

code is available at https://github.com/Vandermode/QRNN3D.

Index Terms—Hyperspectral image denoising, structural

spatio-spectral correlation, global correlation along spectrum,

quasi-recurrent neural networks, alternating directional struc-

ture

I. INTRODUCTION

HYPERSPECTRAL image (HSI) is made up of massive

discrete wavebands for each spatial position of real

scenes and provides much richer information about scenes

than RGB images, which has led to numerous applications

in remote sensing [27], [34], classiﬁcation [2], [6], [31], [38],

[45], tracking [37], face recognition [36], and more. However,

due to the limited light for each band, traditional HSIs

are often degraded by various noises (i.e., Gaussian, stripe,

deadline, and impulse noises) during the acquisition process.

These degradations negatively inﬂuence the performance of all

subsequent HSI processing tasks aforementioned. Therefore,

HSI denoising is an essential pre-processing in the typical

workﬂow of HSI analysis and processing.

Recently, more HSI denoising works pay attention to the

domain knowledge of the HSI — structural spatio-spectral

correlation and global correlation along spectrum (GCS) [42].

Top-performing classical methods [8], [9], [39], [41], [42]

typically utilize non-local low-rank tensors to model them.

Although these methods achieve higher accuracy by effectively

considering these underlying characteristics, the performance

of such methods is inherently determined by how well the

human handcrafted prior (e.g. low-rank tensors) matches with

the intrinsic characteristics of an HSI. Besides, such ap-

proaches generally formulate the HSI denoising as a complex

optimization problem to be solved iteratively, making the

denoising process time-consuming.

Alternative learning-based approaches rely on convolutional

neural networks in lieu of the costly optimization and hand-

10-2

10-1

100

101

102

103

104

Running Time (sec)

37

38

39

40

41

42

PSNR (dB)

Gaussian Noise Case (Blind)

BM4D [28]

TDL [30]

ITSReg [42]

LLRT [9]

HSID-CNN [46]

MemNet [33]

QRNN3D

10-2

10-1

100

101

102

103

Running Time (sec)

25

30

35

40

PSNR (dB)

Complex Noise Case (Mixture)

LRMR [48]

LRTV [20]

NMoG [11]

LRTDTV [39] HSID-CNN [46]

MemNet [33]

QRNN3D

Fig. 1: Our QRNN3D outperforms all leading-edge methods

on ICVL dataset in both Gaussian and complex noise cases.

crafted priors [7], [46]. Promising results notwithstanding,

these approaches model HSI by learned multichannel or band-

wise 2D convolutions, which sacriﬁce either the ﬂexibility

with respect to the spectral dimension [7] (hence requiring

retraining network to adapt to HSIs with mismatched spectral

dimention), or the model capability to extract GCS knowledge

[46] (thus leading to relatively low performance as shown in

Figure 1).

In principal, the trade-off between the model capability

and ﬂexibility imposes a fundamental limit for real-world

applications. In this paper, we ﬁnd that combining domain

knowledge with 3D deep learning (DL) can achieve both

goals simultaneously. Unlike prior DL approaches [7], [46]

that always utilize the 2D convolution as a basic building

block of network, we introduce a novel building block namely

3D quasi-recurrent unit (QRU3D) to model HSI from a 3D

perspective. This unit contains a 3D convolutional subcom-

ponent and a quasi-recurrent pooling function [5], enabling

structural spatio-spectral correlation and GCS modeling re-

spectively. The 3D convolutional subcomponent can extract

spatio-spectral features from multiple adjacent bands, while

the quasi-recurrent pooling recurrently merges these features

over the whole spectrum, controlled by a dynamic gating

mechanism. This mechanism renders the pooling weights

to be dynamically calculated by the input features, thereby

allowing for adaptively modeling the GCS knowledge. To

eliminate the unidirectional causal dependency (Figure 4),

introduced by the vanilla recurrent structure, we furthermore

propose an alternating directional structure with no additional

computation cost.

Our network, called 3D quasi-recurrent neural network

(QRNN3D), has been designed to make full use of the

domain knowledge especially the GCS. It makes signiﬁcant

improvements in model capability/accuracy while is agnostic

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 2

to the spectral dimension of input HSIs, thus can be applied

to any HSIs captured by unknown sensors (with different

spectral resolutions). Over extensive experiments, QRNN3D

outperforms all leading-edge methods on several benchmark

datasets under various noise settings as shown in Figure 1.

Our main contributions are summarized that we

1) present a novel building block namely QRU3D that can

effectively exploit the domain knowledge – structural

spatio-spectral correlation and global correlation along

spectral (GCS) simultaneously.

2) introduce an alternating directional structure to eliminate

the unreasonable causal dependency towards HSI model-

ing, with no additional computation cost.

3) demonstrate our model pretrained on ICVL dataset can be

directly utilized to tackle remotely sensed imagery which

is infeasible in conventional 2D DL approaches for the

HSI modeling.

The remainder of this paper is organized as follows. In

Section II, we review related HSI denoising methods and DL

approaches that inspire our work. Section III introduces the

QRNN3D approach for HSI denoising. Extensive experimental

results on natural scenes of HSI database and remote sensed

images are presented in Section IV, followed by more discus-

sions that facilitate the understanding of QRNN3D in Section

V. Conclusions are drawn in Section VI.

II. RE LATE D WOR K

A. HSI Denoising

Existing methods towards HSI denoising can be roughly

classiﬁed into two categories depending on the noise model.

The most frequently used noise model is zero-mean white

and homogeneous Gaussian additive noise. Under this as-

sumption, BM4D [28], an extension of the BM3D ﬁlter

[13] to volumetric data, could be directly applied for HSI

denoising. By regarding the GCS and non-local self-similarity

in HSI simultaneously, Peng et al. proposed a tensor dictionary

learning (TDL) model [30] which achieved very promising

performance. Following this line, more sophisticated methods

have been successively proposed [8], [9], [14], [16], [19],

[41], [42], [50]. Among these methods, the low-rank tensor

based models, i.e. ITS-Reg [42], LLRT [9] and a new iterative

projection and denoising algorithm, i.e. NG-meet [19] achieve

state-of-the-art performance, owing to their elaborate efforts

on modeling intrinsic property of the HSI.

Besides, several works [11], [20], [39], [43], [48] aim to

resolve the realistic complex noise by modeling the noise with

complicated non-i.i.d. statistical structures. They all frame the

denoising problem into a low-rank based optimization scheme,

and then utilize some constraints (e.g. total variation, l1and

nuclear norm) to remove the complex noise (e.g. non-i.i.d.

Gaussian, stripe, deadline, impulse).

Recently, leveraging the power of the DL, Chang et al. [7]

extended the 2D image denoising architecture – DnCNN [49]

to remove various noise in HSIs. They argued the learned

ﬁlters can well extract the structural spatial information.

Yuan et al. [46] utilized a deep residual network to recover

the remotely sensed images under Gaussian noise, which

processed HSI with a sliding window strategy. Concurrently

to our work, Dong et al. [15] proposed a 3D factorizable

U-net architecture to exploit spatial-spectral correlations in

HSIs from the 3D perspective. All these DL-based methods

insufﬁciently exploit the GCS knowledge, and they cannot

adjust the learned parameters to adaptively ﬁt input data,

consequently lacking the freedoms to discriminate the input-

dependent spatio-spectral correlations.

In this paper, we leverage the power of the DL to au-

tomatically learn the mapping purely from the data instead

of handcrafted prior and complex optimization, reaching to

orders-of-magnitude speedup in both Gaussian and complex

noise contexts. Besides, our DL-based method can effectively

exploit the underlying characteristics — structural spatio-

spectral correlation and GCS, even without sacriﬁcing the

ﬂexibility towards HSIs with arbitrary number of bands.

B. Deep Learning for Image Denoising

Researches on Gray/RGB image denoising has been domi-

nated by the discriminative learning based approach especially

the deep convolutional neural network (CNN) in recent years

[10], [29], [33], [49], [51], [52]. Zhang et al. [49] proposed a

modern deep architecture namely DnCNN by embedding the

batch normalization [23] and residual learning [18]. Mean-

while, Mao et al. [29] presented a very deep fully convo-

lutional encoding-decoding framework for image restoration

such as denoising and super-resolution. Both of them yielded

better Gaussian denoising results and less computation time

than the highly-engineered benchmark BM3D [13]. Along

this line, more works have been proposed to explore the

deep architecture design for image denoising. For example,

MemNet [33] introduces memory block to investigate the

long-term information. Residual dense network [52] goes

beyond that to build dense connections inner blocks. Residual

non-local attention network [51] utilizes local and non-local

attention blocks to extract features that capture the long-range

dependencies between pixels and pay more attention to the

challenging parts.

Although all these networks can be directly extended into

the HSI case, none of them speciﬁcally consider the domain

knowledge of the HSI.

C. Deep Image Sequence Modeling

Modeling image sequence with various lengths is a fun-

damental problem in a variety of research ﬁelds such as

precipitation nowcasting, video processing, and so on.

Bidirectional recurrent convolutional networks (BRCN) [22]

and convolutional LSTM (ConvLSTM) [44] were proposed for

resolving the multi-frame super-resolution and precipitation

nowcasting problem respectively. The key insight of these

models is to replace the common-used recurrent full connec-

tions by weight-sharing convolutional connections such that

they can greatly reduce the large number of network parame-

ters and well model the temporal dependency in a ﬁner level

(i.e. patch-based rather than frame-based). However, these

patch-based operations cannot efﬁciently capture the spectral

correlation, meanwhile recurrently applying convolution along

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 3

TABLE I: Network conﬁguration of our residual encoder-

decoder style QRNN3D for HSI restoration.

Layer Cout Stride Output size

Extractor 16 1,1,1H×W×B

Encoder

16 1,1,1H×W×B

32 2,2,1H

2×W

2×B

32 1,1,1H

2×W

2×B

64 2,2,1H

4×W

4×B

64 1,1,1H

4×W

4×B

Decoder

64 1,1,1H

4×W

4×B

32 1

2,1

2,1W

2×W

2×B

32 1,1,1H

2×W

2×B

16 1

2,1

2,1H×W×B

16 1,1,1H×W×B

Reconstructor 1 1,1,1H×W×B

spectrum would drastically increase the computational com-

plexity. In contrast, our QRNN3D employs an elementwise

recurrent mechanism, enabling good scaling to HSI with a

large number of bands. Besides, this mechanism naturally

imposes a prior constraint over the spectrum, making it well-

suited for extracting GCS knowledge.

Fig. 2: The overall architecture of our residual encoder-decoder

QRNN3D. The network contains layers of symmetric QRU3D

with convolution and deconvolution for encoder (blue) and

decoder (orange) respectively. Symmetric skip connections are

added in each layer. Besides, alternating directional structure

is equipped in all layers except the top and bottom ones with

bidirectional structure to avoid bias.

III. THE PRO PO SE D METHOD

An HSI degraded by additive noise can be linearly modeled

as

Y=X+,(1)

where {Y,X,} ∈ RH×W×B,Yis the observed noisy image,

Xis the original clean image, denotes the additive random

noise. H, W, B indicate the spatial height, spatial width, and

number of spectral bands respectively.

Here, we consider miscellaneous noise removal in denoising

context, where can represent different types of random noise

including Gaussian noise, sparse noise (stripe, deadline and

impulse) or mixture of them. Given a noisy HSI, our goal is

to obtain its noise-free counterpart.

In this section, we introduce the residual encoder-decoder

QRNN3D for HSI denoising. As shown in Figure 2, our

network consists of six pairs of symmetric QRU3D with

convolution and deconvolution for encoder and decoder re-

spectively, leading to twelve layers in total. We use two layers

with stride=2 convolution to downsample the input in encoder

part, and then two layers with stride=1/2 to upsample in

decoder part. The beneﬁts from downsampling and unsampling

operations are that we can use a larger network under the same

computational cost, and increase receptive ﬁeld size to make

use of the context information in larger image region. Table

I illustrates our network conﬁguration. Each layer contains a

QRU3D with kernel size 3×3×3, which is set to maximize

performance empirically [35]. Stride and output channels

(Cout) in each layer are listed and other conﬁguration (e.g.

padding) can be inferred implicitly.

In the following, we ﬁrst present the QRU3D, which is

the core building block in our method. Then, alternating

directional structure used to eliminate the unreasonable causal

dependency is introduced, and learning details are provided.

A. 3D Quasi-Recurrent Unit

QRU3D is the basic building block of QRNN3D. It consists

of two subcomponents, i.e. 3D convolutional subcomponent

and quasi-recurrent pooling, as shown in Figure 3. Unlike the

2D convolution, both of the subcomponents do not enforce

the number of spectral bands, making the QRNN3D free for

processing HSIs with arbitrary bands.

3D Convolutional Subcomponent. The 3D convolutional

subcomponent of QRU3D performs two set of 3D convolutions

[24], [35] with separated ﬁlter banks, producing sequence of

tensors passed through different activation functions,

Z= tanh(Wz∗I),

F=σ(Wf∗I),(2)

where I∈RCin×H×W×Bis the input feature maps coming

from last layer (in ﬁrst layer, input I=Ywith Cin = 1);

Z∈RCout×H×W×Bis a high dimensional candidate tensor.

Fhas the same dimension as Z, representing the neural forget

gate that controls the behavior of dynamic memorization. Both

Wzand Wf∈RCout×Cin ×3×3×3are the 3D convolutional

ﬁlter banks and ∗denotes a 3D convolution, σindicates a

sigmoid non-linearity.

The 3D convolution is achieved by convolving a 3D kernel

to a whole HSI in both spatial and spectral dimensions. The

3D convolution in the spatial domain can mimic numerous

operations widely used in low-level vision (like image patch

extraction and 2D patch transform in BM3D [13], [26]) and

the 3D convolution in the spectral domain can model the

local spectrum continuity to alleviate the spectral distortion.

Consequently, the embedded C3D can effectively exploit the

structural spatio-spectral correlation in HSIs.

Quasi-Recurrent Pooling. Although the 3D convolutional

subcomponent has already exploited the inter-band relation-

ship, it is computed in a local way and cannot explicitly

exploit GCS. To effectively utilize the GCS, we present quasi-

recurrent pooling, in which pooling operation and dynamic

gating mechanism are introduced.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 4

Fig. 3: The overall structure of QRU3D. It can be described in four steps. First, the input Iis transformed by two set of 3D

convolutions, generating a candidate tensor Zand a neural forget gate F. Second, Zand Fare split along the spectrum to

produce sequences of zband fb. Third, the quasi-recurrent pooling function is applied recurrently to merge the previous hidden

state hb−1and current candidate zbcontrolled by current neural gates fb, resulting in a new hidden state hb. Finally, each

hidden state hbis concatenated together to form the whole output Hto the next layer.

In our QRU3D, the quasi-recurrent pooling is applied after

the candidate tensor Zand neural forget gate Fare obtained

by the 3D convolutional subcomponent. We ﬁrst split Zand

Falong the spectrum, generating sequences of zband fb

respectively, and then feed these states into a quasi-recurrent

pooling function [5],

hb=fbhb−1+ (1 −fb)zb,∀b∈[1, B],(3)

where denotes an element-wise multiplication, hb−1is

the hidden state merged through all previous states and also

represents the (b−1)-th band in the output of this layer, h0=0

with all entries equal to zero. The forget gate fbbalances

the weight of current candidate zband previous memory, i.e.

hidden state hb−1. Its value depends on the current input

Iinstead of being ﬁxed like a convolutional ﬁlter, which

can effectively adapt to the input image own and not solely

rely on the parameters learned in the training stage. By this

construction, the inter-band information would be accurately

merged. Meanwhile, since this dynamic pooling recurrently

operates across the whole spectrum, the GCS can be effectively

exploited. The output feature maps Hwill be produced by

concatenating all hidden states along the spectrum.

In addition, due to independent neural gate and element-

wise recurrent operations (multiplication), the QRU3D is

highly parallel, enabling good scaling to HSI with a large

number of bands. More speciﬁcally, the calculation of neural

forget gate fbis only dependent on multiple contiguous bands

of input instead of involving the previous hidden state in

typical RNNs (e.g. LSTM [21] and GRU [12]). Meanwhile,

the elementwise multiplication is exceedingly computationally

economical than the convolution used by ConvLSTM [44],

thus can be easily recurrently utilized hundreds of times.

B. Alternating Directional Structure

A forward 3D quasi-recurrent unit, as in Equation (3), reads

a candidate tensor zbin order starting from the ﬁrst z1to

the last zB, so that a hidden state hbonly depends on the

(a) (b) (c)

Fig. 4: Directional structure overview. (a) Unidirectional struc-

ture: hidden states propagate unidirectionally. (b) Bidirectional

structure: one layer contains two sublayers which propagate

states with inverse direction, generating results by adding

sublayers’ output. (c) Our proposed alternating directional

structure: direction of network changes in each layer.

Fig. 5: Synthesized RGB image samples from ICVL dataset.

previous zb(and theirs corresponding bands). This introduces

the causal dependency since the computing stream of hidden

state propagates unidirectionally as shown in Figure 4(a),

which is not reasonable for the HSI.

A typical solution is to use a bidirectional structure [4], [22],

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 5

[32], in which a layer of network contains two sublayers, i.e.

a forward QRU3D and a backward QRU3D in our case, as

shown in Figure 4(b). The forward QRU3D reads the candidate

tensor sequence in order and calculates a sequence of forward

hidden states. The backward QRU3D reads the sequence in

reverse order, leading to a sequence of backward hidden states.

The output of this layer is calculated by adding the forward and

backward hidden states elementwisely. However, this structure

makes the computational burden unacceptable because of the

nearly double amount of memory consumption.

To ease this issue, we present an alternating directional

structure for HSIs. In speciﬁc, a QRNN3D with alternating

directional structure changes the direction of computing stream

of hidden state in each layer, as shown in Figure 4(c). This

structure is built by alternately stacking forward and backward

QRU3D, in which a forward (or backward) state is be merged

by a backward (or forward) state in next layer, such that the

global context information could be propagated through the

whole spectrum.

Compared with the typical solution by bidirectional struc-

ture, our proposed alternating directional structure almost adds

no additional computation cost, while keeping the ability

to model the dependency from whole spectrum of an HSI

regardless of the position of the output.

IV. EXP ER IM EN TAL RESULTS

A. Experimental settings

Benchmark Datasets. We conduct several experiments

using data from ICVL hyperspectral dataset [3], where 201

images were collected at 1392 ×1300 spatial resolution over

31 spectral bands. The simulated pseudo color image samples

from this dataset are illustrated in Figure 5. We use 100

images for training, 5 images for validation, while others

are for testing. To enlarge the training set, we crop multiple

overlapped volumes from training HSIs and then regard each

volume as a training sample. During cropping, each volume

has a spatial size of 64 ×64 and a spectral size of 31 for the

purpose of preserving the complete spectrum of an HSI. Data

augmentation schemes such as rotation and scaling are also

employed, resulting in roughly 50k training samples in total.

As for testing set, we crop the main region of each image with

size of 512 ×512 ×31 given the computation cost1.

Besides, we evaluate the robustness and ﬂexibility of our

model in remotely sensed hyperspectral datasets including

Pavia Centre,Pavia University,Indian Pines and Urban.

Pavia Centre and Pavia University were acquired by the

ROSIS sensor, the number of spectral bands is 102 for Pavia

Centre and 103 for Pavia University.Indian Pines and Urban

were gathered by 224-bands AVIRIS sensor and 210-bands

HYDICE hyperspectral system respectively. Both of them have

been used for real HSI denoising experiments [9], [20], [39].

Noise settings. Real-world HSIs are usually contaminated

by several different types of noise, including the most common

Gaussian noise, impulse noise, dead pixels or lines, and stripes

[11], [17], [48]. We deﬁne ﬁve types of complex noise as

1It’s unwieldy to evaluate a image with large size in some competing

methods rather than ours, see Figure 1 for more detail.

follows, and the types of complex noise are referred as Case

1-5 respectively.

Case 1: Non-i.i.d. Gaussian noise. Entries in all bands are

corrupted by zero-mean Gaussian noise with different

intensities, randomly selected from 10 to 70.

Case 2: Gaussian + Stripe noise. All bands are corrupted

by non-i.i.d. Gaussian noise as Case 1. One third

of bands (10 bands for ICVL dataset) are randomly

chosen to add stripe noise (5% to 15% percentages

of columns).

Case 3: Gaussian + Deadline noise. The noise generation

process is nearly the same as Case 2 except the stripe

noise is replaced by deadline.

Case 4: Gaussian + Impulse noise. Each band is contaminated

by Gaussian noise as Case 1. One third of bands are

randomly selected to add impulse noise with intensity

ranged from 10% to 70%.

Case 5: Mixture noise. Each band is randomly corrupted by

at least one kind of noise mentioned in Case 1-4.

Competing Methods. We compare our method against

both traditional and DL methods in both Gaussian and com-

plex noise cases. In general, the traditional methods are best

suited to be applied in a speciﬁc noise setting, relying on

their noise assumption. While DL methods, can be applied

in various noise setting by training multiple models to tackle

miscellaneous noises. For the sake of fairness, we adopt

different traditional baselines in these two noise contexts,

given their noise assumptions.

In Gaussian noise case, we compare with several represen-

tative traditional methods including ﬁltering-based approaches

(BM4D [28]), dictionary learning approach (TDL [30]), and

tensor-based approaches (ITSReg [42], LLRT [9]). In complex

noise case, the competing traditional baselines include low-

rank matrix recovery approaches (LRMR [48], LRTV [20],

NMoG [11]), and low-rank tensor approach (TDTV [39]).

For DL approaches, we compare our model with HSID-

CNN [46]. Besides, any DL method for single image denoising

can be extended to HSI denoising case (by modifying the ﬁrst

layer to adapt the HSI, i.e. changing Cin from 3 to 31). For

completeness, we also compare such state-of-the-art 2D DL

approach, i.e. MemNet [33] with Cin = 31 in ﬁrst layer, which

entails the ﬁxed number of spectral bands. Since the training

setting is different between ours and other DL approaches, we

ﬁnetune/retrain their pretrained models with our well-designed

training strategy to achieve better performance in our dataset.

Network learning. We develop an incremental training

policy to stabilize and accelerate the training, which also

avoids the network converging to a poor local minimum. The

philosophy of our training policy is simple: learning to solve

tasks in an easy-to-difﬁcult way [1]. Networks are learned

by minimizing the mean square error (MSE) between the

predicted high-quality HSI and the ground truth. The network

parameters are initialized as in [17], and optimized using

ADAM optimizer [25] with the deep learning framework Py-

torch2on a machine with NVIDIA GTX 1080Ti GPU, Intel(R)

Core(TM) i7-7700K CPU of 4.2GHz and 16 GB RAM. Unlike

2https://pytorch.org/

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 6

TABLE II: Overview of our incremental train policy. Our network learning goes through three stages, from the easy task of

Gaussian denoising with ﬁxed noise level, to the difﬁcult one of complex noise removal. In our implementation, ﬁxed noise

level σin stage 1 is set to 50. Unknown σin stage 2 is uniformly sampled from 30 to 70. Unknown complex noise in stage 3

denotes the complex noise randomly chosen from Case 1 to 4 (without Case 5: mixture noise). The models trained at the end

of stage 2 (epoch 50) and 3 (epoch 100) are used in Gaussian denoising and complex noise removal tasks respectively.

Stage 1 2 3

Noise model Gaussian noise with known σGaussian noise with unknown σUnknown complex noise

Epoch 0 ∼20 20 ∼30 30 ∼35 35 ∼45 45 ∼50 50 ∼85 85 ∼95 95 ∼100

Learning rate 10−310−410−310−410−510−310−410−5

Batch size 16 64

(a) Noisy

(14.17)

(b) BM4D

(33.00)

(c) TDL

(35.11)

(d) ITSReg

(36.09)

(e) LLRT

(36.08)

(f) HSID-CNN

(35.22)

(g) MemNet

(36.29)

(h) Ours

(36.73)

Fig. 6: Simulated Gaussian noise removal results of PSNR (dB) at 20th band of image under noise level σ= 50 on ICVL

dataset. (Best view on screen with zoom)

Noisy LRMR [48] LRTV [20] NMoG [11] TDTV [39] D-CNN [46] MemNet [33] Ours

Case 1Case 2Case 3Case 4Case 5

Fig. 7: Simulated complex noise removal result s on ICVL dataset. Examples for non-i.i.d Gaussian noise, Gaussian + stripes,

Gaussian + deadline, Gaussian + impulse and mixture noise removal (Cases 1-5) are presented respectively. (Best view on

screen with zoom)

training networks independently to tackle several different types of noise separately, we simply train two models in both

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 7

400 450 500 550 600 650 700

30

35

40

45

BM4D

TDL

ITSReg

LLRT

HSID-CNN

MemNet

QRNN3D

(a) i.i.d. Gaussian (σ= 50)

400 450 500 550 600 650 700

25

30

35

40

45

50

LRMR

LRTV

NMoG

LRTDTV

HSID-CNN

MemNet

QRNN3D

(b) Non-i.i.d. Gaussian (Case 1)

400 450 500 550 600 650 700

25

30

35

40

45

50

LRMR

LRTV

NMoG

LRTDTV

HSID-CNN

MemNet

QRNN3D

(c) Gaussian + Stripe (Case 2)

400 450 500 550 600 650 700

25

30

35

40

45

50

LRMR

LRTV

NMoG

LRTDTV

HSID-CNN

MemNet

QRNN3D

(d) Gaussian + Deadline (Case 3)

400 450 500 550 600 650 700

20

25

30

35

40

45

LRMR

LRTV

NMoG

LRTDTV

HSID-CNN

MemNet

QRNN3D

(e) Gaussian + Impulse (Case 4)

400 450 500 550 600 650 700

20

25

30

35

40

45

LRMR

LRTV

NMoG

LRTDTV

HSID-CNN

MemNet

QRNN3D

(f) Mixture (Case 5)

Fig. 8: PSNR values across the spectrum corresponding to Gaussian and complex noise removal results in Figure 6 and 7

respectively.

Gaussian and complex noise cases respectively. Our network

learning goes through three stages, from the easy task of

Gaussian denoising with ﬁxed noise level, to the difﬁcult

one of complex noise removal. The models are incrementally

trained that reuse the prior state (pretrained parameters) to

maximize the training efﬁciency (See discussions in Section

V-A). We follow the previous image restoration work [29] to

choose hyper-parameters of learning algorithm. These values

were empirically set to make network learning fast yet stable.

Speciﬁcally, the learning rate is initialized at 10−3and decayed

at epochs, where the validation performance not increases any

more. Small batch size (i.e. 16) is used to accelerate training at

ﬁrst stage, while large batch size (i.e. 64) is adopted to stabilize

training when tackling harder cases (e.g. complex noise case).

The overview of our training procedures is shown in Table II,

with detailed hyper-parameter setting.

Quantitative Metrics. To give an overall evaluation, three

quantitative quality indices are employed, i.e. PSNR, SSIM

[40], and SAM [47]. PSNR and SSIM are two conventional

spatial-based indexes, while SAM is spectral-based. Larger

values of PSNR and SSIM imply better performance, while

a smaller value of SAM suggests better performance.

B. Experiments on ICVL Dataset

Denoising in Gaussian Noise Case. Zero mean additive

white Gaussian noises with different variance are added to

generate the noisy observations. The model trained at the end

of stage 2 (epoch 50) is used to tackle all different levels

of corruption3. Figure 6 shows the denoising results under

noise level σ= 50. It can be easily observed that the image

restored by our method is capable of properly removing the

Gaussian noise while ﬁnely preserving the structure underlying

the HSI. Traditional methods like BM4D and TDL introduce

evident artifacts to some areas. Other methods suppress the

noise better, but still lose some ﬁne-grained details and pro-

duce relatively low-quality results compared with ours. The

qualitative assessment results are listed in Table III. Compared

3We do not train multiple networks to tackle different noise intensities

respectively. Instead, only one single network is trained using training sample

with various noise intensities.

with all competing methods, the QRNN3D achieves better per-

formance in most qualitative/quantitative assessments, further

conﬁrming the high ﬁdelity of our method.

Denoising in Complex Noise Case. Five types of the com-

plex noise are added to generate noisy samples. In brief, cases

1-5 represent non-i.i.d Gaussian noise, Gaussian + stripes,

Gaussian + deadline, Gaussian + impulse, and mixture of

them respectively (see Section IV-A for more details). Like

Gaussian noise case, a single model trained at the end of

stage 3 (epoch 100) is utilized to dealing with case 1-5 simul-

taneously. It’s worth noting that each sample in our training

set is corrupted by one of noise types (i.e. cases 1-4), while

in case 5, each testing sample suffers from multiple types of

noise, not contained in the training set. We show the qualitative

and quantitative results in Figure 7 and Table IV respectively,

which show our QRNN3D signiﬁcantly outperforms the other

methods. Furthermore, the results in mixture noise case exhibit

the strong generalization of our model since the mixture noise

is not seen by our model in the training stage.

In Figure 7, the observation images are corrupted by miscel-

laneous complex noises. Low-rank matrix recovery methods,

i.e. LRMR and LRTV, holding the assumption that the clean

HSI lies in low-rank subspace from the spectral perspective,

successfully remove great mass of noise, but at a cost of

losing ﬁne details. Our QRNN3D eliminates miscellaneous

noises to a great extent, while more faithfully preserving the

ﬁne-grained structure of original image (e.g. the texture of

road in the second photo of Figure 7) than top-performing

traditional low-rank tensor approach TDTV and other DL

methods. Figure 8 shows the PSNR value of each bands in

these HSIs. It can be seen that the PSNR values of all bands

obtained by Our QRNN3D are obviously higher than those

compared methods.

C. Experiments on Remotely Sensed Images

Synthetic Data. Here, we conduct experiments on Pavia

University in mixture noise case. Given the similarity between

Pavia Centre and Pavia University, the model is ﬁrst trained

from scratch only on Pavia Centre. It can be seen our train-

from-scratch model (Ours-S in Table V) performs undesirable,

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 8

(a) Noisy

(13.54)

(b) LRMR

(26.35)

(c) LRTV

(25.93)

(d) NMoG

(28.90)

(e) LRTDTV

(30.06)

(f) HSID-CNN

(30.14)

(g) Ours-S

(29.64)

(h) Ours-P

(31.50)

(i) Ours-F

(34.32)

(j) Clean

(+∞)

Fig. 9: Simulated complex noise removal results of PSNR (dB) at 10th band of image in case 5 (mixture noise) on Pavia

University dataset. (Best view on screen with zoom)

TABLE III: Quantitative results of different methods under several noise levels on ICVL dataset. ”Blind” suggests each sample

is corrupted by Gaussian noise with unknown σ(ranged from 30 to 70).

Sigma Index

Methods

Noisy BM4D TDL ITSReg LLRT HSID-CNN MemNet Ours

[28] [30] [42] [9] [46] [33]

30

PSNR 18.59 38.45 40.58 41.48 41.99 38.70 41.45 42.28

SSIM 0.110 0.934 0.957 0.961 0.967 0.949 0.972 0.973

SAM 0.807 0.126 0.062 0.088 0.056 0.103 0.065 0.061

50

PSNR 14.15 35.60 38.01 38.88 38.99 36.17 39.76 40.23

SSIM 0.046 0.889 0.932 0.941 0.945 0.919 0.960 0.961

SAM 0.991 0.169 0.085 0.098 0.075 0.134 0.076 0.072

70

PSNR 11.23 33.70 36.36 36.71 37.36 34.31 38.37 38.57

SSIM 0.025 0.845 0.909 0.923 0.930 0.886 0.946 0.945

SAM 1.105 0.207 0.105 0.112 0.087 0.161 0.088 0.087

Blind

PSNR 17.34 37.66 39.91 40.62 40.97 37.80 40.70 41.50

SSIM 0.114 0.914 0.946 0.953 0.956 0.935 0.966 0.967

SAM 0.859 0.143 0.072 0.087 0.064 0.116 0.070 0.066

TABLE IV: Quantitative results of different methods in ﬁve complex noise cases on ICVL dataset.

Case Index

Methods

Noisy LRMR LRTV NMoG TDTV HSID-CNN MemNet Ours

[48] [20] [11] [39] [46] [33]

1

PSNR 18.25 32.80 33.62 34.51 38.14 38.40 38.94 42.79

SSIM 0.168 0.719 0.905 0.812 0.944 0.947 0.949 0.978

SAM 0.898 0.185 0.077 0.187 0.075 0.095 0.091 0.052

2

PSNR 17.80 32.62 33.49 33.87 37.67 37.77 38.57 42.35

SSIM 0.159 0.717 0.905 0.799 0.940 0.942 0.945 0.976

SAM 0.910 0.187 0.078 0.265 0.081 0.104 0.095 0.055

3

PSNR 17.61 31.83 32.37 32.87 36.15 37.65 38.15 42.23

SSIM 0.155 0.709 0.895 0.797 0.930 0.940 0.945 0.976

SAM 0.917 0.227 0.115 0.276 0.099 0.102 0.096 0.056

4

PSNR 14.80 29.70 31.56 28.60 36.67 35.00 35.93 39.23

SSIM 0.114 0.623 0.871 0.652 0.935 0.899 0.907 0.945

SAM 0.926 0.311 0.242 0.486 0.094 0.174 0.126 0.109

5

PSNR 14.08 28.68 30.47 27.31 34.77 34.05 35.16 38.25

SSIM 0.099 0.608 0.858 0.632 0.919 0.888 0.903 0.938

SAM 0.944 0.353 0.287 0.513 0.113 0.181 0.130 0.107

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 9

(a) Noisy (b) BM4D (c) TDL (d) ITSReg (e) LLRT (f) LRMR

(g) LRTV (h) NMoG (i) TDTV (j) HSID-CNN (k) Ours

Fig. 10: Real-world unknown noise removal results at 2th band of image on AVIRIS Indian Pines dataset. (Best view on

screen with zoom)

(a) Noisy (b) BM4D (c) TDL (d) ITSReg (e) LLRT (f) LRMR

(g) LRTV (h) NMoG (i) TDTV (j) HSID-CNN (k) Ours

Fig. 11: Real-world unknown noise removal results at 107th band of image on HYDICE Urban dataset. (Best view on screen

with zoom)

TABLE V: Quantitative results of different methods in mixture noise case on Pavia University dataset. ”Ours-S” is our trained-

from-scratch model which is only trained on Pavia Centre dataset; ”Ours-P” denotes our pretrained model which is only trained

on ICVL dataset; ”Ours-F” indicates our ﬁne-tuned model which is pretrained on ICVL dataset, and then is ﬁne-tuned on

Pavia Centre dataset.

Index

Methods

Noisy LRMR LRTV NMoG TDTV HSID-CNN Ours Ours Ours

[48] [20] [11] [39] [46] S P F

PSNR 13.54 26.35 25.93 28.90 30.06 30.14 29.64 31.50 34.32

SSIM 0.161 0.660 0.676 0.781 0.819 0.805 0.892 0.866 0.925

SAM 0.896 0.406 0.359 0.388 0.239 0.142 0.166 0.127 0.093

even compared with traditional method TDTV (29.64 v.s.

30.06).

Nevertheless, our method utilizes QRU3D, which makes

it can be naturally used for input data with various number

of bands. On the basis of this ﬂexibility, we directly apply

our model pretrained on ICVL dataset (in complex noise

case) to Pavia University. Although the Pavia University is

recorded with a spectral curve totally distinct from ICVL

dataset, our model called Ours-P performs much better than

all compared methods4, which strongly veriﬁes the robustness

4The result of HSID-CNN is also obtained by its pretrained model on

ICVL dataset under complex noise case. The learned MemNet cannot be

useful for the data with different bands and its results are not provided in

Table V.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 10

TABLE VI: Ablations on ICVL HSI Gaussian denoising

(under noise level σ= 50). We evaluate the results by

PSNR (dB), running Time (sec) and the number of parameters

(Params) of these networks. All running times are measured on

a Nvidia GTX 1080Ti by processing an HSI with size of 512

×512 ×31. Direction of network is denoted by initials, i.e.

U: unidirectional; B: bidirectional; A: alternating directional,

Our benchmark network is indicated by boldface. The results

of MemNet are also provided as an additional reference.

Model PSNR (dB) Time (s) Params (#)

MemNet 39.76 0.88 2.94M

QRU2D 38.63 0.60 0.29M

WQRU2D 39.82 1.16 0.88M

C3D 36.83 0.56 0.43M

WC3D 40.00 0.93 1.72M

QRU3D 40.23 0.74 0.86M

U 40.07 0.75 0.86M

B 40.26 1.26 1.72M

A40.23 0.74 0.86M

of our method.

Furthermore, we employ small pieces of samples from Pavia

Center to ﬁne-tune the model only learned from ICVL dataset.

This learned model (Ours-F in Table V) signiﬁcantly boosts

the performance. The visual comparison is provided in Figure

9. Interestingly, the Gaussian-like residuals are still visible in

Ours-S model, while Ours-P model suffers from stripes. Ours-

F model combines the strengths of the two models, yielding

clear and clean result. This seems to indicate the knowledge

from ICVL dataset is complementary to one from Pavia Centre

dataset, so that the transfer learning enabled by ﬂexibility will

bring great beneﬁts in performance.

Real-world Noisy Data. We also verify our model in real-

world noisy HSI Indian Pines and Urban without correspond-

ing ground truth. It can be observed in Figure 10 and Figure

11 that terrible atmosphere and water absorption obstruct the

view to the real scenario, severely degrading the quality of

images. The Gaussian denoising methods, e.g. BM4D, TDL,

cannot accurately estimate the underlying clean image due to

the non-Gaussian noise structure. Our QRNN3D successfully

tackles this unknown noise, and produces sharper and clearer

result than others, consistently demonstrating the robustness

and ﬂexibility of our model.

V. DISCUSSION AND ANALYS IS

In this section, we provide a broad discussion and analysis

of QRNN3D to facilitate understanding of where its great

performance comes from. We ﬁrst demonstrate the efﬁcacy of

our incremental training policy, then analyze the functionality

of each network component in QRNN3D (i.e. 3D convolution,

quasi-recurrent pooling, alternating-directional structure). The

selection of network hyper-parameters is followed. The visu-

alization method (and results) of GCS knowledge in QRU3D

are presented in ﬁnal.

0 10 20 30 40 50 60 70 80 90 100

Epoch

10-4

10-3

10-2

Average Training Loss (MSE)

training from scratch

incremental training

0 10 20 30 40 50 60 70 80 90 100

Epoch

28

30

32

34

36

38

Validation PSNR

training from scratch

incremental training

Fig. 12: Average training loss (Left) and Validation PSNR

(Right) of QRNN3D for complex noise removal. We show

the results of the model trained from scratch, and the one

that reuses the pretrained parameters in Gaussian denoising

(incremental training).

A. Efﬁcacy of Incremental Training Policy

The key idea of our training policy lies at the fact that

knowledge can be efﬁciently learned in an easy-to-difﬁcult

way [1]. Our training policy enables reusing prior learned

knowledge (pretrained parameters), which signiﬁcantly sta-

bilizes and accelerates the whole training process. As an

example, we show the optimization curves with and without

reusing the pretrained parameters when training the model in

complex noise case. As shown in Figure 12, training from

scratch renders the optimization slow, instable and converge

to a poor local minimum, in contrast to training with a good

initialization in our incremental learning policy.

B. Component Analysis in QRNN3D

To thoroughly verify the functionality of each component

in our QRNN3D, comprehensive ablation experiments are

conducted on HSI Gaussian denoising task on ICVL dataset.

We focus on the components associated with HSI modeling

and domain knowledge embedding, and study the best trade-

off between performance and computational burden. The eval-

uation measures include PSNR, running time and total number

of parameters of network.

We choose our encoder-decoder QRNN3D as the bench-

mark. For fair comparison, same network architecture is used

except the modiﬁcation in the investigated component. Ab-

lation results are exhibited in Table VI and analyzed in the

following.

Subcomponents Investigation. Table VI investigates the

effect of subcomponents (i.e. 3D convolution and quasi-

recurrent pooling function) in QRU3D. QRU3D is the basic

building block of our QRNN3D. In the experiments, four

variants of this basic block are tested, i.e. QRU2D,WQRU2D,

C3D and WC3D.

QRU2D is instantiated by replacing the 3D convolution by

2D convolution (implemented by simply setting the kernel size

to 3×3×1). Drastic performance losing (i.e. -1.6 dB) can be

observed in Table VI, meaning ignoring the structural spectral

correlation would severely impact the model capacity.

WQRU2D is formed by a wider QRU2D model whose

number of parameters is comparable to QRU3D. Nevertheless,

It can be observed that the QRU3D still outperforms the

WQRU2D, even with less computation cost, which suggests

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 11

5 10 15 20 25 30

Band j

5

10

15

20

25

30

Band i

Relative Region

(a)

0 5 10 15 20 25 30

Band #

0

5

10

15

20

25

30

# of Relative Bands

Backward

Forward

(b) (c)

Fig. 13: (a) The captured GCS in a bidirectional QRU3D layer. (b) The number of relative bands for output of each band.

Band ideﬁned as an ”relative band” for band jmeans it will produce at least 10% perturbation (i.e. GCSij ≥0.1k1kF,

where 1has same size as hjwith all entries equal to 1) to the output if discarded. Forward/Backward denotes the direction

of dependency. (i.e. i<j for forward direction). (c) The empirical distribution of the number of relative bands.

the higher efﬁciency of 3D convolution against the 2D ap-

proach towards HSI modeling.

C3D is constructed by removing the quasi-recurrent pooling

(and the associated neural gates), deﬁnitely a residual encoder-

decoder 3D convolutional neural network. We ﬁnd lack of

mechanism to model the GCS, would degrade the performance

by a large margin (-3.4 dB).

WC3D is built by a wider C3D model with more parameters

(four times as much as the C3D model). It can be seen the

PSNR of QRU3D is 40.23 dB, higher than the WC3D’s 40.00

dB. This suggests that the improvement of quasi-recurrent

pooling is not just because it adds width to the C3D model.

Besides, the QRU3D has only ∼50% parameters and ∼80%

running time of the WC3D model and is also narrower. This

comparison shows that the improvement from quasi-recurrent

pooling is complementary to going wider in standard ways.

Direction of Network. Table VI also shows the results of

different directional structures denoted by initials (e.g. U for

unidirectional, e.t.c.). Without considering backward spectral

dependency, the unidirectional architecture performs worst.

After eliminating the causal dependency, both alternating

directional and bidirectional architectures signiﬁcantly exceed

the unidirectional one, and achieve similar performance (40.26

v.s. 40.23). Nevertheless, the bidirectional version requires

much larger memory footprint than ours alternating directional

structure, indicating the alternating directional structure can

be used as a lightweight alternative to the typical bidirectional

one.

C. Network Hyperparameter Selection

Our principle of network hyper-parameter selection is to

make it compact yet work. Table VII shows the results of

hyper-parameter selection on Gaussian denoising task through

a small grid search, where we select the depth and width of our

QRNN3D considering the best tradeoff between performance

and computation overload.

Nonetheless, we note the major goal of this work is to intro-

duce a novel building block, specially tailored to model HSI.

TABLE VII: Network hyper-parameter selection on ICVL HSI

Gaussian denoising (under noise level σ= 50) through a small

grid search. We evaluate the results by PSNR (dB), running

Time (sec) and the number of parameters (Params) of these

networks. The selected parameters are indicated by boldface.

Depth Width PSNR (dB) Time (s) Params (#)

10

16

39.85 0.68 0.42M

12 40.23 0.74 0.86M

14 39.52 0.80 1.30M

12

12 39.82 0.62 0.48M

16 40.23 0.74 0.86M

20 40.01 1.18 1.34M

Such building block can be naturally inserted into any network

topology, not restricted to the encoder-decoder network used in

this paper. We mainly show the effectiveness of our proposed

building block and don’t pursue higher performance via ex-

haustive search of other conﬁgurations. We have demonstrated

state-of-the-art performance of our QRNN3D without heavy

engineering effort on network hyper-parameter selection. Our

current hyper-parameter setting might not be perfect, and the

performance could be boosted potentially by parameter tuning,

though this is not a major focus of this paper.

D. Visualizing GCS Knowledge

To visualize the captured GCS knowledge in QRNN3D, we

ﬁrst unfold the Equation (3) and obtain

hj=

j

X

i=1

Φj(zi)∀i, j ∈[1, B], i ≤j, (4)

where Φj(zi) = fjfj−1· · · fi+1 (1 −fi)zi.

We deﬁne the GCSij by the degree of zi’s contribution to

hjunder Frobenius norm measure, i.e.

GCSij =kΦj(zi)/hjkF,(5)

where /denotes element-wise division. It also implies the

band i’s effect on band j. The captured GCS in each QRU3D

layer can be calculated through a single inference pass by

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 12

using Equation (5). To completely visualize GCS5, we choose

the ﬁrst bidirectional QRU3D for such analysis6. Figure 13(a)

exhibits the captured GCS of a random selected HSI, showing

the output of each band would be highly affected by the whole

spectrum. Figure 13(b) illustrates the number of relative bands

for output of each band. It can be seen that 15th to 17th bands

(h15,h17 ) are deeply correlated to almost all bands (Z). Figure

13(c) summarizes this statistics of all testing images on ICVL.

It shows that a randomly selected band would be typically

related to at least 15 bands (31 in total), meaning the GCS

is effectively utilized by our model and our method can also

automatically determine the most relative bands across global

spectra.

VI. CONCLUSIONS

In this paper, we have proposed an alternating directional

3D quasi-recurrent neural network for hyperspectral image

denoising. Our main contribution is the novel use of 3D

convolution subcomponent, quasi-recurrent pooling function,

and alternating directional scheme for efﬁcient spatio-spectral

dependency modeling. We have applied our model to resolve

HSI denoising beyond the Gaussian, especially in the very

challenging real-world complex noise case, and achieve better

performance and faster speed. We also show our model

pretrained on ICVL dataset can be directly utilized to tackle

remotely sensed images which is infeasible in most of existing

DL approaches for the HSI modeling.

In addition, the visualized results for global correla-

tion along spectrum (GCS) in our 3D quasi-recurrent unit

(QRU3D) further experimentally convinces the GCS is effec-

tively exploited by our model. It’s also worth investigating the

proposed QRU3D in other image sequence modeling tasks in

future.

REFERENCES

[1] M. Ahissar and S. Hochstein. The reverse hierarchy theory of visual

perceptual learning. Trends in Cognitive Sciences, 8(10):457–464, 2004.

[2] N. Akhtar and A. Mian. Nonparametric, coupled ,bayesian ,dictionary

,and classiﬁer learning for hyperspectral classiﬁcation. IEEE Trans-

actions on Neural Networks and Learning Systems, 29(9):4038–4050,

2018.

[3] B. Arad and O. Ben-Shahar. Sparse recovery of hyperspectral signal

from natural rgb images. In European Conference on Computer Vision,

pages 19–34. Springer, 2016.

[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by

jointly learning to align and translate. International Conference on

Learning Representations (ICLR), 2015.

[5] J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent

neural networks. International Conference on Learning Representations

(ICLR), 2017.

[6] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson. Ad-

vances in hyperspectral image classiﬁcation: Earth monitoring with sta-

tistical learning methods. IEEE Signal Processing Magazine, 31(1):45–

54, 2014.

[7] Y. Chang, L. Yan, H. Fang, S. Zhong, and W. Liao. Hsi-denet:

Hyperspectral image restoration via convolutional neural network. IEEE

Transactions on Geoscience and Remote Sensing, pages 1–16, 2018.

5in a forward (backward) QRU3D, the captured GCS is an upper (lower)

triangular matrix

6The body of QRNN3D is equipped with the alternating directional

structure, while in head and tail, the bidirectional directional structure is

employed to avoid directional bias.

[8] Y. Chang, L. Yan, H. Fang, S. Zhong, and Z. Zhang. Weighted low-

rank tensor recovery for hyperspectral image restoration. arXiv preprint

arXiv:1709.00192, 2017.

[9] Y. Chang, L. Yan, and S. Zhong. Hyper-laplacian regularized unidi-

rectional low-rank tensor recovery for multispectral image denoising.

In The IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), pages 4260–4268, 2017.

[10] C. Chen, Z. Xiong, X. Tian, and F. Wu. Deep boosting for image

denoising. In The European Conference on Computer Vision (ECCV),

September 2018.

[11] Y. Chen, X. Cao, Q. Zhao, D. Meng, and Z. Xu. Denoising hyperspectral

image with non-iid noise structure. arXiv preprint arXiv:1702.00098,

2017.

[12] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,

H. Schwenk, and Y. Bengio. Learning phrase representations using

rnn encoder–decoder for statistical machine translation. In Proceedings

of the 2014 Conference on Empirical Methods in Natural Language

Processing (EMNLP), pages 1724–1734, 2014.

[13] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by

sparse 3-d transform-domain collaborative ﬁltering. IEEE Transactions

on Image Processing, 16(8):2080–2095, 2007.

[14] W. Dong, G. Li, G. Shi, X. Li, and Y. Ma. Low-rank tensor approx-

imation with laplacian scale mixture modeling for multiframe image

denoising. In Proceedings of the IEEE International Conference on

Computer Vision (ICCV), pages 442–449, 2015.

[15] W. Dong, H. Wang, F. Wu, G. ming Shi, and X. Li. Deep spatial-

spectral representation learning for hyperspectral image denoising. IEEE

Transactions on Computational Imaging, pages 1–1, 2019.

[16] Y. Fu, A. Lam, I. Sato, and Y. Sato. Adaptive spatial-spectral dictionary

learning for hyperspectral image restoration. International Journal of

Computer Vision (IJCV), 122(2):228–245, 2017.

[17] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers:

Surpassing human-level performance on imagenet classiﬁcation. In The

IEEE International Conference on Computer Vision (ICCV), December

2015.

[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image

recognition. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 770–778, 2016.

[19] W. He, Q. Yao, C. Li, N. Yokoya, and Q. Zhao. Non-local meets

global: An integrated paradigm for hyperspectral denoising. In The

IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

June 2019.

[20] W. He, H. Zhang, L. Zhang, and H. Shen. Total-variation-regularized

low-rank matrix factorization for hyperspectral image restoration. IEEE

Transactions on Geoscience and Remote Sensing, 54(1):178–188, 2016.

[21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural

Computation, 9(8):1735–1780, 1997.

[22] Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutional

networks for multi-frame super-resolution. In Advances in Neural

Information Processing Systems (NIPS), pages 235–243, 2015.

[23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network

training by reducing internal covariate shift. In International Conference

on Machine Learning (ICML), pages 448–456, 2015.

[24] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks

for human action recognition. IEEE Transactions on Pattern Analysis

and Machine Intelligence (PAMI), 35(1):221–231, 2013.

[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.

arXiv preprint arXiv:1412.6980, 2014.

[26] S. Lefkimmiatis. Non-local color image denoising with convolutional

neural networks. In The IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), July 2017.

[27] T. Lillesand, R. W. Kiefer, and J. Chipman. Remote sensing and image

interpretation. John Wiley & Sons, 2014.

[28] M. Maggioni, V. Katkovnik, K. Egiazarian, and A. Foi. Nonlocal

transform-domain ﬁlter for volumetric data denoising and reconstruction.

IEEE Transactions on Image Processing, 22(1):119–133, 2013.

[29] X. Mao, C. Shen, and Y.-B. Yang. Image restoration using very deep

convolutional encoder-decoder networks with symmetric skip connec-

tions. In Advances in Neural Information Processing Systems (NIPS),

pages 2802–2810, 2016.

[30] Y. Peng, D. Meng, Z. Xu, C. Gao, Y. Yang, and B. Zhang. Decomposable

nonlocal tensor dictionary learning for multispectral image denoising. In

Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 2949–2956, 2014.

[31] Z. Ping and R. Wang. Jointly learning the hybrid crf and mlr

model for simultaneous denoising and classiﬁcation of hyperspectral

imagery. IEEE Transactions on Neural Networks and Learning Systems,

25(7):1319–1334, 2014.

[32] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 13

IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.

[33] Y. Tai, J. Yang, X. Liu, and C. Xu. Memnet: A persistent memory

network for image restoration. In The IEEE International Conference

on Computer Vision (ICCV), Oct 2017.

[34] P. S. Thenkabail and J. G. Lyon. Hyperspectral remote sensing of

vegetation. CRC Press, 2016.

[35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning

spatiotemporal features with 3d convolutional networks. In Proceedings

of the IEEE International Conference on Computer Vision (ICCV), pages

4489–4497, 2015.

[36] M. Uzair, A. Mahmood, and A. Mian. Hyperspectral face recognition

with spatiospectral information fusion and pls regression. IEEE Trans-

actions on Image Processing, 24(3):1127–1137, 2015.

[37] H. Van Nguyen, A. Banerjee, and R. Chellappa. Tracking via object

reﬂectance using a hyperspectral video camera. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition

Workshops (CVPRW), pages 44–51, 2010.

[38] Q. Wang, J. Lin, and Y. Yuan. Salient band selection for hyperspectral

image classiﬁcation via manifold ranking. IEEE Transactions on Neural

Networks and Learning Systems, 27(6):1279–1289, 2017.

[39] Y. Wang, J. Peng, Q. Zhao, Y. Leung, X.-L. Zhao, and D. Meng.

Hyperspectral image restoration via total variation regularized low-rank

tensor decomposition. IEEE Journal of Selected Topics in Applied Earth

Observations and Remote Sensing, 2017.

[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image

quality assessment: from error visibility to structural similarity. IEEE

Transactions on Image Processing, 13(4):600–612, 2004.

[41] K. Wei and Y. Fu. Low-rank bayesian tensor factorization for hyper-

spectral image denoising. Neurocomputing, 331:412 – 423, 2019.

[42] Q. Xie, Q. Zhao, D. Meng, Z. Xu, S. Gu, W. Zuo, and L. Zhang. Mul-

tispectral images denoising by intrinsic tensor sparsity regularization.

In The IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), pages 1692–1700, 2016.

[43] Y. Xie, Y. Qu, D. Tao, W. Wu, Q. Yuan, and W. Zhang. Hyperspectral

image restoration via iteratively regularized weighted schatten p-norm

minimization. IEEE Transactions on Geoscience and Remote Sensing,

54(8):4642–4659, 2016.

[44] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-

c. Woo. Convolutional lstm network: A machine learning approach for

precipitation nowcasting. In Advances in Neural Information Processing

Systems (NIPS), pages 802–810, 2015.

[45] S. Yang, Z. Feng, M. Wang, and K. Zhang. Self-paced learning-

based probability subspace projection for hyperspectral image classi-

ﬁcation. IEEE Transactions on Neural Networks and Learning Systems,

PP(99):1–6, 2018.

[46] Q. Yuan, Q. Zhang, J. Li, H. Shen, and L. Zhang. Hyperspectral

image denoising employing a spatialspectral deep residual convolutional

neural network. IEEE Transactions on Geoscience and Remote Sensing,

57(2):1205–1218, 2019.

[47] R. H. Yuhas, J. W. Boardman, and A. F. Goetz. Determination of semi-

arid landscape endmembers and seasonal trends using convex geometry

spectral unmixing techniques. In Summaries of the 4-th Annual JPL

Airborne Geoscience Workshop, 1993.

[48] H. Zhang, W. He, L. Zhang, H. Shen, and Q. Yuan. Hyperspectral

image restoration using low-rank matrix recovery. IEEE Transactions

on Geoscience and Remote Sensing, 52(8):4729–4743, 2014.

[49] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian

denoiser: Residual learning of deep cnn for image denoising. IEEE

Transactions on Image Processing, 2017.

[50] L. Zhang, W. Wei, Y. Zhang, C. Shen, A. van den Hengel, and Q. Shi.

Cluster sparsity ﬁeld for hyperspectral imagery denoising. In European

Conference on Computer Vision (ECCV), pages 631–647. Springer,

2016.

[51] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu. Residual non-local attention

networks for image restoration. In International Conference on Learning

Representations, 2019.

[52] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense

network for image restoration. arXiv preprint arXiv:1812.10477, 2018.