ArticlePDF Available

MPformer: A Transformer-Based Model for Earthen Ruins Climate Prediction

Authors:

Abstract and Figures

Earthen ruins contain rich historical value. Affected by wind speed, temperature, and other factors, their survival conditions are not optimistic. Time series prediction provides more information for ruins protection. This work includes two challenges: (1) The ruin is located in an open environment, causing complex nonlinear temporal patterns. Furthermore, the usual wind speed monitoring requires the 10 meters observation height to reduce the influence of terrain. However, in order to monitor wind speed around the ruin, we have to set 4.5 meters observation height according to the ruin, resulting in a non-periodic and oscillating temporal pattern of wind speed; (2) The ruin is located in the arid and uninhabited region of northwest China, which results in accelerating aging of equipment and difficulty in maintenance. It significantly amplifies the device error rate, leading to duplication, missing, and outliers in datasets. To address these challenges, we designed a complete preprocessing and a Transformer-based multi-channel patch model. Experimental results on four datasets that we collected show that our model outperforms others. Ruins climate prediction model can timely and effectively predict the abnormal state of the environment of the ruins. This provides effective data support and decisionmaking for ruins conservation, and exploring the relationship between the environmental conditions and the living state of the earthen ruins.
Content may be subject to copyright.
MPformer: A Transformer-Based Model for Earthen
Ruins Climate Prediction
Guodong Xu, Hai Wang*, Shuo Ji, Yuhui Ma, and Yi Feng
Abstract:Earthen ruins contain rich historical value. Affected by wind speed, temperature, and other factors,
their survival conditions are not optimistic. Time series prediction provides more information for ruins protection.
This work includes two challenges: (1) The ruin is located in an open environment, causing complex nonlinear
temporal patterns. Furthermore, the usual wind speed monitoring requires the 10 meters observation height to
reduce the influence of terrain. However, in order to monitor wind speed around the ruin, we have to set 4.5
meters observation height according to the ruin, resulting in a non-periodic and oscillating temporal pattern of
wind speed; (2) The ruin is located in the arid and uninhabited region of northwest China, which results in
accelerating aging of equipment and difficulty in maintenance. It significantly amplifies the device error rate,
leading to duplication, missing, and outliers in datasets. To address these challenges, we designed a complete
preprocessing and a Transformer-based multi-channel patch model. Experimental results on four datasets that
we collected show that our model outperforms others. Ruins climate prediction model can timely and effectively
predict the abnormal state of the environment of the ruins. This provides effective data support and decision-
making for ruins conservation, and exploring the relationship between the environmental conditions and the
living state of the earthen ruins.
Keywords:earthen ruins protection; time series prediction; Transformer
1 Introduction
Earthen ruins are ruins with soil as the main building
material. As a carrier of civilization, earthen ruins
contain rich historical information and are an important
basis for studying the origin and development of
civilization. The Suoyang City earthen ruin we focus
on is located in Jiuquan City, Gansu Province, China.
Due to wind erosion, temperature, and humidity
changes, the walls of earthen ruins are worn, the
surface peels off, and the structure deforms. By
deploying sensors around the ruins, we can collect data
such as wind speed, temperature, humidity, and
moisture in real time, and use these data to predict the
future environmental development of the ruins, we can
help earthen ruins protection work to take targeted
protective measures in advance when faced with
abnormal predicted data and avoid further damage to
them. Therefore, it is necessary to establish an earthen
ruins climate prediction model.
Time series forecasting is widely used in fields such
as electricity consumption, stock prices, traffic flow,
weather forecasting, and disease spread. Due to the
importance of time series forecasting, many time series
models have been well developed. The classical
statistical method, autoregressive integrated moving
average model (ARIMA)[1], based on a Markov
process, establishes a recursive autoregressive
prediction model. However, autoregressive models are
Guodong Xu, Hai Wang, Shuo Ji, Yuhui Ma, and Yi Feng are
with School of Information Technology, Northwest University,
Xi’an 710127, China. E-mail: 202221543@stumail.nwu.edu.cn;
hwang@nwu.edu.cn; mayh@nwu.edu.cn.
* To whom correspondence should be addressed.
Manuscript received: 2023-09-13; revised: 2023-12-07;
accepted: 2024-02-06
TSINGHUASCIENCEANDTECHNOLOGY
ISSN1007-0214
DOI:10.26599/TST.2024.9010035
©The author(s) 2024. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
difficult to deal with nonlinear and non-stationary time
series data, and model parameter settings require
domain expertise. With the explosive development of
deep neural networks, recurrent neural network
(RNN)[2] has been specialized in handling time series
data, because of the well-known vanishing and
exploding gradient problems, the model is difficult to
train, although some RNN variants have been
proposed, including long short-term memory
(LSTM)[3] and gated recurrent unit (GRU)[4], the
problem has not been completely solved. DeepAR[5]
combines autoregressive methods and RNN to model
probability distribution. The attention-based RNN
model[6] uses temporal attention to capture long-term
dependencies, however these recurrent models have
limited parallelism and face unbearable time costs
under extremely long input sequences. Long- and
short-term time-series network (LSTNet)[7] proposes a
skip-connection convolutional neural network to
capture long-term and short-term patterns. Many
works[8–11] based on time convolutional networks
(TCN) model time correlation, which is limited by the
receptive field of the convolution kernel. In practical
applications, the model needs to predict the range as
long as possible when the prediction error is small,
namely the model can handle long sequence times-
series forecasting (LSTF) problem[12]. Benefiting from
attention mechanisms, Transformer[13] shows great
potential in modeling long sequences data, and many
Transformer-based time series forecasting models[12,
14–18] have achieved great results on public datasets.
The first challenge in time series forecasting of
earthen ruins comes from the complex temporal
patterns of the datasets. First of all, the earthen ruins is
not in a closed environment, but in a completely open
environment, which leads to complex nonlinear
patterns in wind speed, temperature, humidity, and
moisture dataset; secondly, in order to reduce the
impact of terrain, the usual wind speed monitoring
height is set above 10meters. However, to monitor the
wind speed around the earthen ruins, we have to set a
monitoring height of 4.5meters according to the size of
the earthen ruins. The deployment of the wind speed
monitoring device is shown in Fig. 1. At this height,
The temporal pattern of wind speed data is greatly
affected by terrain, showing non-periodic oscillations.
The Transformer-based model PatchTST[18] defeats all
previous models including ARIMA[1], multilayer
perceptron (MLP)[19], LSTM[3], and TCN[8] on time
series public datasets, demonstrate the effectiveness of
Transformer in the field of time series forecasting.
Therefore, following the work of PatchTST, we
propose the Transformer-based model MPformer to
tackle this challenge. The second challenge comes
from the dataset. The remote geographical location of
the earthen ruins makes maintenance of the devices
difficult, and the extreme harsh environment
accelerates device aging. These factors significantly
increase the probability of errors occurring during data
transmission. As a result, the dataset contains issues
such as duplication, missing, and outliers. To address
this challenge, we designed a comprehensive
preprocessing workflow. We also designed a multi-
channel patch method for univariate data
characteristics. Unlike the patch method in PatchTST,
which only focuses on local sequence information,
multi-channel patch method aggregates time series
information from local to global across channels. Our
experiments show that the proposed model performs
the best performance on four datasets. In order to solve
the permutation invariance problem of Transformer,
some works[12, 20, 21] designed complex and subtle
positional encoding to encode time and position
information into the input, but a recent study[19] pointed
out that these methods have not worked well. Our
multi-channel patch method naturally contains the
relative position information of time series data in
different channel. Empirical experiments demonstrate
that the multi-channel patch method can replace
positional encoding and preserve time series
information better. The contributions of our work are
summarized as follows:
(1) We collect time series datasets of earthen ruins in
northwest China, including wind speed, temperature,
humidity, and moisture datasets.
(2) We design a complete pre-processing process to
tackle the problems of dumplicaiton, missingness,
Fig.1Deployment of wind speed monitoring device.
Tsinghua Science and Technology
outliers, and floating collection time intervals in the
datasets, and Transformer-based multi-channel patch
model MPformer to predict the climate of the ruins. A
large number of experimental results show that our
model is superior to other models.
(3) In the comparative experiment of randomly
shuffling the input, MPformer’s excellent performance
reveal that our multi-channel patch method can
effectively replace positional encoding and alleviate
the widely criticized permutation invariance of
Transformer.
The remainder of this paper is organized as follows:
Section 2gives the related work of this paper. Section
3details the pre-processing process, multi-channel
patch method, our proposed model MPformer, and time
complexity analysis. Section 4provides detailed
information about the experiment and the thorough
analysis of the results. Section 5gives the conclusion
and future work direction.
2 Related work
2.1 Transformer-based time series forecasting
O(L2)
O(L(logL)2)
O(LlogL)
O(LlogL)
The superior performance of Transformer in the fields
of natural language processing[22], computer vision[23],
and speech processing[24, 25] has attracted the attention
of researchers. A large number of Transformer-based
models have achieved state of the art (SOTA) in many
fields. Benefiting from the self-attention mechanism,
Transformer can automatically learn the relationship
between long-term sequence elements, and has a strong
modeling ability for long distance interdependent
sequence data. Therefore, it is used to deal with long-
term sequence prediction. However, limited by the
attention mechanism, the time complexity of the
vanilla Transformer is , which limits the model’s
ability to extract longer range historical data features
and predict further future. Some works mainly focus on
the efficiency of self-attention. LogTrans[15] proposed
the LogSparse attention mechanism, which
exponentially increases the distance between two
adjacent elements, and reduces the time complexity to
. Reformer[26] proposed locality-sensitive
hashing attention, which reduces the time complexity
to . Informer[12] proposed ProbSparse
attention based on KL divergence, which also reduces
the time complexity to . The encoder-decoder
architecture is widely used in sequence-to-sequence
tasks such as time series forecasting. The decoder of
the vanilla Transformer uses autoregressive methods to
generate prediction results, resulting in as slow as the
RNN model in the inference stage. Informer proposes a
generative decoder, which generates prediction results
in one step, thus saving a lot of inference time and
avoiding error propagation caused by autoregression.
The subsequent Transformer[14, 16] varianted all follow
the generative decoder.
O(LlogL)
O(L)
Some works introduce traditional time series
processing techniques. Autoformer[14] first introduced
the idea of seasonal-trend decomposition and proposes
an auto-correlation block to replace the vanilla self-
attention, obtaining sub-sequence level attention
instead of point-by-point, and reducing the complexity
to by fast Fourier transform. FEDformer[16]
followed Autoformer’s season-trend decomposition
idea, proposed Fourier enhancement block and wavelet
enhancement block to replace vanilla self-attention,
and obtain time complexity by random selection.
A recent work[19] defeated the above Transformer
variant models with a very simple linear network and
question whether Transformer is suitable for the field
of time series forecasting. The latest work PatchTST[18]
believed that a time step data in a time series does not
have semantic information similar to a word in a
sentence. Previous works only use point-by-point data
or manually constructed sequence information to
capture local and global features, which leads to the
model being defeated by linear network. PatchTST
obtains enhanced local information and achieves
SOTA by dividing historical information into a series
of patches.
2.2 Patch in Transformer-based models
16 ×16
Patch is usually used in fields where local information
is critical, such as computer vision, natural language
processing, and speech recognition. By dividing
images, text or audio into a series of fixed-length
patches, the model can not only efficiently process a
larger range of data but also capture fine-grained local
features of the data. In the field of computer vision,
Vision Transformer[23] divided the image into
patches as the input of the model. In natural language
processing (NLP), Bert[22] used sub word instead of
characters for word segmentation, which is consistent
with the idea of patch, and some subsequent works[27, 28]
also used patch as input. PatchTST first introduce patch
into the field of time series forecasting, and segment
the sequence data into a series of patches as model
GuodongXuetal.: MPformer: A Transformer-Based Model for Earthen Ruins Climate Prediction
input to achieve SOTA.
3 Methodology
L:I=(x1,x2,...,xL)
xt
t
Due to the four datasets only contain univariate data,
we consider the following problem: Given an
univariate time series with a lookback window of
length , where each is a scalar at
time step , we want to predict future values of length
.
3.1 Pre-processing
Limited by hardware, the sensor has a bit error rate,
and factors such as extreme environments and
equipment aging significantly magnify the probability
of errors in the data transmission. The specific
problems in the dataset include duplication, missing,
abnormality, and fluctuations in the collection time
interval. To address the issues present in the dataset,
we designed the following pre-processing:
(1) Duplicate data processing. A number of factors
can cause a sensor to send multiple copies of the exact
same data, mistakenly believing that the recipient does
not receive the data. We sort the data according to the
collection time, and remove the data with exactly the
same collection time.
(2) Outlier data processing. In some cases, the data
transmitted will be bit-inverted, the receiver will
mistakenly think that it is correct data, but the data is
far beyond the actual range or does not conform to
common sense. We use boxplot, a fast and general
graphically-based method for identifying outliers[29],
and treat outliers as missing values.
(3) Missing data processing. Whether the abnormal
data is regarded as missing values or some factors
cause the sensor to fail to collect data, there are many
missing values in the actual dataset, and the missing
values are characterized by independent distribution
rather than large-scale continuous existence.
References [30, 31] reported a comparison of missing
value filling methods such as linear interpolation,
quadratic interpolation, cubic interpolation, and
average substitution on the air pollution monitoring
datasets. We use the best performance method linear
interpolation to fill missing values.
(4) Collection time interval floating processing.
Usually the model requires the input to be a time series
with a fixed sampling interval, but it is difficult to
achieve a fixed sampling interval in the actual datasets.
We increase the data sampling frequency to obtain
more detailed signals and estimate the original data
using finer-grained time series. We employ the simple
moving average method, which is objective, reliable,
simple to implement[32], and can effectively reduce
noise in the data. Specifically, hourly wind speed
dataset collection requires increasing the sampling
frequency to every 10minutes, and then calculating the
average data value within each hour through a sliding
average window, and treating it as an estimate of the
actual observations.
3.2 Multi-channel patch
xRD×L
D=1
L
Pl
Ps
Unfold
Assume input where represents the
data dimension and represents the input length. Let
represent the patch length and represent the patch
stride, which is the distance between two adjacent
patches. The multi-channel patch method divides the
data into multiple channel patches. The same channel
patches can be overlap or not. We gradually increase
the interval of adjacent data in different channels so
that the patches contain sequence information from
local to global. The multi-channel patch pseudocode is
provided in Algorithm 1. Note that we implement the
function based on Torch.nn.Unfold. Its
pseudocode description is quite obscure, we
recommend reader to consult relevant information to
understand this process more clearly and accurately.
When the number of patch channels equal to 1, the
multi-channel patch method degenerates into the
method adopted by PatchTST. The difference of Patch
used by PatchTST and MPformer is shown in Fig. 2.
3.3 Model structure
XRB×D×L
XpRB×C×Pn×Pl
The MPformer model architecture is shown in Fig. 3.
Following PatchTST, we use a vanilla Transformer
encoder as the core architecture of MPformer.
MPformer first divides the input data into
multiple channel patches :
Xp=MultiChannelPatch(X)(1)
C=L/(Pl1)
Pn=(LPl+Ps)/Ps
dh
WpRPl×dh
where represents the number of
channels, and represents the
number of patches. Then the patches are mapped to the
latent space of dimension through a learnable linear
projection :
x=XpWp(2)
xRB×C×pn×dn
xen RB×Pn×dh
where and then each channel
will be fed into encoder. Each head
Tsinghua Science and Technology
h=1,2,...,H
of multi-head attention transforms them
into query, key, and value matrix:
Qh=xenWhQ(3)
Kh=xenWhK(4)
Vh=xenWhV(5)
WQ
h,WK
hRdh×dk,WV
hRdh×dv
where , after that a scaled
OhRB×pn×dv
production is used for getting attention output
.
OhT=Softmax( QhKhT
dk
)Vh(6)
ORB×pn×dh
WRhdv×dh
Connect the output of each head and get the output
through linear mapping .
O=Concat(O1,O2,...,Oh)W(7)
Yi=(yL+1,yL+2,...,yL+T)
YRB×T
Encoder also includes batchnorm layer, feedforward
neural network, and residual connection. Finally, the
encoder output is passed through a learnable linear
projection to obtain the prediction result
, where .
3.4 Time complexity analysis
O(N2)
N
L
L
L/Ps
C
C=L/Pl
Pl=αL(0 <α<1)
α
O(L2/P2
s)
T
The time complexity of the vanilla Transformer is
, where is the number of input tokens, namely
the lookback sequence length . With the use of multi-
channel patch, the number of input tokens can reduce
from to . In addition, we constructed channels
to incorporate more sequence information, where
, . Ignoring the constant ,
the time complexity of MPformer is . Figure 4
shows the average one epoch time overhead of the
training phase of Transformer variants under different
lookback sequence length on the temperature and wind
speed datasets. The prediction length is set to 48, the
(a) Patch in PatchTST. (b) Multi-channel patch in MPformer.
Pl=3
Ps=2
Fig.2Patch method comparison between PatchTST and MPformer. For a clear and concise illustration, we set a small patch
length and patch stride . (a) PatchTST integrates local information within a single patch. (b) The Multi-Channel
Patch in MPformer gradually increases the intervals between elements within different channels to obtain time series
information from local to global.
Algorithm 1 Multi-channel patch
XRB×D×L
Pl>1
Ps>0
Input: Past time series ; Patch length ; Patch
   stride .
Xp
Output:
B,D,L=X.shape
1:
C=L/(Pl1)
Channel number
2:           
Pn=(LPl+Ps)/Ps
Patch number
3:          
Xp=list()
4:
i=1
C
5: for to do
x=Padding(X)
xRB×D×L
6:              
x=Un sequeeze(x,dim =1)
xRB×1×D×L
7:        
Un f old (x,kernal_size =(1,Pl),dilation =i,stride =
Ps)patching
8:   x =
    
x=x.permute(0,2,1)
xRB×Pn×Pl
9:           
Xp.ap pend(x)
10:   
11: end for
Xp=S tack(Xp,dim =1)
XpRB×C×Pn×Pl
12:       
Xp
Return Multi Channel Patch results
13: return      
Fig.3MPformer architecture.
GuodongXuetal.: MPformer: A Transformer-Based Model for Earthen Ruins Climate Prediction
Pl
Ps
Ps
patch length is set to 24, and the patch stride is
set to 12, 18, 24, respectively, corresponding to
MPformer-PsXX. We do not present FEDformer
bacause its excessive time consumption prevents the
comparisons between other models, which we believe
related to algorithm implementation and constant
factors. Interestingly, the time consumption of other
variants is higher than vanilla Transformer, which is
consistent with the results from DLinear[19], and the
growth rate of time overhead is comparable to the
vanilla Transformer. The excellent performance of
PatchTST is consistent with its paper. MPformer
constructs multiple channels based on PatchTST and
pays extra overhead for this. Finally, the performance
of the three MPformer models improves as the patch
stride increases when facing extremely long
lookback sequences. This demonstrates that the time
complexity of the model decreases with the square of
the patch stride, which implies the model can see
longer historical time series.
4 Experiment
4.1 Experimental setup
We evaluate MPformer on four datasets collected from
the Suoyang City earthen ruins. (1) The wind speed
dataset contains the hourly average near-ground wind
speed from October 20, 2021to March 30, 2023. (2)
The Temperature dataset records the hourly air
temperature from October 20, 2021to March 30, 2023.
(3) The Moisture dataset includes the hourly soil
moisture from September 6, 2021to October 23, 2022.
(4) The Humidity dataset contains the hourly humidity
at 15cm above the ground from June 29, 2021to May
23, 2023. The statistical characteristics of the dataset
are summarized in Table 1. We split all datasets into
training, validation and test set in chronological order
by the ratio of 7:1:2.
We selected 6baseline models, including
Transformer-based models that have achieved SOTA,
such as Transformer[13], Informer[12], Autoformer[14],
FEDformer[16] and PatchTST[18], and the linear model
NLinear[19].
104
All models use L2loss, adam optimizer, the initial
learning rate is set to , the batch size is set to 32,
and the training phase is terminated after 3consecutive
triggers of early stopping. MPformer includes
2encoder layers. The experiment was repeated three
times on an RTX3090 24 GB GPUs.
Differentfromthemetricofpreviousworks[12, 14, 16, 18, 19],
we used the root mean square error (RMSE) instead of
the mean square error (MSE), because the root mean
square error can better reflect the real error, and
continue to use the mean absolute error (MAE) as the
metric. All results are based on the normalized data.
RMSE =v
t1
N
n
i=1
(yiy
i)2(8)
MAE =1
N
n
i=1|yiy
i|(9)
4.2 Discussion
Different from the extremely long prediction length set
by previous works[12, 14, 16, 18, 19], ruins workers will
Fig.4Comparison of time overhead changes of Transformer variant models.
Table 1Statistical characteristics of the earthen ruins
datasets.
Dataset WindSpeed Temperature Moisture Humidity
Dimension 1 1 1 1
Timestep 12618 12618 9879 16642
Tsinghua Science and Technology
take timely measures to eliminate potential risks after
finding abnormal prediction results. For example,
workers will immediately cool down the temperature
around earthen ruins when the model predicts too high
temperature. In short, extremely long prediction ranges
may not have practical significance due to people
interventions. Therefore, we choose a moderate
prediction length: 1, 12, 24, 48, with a fixed input
length of 48.
5.93% (0.810 0.762)
10.88% (0.855 0.762)
4.27% (0.796 0.762)
7.07% (0.820 0.762)
2.56% (0.782 0.762)
We list the results on four datasets in Table 2. The
results show that our model MPformer achieves the
best results. Specifically, under the input-48-predict-
48setting, MPformer gives
MAE reduction in wind speed dataset compared with
the Transformer-based SOTA model PatchTST, gives
MAE reduction compared with
the linear model NLinear, gives
MAE reduction compared with FEDformer, gives
MAE reduction compared with
Autoformer, and gives MAE
reduction compared with Informer. The performance
on the temperature, humidity, and moisture datasets is
consistent with wind speed dataset. MPformer perform
better on all datasets, while Transformer and NLinear
perform poorly on the wind speed dataset without
obvious periodicity.
A previous work[19] found that the prediction
accuracy of the Transformer-based model decreased
slightly or even improved after shuffling the input,
questioning the ability of the positional encoding in the
Transformer-based model to preserve sequence
information. We believe that the multi-channel patch
method naturally includes the relative position
information of sequence data from local to global in
different channels, which can effectively retain
sequence information and alleviate the permutation
invariance of Transformer. In order to verify our idea,
we compared the percentage of drop in prediction
results of Transformer-based models after shuffling the
input. The experimental results are listed in Table 3.
FEDformer and Autoformer introduce a time series
decomposition architecture, and the performance drops
more on the wind speed dataset, which is more
complex and has non-obvious periodicity, but less on
other datasets. The prediction results of Informer and
Transformer decreases slightly or even improves in
most cases, which is consistent with previous work[19].
Our proposed model has the most performance drop on
the Temperature, Humidity, and Moisture datasets, and
also shows a significant drop on the wind speed
dataset. Therefore, the multi-channel patch method can
replace positional encoding and preserve sequence
information more effectively.
5 Conclusion
Earthen ruins are important witnesses of human
Table 2Prediction results on the four datasets. We fix the input length to 48and the prediction length to {1, 12, 24, 48}. A
lower RMSE or MAE indicates a better prediction. The best results are in bold and the second best are underlined.
Models Metric MPformer PatchTST NLinear FEDformer Autoformer Informer Transformer
RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
WindSpeed
10.730 0.571 0.754 0.595 0.800 0.626 0.748 0.588 0.756 0.594 0.734 0.577 0.739 0.587
12 0.932 0.743 0.961 0.768 1.021 0.809 0.951 0.765 0.974 0.783 0.929 0.748 0.952 0.777
24 0.943 0.754 0.985 0.791 1.029 0.817 0.962 0.771 0.971 0.781 0.960 0.781 0.992 0.810
48 0.957 0.762 1.015 0.810 1.075 0.855 1.001 0.796 1.027 0.820 0.976 0.782 0.974 0.788
Temperature
10.245 0.157 0.253 0.170 0.295 0.214 0.269 0.188 0.284 0.200 0.259 0.182 0.242 0.159
12 0.366 0.252 0.383 0.263 0.480 0.357 0.387 0.281 0.394 0.289 0.406 0.301 0.391 0.265
24 0.379 0.260 0.406 0.273 0.491 0.367 0.392 0.281 0.451 0.332 0.415 0.305 0.413 0.290
48 0.398 0.279 0.418 0.288 0.491 0.366 0.407 0.295 0.450 0.333 0.442 0.323 0.410 0.290
Moisture
10.136 0.099 0.145 0.103 0.195 0.151 0.178 0.130 0.195 0.141 0.165 0.117 0.148 0.108
12 0.325 0.212 0.329 0.223 0.468 0.346 0.373 0.263 0.400 0.284 0.357 0.248 0.341 0.230
24 0.350 0.242 0.365 0.258 0.496 0.377 0.409 0.294 0.449 0.322 0.377 0.268 0.372 0.266
48 0.400 0.280 0.414 0.290 0.524 0.388 0.445 0.321 0.497 0.373 0.448 0.324 0.453 0.333
Humidity
10.446 0.345 0.455 0.351 0.503 0.388 0.457 0.356 0.472 0.370 0.447 0.347 0.449 0.349
12 0.499 0.386 0.509 0.396 0.558 0.430 0.508 0.394 0.575 0.449 0.507 0.392 0.511 0.395
24 0.509 0.391 0.516 0.399 0.575 0.442 0.517 0.399 0.555 0.428 0.524 0.407 0.517 0.395
48 0.529 0.408 0.531 0.412 0.589 0.455 0.534 0.409 0.600 0.465 0.542 0.421 0.537 0.417
GuodongXuetal.: MPformer: A Transformer-Based Model for Earthen Ruins Climate Prediction
existence and have extremely high historical and
cultural value. The environment prediction of earthen
ruins provide more information for the protection of
earthen ruins. Therefore, it is necessary to establish a
climate prediction model for earthen ruins and control
abnormal data in advance. For timely and effective
prediction of ruins climate, we propose a complete
preprocessing and design a multi-channel patch
Transformer-based model, named MPformer, which
aggregates sequence information from local to global
in different channels. Experiments show that MPformer
not only achieves the best performance on the four
datasets, but also effectively alleviates the widely
criticized permutation invariance of Transformer.
The multi-channel patch method designed for
univariate data in this paper is a simple but effective
method. It would be an important future step to explore
similar methods for multivariate data.
Acknowledgments
This work was financially supported in part by the
National Key Research and Development Program of
China ( No. 2019YFC1520904).
References
G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung,
Time Series Analysis: Forecasting and Control. Hoboken,
NJ, USA: John Wiley & Sons, p. 712, 2015.
[1]
H. Hewamalage, C. Bergmeir, and K. Bandara, Recurrent
neural networks for time series forecasting, arXiv preprint
arXiv: 1901.00069, 2019.
[2]
S. Hochreiter and J. Schmidhuber, Long short-term[3]
Table 3Transformer-based model predict result percentage drop after shuffling the input. The input length is set to 48and
the prediction length is set to {1,12, 24, 48}. Ori. represents the original MAE result. Shuf. Represents the MAE result after
shuffling the input. A larger drop indicates a higher ability to retain sequence information. The best results are in bold.
Model Metric
MPformer PatchTST NLinear FEDformer Autoformer Informer Transformer
Ori.
Shuf. Drop Ori.
Shuf. Drop Ori.
Shuf. Drop Ori.
Shuf. Drop Ori.
Shuf. Drop Ori.
Shuf. Drop Ori.
Shuf. Drop
WindSpeed
10.575
0.745
8.87%
0.595
0.747
5.80%
0.626
0.843
11.64%
0.588
0.778
9.31%
0.594
0.748
7.48%
0.577
0.578
−0.37
%
0.587
0.577
−1.66
%
12 0.743
0.776
0.768
0.773
0.809
0.843
0.765
0.781
0.783
0.796
0.748
0.774
0.777
0.772
24 0.754
0.767
0.791
0.780
0.817
0.857
0.771
0.784
0.781
0.803
0.781
0.765
0.810
0.764
48 0.776
0.774
0.810
0.797
0.855
0.879
0.796
0.805
0.820
0.816
0.782
0.758
0.788
0.799
Temperature
10.170
0.517
130.30
%
0.157
0.500
123.88
%
0.214
0.554
82.92%
0.188
0.264
18.53
%
0.200
0.330
28.05
%
0.182
0.189
23.96%
0.159
0.162
9.34%
12 0.252
0.520
0.263
0.519
0.357
0.560
0.281
0.315
0.289
0.322
0.301
0.375
0.265
0.295
24 0.260
0.533
0.273
0.519
0.367
0.590
0.281
0.316
0.332
0.394
0.305
0.402
0.290
0.320
48 0.279
0.574
0.288
0.546
0.366
0.568
0.295
0.322
0.333
0.390
0.323
0.438
0.290
0.330
Moisture
10.103
0.528
196.13
%
0.099
0.508
188.11
%
0.151
0.618
122.00
%
0.130
0.351
65.66
%
0.141
0.368
59.77
%
0.117
0.134
34.91%
0.108
0.110
0.64%
12 0.212
0.541
0.223
0.536
0.346
0.552
0.263
0.370
0.284
0.416
0.248
0.343
0.230
0.233
24 0.242
0.521
0.258
0.540
0.377
0.596
0.294
0.379
0.322
0.391
0.268
0.384
0.266
0.258
48 0.280
0.564
0.290
0.550
0.388
0.625
0.321
0.395
0.373
0.411
0.324
0.465
0.333
0.341
Humidity
10.347
0.453
24.33%
0.351
0.461
20.40%
0.388
0.504
19.12%
0.356
0.443
17.85
%
0.370
0.514
12.59
%
0.345
0.361
10.17%
0.349
0.335
−0.33
%
12 0.384
0.465
0.396
0.466
0.430
0.493
0.394
0.432
0.449
0.476
0.392
0.447
0.395
0.397
24 0.391
0.469
0.399
0.467
0.442
0.517
0.399
0.467
0.428
0.466
0.407
0.455
0.395
0.397
48 0.408
0.513
0.412
0.476
0.455
0.523
0.409
0.492
0.465
0.449
0.421
0.464
0.417
0.424
Tsinghua Science and Technology
memory, Neural Comput., vol. 9, no. 8, pp. 1735–1780,
1997.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical
evaluation of gated recurrent neural networks on sequence
modeling, arXiv preprint arXiv: 1412.3555, 2014.
[4]
D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski,
DeepAR: Probabilistic forecasting with autoregressive
recurrent networks, Int. J. Forecast., vol. 36, no. 3, pp.
1181–1191, 2020.
[5]
Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G.
Cottrell, A dual-stage attention-based recurrent neural
network for time series prediction, arXiv preprint arXiv:
1704.02971, 2017.
[6]
G. Lai, W. C. Chang, Y. Yang, and H. Liu, Modeling
long- and short-term temporal patterns with deep neural
networks, in Proc. 41st Int. ACM SIGIR Conf. Research &
Development in Information Retrieval, Ann Arbor, MI,
USA, 2018, pp. 95–104.
[7]
S. Bai, J. Z. Kolter, and V. Koltun, An empirical
evaluation of generic convolutional and recurrent
networks for sequence modeling, arXiv preprint arXiv:
1803.01271, 2018.
[8]
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O.
Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K.
Kavukcuoglu, Wavenet: A generative model for raw
audio, arXiv preprint arXiv: 1609.03499, 2016.
[9]
A. Borovykh, S. Bohte, and C. W. Oosterlee, Conditional
time series forecasting with convolutional neural
networks, arXiv preprint arXiv: 1703.04691, 2017.
[10]
R. Sen, H. F. Yu, and I. S. Dhillon, Think globally, act
locally: A deep neural network approach to high-
dimensional time series forecasting, in Proc. 33rd Int.
Conf. on Neural Information Processing Systems,
Vancouver, Canada, no. 435, pp. 4837–4846, 2019.
[11]
H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and
W. Zhang, Informer: Beyond efficient transformer for long
sequence time-series forecasting, in Proc. AAAI Conf. on
Artificial Intelligence, pp. 11106–11115, 2021.
[12]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all
you need, in Proc. 31st Int. Conf. Neural Information
Processing Systems, Long Beach, CA, USA, 2017, pp.
6000–6010.
[13]
H. Wu, J. Xu, J. Wang, and M. Long, Autoformer:
Decomposition transformers with autocorrelation for long-
term series forecasting, arXiv preprint arXiv: 2106.13008,
2021.
[14]
S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. X. Wang,
and X. Yan, Enhancing the locality and breaking the
memory bottleneck of transformer on time series
forecasting, in Proc. 33rd Int. Conf. on Neural
Information Processing Systems, Vancouver, Canada, pp.
5243–5253, 2019.
[15]
T. Zhou, Z. Q. Ma, Q. S. Wen, X. Wang, L. Sun, and R.
Jin, Fedformer: Frequency enhanced decomposed
transformer for long-term series forecasting, in Proc. Int.
Conf. on Machine Learning, pp. 27268–27286, 2022.
[16]
S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S.
Dustdar, Pyraformer: low-complexity pyramidal attention
for long-range time series modeling and forecasting,
https://openreview.net/forum?id=0EXmFzUn5I, 2022.
[17]
Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, A
time series is worth 64words: Long-term forecasting with
transformers, arXiv preprint arXiv: 2211.14730, 2022.
[18]
A. Zeng, M. Chen, L. Zhang, and Q. Xu, Are transformers
effective for time series forecasting? arXiv preprint arXiv:
2205.13504, 2022.
[19]
G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and
C. Eickhoff, A transformer-based framework for
multivariate time series representation learning, in Proc.
27th ACM SIGKDD Conf. Knowledge Discovery & Data
Mining, Virtual Event, Singapore, 2021, pp. 2114–2124.
[20]
B. Lim, S. Ö. Arık, N. Loeff, and T. Pfister, Temporal
fusion transformers for interpretable multi-horizon time
series forecasting, Int. J. Forecast., vol. 37, no. 4, pp.
1748–1764, 2021.
[21]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, Bert:
Pre-training of deep bidirectional transformers for
language understanding, arXiv preprint arXiv:
1810.04805, 2018.
[22]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G.
Heigold, S. Gelly, et al., An image is worth 16x16words:
Transformers for image recognition at scale, arXiv
preprint arXiv: 2010.11929, 2020.
[23]
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, wav2vec
2.0: A framework for selfsupervised learning of speech
representations, in Proc. 34rd Int. Conf. on Neural
Information Processing Systems, Red Hook, NY, USA,
pp. 12449–12460, 2020.
[24]
W. N. Hsu, B. Bolte, Y. H. H. Tsai, K. Lakhotia, R.
Salakhutdinov, and A. Mohamed, HuBERT: Self-
supervised speech representation learning by masked
prediction of hidden units, IEEE/ACM Trans. Audio
Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
[25]
N. Kitaev, L. Kaiser, and A. Levskaya, Reformer: The
efficient transformer, arXiv preprint arXiv: 2001.04451,
2020.
[26]
H. Bao, L. Dong, S. Piao, and F. Wei, Beit: Bert pre-
training of image transformers, arXiv preprint arXiv:
2106.08254, 2021.
[27]
K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and R. Girshick,
Masked autoencoders are scalable vision learners, in Proc.
IEEE/CVF Conf. Computer Vision and Pattern
Recognition (CVPR), New Orleans, LA, USA, 2022, pp.
16000–16009.
[28]
F. N. David and J. W. Tukey, Exploratory data analysis,
Biometrics, vol. 33, no. 4, p. 768, 1977.
[29]
N. M. Noor, M. M. Al Bakri Abdullah, A. S. Yahaya, and
N. A. Ramli, Comparison of linear interpolation method
and mean method to replace the missing values in
environmental data set, Mater. Sci. Forum, vol. 803, pp.
278–281, 2014.
[30]
M. N. Noor, A. S. Yahaya, N. A. Ramli, and A. M. M. Al
Bakri, Filling missing data using interpolation methods:
Study on the effect of fitting distribution, presented at the
2013 International Conference on Advanced Materials
Engineering and Technology (ICAMET 2013), 2013,
Bandung, Indonesia.
[31]
H. Seng, A new approach of moving average method in
time series analysis, in Proc. Conf. New Media Studies
(CoNMedia), Tangerang, Indonesia, 2013, pp. 1–4.
[32]
GuodongXuetal.: MPformer: A Transformer-Based Model for Earthen Ruins Climate Prediction
Guodong Xu received the BS degree in
computer science and technology from
Northwest University, Xi’an, China, in
2022. He is currently working toward the
MS degree in software engineering at
School of Information Technology,
Northwest University, Xi’an, China. His
current research interests include time
series forecasting.
Hai Wang is an associate professor at
School of Information Technology,
Northwest University, Xi’an, China. He is
the deputy director of Shaanxi New
Network Security Guarantee and Service
Engineering Laboratory, member of China
Computer Society, and special committee
of Network and Data Communication,
China. His current research interests include mobile network
management, service computing, and semantic Web.
Shuo Ji received the PhD degree in
computer science from Xi’an Jiaotong
University, China, in 2019. She is
currently with School of Information
Science and Technology at Northwest
University, China. Her current research
consists of large-scale data mining,
distributed systems, cloud computing, and
machine learning.
Yuhui Ma received the MS degree in
electromagnetic field and microwave
technology from Xidian University in
2018. She currently is a teacher with
Center for Experimentation and
Instruction, School of Information
Technology, Northwest University, Xi’an,
China. Her current research interests
include microwave technology and antenna, information
security, and intelligent information processing.
Yi Feng received the MS degree in
software engineering from Northwest
University, Xi’an, China, in 2022. His
current research interests include big data
analysis and visualization techniques.
Tsinghua Science and Technology
... DL architectures, such as Deep Neural Networks (DNNs), Deep Belief Networks (DBNs), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers, have found compelling applications across various fields. These include computer vision, speech recognition [15][16][17][18][19][20][21], natural language processing [22][23][24][25][26][27][28][29], bioinformatics [30][31][32][33][34][35][36][37], medical image analysis [38][39][40][41][42][43][44][45][46][47], climate science [48][49][50][51][52][53][54][55][56][57][58], and material inspection [59][60][61][62][63][64][65][66][67]. ...
Article
Full-text available
In recent years, deep learning (DL) has garnered significant attention for its successful applications across various domains in solving complex problems. This interest has spurred the development of numerous neural network architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), and the more recently introduced Transformers. The choice of architecture depends on the data characteristics and the specific task at hand. In the 1D domain, one-dimensional CNNs (1D CNNs) are widely used, particularly for tasks involving the classification and recognition of 1D signals. While there are many applications of 1D CNNs in the literature, the technical details of their training are often not thoroughly explained, posing challenges for those developing new libraries in languages other than those supported by available open-source solutions. This paper offers a comprehensive, step-by-step tutorial on deriving feedforward and backpropagation equations for 1D CNNs, applicable to both regression and classification tasks. By linking neural networks with linear algebra, statistics, and optimization, this tutorial aims to clarify concepts related to 1D CNNs, making it a valuable resource for those interested in developing new libraries beyond existing ones.
Conference Paper
Full-text available
The Nonlinear autoregressive exogenous (NARX) model, which predicts the current value of a time series based upon its previous values as well as the current and past values of multiple driving (exogenous) series, has been studied for decades. Despite the fact that various NARX models have been developed, few of them can capture the long-term temporal dependencies appropriately and select the relevant driving series to make predictions. In this paper, we propose a dual-stage attention-based recurrent neural network (DA-RNN) to address these two issues. In the first stage, we introduce an input attention mechanism to adaptively extract relevant driving series (a.k.a., input features) at each time step by referring to the previous encoder hidden state. In the second stage, we use a temporal attention mechanism to select relevant encoder hidden states across all time steps. With this dual-stage attention scheme, our model can not only make predictions effectively, but can also be easily interpreted. Thorough empirical studies based upon the SML 2010 dataset and the NASDAQ 100 Stock dataset demonstrate that the DA-RNN can outperform state-of-the-art methods for time series prediction.
Article
Recently, there has been a surge of Transformer-based solutions for the long-term time series forecasting (LTSF) task. Despite the growing performance over the past few years, we question the validity of this line of research in this work. Specifically, Transformers is arguably the most successful solution to extract the semantic correlations among the elements in a long sequence. However, in time series modeling, we are to extract the temporal relations in an ordered set of continuous points. While employing positional encoding and using tokens to embed sub-series in Transformers facilitate preserving some ordering information, the nature of the permutation-invariant self-attention mechanism inevitably results in temporal information loss. To validate our claim, we introduce a set of embarrassingly simple one-layer linear models named LTSF-Linear for comparison. Experimental results on nine real-life datasets show that LTSF-Linear surprisingly outperforms existing sophisticated Transformer-based LTSF models in all cases, and often by a large margin. Moreover, we conduct comprehensive empirical studies to explore the impacts of various design elements of LTSF models on their temporal relation extraction capability. We hope this surprising finding opens up new research directions for the LTSF task. We also advocate revisiting the validity of Transformer-based solutions for other time series analysis tasks (e.g., anomaly detection) in the future.
Article
Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which achieves O(L log L) in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.
Article
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960 h) and Libri-light (60,000 h) benchmarks with 10 min, 1 h, 10 h, 100 h, and 960 h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. 1 2</xref
Article
Multi-horizon forecasting often contains a complex mix of inputs – including static (i.e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed in the past – without any prior information on how they interact with the target. Several deep learning methods have been proposed, but they are typically ‘black-box’ models that do not shed light on how they use the full range of inputs present in practical scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) – a novel attention-based architecture that combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, TFT uses recurrent layers for local processing and interpretable self-attention layers for long-term dependencies. TFT utilizes specialized components to select relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of scenarios. On a variety of real-world datasets, we demonstrate significant performance improvements over existing benchmarks, and highlight three practical interpretability use cases of TFT.
Article
Recurrent Neural Networks (RNNs) have become competitive forecasting methods, as most notably shown in the winning method of the recent M4 competition. However, established statistical models such as exponential smoothing (ETS) and the autoregressive integrated moving average (ARIMA) gain their popularity not only from their high accuracy, but also because they are suitable for non-expert users in that they are robust, efficient, and automatic. In these areas, RNNs have still a long way to go. We present an extensive empirical study and an open-source software framework of existing RNN architectures for forecasting, and we develop guidelines and best practices for their use. For example, we conclude that RNNs are capable of modelling seasonality directly if the series in the dataset possess homogeneous seasonal patterns; otherwise, we recommend a deseasonalisation step. Comparisons against ETS and ARIMA demonstrate that (semi-) automatic RNN models are not silver bullets, but they are nevertheless competitive alternatives in many situations.
Conference Paper
Multivariate time series forecasting is an important machine learning problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. Temporal data arise in these real-world applications often involves a mixture of long-term and short-term patterns, for which traditional approaches such as Autoregressive models and Gaussian Process may fail. In this paper, we proposed a novel deep learning framework, namely Long- and Short-term Time-series network (LSTNet), to address this open challenge. LSTNet uses the Convolution Neural Network (CNN) and the Recurrent Neural Network (RNN) to extract short-term local dependency patterns among variables and to discover long-term patterns for time series trends. Furthermore, we leverage traditional autoregressive model to tackle the scale insensitive problem of the neural network model. In our evaluation on real-world data with complex mixtures of repetitive patterns, LSTNet achieved significant performance improvements over that of several state-of-the-art baseline methods. All the data and experiment codes are available online.
Article
For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks.