Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
1
Deep Temporal Joint Clustering for Satellite Image
Time Series Analysis
Wenqi Guo, Zheng Zhang, Yu Meng, Weixiong Zhang, Shichen Gao, and Ping Tang
Abstract—With the advancement of remote sensing satellite
technology, the acquisition of Satellite Image Time Series (SITS)
data has significantly increased, providing new opportunities and
challenges for land cover analysis. Traditional unsupervised clus-
tering methods often struggle with the complexity of these data
due to limitations in scalability and generalization capabilities.
In response, this paper proposes a new unsupervised learning ap-
proach called Deep Temporal Joint Clustering (DTJC) designed
for efficient pixel-wise clustering of SITS data. DTJC optimizes
the reconstruction of temporal information along with clustering
objectives, which not only preserves the temporal dynamics of
the original data but also creates a feature space conducive
to clustering. Experimental results show that DTJC achieves
optimal clustering performance across four publicly available
multi-spectral SITS datasets, including TimeSen2Crop, Cerrado
Biome, Reunion Island, and Imperial datasets. Compared to tra-
ditional K-means and projection algorithms, DTJC significantly
improves clustering accuracy, especially in environments with
complex geographical distributions. Leveraging the principles of
the K-means clustering algorithm, DTJC showcases remarkable
performance improvements over traditional optimized K-means
and projection algorithms in land cover analysis, heralding a
new era in the unsupervised learning landscape of SITS data.
The DTJC method greatly enhances the efficiency of SITS data
analysis without the need for labeled data, making it a powerful
tool for automated land cover classification and environmental
monitoring.
Index Terms—Satellite image time series, unsupervised, clus-
tering, deep learning, joint optimization.
I. INTRODUCTION
IN recent years, the evolution of earth observation satellites
has led to a substantial increase in the availability of
time-intensive remote sensing images [1]–[3].These images,
which capture the temporal dynamics of land cover, provide
additional spectral and contextual data that are crucial for
improving the distinction between various land cover types and
reducing confusion among similar classes [4], [5]. Supervised
machine learning methods have played a significant role
in processing Satellite Image Time Series (SITS), enabling
more accurate land cover classification and segmentation [6]–
[9]. However, these methods are highly dependent on large
amounts of accurately labeled data, which are often time-
This research was funded by the National Key R&D Program of
China(Grant No. 2021YFB3900503).(Corresponding author: Zheng Zhang.)
Wenqi Guo and Shichen Gao are with School of Science, China
University of Geosciences (Beijing), Beijing 100083, China (e-mail:
2119210056@email.cugb.edu.cn; gsc2039@cugb.edu.cn).
Zheng Zhang, Yu Meng, Weixiong Zhang and Ping Tang are with the
Aerospace Information Research Institute (AIR), Chinese Academy of Sci-
ences (CAS), Beijing 100094, China; (e-mail: zhangzheng@aircas.ac.cn;
mengyu@aircas.ac.cn; zhangweixiong@aircas.ac.cn; tangping@aircas.ac.cn).
consuming and expensive to obtain, especially for large-scale
satellite imagery datasets.
To overcome these limitations, unsupervised methods, par-
ticularly clustering algorithms, have emerged as essential tools
for analyzing multivariate SITS data [10]–[12]. Despite their
prevalence, traditional clustering methods typically rely on
static sample features, neglecting the temporal information
that is crucial for accurate SITS analysis [13], [14]. This
neglect often leads to suboptimal performance when dealing
with the complex, multivariate nature of SITS data [15].
Recent advances in deep learning have sought to address
these challenges by integrating temporal information more
effectively into the clustering process [16]–[18].
Convolutional Autoencoders (CAEs) have been applied with
clustering algorithms for classifying multi-temporal SAR im-
age time series without training labels [19], while another
approach leveraged cluster assignments as pseudo-labels to
optimize deep feature extraction [20]. However, the two-stage
process and heavy dependence on cluster information in these
methods often restrict the integration of clustering and data
reconstruction, distorting the original SITS feature space and
reducing their effectiveness in fully unsupervised learning.
These methods emphasize the need for better-integrated ap-
proaches that preserve temporal information while fostering a
feature space suitable for clustering.
To address these challenges, we apply joint optimization
between the autoencoder and clustering module in an un-
supervised training framework [21], [22]. This paper intro-
duces Deep Temporal Joint Clustering (DTJC), an innovative
unsupervised clustering algorithm for pixel-wise clustering
of SITS. DTJC addresses limitations in current methods by
combining a temporal convolutional autoencoder with joint op-
timization, which simultaneously optimizes reconstruction and
clustering losses. This approach preserves the local structure
of the SITS data while guiding the model to a feature space
that improves clustering. We evaluate our method on four
publicly available multi-spectral SITS datasets with consistent
time-series lengths. The main contributions of this paper are
summarized as follows:
1) We propose a novel unsupervised clustering algorithm
for SITS land cover detection, integrating temporal
information reconstruction with a clustering objective to
enhance clustering performance.
2) We employ a centroid-based probability distribution and
minimize its Kullback-Leibler Divergence to a target
distribution, which can refine the quality of cluster
assignments. Meanwhile, a reconstruction loss with tem-
poral convolutional autoencoders is used to extract the
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
2
intrinsic temporal dynamics of the original data, mitigat-
ing potential degradation during the joint optimization.
3) We compare our method with other optimized Kmeans
and projection techniques, such as principal component
analysis (PCA) and stacked temporal convolutional au-
toencoders (CAEs). Extensive experiments illustrate that
our approach outperforms other benchmark clustering
methods.
The remainder of this article is structured as follows:
Section 2 provides an overview of related work. Section 3
introduces the four SITS datasets. Section 4 elucidates the
architecture and workflow of DTJC. Section 5 presents the
evaluation metrics and the results of the SITS clustering
experiments. Finally, Sections 6 and 7 offer discussions and
conclusions, respectively.
II. RE LATE D WORK
With advancements in remote sensing technology, SITS
data have become essential for understanding temporal land
cover dynamics. Traditional approaches, particularly clustering
algorithms, have been widely used to analyze SITS data
[10], [23], [24]. This section provides an overview of recent
advancements in clustering SITS data, highlighting key con-
tributions and their limitations.
Unsupervised clustering methods have long been used for
land cover classification in SITS data. Traditional clustering
techniques, such as K-means [25] and hierarchical clustering
[26], are static methods that do not account for temporal de-
pendencies, making them unsuitable for time series data where
temporal relationships are critical. These static approaches
often fail to capture the dynamic changes inherent in SITS
data, leading to limited clustering accuracy. Variations such as
K-means combined with dynamic time warping (DTW) [27]
have been foundational in SITS clustering, as DTW allows
the identification of temporal trends by measuring similarities
between time series. However, DTW is computationally inten-
sive and becomes impractical for large-scale remote sensing
time series data. Furthermore, while DTW captures temporal
relationships, its effectiveness is often limited when dealing
with the complexity and scale of SITS data, resulting in
suboptimal clustering performance.
To address the shortcomings of traditional clustering ap-
proaches, dimensionality reduction techniques have been in-
tegrated with clustering methods. PCA [28] is commonly
used to reduce the high dimensionality of SITS data before
applying clustering algorithms. Although PCA helps manage
the complexity of the data, it often loses temporal information
that is crucial for accurate land cover analysis. In addition
to PCA, other techniques such as Independent Component
Analysis (ICA) [29], and Linear Discriminant Analysis (LDA)
[30] have also been explored for dimensionality reduction.
These methods facilitate feature extraction by transforming
high-dimensional data into lower-dimensional spaces, making
subsequent clustering more efficient. However, they still lack
an integrated mechanism to optimize clustering concurrently
with feature extraction, limiting their ability to capture com-
plex temporal patterns effectively [16], [18].
Recent developments in deep learning have enabled the
creation of more advanced clustering frameworks. For in-
stance, Thomas et al. [19] explored the use of a Convolutional
Autoencoder (CAE) combined with a clustering algorithm for
classifying multi-temporal Synthetic Aperture Radar (SAR)
image time series without training labels. This study utilized
CAE for dimensionality reduction to enhance the clustering
precision of temporal SAR images. The two-stage nature of
this method makes it difficult to use clustering outcomes
as feedback to enhance feature extraction. Guo et al. [20]
extended this by using cluster assignments as pseudo-labels
to enhance deep feature extraction, but this approach risked
distorting the feature space by overemphasizing cluster in-
formation. These methods underscore the need for better
integration of clustering and deep learning, where temporal
information is preserved, and the feature space remains con-
ducive to clustering. Achieving this balance between temporal
feature reconstruction and clustering objectives continues to
be a challenge, calling for approaches that more effectively
unify feature extraction and clustering.
Our proposed Deep Temporal Joint Clustering (DTJC)
builds on these previous works by integrating a temporal
convolutional autoencoder with a clustering layer, optimized
through a joint loss function. This approach addresses key
challenges in unsupervised SITS clustering by combining
temporal feature reconstruction with clustering optimization,
thereby preserving temporal dynamics while creating a more
clustering-friendly latent feature space. Unlike previous ap-
proaches that either neglected temporal dependencies or used
two-stage clustering procedures, DTJC simultaneously en-
hances temporal feature extraction and clustering accuracy, as
demonstrated by its superior performance on multiple SITS
datasets.
III. STU DY ARE A AN D DATA
In our study, we employ four multifaceted multi-spectral
SITS datasets, TimeSen2Crop, Cerrado Biome, Reunion Is-
land, and Imperial to rigorously evaluate the effectiveness
of our method. These datasets are distinct not only in the
volume of samples, which ranges from 50,000 to over 1.1
million, but also in their diversity of satellite sensors, including
Landsat, Sentinel, and MODIS. They encompass a broad
spectrum of geographic areas across the Americas, Europe,
and Africa. Additionally, each dataset is characterized by its
unique class distribution. For example, the Reunion Island
dataset encompasses complex urban and natural landscapes,
whereas the TimeSen2Crop dataset includes a variety of crop
cover types, exhibiting significant variation in the number of
samples per category. This diverse selection of datasets ensures
a comprehensive evaluation of our proposed method. Details
of the class information for these datasets are presented in
Table I.
•The TimeSen2Crop Dataset [31] is constructed using
Sentinel-2 satellite images that target Austrian agricul-
tural parcels. In our experiments, the 2019 33UVP tile
test dataset is used. Spanning from September 3, 2017,
to September 1, 2018, this collection comprises 31 time-
series satellite images, creating a pixel-based structure
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
3
with 139,593 samples. Each of these samples captures
9 bands: blue (B2–490 nm), green (B3–560 nm), red
(B4–665 nm), four vegetation red edge bands, and two
short-wave infrared (SWIR) bands. The dataset is ori-
ented towards 16 crop types. To evaluate noise resistance,
instances of cloud cover and shadow from the original
dataset are maintained. Geographical locations of this
dataset can be shown in Fig.1
Fig. 1. The geographical information for TimeSen2Crop dataset.
•The Cerrado Biome Dataset [32] is derived from the
MODIS sensor on NASA’s Terra satellite, covering the
years 2000 to 2017. It features 23 time-series images at
16-day intervals with a 250-meter resolution. This dataset
includes four MODIS bands: NDVI, EVI, NIR, and MIR,
and comprises 50,160 land use and cover samples across
12 categories, as classified by the Brazilian National
Institute for Space Research and their collaborators. The
classification adheres to Ribeiro’s guidelines for the Cer-
rado biome [33]. The location of this dataset is displayed
in Fig.2.
Fig. 2. The geographical information for Cerrado dataset.
•The Reunion Island Dataset [34], originating from the
2017 TiSeLaC Time Series Land Cover Classification
Competition, uses 23 time-series images from Landsat 8,
capturing Reunion Island in 2014. It includes 81,714 sam-
ples, each with a 30 m resolution, and features seven sur-
face reflectance bands and three indices (NDVI, NDWI,
and BI) per pixel at each timestamp. It encompasses
nine classes within its training subset. The geographical
location is depicted in Fig.3.
Fig. 3. The geographical information for Reunion dataset.
•The Imperial Dataset [20] comprises 23 time-series
cloud-free Landsat 8 images of Imperial County, Cali-
fornia, USA, from 2016 to 2018. With over one million
samples, each offering a 30-meter spatial resolution, this
dataset presents seven surface reflectance bands for each
pixel at every timestamp. The labels correspond with
California’s crop mapping from 2018, focusing on seven
primary categories. The dataset’s geographical location is
shown in Fig.4.
Fig. 4. The geographical information for Imperial dataset.
IV. METHODOLOGY
In this section, firstly we supply a general overview of our
approach, then we detail the structure and workflow of the
proposed framework.
A. Context
Given a series of SITS acquired at regular time interval,
we define single-temporal image as a flattened list of time-
series x={xi,∀i∈[1, N ]}, where Nis the number of
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
4
TABLE I
CLASS INFORMATION FOR FOUR DATASETS
No. Cerrado Reunion TimeSen2crop Imperial
Class name Samples Class name Samples Class name Samples Class name Samples
1 Dunes 550 Urban Areas 16,000 Legumes 853 Alfalfa 558,386
2 Fallow-Cotton 630 Other built surfaces 3,236 Grassland 20,061 Grasses 300,896
3 Millet-Cotton 316 Forests 16,000 Maize 20,001 Carrots 42,334
4 Soy-Corn 4,971 Sparse Vegetation 16,000 Potato 2,910 Onions /garlic 64,796
5 Soy-Cotton 4,124 Rocks and bare soil 12,942 Sunflower 151 Lettuce 88,882
6 Soy-Fallow 2,098 Grassland 5,681 Soy 9,750 Citrus 31,607
7 Pasture 7,206 Sugarcane crops 7,656 Winter Barley 20,001 Corn 60,703
8 Rocky Savanna 8,005 Other crops 1,600 Winter Caraway 202
9 Savanna 9,172 Water 2,599 Rye 7,503
10 Dense Woodland 9,996 Rapeseed 3,923
11 Savanna Parkland 2,699 Beet 3,573
12 Planted Forest 423 Spring Cereals 8,111
13 Winter Wheat 20,001
14 Winter Triticale 12,342
15 Permanent Plantation 210
16 Other Crops 10,001
Total 50,160 81,714 139,593 1,147,604
pixels in a single image. A time series of length Tcan be
noted as xi=x1
i, x2
i, x3
i,· · · , xT
i, where xitcan be a vector
of a pixel sample at a timestamp. Deep clustering algorithm
usually consists in learning a new representation instead of
the raw data and performing clustering on new representation.
This representation is derived by encoding the data using
a DNN, which involves a non-linear mapping denoted as
fΘ:X→Hwhere Θrepresents the learnable parameters
of the encoder. Hsignifies the representation of Xthrough
DNN learning. The new feature space created by the DNN
is referred to as the latent space, which preserves the key
temporal information in SITS data. Our objective is to acquire
a proficient feature space Hthat effectively segregates X
into a partition C={c1, c2, . . . , ck}of kclusters by a
comprehensive joint learning strategy, which both maximize
the similarity among the same clusters and the dissimilarity
between distinct clusters [35].
B. The structure of the joint optimization process
The joint optimization process, as depicted in Figure 5,
consists of two main components: a temporal convolutional
autoencoder and a clustering layer. The autoencoder captures
the temporal features of the data, while the clustering layer
refines the feature space for clustering. These two components
are optimized simultaneously through a joint loss function,
combining the reconstruction and clustering losses, as de-
scribed in this section.
1) Temporal Convolutional Autoencoder: In recent years,
several deep learning solutions provide the representational
tools to extract discriminating information in an unsupervised
framework. A classic example is the autoencoder [36], which
consists of an encoder and decoder with multi-layer neural
networks. It begins with project the input data onto a latent
dimension space, subsequently reconstructing the original in-
put using latent features. This process can be formulated as
follows:
hi=fenc(xi)(1)
bxi=fdec (hi)(2)
fenc is the encoder function with mapping the input data xi
onto a latent representation hi. Then fdec reconstruct the orig-
inal input bxiusing hi. Finally, the gradient back propagation
will update the network to learn a better representational space.
In the field of SITS land cover classification, temporal
convolutional network [37] is a supervised learning architec-
ture that is inherently designed for time series data, which
has proved their ability to classify time series. As shown in
Fig.6, we adopt the stacked temporal convolutional networks
as our encoder network. Specifically, some 1D-convolutional
layers are stacked for the input SITS data to extract temporal
hierarchical features. Each convolutional layer is followed
by a BatchNorm1d and ReLU layer. The units in the final
layer are flattened to a vector, followed by a fully connected
layer referred to the embedded layer, which transform SITS
data into a deep feature space. For unsupervised training, we
utilize a series of transpose convolutional layers to revert the
embedded feature to the original SITS data. The decoder’s
network structure is the mirror inverse of the encoder network.
Feature extraction along the time series dimension is the
central role of temporal convolutional. The extent of the
receptive field is determined by the size of the convolution
kernel, which can sense fixed time series neighborhood range
information for encoding during feature extraction. In the
proposed autoencoder framework, the parameters of encoder
fenc and decoder fdec are iteratively updated by minimizing
the reconstruction error:
Lr=1
n
n
X
i=1
∥fdec (fenc (xi)) −xi∥2
2(3)
where nis the number of pixels in dataset, and xiis the ith
samples.
2) Cluster Module and Cluster Layer: In our deep cluster-
ing framework, although the features learned through recon-
struction loss capture informative characteristics, they are not
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
5
Fig. 5. The structure of the joint optimization process.
Fig. 6. The network of temporal convolutional network.
directly suitable for clustering tasks. Following the unsuper-
vised learning approach in image clustering by Xie et al. [38],
we employ a clustering layer and clustering loss to create a
space more conducive to clustering.
The primary function of the clustering layer is to soft-
assign each embedded point hito a cluster center. We utilize
the Student’s t-distribution as a kernel function to assess the
similarity between the embedded point hiand the cluster
centroid uj, calculated as follows [39]:
qij =(1 + ∥hi−uj∥2)−1
Pj(1 + ∥hi−uj∥2)−1(4)
where qij represents the probability of embedded point hi
belonging to cluster j.
The clustering loss is defined using the Kullback-Leibler
Divergence (KL divergence), which measures the difference
between the target distribution Pand the model-predicted
distribution Q. The calculation of Qis as described above.
The target distribution Pis computed through the following
steps:
1) Squaring and Normalization: Each element qij is
squared to emphasize more probable cluster assign-
ments, followed by normalization to ensure Premains
a probability distribution.
2) Normalization by Cluster Frequency: The squared values
q2
ij are further normalized by the frequency of data
points in each cluster, ensuring Pis evenly distributed
across clusters. The formula for Pis:
pij =q2
ij /Piqij
Pjq2
ij /Piqij (5)
Finally, the clustering loss Lcis defined as:
Lc=KL(P∥Q) = X
i
X
j
pij log pij
qij
(6)
To facilitate comparisons, we employ the widely used K-
means algorithm to initialize cluster centers and assignments.
These are then continuously adjusted through the clustering
layer during the joint optimization process. This approach
allows the model to gradually adapt to the true structure of
the data throughout the training process, enhancing both the
accuracy and efficiency of clustering.
3) Optimization: The embedded feature space could be
potentially distorted when solely relying on clustering-oriented
loss. To mitigate this, the reconstruction loss of autoencoder
is added to the objective and optimized along with clustering
loss simultaneously.This synergistic optimization process is
crucial in DTJC as it allows the model to simultaneously retain
the temporal dynamics of the original data while guiding the
feature space towards a clustering-friendly configuration. By
integrating the temporal reconstruction loss with the clustering
loss, the model ensures that critical temporal information
is preserved during feature extraction, preventing the loss
of valuable data characteristics. This collaborative approach
enhances the overall clustering accuracy and stability, demon-
strating its particular effectiveness in the context of DTJC.
Proposed objective of training is:
L=Lr+γLc(7)
where Lrand Lcare reconstruction loss and clustering loss
respectively. Additionally, γ > 0serves as a coefficient to
control the extent of distortion in the embedded space. We
initialize the parameters of network by setting γ= 0 and a
specific value is set to γfor joint optimization.
C. The workflow of Deep Temporal Joint Clustering
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
6
Fig. 7. The workflow of DTJC.
Algorithm 1 The workflow of proposed method.
Require:
The SITS dataset, X; Regularization parameter, γ;
Number of principal components, P;
Epoch1and epoch2;
Ensure:
Cluster assignments and clustering metric, Y;
1: Feature engineering process:
2: Normalize the original dataset X;
3: Reduce dimensionality using PCA for normalized data;
4: Initialization process:
5: For iter = 1 to epoch1do:
6: Initialize the Autoencoder weights by (3);
7: End for
8: Initialize cluster centers by clustering the latent feature;
9: Joint optimization process:
10: For iter = 1 to epoch2do:
11: Update the autoencoder layer by (7);
12: Update cluster assignments and target distribution;
13: End for
14: Evaluate the quality of clustering
The workflow of our proposed method is illustrated in Fig.7,
encompassing three main stages: preprocessing, initialization,
and joint optimization.
We conduct experiments on diverse SITS datasets, which
exhibit variations in land categories, quantity distribution, and
satellite sensors. To develop a universal clustering method for
detecting SITS land cover, no filtering or imputation operations
are applied. Therefore, the original data’s clouds, shadows, and
other noise are retained, which can significantly impact the
training results [40]. To this end, prior to the training process,
the normalization method followed by PCA for dimensionality
reduction are implemented.
SITS data typically encompass observations from multiple
bands or sensors, with potentially varying value ranges across
these bands. Maximum-minimum normalization [41] is em-
ployed to map the values of SITS data to a specific range:
xnorm =x−xmin
xmax −xmin
(8)
where xis the original SITS data, xmin represents the smallest
number of data and xmax represents the biggest one. xnorm
stands for the normalized data.
In addition, given our focus on temporal information, we
utilize PCA to minimize the influence of spectral variations on
clustering results. By choosing the appropriate number of prin-
cipal components, we effectively reduce redundant information
within the spectral data. This preprocessing methodology is
consistently applied across all datasets.
After feature engineering, we initialize the weights of the
network and clustering information by training the autoen-
coder through Eq. (3). This allows the encoder to capture the
intrinsic temporal information of the original SITS data. Sub-
sequently, the cluster centers and assignments are initialized
by performing K-means on embedded features. Following the
initialization phase, a joint optimization process is conducted
using Eq. (7). The joint optimization process updates both the
autoencoder and clustering layers simultaneously, optimizing
both the reconstruction loss and the clustering loss to ensure
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
7
that the learned feature space is suitable for clustering while
preserving the temporal structure of the original data. Algo-
rithm 1 summarizes the entire process. By inputting the raw
SITS, the number of clusters, and the maximum iterations
for the initialization and optimization process (epoch1 and
epoch2), we ultimately obtain the results of cluster assign-
ments and accuracy metrics. This algorithm represents a fully
unsupervised training process, where actual labels are only
used for accuracy assessment and do not participate in any
training process.
D. Competitive methods and Implementation Details
In our experiments, the performance of DJTC is compared
with six other optimized K-means and projection methods,
which are summarized as follows:
•Kmeans [25]. We apply the classic K-means with Eu-
clidean distance on the multivariate pixel time series
directly, which is implemented by the scikit-learn library.
It is executed repetitively with 20 random initial centers
and the highest performance is reported.
•Kmeans-DTW [27]. This version of the K-means applies
a dynamic time warping metric instead of the Euclidean
distance, which is specifically designed to measure the
similarity between time series data. We implement this
algorithm based on the tslearn package. The algorithm
chose ten sets of random initial centers and the optimal
result is reported.
•PCA+Kmeans [28]. Principal Component Analysis
(PCA) is a DR technique widely used in statistics and
machine learning. We employ the PCA algorithm before
clustering, implementing by Faiss package. For each
dataset, we start by flattening the features and then
apply PCA to obtain new features. We report the optimal
clustering results at different number of components.
•CAEs+Kmeans [19]. This method projects time series
data onto an embedding space using a temporal con-
volutional autoencoder and applies Kmeans clustering
algorithm on embedding features. This comparative ex-
periment can be viewed as a two-stage deep clustering
algorithm without the joint optimization process, which
can effectively evaluate the impact of the joint optimiza-
tion on improving clustering accuracy. We implement the
comparative method using the same network structure and
parameters as the proposed method.
•RNN+Kmeans [42]. This method is similar to
CAEs+Kmeans, but it replaces the TempCNN network
with an RNN. The RNN processes the time series data
and then projects it onto an embedding space, followed
by the application of Kmeans on the embedding features.
This method serves as a comparative experiment to
evaluate the effect of using RNN-based temporal feature
extraction.
•LSTM+Kmeans [43]. Similarly, this method replaces the
TempCNN network in the CAEs+Kmeans with an LSTM.
The LSTM is used to capture temporal dependencies in
the time series data, and the resulting embeddings are
clustered using Kmeans. This allows us to compare the
performance of LSTM-based temporal feature extraction
against other methods.
In our proposed approach, the encoder structure comprises
five successive layers of stacked 1D convolutions. In the first
four conv1D layers, we specify a kernel size of 5, a stride of 1,
and padding of 2. The fifth conv1D layer sets a kernel size of
3, a stride of 1, and padding of 1. The decoder network mirrors
the encoder architecture and is implemented using ConvTrans-
pose1d layers. For the K-means clustering component, we rely
on the scikit-learn library [44] and initialize the cluster centers
randomly ten times. The clustering layer and clustering loss
within our model is constructed based on [38]. During the
training process, we employ the Adam optimizer [45], utilizing
the Mean Squared Error loss function [46] for reconstitution
and the Kullback-Leibler Divergence loss function [47] for
clustering simultaneously. The learning rates for initialization
and joint optimization are set to 0.002 and 0.001, respectively.
Detailed information regarding the network’s architecture and
other dataset-specific hyper-parameters is presented in Table II.
In the table, ’Bs’ represents the batch size, Hdenotes the size
of latent features, ’γ’ signifies the regularization parameter,
’Ep1’ and ’Ep2’ correspond to the number of iterations for
initialization and joint optimization, and ’Rs’ stands for the
random seed. Our implementation is built by the PyTorch
framework [48] and executed on an NVIDIA GeForce RTX
2080 Ti for training.
E. Evaluation Metrics
To evaluate the effectiveness of the Deep Temporal Joint
Clustering (DTJC) approach comprehensively, we employ
several established metrics from unsupervised learning:
•Normalized Mutual Information (NMI) [49]: NMI mea-
sures the mutual dependency between the clustering
outcomes and the true classes, normalized against the
entropy of each set:
NMI(U, V ) = 2×I(U;V)
H(U) + H(V)(9)
where I(U;V)represents the mutual information, and
H(U),H(V)are the entropies of the clusters and classes,
respectively.
•Adjusted Rand Index (ARI) [50]: ARI corrects the Rand
Index for the chance grouping of elements, offering a
normalized measure:
ARI =Index −Expected
Max −Expected (10)
In this formula, ’Index’ is the sum of the binomial coeffi-
cient of all pairs in each cluster that are also in the same
class, representing the actual agreement. ’Expected’ is the
hypothetical maximum agreement by chance, calculated
from the product of the binomial coefficients of each
cluster and each class divided by the binomial coefficient
of the total number of elements. ’Max’ is the average of
the binomial sums of pairs within each cluster and each
class, representing the potential maximum index.
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
8
TABLE II
THE NETWORK ARCHITECTURE AND HYPER-PARA MET ER S FOR D IFF ERE NT DATAS ETS
Dataset Network structure Hyper-parameters
- Bs H γ Ep1 Ep2 Rs
Cerrado con 1d5
16 →con 1d5
32 →con 1d5
32 →con 1d5
64 →con 1d3
64 128 200 0.01 100 50 4000
Reunion con 1d5
16 →con 1d5
32 →con 1d5
64 →con 1d5
128 →con 1d3
256 512 200 0.1 100 200 3000
Timesen2crop con 1d5
16 →con 1d5
32 →con 1d5
32 →con 1d5
64 →con 1d3
64 512 150 1 200 60 2000
Imperial con 1d5
16 →con 1d5
32 →con 1d5
32 →con 1d5
64 →con 1d3
64 512 80 0.1 150 250 4000
•Clustering Accuracy (ACC) [51]: ACC quantifies the
proportion of correctly classified points:
ACC =1
n
n
X
i=1
1(yi=c(xi)) (11)
where nis the total number of points, yiis the true label
for point i, and c(xi)is the cluster assignment for xi.
These metrics are implemented using the scikit-learn
library, ensuring accurate and reliable evaluations. The use
of NMI, ARI, and ACC provides a robust framework for
evaluating our clustering algorithm’s performance, allowing
for a detailed assessment of its ability to capture the true
structure of the data.
V. EX PE RI ME NT A ND RE SU LT
In this section, we provide a detailed introduction to the
comparative experiments, implementation details, evaluation
metrics, and the visual and quantitative results.
1) Analysis of Initialization Process for DTJC: Fig. 8
presents a visual analysis of the loss function trends during
the initialization phase across four multi-spectral Satellite
Image Time Series (SITS) datasets. This figure illustrates
the evolution of the loss function over successive epochs,
highlighting the adaptive calibration of initialization iterations
in response to datasets of varying scales. To enhance the
interpretability of these dynamics, the graph commences from
the first epoch, intentionally excluding the initial epoch (epoch
0), as initial loss values are typically elevated and may obscure
critical stabilization phases and incremental improvements.
Each dataset is uniquely color-coded—TimeSen2Crop, Cer-
rado Biome, Reunion Island, and Imperial—confirming the
model’s adaptability and consistent performance across the
spectrum of dataset variances and intrinsic complexities.
Specifically, the loss functions of all datasets exhibit a
rapid decline during the initial iterations, indicating that the
autoencoder can quickly capture the main features of the data,
facilitating effective feature extraction and data representation
optimization. As the number of iterations increases, the rate
of decline in the loss function gradually slows and eventually
stabilizes. This phenomenon suggests that the autoencoder
progressively refines data representation, eventually reaching
a stable state where data can be effectively reconstructed.
For the Cerrado dataset, the loss value rapidly decreases and
stabilizes after relatively few iterations, indicating that this
dataset quickly achieves optimal data representation during
initialization. In contrast, the Reunion dataset exhibits a slower
rate of loss decline, requiring more iterations to optimize data
representation. The loss value trends for the TimeSen2Crop
and Imperial datasets fall between those of the Cerrado and
Reunion datasets, ultimately stabilizing as well.
Fig. 8. Evolution of the loss function during the initialization phase
across four SITS datasets, demonstrating the effectiveness of the autoencoder
network in refining data representations.
These observations demonstrate that the autoencoder ex-
hibits a consistent convergence trend across different datasets,
validating its robustness and efficacy in processing various
SITS data. Through the use of multiple layers of 1D convolu-
tion, the autoencoder effectively extracts temporal hierarchical
features, capturing key information within the SITS data while
reducing noise, thereby enhancing the quality of data represen-
tation. This feature extraction method not only preserves the
temporal dynamics of the data but also improves clustering
accuracy and stability. The DTJC method exhibits significant
advantages in effective feature extraction and representation
optimization. By simultaneously optimizing the reconstruction
loss of the autoencoder and the clustering loss, the DTJC
method can create a feature space conducive to clustering
while preserving the local structure of the original data. This
comprehensive and concise description of the initialization
process not only confirms the effectiveness of algorithm in
terms of convergence but also further validates the broad
applicability of the Deep Temporal Joint Clustering method
across various SITS datasets.
2) Analysis of Joint optimization for DTJC: Building on
the comprehensive insights provided in the manuscript, we
performed a detailed visual analysis of the DTJC approach
on the TimeSen2Crop dataset. Employing the t-distributed
stochastic neighbor embedding (t-SNE) technique [52], we
project the deeply learned features onto a two-dimensional
plane and juxtaposed them with the true labels. This process
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
9
Fig. 9. T-SNE visualizations for the evolution of feature representations of the TimeSen2Crop dataset through various stages of the DTJC process: (a) Raw
feature scatter before training, (b) Post-initialization deep features, (c) Deep features after 30 epochs of joint optimization, and (d) Deep features following
60 epochs of joint optimization.
offers an enhanced understanding of the ability of DTJC to
discern and cluster intricate representations during the joint
optimization phase.
Fig. 9 captures the progressive refinement of features
throughout the training procedure on the TimeSen2Crop
dataset, with various crop types delineated by distinct colors.
In the initial stages, the feature distribution is somewhat
chaotic, with a noticeable overlap among different classes,
which poses a challenge for direct clustering using the original
features. After the initialization phase, there is a notable
emergence of cluster prototypes, yet the boundaries between
classes are not clearly defined. As the training progresses
to the 30th epoch, the model begins to distinguish between
classes with greater clarity. By the 60th epoch, the model
has successfully learned deep representations characterized
by minimized intra-class distances and maximized inter-class
separations. These optimized representations are eminently
suited for clustering, as evidenced by the distinct and cohesive
groupings of data points pertaining to the same class.
These visualizations not only underscore the proficiency of
DTJC in overcoming the limitations of the K-means algorithm
when dealing with raw, unstructured features but also demon-
strate the robustness of model. The evolving feature landscape,
as illustrated, clearly validates the capability of model to
refine data representations and achieve highly accurate cluster
assignments as the joint optimization process unfolds. This
analytical portrayal of the training trajectory of DTJC on the
TimeSen2Crop dataset reveals the dynamic nature of feature
separation and the effectiveness of the model in clustering
complex agricultural data.
3) Analysis of Evaluation Metrics for Different Clustering
Methods: Table III delineates the clustering performance
metrics—ACC, NMI, and ARI—for a suite of methodologies
applied to four distinct datasets. The K-means clustering
algorithm iterations are repeated 20 times, with the highest
performance metrics reported. For our method, we present the
optimal accuracy attained at the end of the iterations.
An analysis of the Cerrado dataset shows that best per-
formance is achieved when the cluster count is set to nine,
especially for categories with sparse instances. The optimal
number of clusters for each dataset is further deliberated in
Section 5.1. A consistent observation across all datasets is the
superior performance of our proposed method over compara-
tive methods, including the two-stage CAEs+Kmeans, as well
as the deep clustering methods based on RNN and LSTM. This
underscores the efficacy of our joint optimization approach.
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
10
TABLE III
CLU STE RI NG ME TRI CS F OR DI FFE REN T ME THO DS AC ROS S FOU R DATASE TS
Method Cerrado Reunion TimeSen2crop Imperial
ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI
Kmeans 0.728 0.664 0.593 0.553 0.423 0.374 0.181 0.055 0.152 0.322 0.168 0.096
Kmeans-DTW 0.668 0.637 0.528 0.562 0.430 0.370 0.224 0.169 0.085 0.344 0.227 0.138
PCA+Kmeans 0.692 0.592 0.481 0.654 0.472 0.402 0.560 0.609 0.486 0.661 0.553 0.526
RNN+Kmeans 0.732 0.671 0.594 0.634 0.465 0.402 0.550 0.551 0.461 0.651 0.551 0.503
LSTM+Kmeans 0.801 0.703 0.662 0.631 0.445 0.398 0.531 0.529 0.446 0.663 0.562 0.522
CAEs+Kmeans 0.822 0.724 0.681 0.635 0.454 0.431 0.545 0.543 0.452 0.670 0.572 0.525
Our Method 0.832 0.735 0.701 0.672 0.499 0.495 0.662 0.608 0.590 0.807 0.642 0.744
The K-means performance on the TimeSen2Crop and Imperial
datasets is notably suboptimal, which can be attributed to the
extensive category range, the imbalance among categories, the
voluminous data set sizes, and the potential noise introduced
by collection errors and atmospheric conditions. However, the
PCA-enhanced methodology significantly augments clustering
results for both datasets. Similarly, while the RNN+Kmeans
and LSTM+Kmeans methods offer improvements over tradi-
tional K-means, their performance is still outpaced by our
approach, highlighting the benefits of our joint optimization
strategy. Our joint optimization method, by simultaneously
optimizing the reconstruction loss of the autoencoder and
the clustering loss, effectively extracts temporal features from
SITS data, thereby achieving noise reduction and considerable
clustering enhancements.
Of particular interest is the outcome of the Cerrado dataset,
where clustering the native features yields commendable re-
sults, suggesting that these features intrinsically encapsulate
substantial categorical information. Given that the dataset
comprises only four primary spectral bands, it is inherently
suitable for clustering. However, applying PCA in this context
inadvertently reduces valuable data, which negatively impacts
clustering metrics. Nevertheless, our method still manages to
enhance clustering outcomes through its joint optimization
process, meticulously refining the clustering results while
retaining the critical information inherent in the original data.
The significant advantage of this joint optimization strategy
lies in its ability to simultaneously optimize reconstruction
and clustering losses within a single framework, enabling
the model to create a feature space conducive to clustering
while preserving the local structure of the original data. This
not only improves the quality of feature representation but
also significantly enhances clustering accuracy and stability.
The experimental results clearly demonstrate that our method
consistently exhibits the best clustering performance across
all four datasets, outperforming both traditional and deep
clustering methods, further validating its broad applicability
and effectiveness in various SITS datasets.
4) Changes in Clustering Evaluation Metrics During the
Training Process: Fig. 10 illustrates the clustering perfor-
mance of the DTJC method across four distinct SITS datasets:
Cerrado, Imperial, TimeSen2Crop, and Reunion. As the num-
ber of iterations increases, the clustering performance, mea-
sured by key metrics such as NMI, ARI, and ACC, shows
progressive improvement. For instance, the Cerrado dataset,
which includes natural vegetation and farmland, demonstrates
significant enhancements in clustering accuracy and stability
through the joint optimization process. The Imperial dataset,
comprising extensive farmland imagery, also indicates that the
DTJC method effectively adapts to and processes datasets with
complex geographical features.
In the TimeSen2Crop dataset, which focuses on crop type
time series classification, the model exhibits considerable
fluctuations, likely due to the significant spectral characteristic
changes during different crop growth stages. Despite these
fluctuations, the DTJC method optimizes the clustering process
by learning these temporal dynamics, gradually enhancing
clustering precision. The Reunion dataset, which encompasses
diverse geographical and ecological environments, including
urban areas, natural landscapes, and agricultural land, shows
variability in clustering performance. This reflects the chal-
lenges in adapting clustering strategies to such diversity, but
the overall trend indicates a steady improvement in perfor-
mance.
Visualizing clustering performance over iterations provides
insights into the algorithm’s performance dynamics. The suc-
cess of the DTJC method across these varied datasets can be
attributed to its unique structure and optimization strategy. By
integrating temporal information reconstruction with clustering
objectives, and combining a temporal convolutional autoen-
coder with the K-means clustering algorithm, the method
adjusts cluster centers and classification boundaries through a
joint optimization process. This strategy allows the algorithm
to create a feature space conducive to clustering while pre-
serving the intrinsic properties of the original data, leading to
more accurate clustering in diverse geographical information
datasets.
5) Analysis of the Clustering Results Map: Fig. 11 provides
a vivid portrayal of the cluster assignments for the Reunion,
Imperial, and Cerrado datasets, each distinguished by distinct
color codings to represent various land cover classes. The vi-
sualization encapsulates the outcomes from different clustering
methods, including Kmeans, PCA+Kmeans, and DTJC.
For the Reunion and Cerrado datasets, the maps depict that
most clustering methods yield commendable results, reflecting
a level of regularity in the geographical distribution of similar
land cover classes. Despite occasional misclassifications, the
DTJC method shows a remarkable ability to correct these
inaccuracies, achieving a more accurate representation of the
actual land cover. This is particularly evident in the nuanced
capture of various environments, ranging from urban sprawl
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
11
(a) Cerrado (b) Imperial
(c) TimeSen2crop (d) Reunion
Fig. 10. The evolution of clustering metrics over epochs for four SITS datasets, demonstrating the algorithm’s ability to refine clustering performance
iteratively.
to dense woodland and agricultural domains. The Imperial
dataset poses a greater challenge, with traditional K-means
struggling due to the irregular distribution of categories and
the sheer scale of sample sizes. In stark contrast, the DTJC
method flourishes, demonstrating its prowess by generating
a cluster distribution that resonates more authentically with
the real-world scenario. The learning-based approach of DTJC
evidently transcends the conventional methods, especially in
handling large datasets with complex spatial distributions.
Conclusively, the DTJC not only approximates the reality
of land cover more closely but also illustrates an enhanced
capacity for resolving finer details. This robust performance
underscores the potential of DTJC in providing a nuanced un-
derstanding of varied ecosystems through SITS data analysis.
VI. DISCUSSION
The discussion in this section delves into the ramifications
of varying the number of clustering centers (k), the dimensions
of the latent deep feature space, and the regularization param-
eter (γ) on the efficacy of clustering outcomes. This analysis is
crucial for understanding how these parameters influence the
performance of the DTJC method across different datasets.
A. The Number kof Cluster Centers
A pivotal aspect of our experimental investigation is de-
termining the optimal number of clustering centers for the
Cerrado dataset. We discern that utilizing nine clustering cen-
ters yields results that eclipse the performance achieved with
the actual category count of twelve. To elucidate the influence
of k, we conduct a comparative analysis with conventional
K-means, PCA+Kmeans, and our DTJC, each subjected to
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
12
Fig. 11. Visualizations of cluster assignments for the Reunion, Imperial, and Cerrado datasets, as discerned by Kmeans, PCA+Kmeans, and DTJC methods,
showcasing the detailed accuracy in land cover classification.
varied cluster center counts. Following twenty iterations of
each algorithm, we record the most favorable outcomes. As
delineated in Table 4, selecting kin the range of nine to eleven
markedly supersedes the performance at the actual category
count, a phenomenon potentially ascribable to the infrequent
occurrence of certain categories or their resemblance to other
classes.
Within the DTJC paradigm, choosing nine clustering centers
confers an enhancement of approximately 10%over other
cluster quantities. This increment is particularly noteworthy,
given that no further adjustments are made to the architecture
or the hyper-parameters, emphasizing the inherent robustness
of the deep joint optimization strategy. Despite these fixed
conditions, our method consistently surpasses other clustering
algorithms, even with twelve centers in play. This underscores
the criticality of judiciously choosing an apt kfor the unsu-
pervised deep clustering of SITS data, a task that remains as
challenging as it is essential in real-world applications.
The findings presented here not only validate the effective-
ness of DTJC but also open avenues for future exploration
into optimizing cluster center determination, an endeavor that
could further refine the precision of SITS data clustering.
TABLE IV
CLUSTERING ACCURACY FOR VARIOUS CHOICES OF kON T HE CER RA DO
DATASE T.
Dataset Clusters
9 10 11 12 13
K-means 0.728 0.717 0.720 0.670 0.649
PCA+Kmeans 0.692 0.683 0.633 0.625 0.594
Our method 0.832 0.762 0.751 0.730 0.703
B. The Size of Latent Features
In the domain of computer science, dimensionality reduction
is traditionally regarded as a strategy to enhance clustering
outcomes. However, this assumption does not invariably hold,
especially with publicly available SITS datasets. For instance,
observations with the Cerrado dataset indicate that employing
PCA for dimensionality reduction actually diminishes cluster-
ing accuracy. This reduction process often results in the loss
of crucial clustering information in the original feature space,
which might otherwise be preserved without reduction. To
identify the optimal feature representation, researchers conduct
a series of experiments to determine the most suitable latent
feature dimensions for the dataset. As illustrated in Figure
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
13
12, these experiments highlight how clustering performance
varies across different embedding sizes (H), encompassing
ACC, NMI, and the ARI.
The data depict a notable enhancement in all three metrics
as the embedding size increases from 10 to 200. Specifically,
ACC escalates from approximately 0.55 to about 0.85, NMI
rises from around 0.4 to roughly 0.75, and the ARI advances
from about 0.35 to nearly 0.6. These figures indicate that an
embedding size of 200 provides optimal clustering outcomes
for this dataset. Beyond this dimension, although accuracy
remains relatively stable, both NMI and ARI exhibit slight
declines, suggesting that an excessively large embedding size
might introduce redundant data, thereby disrupting the clus-
tering process. These findings underscore the importance of
judiciously selecting an embedding size in clustering analysis.
Too small a size can lead to the loss of critical informa-
tion during the decoding phase, severely affecting clustering
outcomes; conversely, an overly large size might introduce
irrelevant data, increasing the complexity of the clustering
process. Thus, choosing an embedding size that balances these
factors is crucial for effective clustering.
Moreover, Fig. 12 not only demonstrates the impact of
embedding size on the efficacy of clustering but also provides
critical insights for future optimizations of SITS data cluster-
ing algorithms. These insights guide ongoing efforts to fine-
tune unsupervised learning algorithms, aiming to minimize
irrelevant data interference while preserving essential informa-
tion, thereby enhancing the overall performance of clustering
algorithms.
Fig. 12. Evaluation of clustering metrics against different latent feature
dimensions on the Cerrado dataset, delineating the optimal embedding size
for maximal clustering accuracy.
C. The value of γ
In the DTJC algorithm, the regularization parameter γplays
a pivotal role as it fine-tunes the trade-off between the data’s
original feature space and the clustering-amenable latent space.
An excessively high γvalue might lead to a distortion of
the original features, whereas an exceedingly low γfails to
foster a latent space conducive to K-means clustering. This
delicate balance is critical, as evidenced by experimentation
with varying values of γ: 0.01, 0.1, 1, and 10.
The results depicted in Fig. 13 illuminate that a suitably cho-
sen γcan bolster clustering accuracy by an estimated 10% in
the Timesen2crop dataset compared to other evaluated values.
Notably, the ’optimal’ γdiffers across datasets, implying that
there is no one-size-fits-all value, and each dataset may require
a tailored approach to parameter tuning. For instance, the
Cerrado and Imperial datasets exhibit peak accuracies at a γof
1, while the Reunion and Timesen2crop datasets demonstrate
optimal performances at γvalues of 0.1 and 10, respectively.
Fig. 13. The impact of varying the regularization parameter γon clustering
accuracy across four SITS datasets, highlighting the necessity of fine-tuning
γto match dataset-specific characteristics.
The selection of γis a process of meticulous experimenta-
tion, one that is indispensable for tailoring the DTJC algorithm
to the unique demands of various datasets. As Fig. 13 conveys,
the right choice of γcan lead to significant improvements in
clustering accuracy, underscoring the importance of this hyper-
parameter in the realm of unsupervised learning for SITS data.
D. Efficiency Analysis of the DTJC Method
The DTJC method integrates feature extraction with a joint
optimization process, which inherently affects time efficiency.
The integration of feature extraction and clustering optimiza-
tion introduces additional computational complexity due to the
iterative nature of optimizing both objectives simultaneously.
The efficiency of DTJC is also influenced by the structure
and parameters of the feature extraction network, such as the
number of layers, filter sizes, and activation functions.
Compared to traditional clustering methods, DTJC presents
unique efficiency characteristics. Unlike the direct application
of Kmeans clustering, DTJC uses deep learning to refine
clustering results through a more complex iterative process.
This added complexity, involving multiple iterations, results in
higher computational requirements than Kmeans alone. How-
ever, the use of CAE for dimensionality reduction alleviates
some of the computational burden, as it reduces the data to a
more compact form that is conducive to efficient clustering.
Despite the increased computational effort compared to K-
means, DTJC achieves superior clustering performance, par-
ticularly for large datasets, due to its ability to learn more
meaningful representations of the data.
The efficiency comparison provided in Figure 14 illustrates
that while DTJC takes slightly more time than PCA+Kmeans,
it is still significantly faster than Kmeans-DTW, particularly
on the Imperial and Reunion datasets. In contrast to meth-
ods such as DTW, DTJC offers substantial improvements in
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
14
Fig. 14. Execution time comparison between different methods across two datasets: Imperial and Reunion.
both computational efficiency and clustering accuracy. DTW,
known for its high computational burden, involves extensive
pairwise distance calculations, resulting in quadratic time
complexity that becomes impractical for large datasets. By
leveraging CAE to reduce dimensionality and deep learning for
iterative optimization, DTJC strikes a better balance between
complexity and scalability, making it well-suited for clustering
large-scale SITS data.
In summary, while DTJC introduces additional complexity
compared to traditional K-means clustering, the dimensionality
reduction via CAE and the iterative optimization process
contribute to its superior clustering performance, particularly
for larger datasets. The advantages of DTJC are further em-
phasized when compared to DTW, as DTJC is able to provide
both better scalability and reduced computational demands in
large-scale scenarios.
E. Prospects of DTJC
The DTJC framework demonstrates significant improve-
ments in certain SITS data analysis tasks. However, sev-
eral challenges remain that require further exploration and
refinement. One key issue is the need to predefine hyper-
parameters such as the number of clusters (k), regularization
parameters (γ), and the dimensions of the encoding space. The
performance of DTJC is highly sensitive to these parameters,
with considerable variation across different datasets. This
underscores the importance of developing methods to automate
the selection of optimal cluster numbers and streamline hyper-
parameter tuning.
Another challenge lies in the nature of the clustering vari-
able integration during training. Unlike supervised learning,
where outcomes are more predictable, DTJC’s results can be
influenced by random initializations, leading to inconsistent
performance. Therefore, there is a critical need to refine
optimization techniques that minimize the impact of these
stochastic factors.
Additionally, integrating spatial information into the clus-
tering process presents unique challenges [17]. Incorporating
spatial features can increase computational complexity, partic-
ularly when processing high-resolution satellite images. This
not only lengthens training time but also demands greater
computational resources. Moreover, spatial features often cor-
relate with geographic locations, while temporal features cap-
ture changes over time. Striking the right balance between
preserving spatial locality and capturing temporal dynamics is
essential to avoid one set of features overshadowing the other.
Furthermore, spatial information may introduce additional
noise, especially in areas with complex topography, which
requires more sophisticated preprocessing to mitigate [53].
To effectively guide future research, several key questions
and hypotheses should be addressed:
•How can spatial and temporal features be effectively
merged to optimize clustering performance? This requires
a careful balance to ensure that neither feature set domi-
nates the other.
•What is the contribution of spatial features in different ge-
ographic regions and datasets to clustering accuracy? We
hypothesize that spatial features play a more significant
role in regions with higher topographical complexity.
Potential methodologies to address these questions include
exploring multi-scale convolutional neural networks to ex-
tract features at various spatial scales and combining them
with temporal features for joint clustering [54], [55]. Graph
convolutional networks could also be leveraged to model
spatial relationships, while temporal convolutional networks
would capture time-series dynamics. Additionally, attention
mechanisms may offer a promising solution by adaptively
balancing the importance of spatial and temporal features
during clustering. Future experiments should validate these
approaches across diverse datasets with varying land cover
types to assess the effectiveness of spatial-temporal integration
[56].
Collectively, by addressing these challenges and expanding
its capabilities, DTJC has the potential to evolve into a
more comprehensive framework. Integrating spatial-temporal
features could unlock its full potential, transforming the way
unsupervised clustering is applied to remote sensing data.
VII. CONCLUSION
This study successfully develops and validates a novel un-
supervised clustering algorithm, DTJC, specifically designed
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
15
for land cover analysis in SITS data. DTJC integrates tem-
poral information reconstruction with clustering objectives,
significantly enhancing clustering accuracy while preserving
the intrinsic properties of the data.
Extensive experiments across four distinct SITS datasets
demonstrate that DTJC consistently outperforms traditional
K-means and other deep clustering techniques.The method
proves adaptable and stable across diverse geographic and
environmental conditions. DTJC not only advances unsuper-
vised learning in remote sensing image analysis but also opens
new directions for integrating spatial information into future
clustering algorithms. With the continuous advancements in
remote sensing technology and the growing availability of
data, DTJC and its enhanced versions are well-positioned to
play a crucial role in automated land cover detection and
environmental monitoring.
REFERENCES
[1] M. Drusch, U. Del Bello, S. Carlier, O. Colin, V. Fernandez, F. Gascon,
B. Hoersch, C. Isola, P. Laberinti, P. Martimort, A. Meygret, F. Spoto,
O. Sy, F. Marchese, and P. Bargellini, “Sentinel-2: Esa’s optical high-
resolution mission for gmes operational services,” REMOTE SENSING
OF ENVIRONMENT, vol. 120, no. SI, pp. 25–36, MAY 15 2012.
[2] P. Wang, H. Li, B. Chen, and S. Zhang, “Enhancing earth observation
throughput using inter-satellite communication,” IEEE TRANSACTIONS
ON WIRELESS COMMUNICATIONS, vol. 21, no. 10, pp. 7990–8006,
OCT 2022.
[3] B. Tapley, S. Bettadpur, J. Ries, P. Thompson, and M. Watkins, “Grace
measurements of mass variability in the earth system,” SCIENCE, vol.
305, no. 5683, pp. 503–505, JUL 23 2004.
[4] J. Inglada, A. Vincent, M. Arias, B. Tardy, D. Morin, and I. Rodes,
“Operational high resolution land cover map production at the country
scale using satellite image time series,” Remote Sensing, vol. 9, no. 1,
p. 95, 2017.
[5] J. Li, C. Li, and F. Wang, “Automatic scheduling for earth observa-
tion satellite with temporal specifications,” IEEE TRANSACTIONS ON
AEROSPACE AND ELECTRONIC SYSTEMS, vol. 56, no. 4, pp. 3162–
3169, AUG 2020.
[6] Y. Xiao, X. Su, Q. Yuan, D. Liu, H. Shen, and L. Zhang, “Satellite
video super-resolution via multiscale deformable convolution alignment
and temporal grouping projection,” IEEE TRANSACTIONS ON GEO-
SCIENCE AND REMOTE SENSING, vol. 60, 2022.
[7] T. M. Lenton, J. F. Abrams, A. Bartsch, S. Bathiany, C. A. Boulton,
J. E. Buxton, A. Conversi, A. M. Cunliffe, S. Hebden, T. Lavergne,
B. Poulter, A. Shepherd, T. Smith, D. Swingedouw, R. Winkelmann,
and N. Boers, “Remotely sensing potential climate change tipping points
across scales,” NATURE COMMUNICATIONS, vol. 15, no. 1, JAN 6
2024.
[8] Q. Liu, J. Yue, Y. Kuang, W. Xie, and L. Fang, “Semirs-coc: Semi-
supervised classification for complex remote sensing scenes with cross-
object consistency,” IEEE TRANSACTIONS ON IMAGE PROCESSING,
vol. 33, pp. 3855–3870, 2024.
[9] Z. Li, H. Gurgel, N. Dessay, L. Hu, L. Xu, and P. Gong, “Semi-
supervised text classification framework: An overview of dengue land-
scape factors and satellite earth observation,” INTERNATIONAL JOUR-
NAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH,
vol. 17, no. 12, JUN 2020.
[10] Y. Yuan and L. Lin, “Self-supervised pretraining of transformers for
satellite image time series classification,” IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp.
474–487, 2020.
[11] A. Julea, N. M´
eger, P. Bolon, C. Rigotti, M.-P. Doin, C. Lasserre,
E. Trouv´
e, and V. N. L ˘
az˘
arescu, “Unsupervised spatiotemporal mining
of satellite image time series using grouped frequent sequential patterns,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 4,
pp. 1417–1430, 2010.
[12] L. A. Santos, K. R. Ferreira, G. Camara, M. C. A. Picoli, and R. E.
Simoes, “Quality control and class noise reduction of satellite image
time series,” ISPRS JOURNAL OF PHOTOGRAMMETRY AND RE-
MOTE SENSING, vol. 177, pp. 75–88, AUG 2021.
[13] R. Gonc¸alves, J. Zullo, B. Amaral, P. P. Coltri, E. P. M. d. Sousa,
and L. A. S. Romani, “Land use temporal analysis through clustering
techniques on satellite image time series,” in 2014 IEEE Geoscience and
Remote Sensing Symposium. IEEE, 2014, pp. 2173–2176.
[14] S. Wang, G. Azzari, and D. B. Lobell, “Crop type mapping without
field-level labels: Random forest transfer and unsupervised clustering
techniques,” REMOTE SENSING OF ENVIRONMENT, vol. 222, pp.
303–317, MAR 1 2019.
[15] Z. Zhang, P. Tang, W. Zhang, and L. Tang, “Satellite image time series
clustering via time adaptive optimal transport,” REMOTE SENSING,
vol. 13, no. 19, OCT 2021.
[16] D. Ienco and R. Interdonato, “Deep semi-supervised clustering for multi-
variate time-series,” Neurocomputing, vol. 516, pp. 36–47, 2023.
[17] L. Khiali, M. Ndiath, S. Alleaume, D. Ienco, K. Ose, and M. Teisseire,
“Detection of spatio-temporal evolutions on multi-annual satellite image
time series: A clustering based approach,” INTERNATIONAL JOUR-
NAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION,
vol. 74, pp. 103–119, FEB 2019.
[18] Y. Zhou, S. Wang, T. Wu, L. Feng, W. Wu, J. Luo, X. Zhang, and N. Yan,
“For-backward lstm-based missing data reconstruction for time-series
landsat images,” GISCIENCE & REMOTE SENSING, vol. 59, no. 1,
pp. 410–430, DEC 31 2022.
[19] T. Di Martino, R. Guinvarc’h, L. Thirion-Lefevre, and E. C. Koeniguer,
“Beets or cotton? blind extraction of fine agricultural classes using a
convolutional autoencoder applied to temporal sar signatures,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18,
2021.
[20] W. Guo, W. Zhang, Z. Zhang, P. Tang, and S. Gao, “Deep temporal
iterative clustering for satellite image time series land cover analysis,”
Remote Sensing, vol. 14, no. 15, p. 3635, 2022.
[21] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards k-
means-friendly spaces: Simultaneous deep learning and clustering,” in
international conference on machine learning. PMLR, 2017, pp. 3861–
3870.
[22] X. Guo, X. Liu, E. Zhu, and J. Yin, “Deep clustering with convolutional
autoencoders,” in Neural Information Processing: 24th International
Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017,
Proceedings, Part II 24. Springer, 2017, pp. 373–382.
[23] R. d. S. da Silva Adeu, K. R. Ferreira, P. R. Andrade, and L. Santos,
“Assessing satellite image time series clustering using growing som,”
in COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA
2020, PT V, ser. Lecture Notes in Computer Science, O. Gervasi,
B. Murgante, S. Misra, C. Garau, I. Blecic, D. Taniar, B. Apduhan,
A. Rocha, E. Tarantino, C. Torre, and Y. Karaca, Eds., vol. 12253. Univ
Cagliari; Univ Perugia; Univ Basilicata; Monash Univ; Kyushu Sangyo
Univ; Univ Minho; Springer Int Publishing AG; Comp Open Access
Journal; IEEE Italy Sect; IEEE GRSS, Ctr N Italy Chapter; IEEE Comp
Soc, Italy Sect; Sci Assoc Transport Infrastructures; Regione Sardegna,
2020, pp. 270–282, 20th International Conference on Computational
Science and Its Applications (ICCSA), ELECTR NETWORK, JUL 01-
04, 2020.
[24] Z. Zhang, P. Tang, and Z. Zhou, “Satellite image time series clustering
under collaborative principal component analysis,” in LAND SURFACE
REMOTE SENSING II, ser. Proceedings of SPIE, T. Jackson, J. Chen,
P. Gong, and S. Liang, Eds., vol. 9260. SPIE; State Key Lab Remote
Sensing Sci; Natl Aeronaut & Space Adm; Minist Earth Sci, 2014,
conference on Land Surface Remote Sensing II, Beijing, PEOPLES R
CHINA, OCT 13-16, 2014.
[25] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on
information theory, vol. 28, no. 2, pp. 129–137, 1982.
[26] F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an
overview, ii,” WILEY INTERDISCIPLINARY REVIEWS-DATA MINING
AND KNOWLEDGE DISCOVERY, vol. 7, no. 6, NOV-DEC 2017.
[27] H. A. Dau, D. F. Silva, F. Petitjean, G. Forestier, A. Bagnall, A. Mueen,
and E. Keogh, “Optimizing dynamic time warping’s window width
for time series data mining applications,” Data mining and knowledge
discovery, vol. 32, pp. 1074–1120, 2018.
[28] C. Labr´
ın and F. Urdinez, “Principal component analysis,” in R for
Political Data Science. Chapman and Hall/CRC, 2020, pp. 375–393.
[29] J. Cohen-Waeber, R. Burgmann, E. Chaussard, C. Giannico, and A. Fer-
retti, “Spatiotemporal patterns of precipitation-modulated landslide de-
formation from independent component analysis of insar time series,”
GEOPHYSICAL RESEARCH LETTERS, vol. 45, no. 4, pp. 1878–1887,
FEB 28 2018.
[30] M. A. Pena, R. Liao, and A. Brenning, “Using spectrotemporal indices
to improve the fruit-tree crop classification accuracy,” ISPRS JOURNAL
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
16
OF PHOTOGRAMMETRY AND REMOTE SENSING, vol. 128, pp. 158–
169, JUN 2017.
[31] G. Weikmann, C. Paris, and L. Bruzzone, “Timesen2crop: A million
labeled samples dataset of sentinel 2 image time series for crop-
type classification,” IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, vol. 14, pp. 4699–4708, 2021.
[32] L. A. Santos, K. R. Ferreira, G. Camara, M. C. Picoli, and R. E. Simoes,
“Quality control and class noise reduction of satellite image time series,”
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 177, pp.
75–88, 2021.
[33] J. F. Ribeiro and B. M. T. Walter, “As principais fitofisionomias do
bioma cerrado,” Cerrado: ecologia e flora, vol. 1, pp. 151–212, 2008.
[34] N. N. Navnath, K. Chandrasekaran, A. Stateczny, V. M. Sundaram, and
P. Panneer, “Spatiotemporal assessment of satellite image time series for
land cover classification using deep learning techniques: A case study of
reunion island, france,” Remote Sensing, vol. 14, no. 20, p. 5232, 2022.
[35] Q. Ma, J. Zheng, S. Li, and G. W. Cottrell, “Learning representations
for time series clustering,” Advances in neural information processing
systems, vol. 32, 2019.
[36] C.-Y. Liou, W.-C. Cheng, J.-W. Liou, and D.-R. Liou, “Autoencoder for
words,” Neurocomputing, vol. 139, pp. 84–96, 2014.
[37] C. Pelletier, G. I. Webb, and F. Petitjean, “Temporal convolutional neural
network for the classification of satellite image time series,” Remote
Sensing, vol. 11, no. 5, p. 523, 2019.
[38] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for
clustering analysis,” in International conference on machine learning.
PMLR, 2016, pp. 478–487.
[39] Z. Yang, K.-T. Fang, and S. Kotz, “On the student’s t-distribution and the
t-statistic,” Journal of Multivariate Analysis, vol. 98, no. 6, pp. 1293–
1304, 2007.
[40] B. Rasti, Y. Chang, E. Dalsasso, L. Denis, and P. Ghamisi, “Image
restoration for remote sensing: Overview and toolbox,” IEEE Geoscience
and Remote Sensing Magazine, vol. 10, no. 2, pp. 201–230, 2021.
[41] V. R. Patel and R. G. Mehta, “Impact of outlier removal and normaliza-
tion approach in modified k-means clustering algorithm,” International
Journal of Computer Science Issues (IJCSI), vol. 8, no. 5, p. 331, 2011.
[42] J. Kohne, L. Henning, and C. Guhmann, “Autoencoder-based iterative
modeling and multivariate time-series subsequence clustering algo-
rithm,” IEEE ACCESS, vol. 11, pp. 18 868–18 886, 2023.
[43] W. Yang, X. Li, C. Chen, and J. Hong, “Characterizing residential load
patterns on multi-time scales utilizing lstm autoencoder and electricity
consumption data,” SUSTAINABLE CITIES AND SOCIETY, vol. 84, SEP
2022.
[44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” the Journal of machine
Learning research, vol. 12, pp. 2825–2830, 2011.
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[46] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? a
new look at signal fidelity measures,” IEEE signal processing magazine,
vol. 26, no. 1, pp. 98–117, 2009.
[47] T. Van Erven and P. Harremos, “R´
enyi divergence and kullback-leibler
divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7,
pp. 3797–3820, 2014.
[48] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” Advances in
neural information processing systems, vol. 32, 2019.
[49] P. A. Est´
evez, M. Tesmer, C. A. Perez, and J. M. Zurada, “Normalized
mutual information feature selection,” IEEE Transactions on neural
networks, vol. 20, no. 2, pp. 189–201, 2009.
[50] D. Steinley, “Properties of the hubert-arable adjusted rand index.”
Psychological methods, vol. 9, no. 3, p. 386, 2004.
[51] A. Alqahtani, M. Ali, X. Xie, and M. W. Jones, “Deep time-series
clustering: A review,” Electronics, vol. 10, no. 23, p. 3001, 2021.
[52] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal
of machine learning research, vol. 9, no. 11, 2008.
[53] C. Echegoyen, A. Perez, G. Santafe, U. Perez-Goya, and M. D. Ugarte,
“Large-scale unsupervised spatio-temporal semantic analysis of vast
regions from satellite images sequences,” STATISTICS AND COMPUT-
ING, vol. 34, no. 2, APR 2024.
[54] S. Marino, “Understanding the spatio-temporal behavior of crop yield,
yield components and weed pressure using time series sentinel-2-data in
an organic farming system,” EUROPEAN JOURNAL OF AGRONOMY,
vol. 145, APR 2023.
[55] J. C. White, T. Hermosilla, M. A. Wulder, and N. C. Coops, “Mapping,
validating, and interpreting spatio-temporal trends in post-disturbance
forest recovery,” REMOTE SENSING OF ENVIRONMENT, vol. 271,
MAR 15 2022.
[56] M. Das, S. K. Ghosh, and S. Bandyopadhyay, “A multilayered adaptive
recurrent incremental network model for heterogeneity-aware prediction
of derived remote sensing image time series,” IEEE TRANSACTIONS
ON GEOSCIENCE AND REMOTE SENSING, vol. 60, 2022.
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2024.3502247
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/