ArticlePDF Available

Deep Temporal Conv-LSTM for Activity Recognition

Authors:

Abstract and Figures

Human activity recognition has gained interest from the research community due to the advancements in sensor technology and the improved machine learning algorithm. Wearable sensors have become more ubiquitous, and most of the wearable sensor data contain rich temporal structural information that describes the distinct underlying patterns and relationships of various activity types. The nature of those activities is typically sequential, with each subsequent activity window being the result of the preceding activity window. However, the state-of-the-art methods usually model the temporal characteristic of the sensor data and ignore the relationship of the sliding window. This research proposes a novel deep temporal Conv-LSTM architecture to enhance activity recognition performance by utilizing both temporal characteristics from sensor data and the relationship of sliding windows. The proposed architecture is evaluated based on the dataset consisting of transition activities—Smartphone-Based Recognition of Human Activities and Postural Transitions dataset. The proposed hybrid architecture with parallel features learning pipelines has demonstrated the ability to model the temporal relationship of the activity windows where the transition of activities is captured accurately. Besides that, the size of sliding windows is studied, and it has shown that the selection of window size is affecting the accuracy of the activity recognition. The proposed deep temporal Conv-LSTM architecture can achieve an accuracy score of 0.916, which outperformed the state-of-the-art accuracy.
Content may be subject to copyright.
Neural Processing Letters
https://doi.org/10.1007/s11063-022-10799-5
Deep Temporal Conv-LSTM for Activity Recognition
Mohd Halim Mohd Noor1
·Sen Yan Tan1
·Mohd Nadhir Ab Wahab1
Accepted: 7 March 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
Human activity recognition has gained interest from the research community due to the
advancements in sensor technology and the improved machine learning algorithm. Wearable
sensors have become more ubiquitous, and most of the wearable sensor data contain rich
temporal structural information that describes the distinct underlying patterns and relation-
ships of various activity types. The nature of those activities is typically sequential, with each
subsequent activity window being the result of the preceding activity window. However, the
state-of-the-art methods usually model the temporal characteristic of the sensor data and
ignore the relationship of the sliding window. This research proposes a novel deep temporal
Conv-LSTM architecture to enhance activity recognition performance by utilizing both tem-
poral characteristics from sensor data and the relationship of sliding windows. The proposed
architecture is evaluated based on the dataset consisting of transition activities—Smartphone-
Based Recognition of Human Activities and Postural Transitions dataset. The proposed
hybrid architecture with parallel features learning pipelines has demonstrated the ability
to model the temporal relationship of the activity windows where the transition of activities
is captured accurately. Besides that, the size of sliding windows is studied, and it has shown
that the selection of window size is affecting the accuracy of the activity recognition. The
proposed deep temporal Conv-LSTM architecture can achieve an accuracy score of 0.916,
which outperformed the state-of-the-art accuracy.
Keywords Activity recognition ·Deep learning ·LSTM ·Temporal model
1 Introduction
The rapid development of machine learning techniques and ubiquitous computing has spurred
the interest from academia to analyze and interpret sensor data to extract knowledge from
the omnipresent sensor over the previous few decades. The growing research community is
interested in human activity recognition (later referred to as activity recognition) because of its
tremendous usefulness in health monitoring, medical assistance, entertainment, and personal
health tracking services. For instance, the real-time feedback from the activity detection
system allows the healthcare professional to quickly monitor patients who require close
BMohd Halim Mohd Noor
halimnoor@usm.my
1School of Computer Sciences, Universiti Sains Malaysia, 11800 Pulau Pinang, Malaysia
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
monitoring, especially those with body motion-associated diseases. Most of the research seeks
to improve and boost the algorithm’s accuracy, efficiency, and execution time by applying
pattern recognition on the raw sensor data to extract valuable information related to the
current activity for the user.
Activity recognition is one of several disciplines that utilize machine learning approaches
to detect hidden patterns in sensor data to classify human activities. Thanks to the advance-
ments in technology such as computer systems and power accessible today, the approaches
have steadily improved. At the same time, the improved learning algorithm approaches also
pave the way for the enhancement of activity recognition research. Traditional machine
learning approaches, such as the Support Vector Machine (SVM) or Hidden Markov Model
(HMM), have been widely utilized in many activity recognition-based studies. The underly-
ing characteristics of the dataset must be manually retrieved using various feature extraction
methods such as principal component and linear discriminant analysis [1], wavelet transform
[2], homomorphic analysis [3] and local binary pattern [4], and fed into the machine learning
algorithms for the learning algorithms to understand the patterns in the data. The disadvantage
of the traditional machine learning approach is that researchers must demonstrate possession
of the vast knowledge of the domain, which implies that the researchers must have a thorough
grasp of the behavior and characteristics of the time-series data for better feature extraction.
However, the process of feature extraction is still very susceptible to human mistakes.
Today, the learning algorithm has progressed from manual feature extraction to fully
automatic feature learning by utilizing deep learning methods. Deep learning can extract fea-
tures directly from data without human intervention. In recent years, many researchers have
demonstrated and proved that the deep learning method is satisfactory [5]. One of the most
important aspects of successful deep learning models is the network architecture. As deep
learning methods like convolutional neural network (CNN) and recurrent neural network
(RNN) become more sophisticated and refined in activity recognition, several researchers
have advocated leveraging both methods. CNN is better at recognizing long-term repetitive
activities, while RNN-based networks such as long short-term memory are better at recogniz-
ing short, natural-ordered activities [6]. By combining both main and mature deep learning
methods, one may leverage the strengths of both methods to improve activity recognition
performance.
In activity recognition, the activity signal is typically dividedinto segmentations or known
as windows of equal size for subsequent feature extraction and classification. Typically, the
window size is set based on hardware limitations and experience. Small window size would
slice the activity signal into multiple separate segmentations. Thus, the segmentation lacks
the information for activity recognition. On the other hand, large segmentation could contain
multiple activity signals, confusing the classification model. In both cases, the segmentations
do not have the optimal information of the activity signals, which would lead to misclassi-
fication. Another important property of the window segmentation is that they are inherently
sequential due to the nature of human activities, whereby an activity window can be fol-
lowed by a particular set of activity windows. For example, a window classified as standing
is followed by either a standing window or a walking window only. However, the sequence
of activity windows is often ignored in the development of classification models. The devel-
oped classification models consider only the current window which is the segmentation to
be classified. Such models do not leverage the fact that the sequence of activity windows
is inherently sequential due to the nature of human activities. Therefore, this work aims to
develop a hybrid deep learning model that combines the strength of CNN and RNN to extract
the salient feature representation and capture the temporal information in the activity data.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
Unlike previous hybrid models where the input is a single-window segmentation, the pro-
posed model accepts a sequence of activity windows to model the dependencies between the
windows. Each activity window is processed by a separate stream that extracts the local win-
dow features. The window features are then concatenated to become a sequence of window
features. Then, the dependencies of the features are modeled to capture a better representation
of the data to improve the model generalization.
The remainder of this paper is organized as follows. Section 2reviews the related works.
In Sect. 3, we present the proposed methodology that consists of data collection and pre-
processing, the proposed hybrid model and the implementation details. Section 4presents the
experimental results and their discussion. Finally, the conclusions are presented in Sect. 5.
2 Related Works
Deep learning method has been widely implemented to overcome the limitation of machine
learning. Deep learning can extract features automatically, which leads to lesser human effort.
Numerous deep learning models have been proposed for activity recognition, including CNN
models, RNN models and hybrid CNN and LSTM models.
2.1 CNN Models
In [7], a CNN model is designed to take in raw accelerometer data in three-dimensional (3D)
directly without any complex pre-treatment. Before feeding to the first convolution layer, the
input is pre-processed with the sliding window method before normalization is applied to the
accelerometer data. The normalized data is then fed into 1D convolution and max-polling
layer. The author proposed to perform validation on the model based on the benchmark
WISDM dataset. The experimental result indicates that the proposed model can achieve high
accuracy while maintaining low computation costs. A multiple channel CNN was presented
as a solution to the problem of activity recognition in the context of exercise programmes
[8]. A self-collected dataset comprised of 16 activities from the Otago exercise program is
captured and used in this experiment. Multiple sensors are placed across body parts to capture
the raw inertia data for various activities which each sensor will be fed into a separate CNN
channel. The results from all sensors after CNN’s operation will be compared individually to
determine the best location to place sensors for better lower-limb activity detection. In this
experiment, the authors also conclude that multiple sensor combinations can produce better
results than a single sensor source.
A Deep Human Activity Recognition model is proposed, which converts the motion sensor
data into a spectral image sequence before feeding these images into two independently
trained CNN models [9]. Each CNN model takes in the image sequences that are generated
from the accelerometer and gyroscope. The outputs of the trained CNNs are then fused
to predict the final class of human activity. In this experiment, the public dataset Real-
world Human Activity Recognition (RWHAR) is used. This dataset contains eight activities
which are climbing stairs down and up, lying, standing, sitting, running/jogging, jumping
and walking. The proposed model can achieve an overall F-score of 0.78 for both static
and dynamic activities and 0.87 for dynamics activities. The author also claimed that this
model is capable of handling image input directly. The model’s generalization is encouraging;
however, the recognition accuracy is not comparable with the other benchmark deep learning
model. In [10], three strategies are proposed to exploit the temporal information of a sequence
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
of windows. The first strategy is to compute the average of the windows which will be used
as input to the CNN model. In the second strategy, the sequence of windows is fed to a
concurrent CNN, and the activity class is determined based on the average scores. The final
strategy is similar to the second strategy. However, the learned features are combined using
global average pooling layer to produce the final prediction.
Instead of a single classifier of CNN, an ensemble of CNN has been proposed to improve
the accuracy of activity recognition. Zhu et al. [11] proposed a human activity recogni-
tion framework based on CNN using a fusion of various smartphone-based sensors such as
accelerometer, gyroscope, and magnetometer. The proposed framework is an ensemble of
two different CNN models whereby the first CNN is trained to predict the activity classes
while the second CNN is trained to focus on the activity classes that have a high number of
misclassification. The output of individual CNN models is then combined using weighted vot-
ing to predict unknown activities. The experimental result indicates that this proposed model
can achieve up to 0.962 in terms of accuracy. Zehra et al. [12] also proposed an ensemble
model consisting of three different CNN models. The ensemble model averages the three
CNN models’ outputs to produce the final prediction. The authors evaluated the performance
of each CNN model before ensembling each CNN model for overall performance validation.
The experimental result indicates that the performance of the ensemble model is better than
the three CNN models. The ensemble model achieved an accuracy of 0.940. This experiment
shows that the ensemble learning model can generalize how the learning effect of the weak
learner could be boosted and improve the overall model. In [13], a two-channel CNN model
is proposed for activity recognition. The proposed model leverages the frequency and power
characteristics extracted from sensor signals to improve recognition accuracy. The model was
validated on a public UCI-HAR dataset and demonstrated an accuracy of 0.953. The down-
side of this method is that it requires the extraction of specific features to improve activity
recognition from sensor data. The performance of CNN model is enhanced by integrating
the attention mechanism module to determine the relevance of the features [14]. To extract
the local features, the three acceleration channels are fed to three concurrent convolutional
layers with different filter sizes. Then the attention mechanism computes the contribution
of the features to select the relevant features. The model was validated on a public WISDM
dataset and demonstrated an accuracy of 0.964.
Based on the aforementioned studies, it can be observed that most implementations did
very well at classifying the activities. They could automatically extract the salient features,
which leads to good classification performance. However, the temporal information of the
sensor data is not leveraged for activity classification.
2.2 RNN-Based Models
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two popular RNN
variations. Several studies employ the RNN architectures to tackle the activity recognition
problem. Chen et al. [15] presented a feature extraction approach based on LSTMs for
activity recognition. In the study, the accelerometer data is segmented into a sequence of
windows of size N, and the three acceleration channels are individually processed. Thus,
three LSTMs are used to perform feature extraction on the windows. Following the LSTMs,
a concatenation operation is performed to produce a feature vector which will be fed to a
softmax classifier. WISDM dataset is used to validate the proposed model. The experimental
results show that the proposed model achieved an accuracy of 0.921. A similar work is
reportedin[16], whereby two layers of LSTMs are proposed to perform feature extraction
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
on accelerometer and gyroscope data. The results show that the proposed model achieved
an average accuracy of 0.920. Furthermore, it has been shown that batch normalization can
attain the same accuracy with nearly four times fewer training epochs.
Other than employing the LSTM to process the sensor data directly, several models of
ensemble LSTM networks have been proposed to improvise the accuracy of activity recog-
nition. The performance of ensembles of deep LSTM is verified and reported in [17]. The
authors built diverse base learners using LSTM and the predictions of the base learners
are combined via average operation to obtain a more robust and improvised classification
performance. The authors also proposed a modified training procedure such as random sam-
pling with varying lengths of the sensor data, and sample-wise model evaluation is performed
during inference. The authors validated the proposed model on three different datasets: Oppor-
tunity, PAMAP2, and Skoda. The results show that the ensemble model achieved an accuracy
of 0.726, 0.854 and 0.924 for Opportunity, PAMAP2, and Skoda respectively. Also, the exper-
imental results indicate that the ensemble model performs better than a single classifier. Li
et al. [18] proposed an ensemble model using LSTMs to accept input segmentation with
different sizes to model the underlying temporal patterns at various degrees of granularity.
The predictions of the LSTMs are combined via element-wise multiplication to produce the
final prediction. The experimental results show that the proposed model achieved an average
accuracy of 0.961.
Mahmud et al. [19] proposed a multi-stage LSTM-based model to process multimodal
sensor data for activity recognition. This proposed model consists of three key components,
which are temporal feature extractor, temporal feature aggregator and global feature opti-
mizer. The temporal feature extractor comprises two layers of LSTM to extract temporal
features from each sensor data. The temporal feature aggregator aggregates the temporal
features, taking into account both the time-axis and the feature-axis to preserve the tempo-
ral relationship. The global feature optimizer consists of three layers of LSTM to extract
global features from the aggregated temporal features. The experimental results show that
incorporating multiple sensors into the proposed model outperformed the single sensor-based
model.
Although RNN networks have been shown capable of modeling the temporal characteristic
of sensor data, it is generally not performing well in extracting local features from sensor
data. Therefore, there is a need to combine CNN with LSTM to exploit the strength of both
deep learning methods.
2.3 Hybrid Models
In recent years, hybridization of CNN and RNN networks has been experimented with to
improve the performance of activity recognition. Various hybrid deep learning models have
been proposed in previous studies. But the focus of this study is the hybrid models of CNN and
RNN. Ordóñez & Roggen first proposed a novel DNN framework for activity recognition,
consisting of four convolutional layers, followed by two recurrent layers and a softmax
layerasaclassier[20]. The convolutional layers are used as a feature extractor to produce
the feature representation of the sensor data. In contrast, the recurrent layers are used for
modeling the temporal dynamics of the feature maps. This proposed framework employs the
sliding window approach to segment the time series data. The proposed model is validated
on two popular public datasets, which are OPPORTUNITY and SKODA. The accuracy for
OPPORTUNITY and SKODA are 0.930 (for modes of locomotion with no null class) and
0.958 respectively.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
Mekruksavanich and Jitpattanakul [21] proposed a similar hybrid CNN-LSTM model.
In the study, the authors added Bayesian Optimization to fine-tune each LSTM and CNN
network parameter. The author evaluated the proposed model using WISDM public data. The
result of the experiments indicates that the proposed model outperformed the other baseline,
achieving an average accuracy and F-score of 0.962 and 0.963 respectively. A similar CNN-
LSTM model for activity recognition is reported in [22]. However, the authors proposed
to wrap the convolutional and pooling layers with a Time Distributed wrapper to maintain
its temporal integrity for the LSTM layers. The input data is reshaped to 3D as required
by the Time Distributed wrapper. The proposed model achieved an accuracy of 0.921 and
0.991 for iSPL and UCI-HAR datasets respectively. Thus, it is concluded that the proposed
model outperformed other deep learning models that simply use the raw sensor data as input.
Another similar model is reported in [23]. The input dimension is first expanded to obtain
heterogeneous data and the data is then fed to the proposed model. The proposed model
achieved an accuracy of 97.65% on the UCI-HAR dataset.
Wang et al. [24] proposed a similar CNN-LSTM architecture in which the focus is to
model the transition of the activities of the window sequence. To achieve this, the author
proposed to treat the sensor data as an image-alike 2D array. The image alike array data is
fed into a three-layer CNN network for automatic feature extraction to obtain the feature
vector. The feature vector from the previous layer is then fed into LSTM layers to model the
relationship between time and action sequence. The proposed model is validated using the
SBHAPT dataset and the experimental results show that the proposed achieved an accuracy
of 0.959. The limitation of this proposed model is the pre-requisite of treating the signal data
as image-like, which incurs an additional pre-processing step to convert the raw real-time
signal into image-like form before feeding into the proposed model.
Singh et al. [25] proposed a deep neural network architecture that consists of CNN, LSTM
and a self-attention mechanism. The CNN and LSTM layers extract spatio-temporal features
from multiple time-series data, and the self-attention layer is utilized for training on the
most significant time point. The proposed model is validated with different data sampling
strategies on six public datasets, which are mobile health (MHEALTH), USC human activity
dataset (USC-HAD), Wireless Sensor Data Mining (WISDM), UTD Multimodal Human
Action Dataset (UTD-MHAD2), Wearable Human Activity Recognition Folder (WHARF),
and UTD Multimodal Human Action Dataset (UTD-MHAD1). The proposed model achieved
an accuracy of 0.949, 0.909, 0.904, 0.898, 0.824 and 0.580 for the six datasets above respec-
tively. The proposed model also indicates that the self-attention mechanism significantly
improvises the performance of the model. The results show that the proposed architecture
has significantly outperformed the state-of-the-art methods. However, the experiments did
not involve transitional activities such as stand-to-sit, sit-to-stand and sit-to-lie.
Abdel-Basset et al. [26] presented a supervised dual-channel model comprised of LSTM
and an attention mechanism. The long-term temporal representations of the sensor data
are modeled by the LSTM. An advanced residual network, on the other hand, effectively
extracts hidden characteristics from high-dimension sensory input. The attention mechanism
is applied on LSTM to further improvise on the temporal fusion performance. The proposed
model for multichannel spatial fusion also includes a novel adaptive-squeezing CNN. The
proposed model is evaluated on two benchmark datasets: UCI-HAR and WISDM. The results
show that the proposed model outperforms existing state-of-the-art models by achieving an
accuracy of 0.977 and 0.989 for UCI-HAR and WISDM respectively.
Xia et al. [27] proposed a hybrid deep learning architecture that is made up of two layers
of LSTM followed by convolutional layers. The global average pooling layer (GAP) is
applied instead of the fully connected layer after the convolutional layers and followed with
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
a batch normalization layer (BN). The authors found that GAP could help to reduce the
model parameters while the batch normalization layer helps to speed up the convergence.
The proposed model is validated on three different datasets: OPPORTUNITY, WISDM and
UCI-HAR, achieving an accuracy of 0.927, 0.958 and 0.958.
Nafea et al. [28] proposed a novel hybrid model to extract the temporal cues from the sensor
data. The proposed model consists of a two-stream of CNN and BiLSTM. The features from
both streams are then concatenated and fed to a fully-connected layer for classification. The
proposed model is evaluated on WISDM and UCI-HAR datasets and the results show that the
proposed model outperformed the state-of-the-art models, achieving an accuracy of 0.985 and
0.971 respectively. The authors claim that CNN-BiLSTM is an efficient solution to extract
spatial and temporal features. Similar work is reported in [29], whereby a novel hybrid model
is proposed to extract local features and global temporal relationship of the features. The
proposed model consists of a two-stream of convolutional layers and LSTM-based attention
mechanism modules. The proposed model is evaluated on WISDM, UCI-HAR, Opportunity
and PAMAP2 datasets. The results show that the proposed model outperformed the state-of-
the-art models, achieving an average accuracy of 0.975. Shi et al. proposed a similar model for
WiFi-based activity recognition. The proposed model consists of a series of convolutional and
max-pooling layers followed by a bidirectional LSTM and an attention mechanism module.
The activities considered in the experiments are standing, sitting, walking, running, stand up
and sit down. The results show the proposed model significantly improves the recognition
accuracy.
Gao et al. [30] proposed a novel hybrid model to capture both the channel-wise and
temporal dependencies of the sensor data. The segmented sensor data is fed to convolutional
layers to extract the feature representation. The features is then fed to a squeeze-and-excitation
module which consists of channel attention submodule and temporal attention submodule
to model the dependencies. The channel attention submodule consists of a two-stream of
max-pooling layer and average pooling layer, and each pooling layer is followed by a fully-
connected layer with ReLU activation function. The outputs of the fully-connected layers are
then concatenated via the temporal axis and converted to probabilities using Sigmoid function.
The temporal attention submodule has similar two-stream network. But the concatenation
is performed along the channel index. The proposed model is evaluated on four different
datasets: WISDM, UniMiB SHAR, PAMAP2 and Opportunity. The experimental results
show the proposed model achieved better performance than the existing models.
Based on the past literature, hybrid models can achieve a satisfactory result. However,
several limitations are posed by the aforementioned studies. First, the studies do not exploit
the temporal information of the sequence of activity windows. Also, most of the studies
except [24] consider only basic activities such as walking, standing and sitting and ignore the
transitional activities such as stand-to-sit, sit-to-stand which have a much shorter duration
and less occurrence. Therefore, this paper presents a hybrid deep learning model consisting
of three parts: feature learning pipelines, sequential learning module and activity classifier.
The feature learning pipeline consists of a concurrent feature extraction module that accepts
a sequence of activity windows to learn feature representation of the windows, while the
sequential learning module model the temporal dependencies between the windows. The
temporal features produced by the sequential learning module are fed to the classifier for
activity recognition.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
3 Proposed Methodology
3.1 Data Collection and Pre-processing
The dataset that is used in this study is the Smartphone-Based Recognition of Human Activ-
ities and Postural Transitions (SBHAPT) dataset [31]. The dataset is publicly available from
UCI and is widely used by numerous researchers to validate their architecture. The rationale
of the dataset selection is due to its data collection method whereby the subjects performed
the activities continuously. Thus, the dataset contains not only the basic activities, but also
the transitions between two activities. As the aim of this study is to exploit the sequence of
activity windows, this characteristic becomes a critical component to evaluate our proposed
model. To the best of our knowledge, this is the only public dataset that contains basic activ-
ities as well as their transitions. The dataset was collected from 30 subjects and each subject
performed the protocol twice. During the data collection, a smartphone integrated with a
tri-axial accelerometer and tri-axial gyroscope is attached to the waist of the subjects. The
sensor data is generated at a constant rate of 50 Hz.
A total of 12 activities are captured in this dataset, including the basic activities and
the postural transitional activities. Among the six basic activities, three of them are static
activities such as standing, sitting, lying and the other three are dynamic activities such as
walking downstairs, walking upstairs and walking. The transitional activities are sit-to-lie,
lie-to-sit, stand-to-sit, stand-to-lie, lie-to-stand, and sit-to-stand. Note that stand-to-lie and
lie-to-stand consist of two transitional activities. For example, the stand-to-lie is composed
of stand-to-sit followed by sit-to-lie. However, the original authors of the dataset annotate
the activities as a single transitional activity. Table 1lists the activities and their number of
samples. The sensor data is normalized to zero mean and unit variance. Then the sensor data
is segmented using the fixed-size sliding window method. An activity window may contain
samples from two activity classes. This is due to the nature of time-series data as the subject
transition from one activity to another. Therefore, the activity windows are labeled according
to the majority samples within the window. For example, if the size of the window is 100
Table 1 Distribution of sensor
data (number of activity samples) ID Activity Number instances Percentage (%)
A1 Walk 122,091 14.97
A2 Upstairs 116,707 14.31
A3 Downstairs 107,961 13.24
A4 Sit down 126,677 15.53
A5 Stand 138,105 16.93
A6 Lie 136,865 16.78
A7 Stand to sit 10,316 1.26
A8 Sit to stand 8029 0.98
A9 Sit to lie 12,428 1.52
A10 Lie to sit 11,150 1.37
A11 Stand to lie 14,418 1.77
A12 Lie to stop 10,867 1.33
Total 815,614 100.00
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
samples, and 55 samples belong to class A and the remaining samples belong to class B. The
window is labeled as class A.
3.2 Proposed Deep Temporal Model
The block diagram of the proposed model is as illustrated in Fig. 1. The time-series data
generated by the sensor is segmented with the sliding window method. Each activity window
contains sensor data that lasted for a finite amount of time. The proposed model accepts a
sequence of activity windows as input. The window sequence contains Kprevious activity
windows in addition to the current window to be predicted. The Kprevious windows provide
additional information to the model in predicting the current window. It is worth noting that
in the figure, the sensor data is segmented with the sliding window with no overlapping.
However, overlapping segmentation is typically used and has been shown to achieve better
recognition accuracy as reported in previous studies [32]. Furthermore, the overlapping slid-
ing window increases the number of segmentations, improving the generalization of the deep
learning models.
Fig. 1 The block diagram of the
proposed hybrid model
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
Table 2 The parameters of the feature learning pipeline (segmentation size equals 128)
Layer Kernel or pool size Stride Activation Output shape
Input 128 ×6
1D Conv 2 1 tanh 128 ×8
Max pooling 2 1 127 ×8
Dropout (prob 0.5)
1D Conv 2 1 tanh 127 ×8
Max pooling 2 1 126 ×18
Dropout (prob 0.5)
1D Conv 2 1 tanh 126 ×36
Max pooling 2 1 125 ×36
Dropout (prob 0.5)
The first part of the proposed model is the concurrent feature learning module that acquires
the sequence of the activity windows. Each feature learning pipeline is composed of convo-
lution and pooling operations with dropout regularization. The convolution layers are used to
extract low- and high-level features in a hierarchical manner. The hyperbolic tangent (tanh)
activation function is selected based on the experiments that have been conducted. The pool-
ing layer reduces the size of the feature maps after each convolution layer which yields a
reduction in the computational complexity. The maximum pooling with pool size equals 2
and stride equals 1 is used because it has been shown to be effective for sensor-based activity
recognition [24]. The dropout regularization is applied after the maximum pooling layer to
reduce overfitting and improve the model generalization. Table 2lists the parameters of the
feature learning pipeline.
As can be shown in Fig. 1, each feature learning pipeline concurrently processes different
activity windows and produces the local window features. The window features are the feature
representation of the activity windows that are segmented at different times. These window
features are concatenated to form a single window feature before being used as input to the
sequential learning module. The concatenation of the window features represents the feature
representation of the activity windows segmented at different times. Thus, the sequential
learning module models the dependencies of the activity windows. Assuming the feature
maps of the feature extractor is denoted by xnwhere ndenotes nth window or the window
to be predicted. The concatenation of the window features or window sequence is given as
follows:
zxnK,...,xn1,xn(1)
where Kis the number of previous windows being used for predicting the window n.The
size of the window sequence vector is given as follows:
TL×(K+1
)(2)
where Lis the size of the single-window feature.
The window sequence is then fed to the sequential learning module. The sequential learn-
ing module aims to model the dependencies between the window features. Previous works
have shown that LSTM is effective in modeling time series data for activity recognition.
Therefore, the LSTM network is adopted as the sequential learning module. LSTM networks
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
have the form of a chain of repeating neural network-based modules known as LSTM cell.
Each cell accepts a single feature as input; thus, the number of LSTM cells is determined
by the size of the window sequence vector. This input is processed, and an output is pro-
duced which will be fed to the next cell. In this way, the temporal information in the window
sequence is captured for classification.
An LSTM cell consists of several gating units that control the flow of information from
one LSTM cell to another LSTM cell. The first gate is called the ‘forgetting gate’. This gate
examines the input feature and the previous cell’s output, and determines which information
needs to be filtered out from the cell. This operation is performed by a layer with sigmoid
activation function, which outputs the value in the range of 0 (filter out) and 1 (keep).
ftσwf·ht1,zt+bf(3)
The second gate is called the ‘input gate’. This gate is responsible for storing information
in the cell based on the input feature and the output of the previous cell. This operation is
performed by two layers, one with sigmoid activation function and the other layer with tanh
activation function. The sigmoid layer retrieves the relevant information to be used to update
while the tanh layer creates a candidate vector that will be used to update the cell state.
utσwu·ht1,zt+bu(4)
˜
Cttanhwc·ht1,zt+bc(5)
The computation to update the cell state is given as follows:
Ctut˜
Ct+ftCt1(6)
As can be seen from the above formula, the update considers the state of the previous cell.
This allows the cell to add some relevant information from the previous cell state to the cell
state.
The final gate is the ‘output gate’. This gate examines the input feature and the previous
cell’s output and produces the output that is based on the cell state. The computation of the
output is given as follows.
vtσwv·ht1,zt+bv(7)
htvttanh(Ct)(8)
Each of the LSTM cells is set to have 48 hidden units and the LSTM network returns only
the output of the last cell, hT. The dropout regularization with a dropout rate equal to 0.5 is
applied to the LSTM network to improve the model generalization. Finally, the output of the
LSTM network is fed to a softmax classifier with 12 units whereby each unit represents an
activity class.
Given a sequence of windows, xnK,...,xn1,xn, the model training is performed by
minimizing the loss function, Lbetween the prediction and the window’s label. This can be
expressed as follows.
(w,b)arg min
w,b
Lyn,ˆ
yn(9)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
where ynis the label of the current window (window being predicted) and ˆ
ynis the prediction
of the current window. The loss function is the cross entropy, which is defined as follows.
Lyn,ˆ
yn−
M
m1
ynlog ˆ
yn(10)
3.3 Implementation Details
The dataset is split using the subject-based hold-out method. The split ratio is 22 subjects
to 8 subjects. Note that each subject was asked to perform the protocol twice. Therefore,
there are 44 activity data for the training set and 16 activity data for the test set. In the
experiments, the validation set is not used due to the limitation of the dataset. Therefore, the
whole training set is used to train the model, and the test is used to evaluate the model. The
training epoch is set to 500 and the batch size is set to 128. The training loss is monitored
during training, and the model checkpoint is used to save the best weights. The proposed
model is trained to minimize the cross-entropy loss. The training algorithm is the adaptive
moment estimation (Adam) optimizer. The L2 regularizationis used to prevent the model from
overfitting the training data. The proposed model was implemented using the TensorFlow
framework. The workstation is equipped with Intel i5, 16 Gb memory and Nvidia GTX 1070.
Several performance metrics are used to evaluate the performance of the proposed model.
The performance metrics are precision, recall, F-score and accuracy. The precision indicates
the ability of the model to distinguish an activity class from all the other classes. The recall
indicates the ability of the model to correctly recognize an activity class. The F-score is the
average of recall and precision. The accuracy indicates the fraction of correctly classified
activity windows.
4 Experimental Results
4.1 Experimental Setup
This section describes the experimental results of this study. Two experiments have been
conducted to evaluate the performance of the proposed model. First, we experimented with
the relation between the number of feature learning pipelines and the recognition accuracy.
In this experiment, we set the size of the window segmentation to 120 samples with an
overlapping of 60 samples. This results in 9237 and 4354 windows for the training set and
test set respectively. The number of feature learning pipelines is increased from 1 to 4.
The second experiment involved the effect of window size on recognition accuracy. Based
on a study reported in [33], the optimal window size for recognizing energetic and non-
energetic activities is in the range of 1–5.75 s, while the recommended window size to
prioritize recognition speed is in the range of 0.25–3.25 s. Therefore, in this experiment, we
experimented with five window sizes in the range of 80 samples (1.6 s) and 140 samples
(2.8 s) as shown in Table 3.
The recall, precision and accuracy are used to determine the optimal parameters and
evaluate the proposed model. Recall is defined as the ability of the model to identify the
activity class of a window segmentation. Precision is the ability of the model to distinguish
an activity class from all the other classes. Accuracy is the fraction of correctly classified
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
Table 3 The number of window segmentations for each experimental setup
Number of windows and
labels
80 samples 100 samples 120 samples 140 samples 128
samples
Training set 13,856 11,084 9237 8659 7917
Test set 6532 5225 4354 4082 3732
window segmentation. The performance metrics are given as follows.
Recall TP
TP+FN (11)
Precision TP
TP+FP (12)
Accuracy TP +TN
TP+FP +TN +FN (13)
where TP is true positive, TN is true negative, FP is false positive and FN is false negative.
4.2 Number of Feature Learning Pipelines
In this experiment, the optimal number of feature learning pipelines is determined. Five
models are built as listed in Table 4. The models are trained using the training set until
the number of maximum epochs is reached. During the training, the training accuracy is
monitored, and the best weights are saved based on the training accuracy. The trained models
are then evaluated using the test set. First, the proposed model is evaluated with 1 feature
learning pipeline without the sequential learning module. This is to evaluate the ability
of the feature extractor in extracting the relevant features for activity classification. The
results show that the proposed model is able to achieve high accuracy of 0.900. Then, we
evaluate the performance of the model when the sequential learning module is integrated
to learn the temporal information of the sensor data as well as the sequence of the activity
windows. An improvement of 0.003 is observed. This shows that capturing the temporal
information of the sensor data and activity windows is significant in the classification of
the activities. Following this, the number of feature learning pipelines is increased to two
to model two activity windows in sequence. In other words, the proposed model utilizes
the previous activity window in predicting the current window. Note that, for models with
multiple feature learning pipelines, the number of windows is equal to the number of feature
Table 4 The accuracy of the proposed model with the different number of feature learning pipelines
Model Accuracy
1-feature learning pipeline 0.900
1-feature learning pipeline with the sequential learning module 0.903
2-feature learning pipeline with the sequential learning module 0.905
3-feature learning pipeline with the sequential learning module 0.912
4-feature learning pipeline with the sequential learning module 0.904
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
learning pipelines. However, the number of output (prediction) remains the same, which is
the prediction of the current window (window being predicted). This is indicated by formula
(9). The results show that the performance of the proposed model is increased by 0.002. The
experiment proceeds with three and four feature learning pipelines. The best performance is
observed when the proposed model is integrated with three feature learning pipelines with
an accuracy of 0.912. For the next experiment, this model configuration is used to investigate
the optimal window size for activity recognition. Table 4list the recognition accuracy for
each model configuration.
4.3 Window Size
In this experiment, the proposed model with 3-feature learning pipeline is used to determine
the optimal window size. The models are trained using the training set until the number
of maximum epochs is reached. During the training, the training accuracy is monitored,
and the best weights are saved based on the training accuracy. The trained models are then
evaluated using the test set. The window size is varied to 80, 100, 120, 140 and 128 samples.
Each window size has 50% overlapping. This experiment is critical in determining the best
window size due to the characteristic of the sensor data, which directly affects the activity
classification. Some activities take a longer time to complete, while others are completed in a
short time. A too small or large window size may cause the window to be wrongly classified.
This problem would be compounded when a sequence of activity windows is considered
during the activity classification. Figure 2shows the recall, precision and F-score measures
of the activity recognition with the different windows sizes. Table 5lists the accuracy of the
experimental setups.
Overall, it is observed that the proposed model performed well in classifying the
non-transitional activities (A1–A6) compared to transitional activities (A7–A12). All the
non-transitional activities were classified with F-score measures above 0.800, whereas the
transitional activities were classified with F-score measures in the range of 0.403 and 0.753.
This is due to the limited windows of transitional activity. This is shown in Table 1whereby
the number of samples of non-transitional activities is significantly higher than the number
of samples of transitional activities. It is also observed that a significant number of A4 win-
dows are misclassified as A5 and vice versa in each of the experiments. Please refer to the
Appendix” for the confusion matrix. This is due to the fact that both activities have similar
signal patterns. Hence, the features that have been learned might have similar representations.
Experimental setup 1 (window size equals 80 samples) shows that the proposed model
performed well in classifying the non-transitional activities compared to transitional activi-
ties with activity A3 achieved the highest precision of 1.00. The recognition accuracy of the
experiment is 0.892. Experimental setup 2 (window size equals 100 samples) shows sim-
ilar performance in classifying the non-transitional activities compared to the transitional
activities. However, it is observed that there is a slight improvement in classifying activity
A1 to A6. It is also observed that the number of misclassifications of activity A4 and A5 is
slightly lower. Overall, the recognition accuracy of the proposed model is 0.897. Experimen-
tal setup 3 (window size equals 120 samples) shows better performance in terms of F-score
measures in all activities except A3, A8, A9 and A12. The classification of activity A4 and
A5 is also improved. The accuracy of the proposed model is 0.907. Experimental setup 4
(window size equals 140 samples) shows a significant reduction in the recognition accuracy
whereby the accuracy is 0.891, which is 0.016 lower than the experimental setup 3. Similar
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
Fig. 2 The performance measures of activity recognition with different window sizes a80 samples, b100
samples, c120 samples, d140 samples and e128 samples
Table 5 Accuracy of activity recognition with different window sizes
80 samples 100 samples 120 samples 140 samples 128 samples
Accuracy 0.892 0.897 0.907 0.891 0.916
classification pattern is observed whereby the non-transitional activities achieved better F-
scores compared to the transitional activities. Based on experimental setup 4, it is concluded
that the optimal window size is between 120 and 140. Therefore, another experiment was
carried out by setting the window size equal to 128 samples. Experimental setup 5 shows a
significant improvement in accuracy which is recorded at 0.916, which is 0.008 higher than
experiment setup 3. It is observed that the number of misclassifications of activity A4 and
A5 is decreased. Also, the transitional activities (activity A7–A12) achieved better F-score
measures except for activity A10. Therefore, we can conclude that a window size of 128
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
is the optimal value for activity recognition. Next, we compare the proposed model with a
benchmark hybrid model.
4.4 Comparison with the State-of-the-art Models
We compare the proposed model with other state-of-the-art models. Table 6reports the
summary of the state-of-the-art models. For fair comparison, the table shows the model
performance in classifying datasets of basic activities only such as jogging, walking, walking
upstairs and downstairs, standing, sitting, lying down. The performance of the state-of-the-art
models on various datasets is shown in the table. The performance of our proposed model is
comparable, if not better than the state-of-the-art models.
The state-of-the-art models were evaluated on datasets collected from subjects while
performing the activities separately (not in continuous manners). Thus, the datasets contain
only basic activities but no transitional activities.Also, the studies split the datasets by instance
(sample) except for study [24,26]. Hence, the intra subject dependencies is present in the
training set which would inflate the recognition accuracy. Unlike the state-of-the-art models,
our proposed model is evaluated on a dataset with basic activities and the transitions between
the activities. Classifying the dataset is challenging because the window segmentation might
contain data belonging to different activities. The dataset is split by subject.
As shown in the table, all the state-of-the-art models employed convolutional layers and
LSTM to extract local and temporal features for more accurate recognition results. Various
improvements to the classification model have been proposed to improve recognition accu-
racy. The attention mechanism module is integrated into the model to learn the relevance
of the features for prediction [25,26]. In [30], the squeeze-and-excitation-based module is
integrated to model the dependencies of the feature maps. Although the modules are shown
to improve the recognition accuracy slightly, the integration increases the complexity of the
model. Furthermore, the state-of-the-art models do not exploit the sequence of the activity
windows when performing the recognition since the models accept a single window segmen-
tation as the input. Unlike the state-of-the-art models, our proposed model accepts a sequence
of activity windows which allows the relationship of the window features to be modeled and
consequently improves the recognition accuracy. In terms of the number of parameters, our
proposed model has the least number of parameters compared to the state-of-the-art models.
In the experiment, a comparison of the proposed model with a benchmark model is per-
formed. The benchmark model is the hybrid convolution LSTM model proposed by [24]. The
rationale behind the comparison is that the authors used the same public SBHAPT dataset
in their study. However, the method reported in [24] converted the sensor data into image
form before feeding it into the proposed model. Therefore, to perform a fair comparison, we
built and trained the benchmark hybrid model on the SBHAPT dataset. The parameters of
the benchmark model such as kernel size, LSTM unit, training epoch, optimizer were set and
defined according to the study. The benchmark model was trained and evaluated with the
same training and test ratio. The performance measures of the benchmark model are given
in Fig. 3a. The comparison of the performance metrics is given in Fig. 3b and Table 7.
We observed that our proposed model with three feature learning pipelines outperforms
the benchmark architecture in terms of accuracy. The accuracy of the proposed model is
0.916 which is 0.013 higher than the benchmark model. In terms of recall, precision and
F-score, the proposed model performed better in classifying the non-transitional activities
(A1–A6), achieving an average F-score of 0.939 which is 0.014 higher than the benchmark
model. However, the proposed model achieved a slightly lower average F-score measure in
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
Table 6 Summary of the state-of-the-art models
Relevant Model Model performance Description
Singh et al. [25] Dataset preparation:
Leave-one-out
(subject-independent)
Accuracy
MHEALTH: 0.9486
USC-HAD: 0.9088
UTD-MHAD2: 0.8994
WISDM: 0.9041
The proposed model accepts a single window
segmentation for activity recognition. The
architecture of the proposed model consists of
convolutional layers to extract local features,
followed by LSTM layer to capture the temporal
dependencies of the features and finally followed
by an attention mechanism to assign different
weights to the features to indicate the relevance
of the features for classifying the activity
Number of parameters: N.A
Abdel-Basset et al.
[26]
Dataset preparation: N.A
Accuracy
UCI-HAR: 0.9770
WISDM: 0.9890
The proposed model accepts a single window
segmentation for activity recognition. The
architecture of the proposed model consists of a
two-stream of spatial feature extractor and
temporal feature extractor, whereby a series of
residual blocks is used to extract the spatial
features while LSTMs with attention mechanism
are used to extract and assign weights to the
temporal features
Number of parameters: 312,934
Xiaetal.[27] Dataset preparation:
Hold-out with 7:3 ratio
(subject-independent)
F-score
UCI-HAR: 0.9578
WISDM: 0.9585
The proposed model accepts a single window
segmentation for activity recognition. The
architecture of the proposed model consists of
two layers of LSTMs to extract the temporal
features of the data, followed by convolutional
and max-pooling layers to extract the local
features
Number of parameters: 49,606
Nafeaetal.[28] Dataset preparation: N.A
Accuracy
WISDM: 0.9853
UCI-HAR: 0.9705
The proposed model accepts a single window
segmentation for activity recognition. The
architecture of the proposed model consists of
two-stream of convolutional layers and
bi-directional LSTM to extract local features and
temporal features, respectively. Finally, the
features are concatenated for classification
Number of parameters: N.A
Gao et al. [30] Dataset preparation:
Hold-out with 7:3 ratio
Accuracy
WISDM: 0.9885
UniMiB: 0.7903
The proposed model accepts a single window
segmentation for activity recognition. The
proposed model has two
squeeze-and-excitation-based modules: temporal
attention and channel-wise attention to capture
the temporal and channel-wise dependencies of
the features extracted by convolutional layers,
respectively
Number of parameters: 950,000–3,510,000
Proposed model Dataset preparation:
Hold-out with 7:3 ratio
(subject independent)
Accuracy
SBHAPT: 0.9160
The proposed model accepts multiple window
segmentations for activity recognition. The
proposed model has concurrent feature learning
pipelines which consist of convolutional and
max-pooling layers to extract local window
features. The window features are concatenated
and modeled with LSTM layers for activity
recognition
Number of parameters: 21,990
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
Fig. 3 aThe performance measures of the benchmark model. bComparison of the performance measures
between the proposed model and the benchmark model in terms of classifying non-transitional (NT) and
transitional activities (TR)
Table 7 Accuracy of the benchmark model and the proposed model
Benchmark model Proposed model
Highest accuracy 0.9054 0.9160
Average accuracy over thirty (30) experiments 0.8893 0.8950
Standard deviation accuracy over thirty (30) experiments 0.0101 0.0096
95% confidence interval 0.8893 ±0.00363 0.8950 ±0.00345
classifying transitional activities (A7–A12) at 0.612 compared to 0.623 for the benchmark
model. Although the average F-score measure is lower, the average precision is higher, which
indicates that the proposed model is more precise in classifying the transitional activities.
Although accepting multiple window segmentations allows our proposed model to achieve
better performance, it introduces several challenges. First, the proposed model consumes
more memory resources because multiple window segmentations need to be stored. This is
also reflected in the dataset preparation for the model training. Since the model needs to
capture the dependencies between the multiple window segmentations, a large number of
samples (window segmentation) is required to ensure the model generalization. The challenge
becomes more complicated when the number of window segmentation is large.
We performed the two independent t-test to determine if there is a statistically significant
difference between the two models’ accuracy. The proposed and benchmark models are
trained thirty (30) times, which is the minimum number of samples for hypothesis tests [34].
Each model’s accuracy is recorded, and the average and standard deviation of the models’
accuracy are calculated as shown in Table 7. The 95% confidence interval of the model
accuracy is also given in the table. The average accuracy of the proposed model is 0.895,
which is 0.006 higher than the benchmark model. It is noticed that the standard deviation of
the proposed model’s accuracy is lower than the benchmark model by 0.0005. The margin of
error for a 95% confidence level for the proposed model and benchmark model are 0.00345
and 0.00363 respectively.
We performed two types of hypothesis tests. The first test is to determine if the accuracy
of the proposed and benchmark models is equal or not, and the second test is to determine the
mean difference of the average accuracy. The significance level of the tests is set to 0.05. The
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
Table 8 The null hypothesis and
the p-value of the hypothesis tests Null hypothesis pvalue
H0:μprop osed _model μbenchmark _model 0.0161
H0:μprop osed _model μbenchmark _model >0.005 0.5939
H0:μprop osed _model μbenchmark _model >0.007 0.2943
H0:μprop osed _model μbenchmark _model >0.009 0.0950
H0:μprop osed _model μbenchmark _model >0.0095 0.0667
results of the tests are given in Table 8. As can be seen in Table 8, the p-value of the first test
is 0.0161, which is lower than the significance level. Therefore, it can be concluded that the
average accuracy of both models is not similar. For the second hypothesis test, we performed
the test for μ0.005, μ0.007, μ0.009 and μ0.0095. As can be seen in Table 8,
the p-values of the four tests are above the significance level. Therefore, it is concluded that
the mean difference of the average accuracy is about 0.01.
5 Conclusion
In this paper, we propose a deep temporal Conv-LSTM model to model the temporal infor-
mation of sensor data and activity windows for activity recognition. The proposed model
consists of concurrent feature learning pipelines to accept a sequence of activity windows
for feature extraction. In addition, the proposed model is integrated with a sequence learning
module to learn the temporal features from the concatenated window features. As a result, the
proposed model is able to learn a better feature representation of the sensor data for activity
recognition. The proposed model is evaluated on a public dataset consisting of dynamic,
static and transitional activities, and compared with a benchmark model. The results show
that the proposed model performs better than the benchmark model, achieving an accuracy
of 0.916, which is 0.013 higher than the accuracy’s of the benchmark model. We plan to
enhance the network architecture by integrating attention mechanism which can learn the
importance of the features to the prediction. The feature learning pipeline can be enhanced
by integrating squeeze-and-excitation module to capture more salient features.
Funding This work has been supported in part by the Ministry of Higher Education Malaysia for Fundamental
Research Grant Scheme with Project Code: FRGS/1/2019/ICT02/USM/02/1.
Data Availability Not applicable.
Code Availability Not applicable/
Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Ethics Approval Not applicable.
Consent to Participate Not applicable.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
Appendix
Window size: 80
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
A1890231 0 0 0 4200 1 2
A2318592 1 2 0 3600 2 3
A34 388180 0 0 2000 1 1
A40 0 0 9181480 2303 1 0
A54 1 0 1509642 2100 1 2
A6000001143 0100 1 2
A700013140 40113 5 1
A80006160 13601 0 2
A9000531702371 310
A100001408 110500 6
A11000621600182 420
A123004711060250 29
Window size: 100
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
A17023400001100 1 1
A219696130 0 0 1000 1 0
A33 5 6760 0 0 0000 0 3
A40 0 0 7101390 2004 2 0
A51109380010101 4 1
A6000009180000 2 1
A7010610022114 151
A8010511013300 0 1
A900031902360 241
A100001206000351 8
A11010411301172 300
A12120237020220 28
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
Window size: 120
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
A1600210100100 1 10
A23156250100200 0 0
A33 115670 0 0 0000 1 0
A40 0 0 606990 4103 0 0
A50104369704300 3 1
A6000007620041 1 1
A7000712026101 5 0
A800047052400 0 0
A9000531100240 210
A10000803301330 7
A1100042640130 280
A12100534020160 22
Window size: 140
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
A14992130002100 0 4
A2124841001001000 0 5
A30 5 4810 0 0 1000 0 1
A40 0 0 503980 2313 0 1
A50007755204801 3 0
A6000006530011 0 3
A7000512025000 1 0
A800013002802 0 0
A900030512210 230
A10000903100290 6
A110003174080 280
A12000124000190 21
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
M. H. Mohd Noor et al.
Window size: 128
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
A15254940000100 0 1
A21544660000100 0 0
A30 105260 0 0 0000 1 0
A40 0 0 594680 1401 0 0
A51102167421201 5 0
A6000007130001 1 4
A7000116027201 2 0
A810025003200 0 0
A900051700200 251
A10000922000280 7
A110001360150 390
A12100222020170 25
References
1. Abidine BM, Fergani L, Fergani B, Oussalah M (2018) The joint use of sequence features combination
and modified weighted SVM for improving daily activity recognition. Pattern Anal Appl 21:119–138.
https://doi.org/10.1007/s10044-016-0570-y
2. Tian Y, Zhang J, Wang J et al (2020) Robust human activity recognition using single accelerometer via
wavelet energy spectrum features and ensemble feature selection. Syst Sci Control Eng 8:83–96. https://
doi.org/10.1080/21642583.2020.1723142
3. Vanrell SR, Milone DH, Rufiner HL et al (2018) Assessment of homomorphic analysis for human activity
recognition from acceleration signals. IEEE J Biomed Health Inform 22:1001–1010. https://doi.org/10.
1109/JBHI.2017.2722870
4. Ertuˇgrul ÖF, Kaya Y (2017) Determining the optimal number of body-worn sensors for human activity
recognition. Soft Comput 21:5053–5060. https://doi.org/10.1007/s00500-016-2100-7
5. Kanjilal R, Uysal I (2021) The future of human activity recognition: deep learning or feature engineering?
Neural Process Lett 53:561–579. https://doi.org/10.1007/s11063-020-10400-x
6. Wang J, Chen Y, Hao S et al (2019) Deep learning for sensor-based activity recognition: a survey. Pattern
Recognit Lett 119:3–11. https://doi.org/10.1016/j.patrec.2018.02.010
7. Xu W, Pang Y, Yang Y, Liu Y (2018) Human activity recognition based on convolutional neural network.
In: 2018 24th International Conference on Pattern Recognition (ICPR), pp 165–170
8. Bevilacqua A, MacDonald K, Rangarej A et al (2019) Human activity recognition with convolutional
neural networks. In: Brefeld U, Curry E, Daly E et al (eds) Machine learning and knowledge discovery
in databases. Springer, Cham, pp 541–552
9. Lawal IA, Bano S (2019) Deep human activity recognition using wearable sensors. In: Proceedings of
the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments.
Association for Computing Machinery, New York, pp 45–48
10. Gil-Martín M, San-Segundo R, Fernández-Martínez F, Ferreiros-López J (2021) Time analysis in human
activity recognition. Neural Process Lett 53:4507–4525. https://doi.org/10.1007/s11063-021-10611-w
11. Zhu R, Xiao Z, Cheng M et al (2018) Deep ensemble learning for human activity recognition using
smartphone. In: 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), pp 1–5
12. Zehra N, Azeem SH, Farhan M (2021) Human activity recognition through ensemble learning of multiple
convolutional neural networks. In: 2021 55th annual Conference on Information Sciences and Systems
(CISS), pp 1–5
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Deep Temporal Conv-LSTM for Activity Recognition
13. Sikder N, Chowdhury MdS, Arif ASM, Nahid A-A (2019) Human activity recognition using multi-
channel convolutional neural network. In: 2019 5th International Conference on Advances in Electrical
Engineering (ICAEE), pp 560–565
14. Zhang H, Xiao Z, Wang J et al (2020) A novel IoT-perceptive Human Activity Recognition (HAR)
approach using multihead convolutional attention. IEEE Internet Things J 7:1072–1080. https://doi.org/
10.1109/JIOT.2019.2949715
15. Chen Y, Zhong K, Zhang J et al (2016) LSTM networks for mobile human activity recognition. Atlantis
Press, pp 50–53
16. Zebin T, Sperrin M, Peek N, Casson AJ (2018)Human activity recognition from inertial sensor time-series
using batch normalized deep LSTM recurrent networks. In: 2018 40th annual international conference of
the IEEE Engineering in Medicine and Biology Society (EMBC), pp 1–4
17. Guan Y, Plötz T (2017) Ensembles of deep LSTM learners for activity recognition using wearables. Proc
ACM Interact Mob Wearable Ubiquitous Technol 1:1–28. https://doi.org/10.1145/3090076
18. Li S, Li C, Li W et al (2018) Smartphone-sensors based activity recognition using IndRNN. In: Proceedings
of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and
Ubiquitous Computing and Wearable Computers. Association for Computing Machinery, New York, pp
1541–1547
19. Mahmud T, Akash SS, Fattah SA et al (2020) Human activity recognition from multi-modal wearable
sensor data using deep multi-stage LSTM architecture based on temporal feature aggregation. In: 2020
IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS), pp 249–252
20. Ordóñez FJ, Roggen D (2016) Deep convolutional and LSTM recurrent neural networks for multimodal
wearable activity recognition. Sensors. https://doi.org/10.3390/s16010115
21. Mekruksavanich S, Jitpattanakul A (2020) Smartwatch-based human activity recognition using hybrid
LSTM network. In: 2020 IEEE SENSORS, pp 1–4
22. Mutegeki R, Han DS (2020) A CNN-LSTM approach to human activity recognition. In: 2020 International
Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp 362–366
23. Li Z, Liu Y, Guo X, Zhang J (2020) Multi-convLSTM neural network for sensor-based human activity
recognition. J Phys Conf Ser 1682:012062. https://doi.org/10.1088/1742-6596/1682/1/012062
24. Wang H, Zhao J, Li J et al (2020) Wearable sensor-based human activity recognition using hybrid deep
learning techniques. Secur Commun Netw 2020:2132138. https://doi.org/10.1155/2020/2132138
25. Singh SP, Sharma MK, Lay-Ekuakille A et al (2021) Deep ConvLSTM with self-attention for human
activity decoding using wearable sensors. IEEE Sens J 21:8575–8582. https://doi.org/10.1109/JSEN.
2020.3045135
26. Abdel-Basset M, Hawash H, Chakrabortty RK et al (2021) ST-DeepHAR: deep learning model for human
activity recognition in IoHT applications. IEEE Internet Things J 8:4969–4979. https://doi.org/10.1109/
JIOT.2020.3033430
27. Xia K, Huang J, Wang H (2020) LSTM-CNN architecture for human activity recognition. IEEE Access
8:56855–56866. https://doi.org/10.1109/ACCESS.2020.2982225
28. Nafea O, Abdul W, Muhammad G, Alsulaiman M (2021) Sensor-based human activity recognition with
spatio-temporal deep learning. Sensors. https://doi.org/10.3390/s21062141
29. Xiao Z, Xu X, Xing H et al (2021) A federated learning system with enhanced feature extraction for human
activity recognition. Knowl -Based Syst 229:107338. https://doi.org/10.1016/j.knosys.2021.107338
30. Gao W, Zhang L, Teng Q et al (2021) DanHAR: dual attention network for multimodal human activity
recognition using wearable sensors. Appl Soft Comput 111:107728. https://doi.org/10.1016/j.asoc.2021.
107728
31. Reyes-Ortiz J-L, Oneto L, Sama A et al (2016) Transition-aware human activity recognition using smart-
phones. Neurocomputing 171:754–767
32. Janidarmian M, Roshan Fekr A, Radecka K, Zilic Z (2017) A Comprehensive analysis on wearable
acceleration sensors in human activity recognition. Sensors. https://doi.org/10.3390/s17030529
33. Banos O, Galvez J-M, Damas M et al (2014) Window size impact in human activity recognition. Sensors
14:6474–6499. https://doi.org/10.3390/s140406474
34. Hogg RV, Tanis EA, Zimmerman DL (2010) Probability and statistical inference. Prentice Hall, Upper
Saddle River
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
... Apart from those early works, more recent works have explored the combination of an LSTM with a CNN to address the identified challenges resulting in a CNN-LSTM or ConvLSTM model [21,22,[52][53][54][55][56][57][58][59]. CNN-LSTM hybrid models have been considered as effective in other domains apart from the HAR domain. ...
... More specifically, the proposed solution only uses sensor data from smartphones as opposed to other models that require multiple wearable and nonwearable sensors, such as video cameras, which are far from ideal and quite intrusive. Although there are similar works that demonstrate the adequacy of smartphone sensor data in activity recognition [53][54][55][56][60][61][62][63][64][65], to the best of our knowledge, the proposed work is the first to combine rich contextual information from environmental sensors with low-level sensor signals from the inertial sensors to distinguish between simple and complex activities and address associated interclass similarity problems using the combination of temporal features from both sliding window and LSTM. ...
Article
Full-text available
Smart devices, such as smartphones, smartwatches, etc., are examples of promising platforms for automatic recognition of human activities. However, it is difficult to accurately monitor complex human activities on these platforms due to interclass pattern similarities, which occur when different human activities exhibit similar signal patterns or characteristics. Current smartphone-based recognition systems depend on traditional sensors, such as accelerometers and gyroscopes, which are built-in in these devices. Therefore, apart from using information from the traditional sensors, these systems lack the contextual information to support automatic activity recognition. In this article, we explore environmental contexts, such as illumination (light conditions) and noise level, to support sensory data obtained from the traditional sensors using a hybrid of Convolutional Neural Network and Long Short-Term Memory (CNN–LSTM) learning models. The models performed sensor fusion by augmenting low-level sensor signals with rich contextual data to improve the models’ recognition accuracy and generalization. Two sets of experiments were performed to validate the proposed solution. The first set of experiments used triaxial inertial sensing signals to train baseline models, while the second set of experiments combined the inertial signals with contextual information from environmental sensors. The obtained results demonstrate that contextual information, such as environmental noise level and light conditions using hybrid deep learning models, achieved better recognition accuracy than the traditional baseline activity recognition models without contextual information.
... It consists of four layers, the first of these is the ConvLSTM layer. The ConvLSTM layer combines the properties of sequential learning associated with LSTMs with the feature extraction capabilities of convolutional neural networks (CNNs) and they have found successful use in the human activity recognition/fall detection domain [22], [23]. By replacing the simple matrix multiplication within LSTM cells by a convolutional operation, the ConvLSTM can capture spatio-temporal dependencies as opposed to the temporal only qualities offered by LSTMs. ...
... Also, the recognition accuracy of this model on some individual activities was low. In Noor et al. 44 a Conv-LSTM model which utilized the temporal features of sensor-based activity recognition data together with sliding window relationship was proposed. The model concatenated various window features, then used a sequence-learning module to learn temporal features, and achieved an accuracy of 91.6% on the benchmarking dataset. ...
Article
With the development of deep learning, numerous models have been proposed for human activity recognition to achieve state‐of‐the‐art recognition on wearable sensor data. Despite the improved accuracy achieved by previous deep learning models, activity recognition remains a challenge. This challenge is often attributed to the complexity of some specific activity patterns. Existing deep learning models proposed to address this have often recorded high overall recognition accuracy, while low recall and precision are often recorded on some individual activities due to the complexity of their patterns. Some existing models that have focused on tackling these issues are always bulky and complex. Since most embedded systems have resource constraints in terms of their processor, memory and battery capacity, it is paramount to propose efficient lightweight activity recognition models that require limited resources consumption, and still capable of achieving state‐of‐the‐art recognition of activities, with high individual recall and precision. This research proposes a high performance, low footprint deep learning model with a squeeze and excitation block to address this challenge. The squeeze and excitation block consist of a global average‐pooling layer and two fully connected layers, which were placed to extract the flattened features in the model, with best‐fit reduction ratios in the squeeze and excitation block. The squeeze and excitation block served as channel‐wise attention, which adjusted the weight of each channel to build more robust representations, which enabled our network to become more responsive to essential features while suppressing less important ones. By using the best‐fit reduction ratio in the squeeze and excitation block, the parameters of the fully connected layer were reduced, which helped the model increase responsiveness to essential features. Experiments on three publicly available datasets (PAMAP2, WISDM, and UCI‐HAR) showed that the proposed model outperformed existing state‐of‐the‐art with fewer parameters and increased the recall and precision of some individual activities compared to the baseline, and the existing models.
... Then, transfer learning techniques [44] can be used on those images, or alternatively can be used in models build with CNN and other deep learning measurement architectures [45]. Moreover, it is possible to apply deep temporal Conv-LSTM architecture [46] in order to improve the overall performance of HAR by using both temporal features from sensor data as well as the relationship of sliding windows. ...
Article
Full-text available
Biomedical images contain a huge number of sensor measurements that can provide disease characteristics. Computer-assisted analysis of such parameters aids in the early detection of disease, and as a result aids medical professionals in quickly selecting appropriate medications. Human Activity Recognition, abbreviated as 'HAR', is the prediction of common human measurements, which consist of movements such as walking, running, drinking, cooking, etc. It is extremely advantageous for services in the sphere of medical care, such as fitness trackers, senior care, and archiving patient information for future use. The two types of data that can be fed to the HAR system as input are, first, video sequences or images of human activities, and second, time-series data of physical movements during different activities recorded through sensors such as accelerometers, gyroscopes, etc., that are present in smart gadgets. In this paper, we have decided to work with time-series kind of data as the input. Here, we propose an ensemble of four deep learning-based classification models, namely, 'CNN-net', 'CNNLSTM-net', 'ConvLSTM-net', and 'StackedLSTM-net', which is termed as 'Ensem-HAR'. Each of the classification models used in the ensemble is based on a typical 1D Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network; however, they differ in terms of their architectural variations. Prediction through the proposed Ensem-HAR is carried out by stacking predictions from each of the four mentioned classification models, then training a Blender or Meta-learner on the stacked prediction, which provides the final prediction on test data. Our proposed model was evaluated over three benchmark datasets, WISDM, PAMAP2, and UCI-HAR; the proposed Ensem-HAR model for biomedical measurement achieved 98.70%, 97.45%, and 95.05% accuracy, respectively, on the mentioned datasets. The results from the experiments reveal that the suggested model performs better than the other multiple generated measurements to which it was compared.
Article
Human Activity Recognition (HAR) is an essential task in various applications such as pervasive healthcare, smart environment, and security and surveillance. The need to develop accurate HAR systems has motivated researchers to propose various recognition models, feature extraction methods, and datasets. A lot of comprehensive surveys have been done on vision-based HAR, while few surveys have been done on sensor-based HAR. The few existing surveys on sensor-based HAR have focused on reviewing various feature extraction methods, the adoption of deep learning in activity recognition, and existing wearable acceleration sensors, among other areas. In recent times, state-of-the-art HAR models have been developed using wearable sensors due to the numerous advantages it offers over other modalities. However, one limitation of wearable sensors is the difficulty of annotating datasets during or after collection, as it tends to be laborious, time-consuming, and expensive. For this reason, recent state-of-the-art activity recognition models are being proposed using fully unlabelled datasets, an approach which is described as unsupervised learning. However, no existing sensor-based HAR surveys have focused on reviewing this recent adoption. To this end, this survey contributes by reviewing the evolution of activity recognition models, collating various types of activities, compiling over thirty activity recognition datasets, and reviewing the existing state-of-the-art models to leveraging fully unlabelled datasets in activity recognition. Also, this survey is the first attempt at a comprehensive review on the adoption of unsupervised learning in wearable sensor-based activity recognition. This will give researchers in this area a solid background and knowledge of the existing state-of-the-art models and an insight into the grand research areas that can still be explored.
Article
Full-text available
Continuous human activity recognition from inertial signals is performed by splitting these temporal signals into time windows and identifying the activity in each window. Defining the appropriate window duration has been the target of several previous works. In most of these analyses, the recognition performance increases with the windows duration until an optimal value and decreases or saturates for longer windows. This paper evaluates several strategies to combine sub-window information inside a window, obtaining important improvements for long windows. This evaluation was performed using a state-of-the-art human activity recognition system based on Convolutional Neural Networks (CNNs). This deep neural network includes convolutional layers to learn features from signal spectra and additional fully connected layers to classify the activity at each window. All the analyses were carried out using two public datasets (PAMAP2 and USC-HAD) and a Leave-One-Subject-Out (LOSO) cross-validation. For 10-s windows, the accuracy increased from 90.1 (± 0.66) to 94.27 (± 0.46) in PAMAP2 and from 80.54 (± 0.73) to 84.46 (± 0.67) in USC-HAD. For 20-s windows, the improvements were from 92.66 (± 0.58) to 96.35 (± 0.38) (PAMAP2) and from 78.39 (± 0.76) to 86.36 (± 0.57) (USC-HAD).
Article
Full-text available
Human activity recognition (HAR) remains a challenging yet crucial problem to address in computer vision. HAR is primarily intended to be used with other technologies, such as the Internet of Things, to assist in healthcare and eldercare. With the development of deep learning, automatic high-level feature extraction has become a possibility and has been used to optimize HAR performance. Furthermore, deep-learning techniques have been applied in various fields for sensor-based HAR. This study introduces a new methodology using convolution neural networks (CNN) with varying kernel dimensions along with bi-directional long short-term memory (BiLSTM) to capture features at various resolutions. The novelty of this research lies in the effective selection of the optimal video representation and in the effective extraction of spatial and temporal features from sensor data using traditional CNN and BiLSTM. Wireless sensor data mining (WISDM) and UCI datasets are used for this proposed methodology in which data are collected through diverse methods, including accelerometers, sensors, and gyroscopes. The results indicate that the proposed scheme is efficient in improving HAR. It was thus found that unlike other available methods, the proposed method improved accuracy, attaining a higher score in the WISDM dataset compared to the UCI dataset (98.53% vs. 97.05%).
Article
Full-text available
A significant gap exists in our knowledge of how domain-specific feature extraction compares to unsupervised feature learning in the latent space of a deep neural network for a range of temporal applications including human activity recognition (HAR). This paper aims to address this gap specifically for fall detection and motion recognition using acceleration data. To ensure reproducibility, we use a publicly available dataset, UniMiB-SHAR, with a well-established history in the HAR literature. We methodically analyze the performance of 64 different combinations of (i) learning representations (in the form of raw temporal data or extracted features), (ii) traditional and modern classifiers with different topologies on (iii) both binary (fall detection) and multi-class (daily activities of living) datasets. We report and discuss our findings and conclude that while feature engineering may still be competitive for HAR, trainable front-ends of modern deep learning algorithms can benefit from raw temporal data especially in large quantities. In fact, this paper claims state-of-the-art where we significantly outperform the most recent literature on this dataset in both activity recognition (88.41% vs. 98.02%) and fall detection (98.71% vs. 99.82%) using raw temporal input.
Article
Full-text available
Decoding human activity accurately from wearable sensors can aid in applications related to healthcare and context awareness. The present approaches in this domain use recurrent and/or convolutional models to capture the spatio-temporal features from time-series data from multiple sensors. We propose a deep neural network architecture that not only captures the spatio-temporal features of multiple sensor time-series data but also selects, learns important time points by utilizing a self-attention mechanism. We show the validity of the proposed approach across different data sampling strategies on six public datasets and demonstrate that the self-attention mechanism gave a significant improvement in performance over deep networks using a combination of recurrent and convolution networks. We also show that the proposed approach gave a statistically significant performance enhancement over previous state-of-the-art methods for the tested datasets. The proposed methods open avenues for better decoding of human activity from multiple body sensors over extended periods of time. The code implementation for the proposed model is available at https://github.com/isukrit/encodingHumanActivity.
Article
Full-text available
In recent years, human activity recognition (HAR) has attracted a lot of attention due to its wide application, such as indoor positioning and navigation. This paper proposes a MconvLSTM to construct a multi-unit deep network structure,which can effectively improve the accuracy of HAR. Firstly, the input data is dimensionally expanded. Secondly, multiple convLSTM module are used to input data from different sensors to achieve partial weight sharing. Multiple outputs are merged finally. The experimental results show that the partial weight sharing mechanism and dimension expansion effectively improve the extraction of single sensor features, aiming to improve the activity recognition rate. Using public UCI datasets for testing, the accuracy is significantly improved compared to traditional convLSTM network results.
Article
With the rapid growth of mobile devices, wearable sensor-based human activity recognition (HAR) has become one of the hottest topics in the Internet of Things. However, it is challenging for traditional approaches to achieving high recognition accuracy while protecting users’ privacy and sensitive information. To this end, we design a federated learning system for HAR (HARFLS). Based on the FederatedAveraging method, HARFLS enables each user to handle its activity recognition task safely and collectively. However, the recognition accuracy largely depends on the system’s feature extraction ability. To capture sufficient features from HAR data, we design a perceptive extraction network (PEN) as the feature extractor for each user. PEN is mainly composed of a feature network and a relation network. The feature network, based on a convolutional block, is responsible for discovering local features from the HAR data while the relation network, a combination of long short-term memory (LSTM) and attention mechanism, focuses on mining global relationships hidden in the data. Four widely used datasets, i.e., WISDM, UCI_HAR 2012, OPPORTUNITY, and PAMAP2, are used for performance evaluation. Experimental results demonstrate that PEN outperforms 14 existing HAR algorithms on these datasets in terms of the F1-score; HARFLS with PEN obtains better recognition results on the WISDM and PAMAP2 datasets, compared with 11 existing federated learning systems with various feature extraction structures.
Article
In the paper, we present a new dual attention method called DanHAR, which blends channel and temporal attention on residual networks to improve feature representation ability for sensor-based HAR task. Specially, the channel attention plays a key role in deciding what to focus, i.e., sensor modalities, while the temporal attention can focus on the target activity from a long sensor sequence to tell where to focus. Extensive experiments are conducted on four public HAR datasets, as well as weakly labeled HAR dataset. The results show that dual attention mechanism is of central importance for many activity recognition tasks. We obtain 2.02%, 4.20%, 1.95%, 5.22% and 5.00% relative improvement over regular ConvNets respectively on WISDM dataset, UNIMIB SHAR dataset, PAMAP2 dataset, OPPORTUNITY dataset, as well as weakly labeled HAR dataset. The DanHAR is able to surpass other state-of-the-art algorithms at negligible computational overhead. Visualizing analysis is conducted to show that the proposed attention can capture the spatial–temporal dependencies of multimodal sensing data, which amplifies the more important sensor modalities and timesteps during classification. The results are in good agreement with normal human intuition.
Article
Human activity recognition (HAR) has been regarded as an indispensable part of many smart home systems and smart healthcare applications. Specifically, HAR is of great importance in the Internet of Healthcare Things (IoHT), owing to the rapid proliferation of Internet of Things (IoT) technologies embedded in various smart appliances and wearable devices (such as smartphones and smartwatches) that have a pervasive impact on an individual’s life. The inertial sensors of smartphones generate massive amounts of multidimensional time-series data, which can be exploited effectively for HAR purposes. Unlike traditional approaches, deep learning techniques are the most suitable choice for such multivariate streams. In this study, we introduce a supervised dual-channel model that comprises long short-term memory (LSTM), followed by an attention mechanism for the temporal fusion of inertial sensor data concurrent with a convolutional residual network for the spatial fusion of sensor data. We also introduce an adaptive channel-squeezing operation to fine-tune convolutional a neural network feature extraction capability by exploiting multichannel dependency. Finally, two widely available and public HAR datasets are used in experiments to evaluate the performance of our model. The results demonstrate that our proposed approach can overcame state-of-the-art methods.