PreprintPDF Available

Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Attention Model for Activity Recognition

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Multimodal features play a key role in wearable sensor based Human Activity Recognition (HAR). Selecting the most salient features adaptively is a promising way to maximize the effectiveness of multimodal sensor data. In this regard, we propose a "collect fully and select wisely (Fullie and Wiselie)" principle as well as a dual-stream recurrent convolutional attention model, Recurrent Attention and Activity Frame (RAAF), to improve the recognition performance. We first collect modality features and the relations between each pair of features to generate activity frames, and then introduce an attention mechanism to select the most prominent regions from activity frames precisely. The selected frames not only maximize the utilization of valid features but also reduce the number of features to be computed effectively. We further analyze the hyper-parameters, accuracy, interpretability, and annotation dependency of the proposed model based on extensive experiments. The results show that RAAF achieves competitive performance on two benchmarked datasets and works well in real life scenarios.
Content may be subject to copyright.
1
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional
Aention Model for Activity Recognition
KAIXUAN CHEN,University of New South Wales
LINA YAO, University of New South Wales
TAO GU, RMIT University
ZHIWEN YU, Northwestern Polytechnical University
XIANZHI WANG, University of New South Wales
DALIN ZHANG, University of New South Wales
Multimodal features play a key role in wearable sensor based Human Activity Recognition (HAR). Selecting the most salient
features adaptively is a promising way to maximize the eectiveness of multimodal sensor data. In this regard, we propose a
"collect fully and select wisely (Fullie and Wiselie)" principle as well as a dual-stream recurrent convolutional attention model,
Recurrent Attention and Activity Frame (RAAF), to improve the recognition performance. We rst collect modality features
and the relations between each pair of features to generate activity frames, and then introduce an attention mechanism to
select the most prominent regions from activity frames precisely. The selected frames not only maximize the utilization of
valid features but also reduce the number of features to be computed eectively. We further analyze the hyper-parameters,
accuracy, interpretability, and annotation dependency of the proposed model based on extensive experiments. The results
show that RAAF achieves competitive performance on two benchmarked datasets and works well in real life scenarios.
Additional Key Words and Phrases: Human Activity Recognition, wearable sensors, attention mechanism, recurrent neural
networks, reinforcement learning
ACM Reference format:
Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang. 2017. Fullie and Wiselie: A Dual-Stream
Recurrent Convolutional Attention Model for Activity Recognition. 1, 1, Article 1 (November 2017), 22 pages.
https://doi.org/0000001.0000001
1 INTRODUCTION
Human Activity Recognition (HAR) plays a key role in several research elds. It has gained broad attention
due to the increasing popularity of ubiquitous environments, especially in health care and surveillance domains
[
3
,
49
]. Generally, HAR diverges into two categories of approaches: vision-based activity recognition [
42
] and
sensor-based activity recognition [
9
]. The sensor-based approach has several advantages over the vision-based
approach and has seen diverse applications including health monitoring and motion sensing games.
Compared with cameras, wearable sensors are not usually conned by environment constraints such as
illumination, point of views, and set up cost. [7].
This is the corresponding author
ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. As such,
the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government
purposes only. Permission to make digital or hard copies for personal or classroom use is granted. Copies must bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. To copy otherwise,
distribute, republish, or post, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2017 Association for Computing Machinery.
XXXX-XXXX/2017/11-ART1 $15.00
https://doi.org/0000001.0000001
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:2 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Sensor data obtained from wearable devices typically appear higher quality, and complicated feature
extraction is not necessary as compared to image data.
Wearable sensors only detect the data that are strongly related to the dynamics of human motions. Therefore,
sensor data collected do not violate human privacy while image data do.
Despite a large number of sensor-based recognition solutions being proposed over the decade, we discover
several limitations. First, there is still a lack of comprehensive model representation to sensor signals in a way
that dierent activities can be distinguished in a more expressive and eective ways. With the recent advances in
deep neural networks and the notable performance achieved by these methods in the community of HAR [
15
,
17
],
Convolutional Neural Network (CNN) appears to be a promising candidate for building such models. However,
while CNN does well in capturing spatial relationships of features, it focuses merely on the features covered by
the convolutional kernels but overlooks the correlation among non-adjacent features [
22
]. Considering that most
of the data collected by wearable sensors such as accelerometers and gyroscopes are tri-axis, in this paper, we
transform sensor signals into a new activity frame which not only captures the relationships between each pair
of tri-axis signals but also contains the relationship between each pair of single signals. The experiments show
that our new representation is far more discriminative than traditional representations.
Second, the demerits of interperson variability and interclass similarity can greatly reduce system performance.
[
7
]. Interperson variability comes from the fact that the same activity can be performed dierently by dierent
people, and interclass similarity results from the similarity in the behavior patterns of dierent activities like
walking and running. Both the above issues require the classier to be task dependent, i.e., it should automatically
extract the salient information indicative of the true activity and ignore the interclass similarity. To this end,
we propose an attention based model, which is directly related to the HAR task, to address the problems of
interperson variability and interclass similarity.
Attention is originally a concept in biology and psychology that implies focusing the power of noticing or
thinking on something special to achieve better cognitive processes. The attention mechanisms have several
advantages, the rst being task dependence. Intuitively, the motion of dierent body parts has varied contributions
to dierent activities [
42
,
45
]. For example, jumping mostly involves legs while running is related to both arms
and legs. More specically, recognizing the patterns of walking depends more on the acceleration of legs while
distinguishing sitting from lying would rely more on the orientation. In this paper, we separate the data related
to each body part to dierent modals, namely accelerometer data, gyroscope data and magnetometer data,
respectively. With the help of activity frames, we can analyze not only the independent modals but also their
correlations thoroughly. Here, the attention mechanisms ensure that the system only focuses on the most
contributing data and ignores the irrelevant sensors or modals.
The second advantage of the attention mechanisms is that it opens the black box of deep neural networks to a
certain degree. While the inner mechanisms of neural networks remain implicit, interpretable neural network is
becoming another trend in the machine learning and data mining elds. Taking convolutional neural networks for
example, when using convolutional neural networks to recognize a dog from an image, we tend to explicitly know
that one lter distinguishes the dog head and another lter identies the dog paw. Back to activity recognition, the
attention model not only provides the specic body parts it focuses on but also highlights the most contributing
sensors and modals to distinguish diverse activities. The salient sensor data can be inferred from the glimpse
patch (to be detailed in Section 3.2.1 ).
The third advantage is that it reduces the computational cost signicantly. Usually, the dimension of the features
expands as we extract the full spatial relationships among sensors, and the cost increases with the increase of
input data dimension. Most existing models process the entire data every time, resulting in high computational
cost. Some works [
24
,
32
,
44
] aim to limit the input dimension using techniques such as dimensionality reduction
and feature selection. However, feature processing comes with information loss, leading to a new trade-o
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition 1:3
problem between accuracy and cost. Inspired by human attention, our proposed method focuses on only one
small patch of the data each time and goes to the next patch when necessary. This method considerably reduces
computational cost as well as information loss.
In this paper, we tackle the HAR problems by transforming wearable sensor data into activity frames and
deploying a dual-stream recurrent convolutional attention model, including one attention stream and one activity
frame stream, to recognize activities. The main contributions of this work are summarized as follows:
We transform the tri-axis sensor data into activity frames to extract the full relationships between data pairs.
This enables the convolutional neural network to cover all features without overlooking any relationships
between data pairs. Furthermore, the activity frames are encoded into convolutional activity frames in
order to extract high-level features. Our model uses a single convolutional layer to encode low level data.
This layer is simple yet generates an eective representation to characterize the local salience of the sensor
data.
We propose a dual-stream recurrent model including one attention stream and one activity frame stream
to recognize activities. Firstly, the system focuses on only a small patch of the activity frame that contains
the most salient information to avoid unnecessary cost on less important areas, by leveraging the recurrent
attention model and combining reinforcement learning. Secondly, we deploy a long short-term memory
network to exploit spatial and temporal information in time-series signals and capture the dynamics of the
sensor data.
We examine our model on two public benchmarked datasets PAMAP2 [
37
,
38
] and MHEALTH [
4
,
5
] and
perform extensive comparison with other methods, as well re-examine our approach on a new dataset col-
lected in the real world named MARS. The experimental results show that our proposed model consistently
outperforms a series of baselines and state-of-the-arts over three datasets.
The remainder of this paper is organized as follows. Section II introduces the existing wearable sensor based
HAR methods and attention based models briey . Section III details the proposed model. Section IV evaluates the
proposed approach and compares it with state-of-the-art methods on two public datasets and one new dataset
collected in the real world. In this section, we will analyze the experimental results in light of the accuracy,
interpretability, latency and annotation dependency as well. Section V summarizes this paper.
2 RELATED WORK
In this section, owing to the prevalence and outstanding performance of deep learning for HAR in recent years,
we aim at giving a comprehensive review of the existing work related to deep learning for human activity
recognition. Also, we briey introduce attention mechanisms used in previous works to study salient features.
2.1 Deep Learning for Human Activity Recognition
Wearable sensor based human activity recognition is essentially a problem of projecting low-level sensor data to
high-level activity knowledge. In our work, one basic challenge behind the "collect and select" principal is how to
deeply extract features adaptive to the classication tasks and obtain the most discriminative representations.
Some works employ traditional machine learning methods working on heuristic hand-crafted features [
6
,
46
],
which not only requires domain knowledge about activity recognition but also may potentially lead to critical
limitations like error-prone bias that hinders the performance. Recently, since deep learning has embraced massive
success in many elds [
29
], a urry of research has emerged providing deep learning based solutions to various
heterogeneous human activity recognition problems. The state-of-the-art deep learning based methods have
made tremendous progress in improving recognition performance and widely used in either feature extraction or
classication process of HAR. The rationale of the evolution is that deep learning is able to automatically extract
adaptive features and spare the eort on manually extracting features and designing classiers in details.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:4 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Enlightened by the work done in [
10
], we group the deep learning algorithms for human activity recognition
into two categories: generative deep architectures including deep belief network, restricted Boltzmann machine
and autoencoder, and discriminative deep architectures containing convolutional neural network and recurrent
neural network. We will overview the recent representative works as follows.
2.1.1 Generative Deep Architectures. Some existing deep learning based activity recognition solutions utilize
generative deep architectures for feature extraction and deriving more discriminative representations. One of the
most widely used architectures is autoencoder. To briey demonstrate, autoencoder is usually a simple 3-layer
neural network where the output units are directly related to input units and back feeds a latent representation
of the input. The motivation of autoencoder is to study higher-level representation that omits noise and enhance
eective information. In [
30
], Li et al. propose to learn features by using sparse autoencoder that adds sparse
constraints, that is, KL divergence to achieve better performance in activity recognition. Wang et. al [
41
] adopt
greedy pretraining to stacked auto encoder and integrate the feature extraction process and the classier into an
architecture to jointly train them by ne-tuning parameters.
Another widely used generative deep architecture is Restricted Boltzmann Machine (RBM) [
19
]. RBM shares a
similar architecture with autoencoder. The dierence lies in that it uses a stochastic approach. To illustrate, it
uses stochastic units with specic distributions such as Gaussian or binary distribution instead of deterministic
activation functions. The authors in [
35
] rstly propose to deploy RBM to study feature representations for
activity recognition. Inspired by this, a sequence of works take RBM as a measure to extract features for HAR. For
example, [
13
] tend to exploit improving training process for RBM. They utilize contrastive gradient to ne-tune
the parameters and accelerate training. [
25
] employs Gaussian layer for the rst layer of their RBM model and
binary for the rest. Furthermore, [
36
] considers multimodal sensor data and designed a multimodal RBM so that
each modality has an individual RBM.
Generative deep architecture enjoys the merits of unsupervised learning and high-quality representations.
However, it leads to unwanted pretraining while our target is to construct an end-to-end model. Compared with
this, discriminative deep architectures are more applicable and popular in previous works.
2.1.2 Discriminative Deep Architectures. Discriminative deep architectures distinguish patterns by calculating
the posterior distributions of classes based on annotated data [
10
]. Existing research can be categorized into two
main directions: convolutional neural network and recurrent neural network.
According to [
29
], the theories behind convolutional neural network including sparse interactions, parameter
sharing and equivariant representations . Usually, the convolutional neural network contains (a) convolutional
layers that create convolution kernels which is convolved with the layer input over a single spatial dimension to
produce a tensor of outputs; (b) rectied linear unit (ReLU) layers that apply the non-saturating activation function
to increase the nonlinear properties of the decision function and of the overall network without aecting the
receptive elds of the convolution layers and (c) max pooling layers that down-sample the input representation,
reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions
binned. After these, there are usually (d) fully-connected layers which perform classication or regression tasks
and CNNs can learn hierarchical representations or high-performance classiers.
For HAR, stemming from the time-series characteristics, CNN can be used with 1D convolution and 2D
convolution to combine temporal information. 1D convolution treats each axis of sensor data as a channel,
attens and unies the outputs of each channel to be one. One example is [
48
], the authors proposed to treat each
axis of the accelerometer as one channel and conduct the convolutional process individually. On the contrary,
2D convolution transforms the input into 2D matrices and considers them as images. In [
17
], Ha et al. simply
generate data images by combining all axis data. After that, the authors in [
21
] additionally consider temporal
information and yield 2D time series images. Furthermore, [
39
] harnesses multimodal sensor data that integrates
pressure sensor data and performs 2D convolutional neural network.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition 1:5
However, these works require massive domain knowledge when conducting transformation, which is not
feasible in more general situations, compared with which, the activity frames proposed in this work not only
considers temporal information and fully extracts spatial relations but also is applicable to most of multimodal
sensor data with better generalization and adaptivity.
Recurrent neural network (RNN) has been proved to be eective in the elds that contains signicant temporal
information such as speech recognition and natural language processing, which is also the reason why RNN is
applicable to HAR. Dierent from CNN which only takes single vector or matrix as input, RNN requires to input
a sequence of vectors or matrices while each sequence has one corresponding class label. With each recurrent
layer considering both the output of the previous layer and the input vector or matrix at the current layer, RNN
thoroughly analyzes the sequences step by step. To achieve better performance, LSTM (long-short term memory)
cells are introduced and usually combined with RNN. Some previous works utilize RNN for in HAR elds [
14
]. In
spite of the competitive performance, the time consumption and computational cost have caused concern. To
adapt RNN to HAR eld where instantaneity is an important issue for developing real application, [
20
] proposed
a new model to which can perform RNN for HAR with high eciency. [
12
] proposed a binarized-BLSTM RNN
model to simplify all the parameters, input, and output to be binary to save the consumption.
In this paper, we innovatively propose a dual-stream recurrent neural network which not only considers
temporal information as conventional works but also leverages attention mechanisms which are introduced next.
2.2 Aention Mechanisms
In our work, except for conventional deep learning approaches including convolutional neural networks and
recurrent neural networks, we also resort to attention mechanisms to facilitate to select the most salient features.
Tracing back the history of selecting eective regions using attention mechanisms or similar theories, some
works in the eld of computer vision [
1
,
11
,
27
] formulate the process of selecting as a sequential decision task.
In these works, the systems decide where to focus on step by step based on the previous decisions and the whole
environment. [
8
] constructs a policy gradient formulation to simulate eye movement. The authors formulate
eye-move control as a problem in stochastic optimal control based on a model of visual perception. However, the
too strict constraints on RNN limit the performance. [
11
,
27
] further combine attention mechanisms with deep
learning algorithms. [
11
] selects forveated images by controlling the location, orientation, scale and speed of the
attended object. To minimize the selecting uncertainty, they proposed a decision-theoretic probabilistic graphical
model based on RBM. Taking policy gradient formulations and deep learning into consideration , [
31
] proposed
the recurrent attention model (RAM) for image classication with a formulation similar to [
8
] but less restrictive
and leverages RNN as well. Inspired by [
31
], we propose a dual-stream recurrent convolutional attention model.
So far, to the best of our knowledge, our work is the rst one to introduce attention mechanisms to the HAR eld.
As feature relations are fully extracted and represented in activity frames, attention based model wisely selects
salient regions to perform activity recognition.
3 OUR MODEL
To fully collect eective information and wisely select salient features, our model contains two parts: (a) feature
extraction to rstly transform wearable sensor data into 2-D matrices and use the convolutional layer to derive
higher-level features. (b) a dual-stream recurrent model including one attention stream and one activity frame
stream for activity recognition. Our attention stream recurrent model simulates the procedures of human brains
processing visual information within several glimpses. In addition, we introduce reinforcement learning to decide
which part of the activity frames it should glimpse next. The other stream is activity frame stream. Owing to the
facts that the activity recognition largely depends on temporal information and that activity frames naturally
capture serial relations. Activity frame based model and are more suitable for our scenarios.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:6 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
The above process is presented as a three-dimensional model in Figure 1, where the time step
t
and frame
f
represent the attention stream and the activity frame stream in our dual-stream method, respectively.
Fig. 1. Work-flow of the Proposed Approach. Dashed arrows indicate the time step
t
for aention stream and the frame
f
for activity frame stream, respectively. For each time step
t
, the input frame goes through a convolutional layer to obtain a
higher-level representation
Cf
. We extract a retina region
ρ(Cf,lf
t)
at location
lf
t
, which is decided by the last time step
t
1.
ρ(Cf,lf
t)
next goes through a glimpse layer to get the glimpse
дf
t
as input of the aention stream recurrent network, LSTM-
a
which decides the action
af
t
and the next location
lf
t+
1. For the activity frame stream recurrent network, the LSTM-
f
takes
the last action of each frame aT
fas input and outputs the final prediction.
Fig. 2. Transformation from sequences to frames
3.1 Input Representation
As we transform the wearable sensor data into activity frames, the data are represented as three-dimensional
vectors. Each sample
(x, y)
of the model consists of a 3-d vector
x
and the activity label
y
. Suppose
X,Y,F
denote
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition 1:7
Fig. 3. Flaened Model. (a) Extracting glimpse
дf
t
from the input activity frame, including a CNN, flaening and reshaping,
and a glimpse layer. (b) The detailed description of the glimpse layer which combines the location
lf
t
and the retina region
ρ(Cf,lf
t). (c) Dual-stream recurrent procedure containing aention stream LSTM-aand activity frame stream LSTM- f.
activity frames’ width, height, and number of frames, and Crepresents the number of activity classes, we have:
xRX×Y×F(1)
and
y∈ [1, . .., C](2)
3.1.1
Activity Frame
.There already exist some previous works that combine multimodal wearable sensor
data for HAR in feature level [
3
,
46
]. For example, Kunze et al. [
23
] concatenate acceleration and angular velocity
into one vector and [26, 33, 40] combine acceleration and other modalities including microphone and GPS data.
However, these works overlook the relations among sensors which are important to activity recognition. A
popular method for extracting spatial relations is deep learning methods like CNN. Although CNN is proven to
perform well in HAR [
21
,
46
], the accuracy is still not that satisfactory. In fact, CNN is originally proposed for
images where each pixel is only related to its adjacent pixels and this small area can be easily covered by a kernel
patch of a convolutional layer. However, it is still challenging to transform features to extract relations between
each signal and the related signals for HAR. In many cases of HAR [
42
], the sensor data are arranged according
to the physical connection of human body parts. For example, the sensor data of hands should be adjacent to
the data of shoulders and the data of shoulders should be adjacent to the data of the waist, which should be
followed by the data of hips, legs, and feet. Nevertheless, in the real world, activities always depend on more
than one body part. For instance, running relies on the cooperation of arms and legs. In addition, the common
Inertial Measurement Unit in wearable devices usually includes a tri-axis accelerometer, a tri-axis gyroscope, and
a tri-axis magnetometer, and the degree to which these sensors contribute to dierent activities are various. This
makes it even more important to nd a representative transformation to extract the relationships between each
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:8 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
ALGORITHM 1: Transformation from Sequences to Images
Input: Stacked raw signals. Each row is a tri-axis data of a accelerometer, gyroscope or a magnetometer which can be
denoted as x,y,z. As shown in Figure 2 (a), each row has a sequence number. Here the number of rows Nr= 9 as an
example.
Output: The activity frame IAwhich is a 2-D array
1: i=1;
2: j=i+1;
3: permutation sequence Sp=[0];
4: adjacent pair set Sap=;
5: activity frame IA= the rst row of stacked signals
6: while i,jdo
7: if j>Nrthen
8: j=1;
9: else if (i,j)<Sapand (j,i)<Sapthen
10: add (i,j)to Sap;
11: add jto Sp;
12: add the j-th row of input data to IA;
13: i=j;
14: j=i+1;
15: else
16: j=j+1
17: end if
18: end while
19: for each row of IAdo
20: if the sequence number of this row is odd then
21: this row is extended as x,y,z,x,y,z,x,y,z
22: else
23: this row is extended as x,y,z,y,z,x,z,x,y
24: end if
25: end for
26: return IA
pair of tri-axis sensor signals (e.g. accelerometer data and gyroscope data) and each pair of single signals (e.g. the
rst dimension of accelerometer data and the second dimension of gyroscope data).
Figure 2 shows the transformation process into activity frames. Each gure is comprised of four parts: sequence
number, sensor location (hand, chest, leg) and modality (acceleration, angular velocity...), notations (x, y, z), and real
data examples. Algorithm 1 further illustrates the transmigration from sequences to images. First, raw signals are
stacked row-by-row as shown in Figure 2 (a). After being permuted in the rst loop (line 6-18 in Algorithm 1), each
tri-axis sensor data has a chance to be adjacent to each of the other sensor data as shown in Figure 2 (b). For example,
supposing
Nr=
9, then the nal
Sp
is
[
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
1
,
3
,
5
,
7
,
9
,
2
,
4
,
6
,
8
,
1
,
4
,
7
,
1
,
5
,
8
,
2
,
5
,
9
,
3
,
6
,
9
,
4
,
8
,
3
,
7
,
2
,
6
,
1
]
.
Since we still need to extract the relationships between each pair of single sensor signals, the second loop (line
19-25 in Algorithm 1) ensures that each single signal has a chance to be adjacent to each of the other signals as
Figure 2 (c) shows. So far we have extracted the relationships between each pair of single sensor signals.
3.1.2
Convolutional Activity Frames
.To derive an eective representation of features, we further trans-
form activity frames into convolutional activity frames. Compared with a convolutional auto-encoder [
18
], we
prefer to train the model end-to-end and omit the pretraining process, as shown in Figure 3. Each activity frame
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition 1:9
If
(
f
denoted the
fth
frame) is transformed into a three-dimensional cube, the height of which depends on the
number of channels of the convolutional network. The convolutional network has two convolutional layers that
learn lters which activate when it detects some specic types of features at some spatial position in the input.
The output is further processed by a ReLU layer and a max pooling layer. The former applies the non-saturating
activation function
relu(ν)=max(ν,
0
)
to increase the nonlinear properties of both the decision function and the
overall network without aecting the receptive elds of the convolution layer. The latter partitions the input
image into a set of non-overlapping rectangles and outputs the maximum for each such sub-region to omit the
less important features.
To obtain new convolutional activity frames, the cubes are attened and reshaped in the same size of original
activity frames by a fully connected layer. After the convolutional layer, the input frame
If
is encoded to be
Cf
.
3.2 Aention and Activity Frame Based Recurrent Model
We propose a dual-stream recurrent model that incorporates both attention and frame to analyze the convolutional
activity frames. Figure 3 shows the structure of this model, where the activity frame stream recurrent modal
leverages the temporal information of sensor data and the attention stream recurrent model solves the human
activity recognition problem.
Since dierent human body parts contribute dierently in recognizing dierent activities, we need to guarantee
that the system only focuses on the most relevant and contributing parts and data. Our dual-stream model
is inpired by Mnih et al. [
31
], who rst adopt the recurrent attention model (RAM) for image classication.
Specically, they address the image classication problem using the basic RAM. As the problem is relatively
simple with only brush strokes in images being salient and the contrast between the strokes and the black
backgrounds being clear. In contrast, analyzing activity frames in this work can be much more complex because
activity frames lack such characteristics compared with image data. Moreover, almost all sensors can detect
motions during activities and sometimes even standstill is still meaningful. Since the convolutional attention
frames fully extract the relationships among all feature pairs, only a part of them is salient to each certain activity.
Therefore, it is natural to introduce attention mechanisms facilitating to mine eective information and minimize
the negative impacts of undesirable information. To the best of our knowledge, our method is the rst one to
leverage the attention model to tackle the activity recognition problems.
Figure 3 shows a attened model, which better interprets the model. Our model is comprised of a glimpse
network, a recurrent attention unit, and a recurrent activity frame unit that we will introduce in the followings.
3.2.1
Glimpse Network
.The rst part after the convolutional layer is a glimpse network. The glimpse
network not only avoids the system processing the whole data in the entirety at a time but also maximally
eliminates the information loss. In our model, each frame will be "understood" within
T
glimpses. For the
transformed frame
Cf
, at each time step
t
, we simulate the process of how the human eyes work. Our model rst
extracts a retina region denoted by
ρ(Cf,lf
t)
from the input data at the location
lf
t
with a retina. The retina image
encodes the region around
lf
t
with high resolution but uses a progressively lower resolution for points further
from lf
t. This has been proved an eective method to remove noises and avoid information loss in [50].
In the human visual system, the retina image is converted into electric signals that are relayed to the brain
via the optic nerves. Likewise, in our model, the retina image is converted into a glimpse
дf
t
as Figure 3 shows.
The retina image
ρ(Cf,lf
t)
and the location
lf
t
are linear transformed independently with two linear layers of
network parameterized by
θρ
д
and
θl
д
, respectively. Next, the summation of these two parts is further transformed
with another linear layer parameterized by
θs
д
and a rectied linear unit. The whole process can be summarized
as the following equation:
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:10 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
дf
t=fд(ρ(Cf,lf
t),lf
t;θρ
д,θl
д,θs
д)=relu(Linear (Linear (ρ(Cf,lf
t)) +Linear (lf
t))) (3)
where
Linear (•)
denotes a linear transformation. Therefore,
дf
t
contains information from both "what" (
ρ(Cf,lf
t)
)
and "where" (lf
t).
3.2.2
Recurrent Aention Unit
.We use the recurrent neural networks as the core to process data step
by step within several glimpses and introduce an attention mechanism to ensure the system only focuses on
the most relevant sensors/modals and the most contributing data. The glimpses at time steps of the recurrent
attention model help visualize the contribution of sensors deployed at dierent body parts, thus achieving better
interpretability of our model.
As Figure 3 shows, the basic structure of the recurrent attention unit is an LSTM-
a
(attention stream LSTM). At
each time step
t
, the LSTM-
a
receives the glimpse
дf
t
and the previous hidden state
hf
t1
as the inputs parameterized
by θh. Meanwhile, it outputs the current hidden state hf
taccording to the equation:
hf
t=fд(hf
t1,дf
t;θh)(4)
The recurrent attention model also contains two sub-networks: the location network and the action network.
These two sub-networks receive the hidden state
hf
t
as the input to decide the next glimpse location
lf
t+1
and the
current action
af
t
. The current action not only determines the activity label
ˆy
but also aects the environment in
some cases while the location network outputs the location at time
t+
1stochastically according to the location
policy dened by a Gaussian distribution stochastic process, parameterized by the location network
f(hf
t
;
θt)
. As
it decides the next region to "look at", the location network is the principal component of the recurrent attention
unit.
lf
t+1P(· | fl(hf
t;θl)) (5)
Similarly, the action network outputs the corresponding action at time
t
and predicts the activity label given
the hidden state
hf
t
. The action
af
t
obeys the distribution parameterized by
f(hf
t
;
θa)
. Owing to its prediction
function, the network uses a softmax formulation:
af
t=fa(hf
t;θa)=so f tmax(Linear (hf
t)) (6)
3.2.3
Recurrent Activity Frame Unit
.Activity recognition heavily relies on the temporal information.
Therefore, besides the single activity frames used by the aforementioned process, we additionally leverage
activity frames via a recurrent activity frame unit. As the hidden layer
hf
t
of the core LSTM-
a
contributes to
predicting the action
af
t
and deciding the next glimpse location
lf
t+1
. For this reason, we believe the hidden state
is discriminative enough to make the nal prediction for the whole system. In particular, we design an LSTM-
f
(activity frame stream LSTM) to combine the hidden states of all the frames at the last time step
T
to predict the
activity label and to preserve the eciency. Given the hidden state of the last frame, the hidden state of each
frame rf=fr(hf
T,rf1;θr), parameterized by θr.
3.3 Training and Optimization
Our proposed model depends on the parameters of every components, including the glimpse network, the recurrent
attention network, the two sub-networks, and the activity frame stream recurrent network
Θ=θд,θh,θa,θl,θr
.
Both the action network and the frame-based recurrent network are based on classication methods. Therefore,
their parameters,
θa
and
θr
, can be trained by optimizing the cross-entropy loss and the backpropagation. However,
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
1:11
the location network should be able to select a sequence of salient regions from activity frames adaptively. Since
this network is non-dierentiable owing to its stochasticity and the problem can also be regarded as a control
problem to settle the attention region at the next step, it can be trained by reinforcement methods to learn the
optimal policies.
We simply introduce some denitions of reinforcement learning based on our case.
Agent: the brain to make decisions, which is the location network in our case.
Environment: the unknown world that may aect the agent’s decision or may be inuenced by the agent.
Reward: the feedback from the environment to evaluate the action. In our case, for each frame, the model
gives a prediction
ˆy=at
and receives a reward
rt
as a feedback for the future correction of the prediction
after each time step
t
. Suppose
T
denotes the number of steps in our attention stream recurrent model.
rt=1if ˆy=yafter Tsteps and 0otherwise. The target of the optimization is to maximize R=ÍT
t=1rt.
Policy: the projection from states to actions, denoted by
π(a|s)=P[At=a|St=s]
. To maximize the
reward
R
, we learn an optimal policy
π(lt,at|s1:t
;
Θ)
to map the attention sequence
s1:t
to a distribution
over actions for the current time step, where the policy
π
is decided by
Θ
of the recurrent attention model.
Based on the above discussion, we deploy a Partially Observable Markov Decision Process (POMDP) to
solve the training and optimization problems, for which the true state of the environment is unobserved. Let
s1:t=x1,l1,a1
;
...xt,lt,at
be the sequence of the input, location and action pairs. This sequence, called an
attention sequence, shows the order of the regions our attention focuses on.
To sum up, in our case, the location network is formulated as a random stochastic process (the Gaussian
distribution) parameterized by
Θ
. Each time after the location selection, the prediction
a
is evaluated to back feed
a reward for conducting the backpropagation training process. The process is also dened as policy gradient. Our
goal is to maximize the simulated rewards using gradient.
Generally, for sample xwith its reward f(x)and the probability p(x), we have:
Ex[f(x)] =Õ
x
p(x)f(x)(7)
so that the gradient can be calculated according to the REINFORCE rule [43]:
θEx[f(x)] =θÕ
x
p(x)f(x)
=Õ
x
θp(x)f(x)
=Õ
x
p(x)θp(x)
p(x)f(x)
=Õ
x
p(x)θloдp(x)f(x)
=Ex[f(x)θloдp(x)] (8)
In our case, given the reward
R
and the attention sequence
s1:T
, the reward function to be maximized is as
follows:
J(Θ)=Ep(s1:T;Θ)[
T
Õ
t=1
rt]=Ep(s1:T;Θ)[R](9)
By considering the training problem as a POMDP, a sample approximation to the gradient is calculated as
follows:
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:12 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
ALGORITHM 2: Overall Process of RAAF
Input: Activity frames from Algorithm 1, T: the number of time steps, F: the number of activity frames.
Output: The prediction results
1: rlas t =RandomInitialize()
2: for ffrom 1to Fdo
3: I=the fthactivity frame
4: IC N N =C N N (I)
5: C=Reshape (F latt en(IC N N ))
6: hlas t =RandomInitialize()
7: l=RandomInitialize()
8: for tfrom 0to Tdo
9: ρ=Ext r act Retina(C,l)
10: дlimpse =r elu(Linear (Linear (ρ)+Linear (L)))
11: h=LST Mat t e nt io n (дlimpse,hl as t )
12: a=so f tmax(Linear(h)) trained by cross-entropy and gradient propagation
13: l=tanh(Linear (h)) trained by equation 11
14: hlas t =h
15: end for
16: r=LST Mf r am e (h,rla st )
17: end for
18: activity_label =r
19: return activity_label
ΘJ=
T
Õ
t=1
Ep(s1:T;Θ)[Θloдπ(y|s1:t;Θ)R](10)
where
i
denotes the
ith
training sample,
y(i
is the correct label for the
ith
sample, and
Θloдπ (y(i)|si
1:t
;
Θ)
is the
gradient of LSTM-acalculated by backpropagation.
We use Monte Carlo sampling which utilizes randomness to yield results that might be deterministic theoreti-
cally. Supposing
M
is the number of Monte Carlo sampling copies, we duplicate the same convolutional activity
frames for
M
times and average them as the prediction results to overcome the randomness in the network,
where the Mduplication generates Msubtly dierent results owing to the stochasticity, so we have:
ΘJ=
T
Õ
t=1
Ep(s1:t;Θ)[Θloдπ (y|s1:t;Θ)R]1
M
M
Õ
i=1
T
Õ
t=1
Θloдπ (y(i)|si
1:t;Θ)R(i)(11)
Therefore, although the best attention sequences are unknown, our proposed model can learn the optimal
policy in the light of the reward.
To summarize, we propose a dual-stream recurrent convolutional attention model which includes transforming
features into activity frames and a dual-stream recurrent model. Firstly, to fully extract relations between each
pair of sensors and modality features, the inputs are innovatively transformed into convolutional activity frames.
After that, the model eectively combines attention based recurrent spatial relations and recurrent temporal
information to wisely select salient features and perform classication. To further illustrate the process detailedly,
an overall algorithm is shown in Algorithm 2. The experimental results presented next show that the proposed
approach outperforms the state-of-the-art HAR methods.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
1:13
4 EXPERIMENTS
In this section, we present the validation of our proposed method via experiments on on two public datasets and
another real-world dataset collected by ourselves. Firstly, we describe the used dataset and the experimental setup.
Secondly, we present our investigation of hyper-parameter study on the classication performance. Thirdly, we
compare the accuracy of our proposed methods with several state-of-the-art HAR methods, present the confusion
matrices on the datasets, and analyze the experimental results. Lastly, we show the interpretability and low
dependency of RAAF on labeled data.
4.1 Datasets and Experimental Seings
We evaluate the proposed method on two public benchmarked activity recognition datasets, PAMAP2 dataset
and MHEALTH dataset and the real-world dataset MARS which is collected by ourselves. These public datasets
are the latest available wearable sensor-based datasets with complete annotation and have been widely used in
the activity recognition research community.
PAMAP2.
The dataset was collected in a constrained setting where 9 participants (1 female and 8 males)
performed 12 daily living activities including basic actions(standing, walking) and sportive exercises(running,
playing soccer). Six activities were carried out by the subjects optionally. The sensor data were collected at
the frequency of 100 Hz from the hardware setup that contains 3 Colibri Inertial Measurement Units (IMUs)
attached to the dominant wrist, the chest and the dominant side’s ankle, respectively. Besides, heart rate (bpm)
was collected by an HR-monitor at the sampling frequency of 9 Hz. All the above collected data include two 3-axis
accelerometer data (
ms2
), 3-axis gyroscope data (rad/s), 3-axis magnetometer data (
/mu
T), 3-axis orientation
data, and temperature (
C
). Specially, temperature is collected from 3 IMUs, so it is also processed to be 3-axis.
Our experiments only consider the high-quality part of data, including temperature, accelerometer, gyroscope,
and magnetometer data, to ensure eective validation of the experimental results.
MHEALTH.
The Mobile Health (MHEALTH) dataset is also devised to benchmark methods of human activities
recognition based on multimodal wearable sensor data. Three IMUs were respectively placed on 10 participants’
chest, right wrist, and left ankle to record the accelerometer (
ms2
), gyroscope (deg/s) and the magnetometer
(local) data while they were performing 12 activities. The IMU on the chest also collected 2-lead ECG data (mV)
to monitor the electrical activity of the heart. All sensing modals are recorded at the frequency of 50 Hz.
MARS.
Our new dataset, the Multimodal Activity Recognition with Sensing (MARS). MARS dataset, was
collected while 8 participants (6 males, 2 females) were doing 5 basic activities (sitting, standing, walking,
ascending stairs and descending stairs). Three IMU sensors, Phidget Spatial 3/3/3 [
34
] were attached to the
dominant wrist, the waist, and the dominant side’s ankle, respectively, to collect 3-axis accelerometer data
(gravitational acceleration
д
), 3-axis gyroscope data (
/s
), and 3-axis magnetometer data (
nT
). Since participants
went up and down through the same ight of stairs during our collecting of data, the magnetometer data contain
signals of two opposite directions. To avoid the misconduct resulted from the opposite data, we excluded the
magnetometer data for activity recognition. All IMUs collected the data at the frequency of 70 Hz.
Similar to [
15
], the experiments conducted on the two public datasets perform background activity recognition
task [
38
]. The activities are categorized into 6 classes: lying, sitting/standing, walking, running, cycling and
other activities. To tackle the task and ensure the rigorousness, all experiments are performed by Leave-One-
Subject-Out (LOSO) cross-validation which can also test the person independence during the evaluation. The
evaluation results are measured by accuracy (%), one of the most commonly used performance measure standards
for classication tasks.
Here, we describe the common design for all the experiments but leave hyper-parameter study to the next
section.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:14 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Convolutional Network:
The convolutional network has three sections. Each of the rst two sections
composes of one convolutional layer with the kernel size of 3x3, one rectied linear unit (ReLU) layer that
applies the non-saturating activation function
relu(ν)=max(ν,
0
)
, and one max pooling layer with the kernel
size of 1x3and the stride of 1x3. The third section has a fully connected layer developed on the attened results
of the second layer. The size of the fully connected layer depends on the size of the input activity frame for the
reason that the output should be reshaped to another 2-D matrix Cfwith the same size of If.
Glimpse Network:
The glimpse network has three fully connected layers dened as
дlf
t=Linear (lf
t)
,
дρf
t=Linear (ρ(If,lf
t))
,
дf
t=relu(Linear (дlf
t+дρf
t))
, respectively. The dimensionality of
дρf
t,дlf
t
and
дf
t
are
128,128 and 220 in our experiments.
Action and Location Networks:
The action network only has one fully connected layer while the policy
for the location network is dened by a dual-component Gaussian with a variance xed to be 0.22. The location
network outputs the location at time
t+
1stochastically according to the location distribution, which is dened
as lf
t+1=tanh(Linear (hf
t)).
Two Recurrent Networks:
The proposed method has two recurrent networks. One is the attention based
LSTM with the cell size of 100. The number of time steps is 40, which denes the number of glimpses. The other
one is the frame-based recurrent network which has an LSTM in a size of 1000 and the number of time steps is
set to 5, which decides the number of frames that are utilized to perform the recognition task.
4.2 Hyper-Parameter Study
In this section, we mainly analyze four most contributing hyper-parameters to which the model is more sensitive
in our experiments, namely the size of glimpse windows (width and height), size of the glimpse output
дf
t
, the
number of copies for Monte Carlo sampling, and the number of glimpses. For the other hyper-parameters, we
just use xed empirical values as suggested in the previous subsection. The variation trend is shown as Figure 4,
Figure 5 and Figure 6.
Taking Figure 4 as an example, rstly, we tune the width and the height of glimpse windows to gure out
their relationship as shown in Figure 4 (a). Specically, there are 13 3-axis vectors to present the temperature,
accelerometer, gyroscope and magnetometer data in our experiments. After Algorithm 1, several 78x9activity
frames are generated. Figure 4 (a) shows that the accuracy achieves the best when the glimpse window size is
64x16 and there is an obvious "ridge" along which the whole gure is almost symmetric. All the points on the
symmetric line are in a ratio of 4 : 1. This suggests that the approach favors a xed ratio of the two dimensions
of the glimpse window, in spite that we used the ratio of the activity frame size of 78 : 9. Also, we can see that
Figure 5 (a) and Figure 6 (a) both show the "ridge" while their optional glimpse window sizes are dierent because
of dierent sizes of activity frames.
Figure 4 (b), (c) and (d) show the experimental results of our studies on the eect of other three hyper-parameters,
the size of the glimpse network, the number of copies for Monte Carlo sampling and the number of glimpses. In
particular, Figure 4 (b), and (d) present similar trends that the accuracy increases remarkably at rst and keeps
rising slowly (Figure 4 (b)) or remains stable (Figure 4 (d)) after getting a turning point. However, for the Monte
Carlo sampling, too low or too high values lead to worse performance, as Figure 4 (c) shows. Considering the
computational complexity increases with larger values of the hyper-parameters, a trade-o between the accuracy
and the computational complexity is necessary, especially for Monte Carlo Sampling. Therefore, we simply select
the points slightly after the turning points (220, 20, 30) as the optimal parameters to conduct our following
experiments. And we can notice that the variation trends in Figure 5 and Figure 6 enjoy the same patterns.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
1:15
Fig. 4. Experimental Results for Hyper-Parameter Tuning on PAMAP2
4.3 Accuracy Comparison and Performance Analysis
To evaluate the performance of the proposed approach, RAAF, we conduct extensive experiments to compare its
performance with the state-of-the-art methods on PAMAP2 and MHEALTH. We elaborately select other four
state-of-the-art and multimodal feature-based approaches (MARCEL [
15
], FEM [
26
], CEM [
16
] and MKL [
2
])
and ve baseline methods (Support Vector Machine (SVM), Random Forest(RF), K-Nearest Neighbors(KNN),
Decision Tree(DT) and Single Neural Networks) to show the competitive power of the proposed method. To
ensure fair comparison, the best parameters test, RAAF, is used on both datasets; the best trade-o parameter
(
λ=
0
.
7) is deployed for MARCEL; time-domain features including mean, variance, standard deviation, median
and frequency-domain features including entropy and spectral entropy are utilized for FEM; each modality
feature group are dened an independent kernel for MKL; and for other baseline methods, all modality features
are deployed. All parameters adopted are in reference to the parameters suggested in literature. The results in
Table 1 show the proposed RAAF outperforms all the state-of-the-art methods and the baseline methods.
To further explain the accuracy of RAAF on each specic activity, Figure 7 (a) and (b) show the confusion
matrices on both public datasets performing the background activity recognition task. The results show the
proposed approach performs well for most activities such as lying, sitting and standing, and cycling. However,
more misclassications occur on activities that have similar patterns to the background activities, such as walking,
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:16 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Fig. 5. Experimental Results for Hyper-Parameter Tuning on MHEALTH
Table 1. Comparison among RAAF and four state-of-the-art methods and five baseline methods. For PAMAP2 dataset,
accelerometer, gyroscope and magnetometer data are utilized. For MHEALTH dataset, ECG data are considered additionally.
Datasets Methods
PAMAP2
RAAF MARCEL [15] FEM+SVM [26] CEM [16] FEM+MKL [2, 26]
83.4 82.8 76.4 81 81.6
SVM RF KNN DT Single NN
59.3 64.7 70.3 57.8 72.0
MHEALTH
RAAF MARCEL [15] FEM+SVM [26] CEM [16] FEM+MKL [2, 26]
94.0 92.3 70.7 74.8 90.6
SVM RF KNN DT Single NN
68.7 82.5 86.1 78.7 89.1
ascending stairs and descending stairs, due to the constraint of the background activity recognition task, "others".
This pattern can also been seen in Figure 7 (c) where sitting and standing can be clearly classied while walking
and ascending or descending stairs appear to be slightly confusing. To present the eectiveness of our method
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
1:17
Fig. 6. Experimental Results for Hyper-Parameter Tuning on MARS
on other activities, Figure 8 shows the confusion matrices on both public datasets performing the all activity
recognition task that denes separate classes for each of the 12 activities [
38
] on PAMAP2 and MHEALTH. From
Figure 8 we observe that on PAMAP2 dataset the model works well for most activities but is confused with
running, ascending & descending stairs and rope jumping because of their similar patterns. And on MHEALTH
dataset, the performance is remarkable except for some misclassications for knees bending, cycling and jogging.
Fig. 7. The confusion matrices of RAAF for background activity recognition on three datasets
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:18 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Fig. 8. The confusion matrices of RAAF for all activity recognition on two public datasets
Table 2. Feature Extraction Capability of Activity Frames
PAMAP2 Dataset MHEALTH Dataset MARS Dataset
Original Frames 81.35 92.20 77.25
Activity Frames 83.42 94.04 85.28
Table 3. Latency Analysis on Three Datasets
PAMAP Dataset MHEALTH Dataset MARS Dataset
0.68s 0.72s 0.59s
We prove the eectiveness of our activity frames by deploying the dual-stream recurrent convolutional
attention model on original features. To adapt features to the proposed model, multimodal features are stacked to
form original frames, as Figure 2(a) shows. Table 2 presents feature extraction capability of activity frames, which
shows that the proposed model based on the original frames outperforms most of the state-of-the-art methods
(listed in Table 1) even without activity frames. But utilizing the activity frames can signicantly improve the
performance of original model due to the availability of the full relationship among features provided by activity
frames.
As latency is a critical indicator to evaluate the applicability of HAR systems in practical scenarios, table 3
shows the latency for testing one sample on the three datasets (all less than 1 second), which we believe is fairly
acceptable in realistic application scenarios.
4.4 Model Interpretability
One of the merits of our method is its interpretability. For wearable sensor-based activity recognition, subjects
usually wear more than one sensors on their dominant body parts like arms, chest, and ankles, each sensor with
multimodal. Attention mechanisms provide a superiority that it feeds the glimpse location back at each time
step. Owing to the particularity of the activity frames, the attention model in our scenario not only provides the
specic body parts it focuses on but also highlights the most contributing sensors and modals to diverse activities.
In this section, we only present the experimental results of running, walking and lying down on MHEALTH
dataset for simplicity. The available sensors on MHEALTH include ECG, chest accelerometer, ankle accelerometer,
ankle gyroscope, ankle magnetometer, arm accelerometer, arm gyroscope and arm magnetometer. Figure 9 shows
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
1:19
Table 4. Modals Involvements on MHEALTH Dataset (%). (Acc, Gyro, Magn denote Accelerometer, Gyroscope and Magne-
tometer, respectively.)
activity ECG Accch es t Acca nk l e Gyroank l e Maдnan kl e Accar m Gyroar m Maдnar m
running 21.98 10.22 30.55 15.86 4.30 6.49 5.03 5.57
walking 7.23 11.05 18.78 19.26 8.72 19.66 9.46 5.83
lying down 6.58 13.45 16.34 10.23 17.92 10.29 10.72 14.47
the glimpse heatmap for all sensors. Taking running as an example, we can observe ankle as the most active part
of running. The chest also contributes a lot while arm involves the least. To further demonstrate the involvement
of all sensors modal data, Table 4 concludes the percentage of our model "looking at" dierent modals for the
latest 120 times (out of 200 times). It shows that for running, the most salient modal is ankle acceleration, which
accounts for 30.55%. ECG and ankle gyroscope data are also signicant. The experimental results totally conform
to the reality that while running, the most active body parts should be legs and ankles. Another self-evident truth
is that in our experiments, one modal that can easily distinguish strenuous exercise like running from others is
ECG. Also, since the model still "looks at" other modals for several times, it is able to better corroborate the claim
that our model minimizes information loss.
Fig. 9. Glimpse Heatmap
4.5 Labeled Data Dependency
It is generally regarded as one of the most serious challenges in human activity recognition to get enough labeled
data, owing to the considerable annotation expense and the possibility of user privacy violation. Semi-supervised
[
47
] or weakly-supervised methods [
28
] may take advantages of unlabeled data but meanwhile incur extra cost
[
47
]. In contrast, we propose to maximize the utilization of features and achieve the best performance with
least cost. With activity frames fully extracting information among features, attention model focusing on the
most salient data, and frame based recurrent network detailedly studying the temporal pattern, RAAF is able to
reduce the dependency on labeled data signicantly. As gure 10 shows although the accuracy decreases with
less labeled data, the downtrend is slow until the number of labeled data is reduced to 1000 on both datasets.
Even 5000 labeled data deliver a relatively satisfactory accuracy. Owning to the fact that the experiments adopts
Leave-One-Subject-Out (LOSO) cross-validation, which means 7 subjects’ data for 6 activities on PAMAP dataset,
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:20 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
8 subjects on MHEALTH dataset and 6 subjects for 5 activities on MARS are used for training, only 119, 104 and
166 data are needed for each subject and each activity on PAMAP2, MHEALTH and MARS, respectively. This fact
fully validates the low dependency of our method on labeled data.
Fig. 10. Labeled Data Dependency
5 CONCLUSION
This paper proposes an innovative human activity recognition approach, RAAF, which includes (a) a novel
form of multimodal sensor features, convolutional activity frames to fully extract relations between each pair of
sensors and modality data and (b) a dual-stream convolutional attention model to combine recurrent attention
and recurrent activity frames. The experiments show our method outperforms the state-of-the-art methods and
has low annotation dependency which indicates that it substantially reduces the requirement for labeled data.
Also, the method enjoys great interpretability in spite of the non-explanation of neural networks. Furthermore,
we conduct the method on a real-world dataset collected by ourselves to validate its applicability in practical
situations, Given the encouraging results, achieving higher performance by RAAF for more complex HAR
scenarios is promising in our future work.
REFERENCES
[1]
Bogdan Alexe, Nicolas Heess, Yee W Teh, and Vittorio Ferrari. 2012. Searching for objects driven by context. In Advances in Neural
Information Processing Systems. 881–889.
[2]
Salah Althloothi, Mohammad H Mahoor, Xiao Zhang, and Richard M Voyles. 2014. Human activity recognition using multi-features and
multiple kernel learning. Pattern recognition 47, 5 (2014), 1800–1812.
[3]
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. 2013. A Public Domain Dataset for Human
Activity Recognition using Smartphones.. In ESANN.
[4]
Oresti Banos, Rafael Garcia, Juan A Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez, and Claudia
Villalonga. 2014. mHealthDroid: a novel framework for agile development of mobile health applications. In International Workshop on
Ambient Assisted Living. Springer, 91–98.
[5]
Oresti Banos, Claudia Villalonga, Rafael Garcia, Alejandro Saez, Miguel Damas, Juan A Holgado-Terriza, Sungyong Lee, Hector Pomares,
and Ignacio Rojas. 2015. Design, implementation and validation of a novel open framework for agile development of mobile health
applications. Biomedical engineering online 14, 2 (2015), S6.
[6]
Yoshua Bengio. 2013. Deep learning of representations: Looking forward. In International Conference on Statistical Language and Speech
Processing. Springer, 1–37.
[7]
Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A tutorial on human activity recognition using body-worn inertial sensors. ACM
Computing Surveys (CSUR) 46, 3 (2014), 33.
[8]
Nicholas J Butko and Javier R Movellan. 2008. I-POMDP: An infomax model of eye movement. In Development and Learning, 2008. ICDL
2008. 7th IEEE International Conference on. IEEE, 139–144.
[9]
Liming Chen, Jesse Hoey, Chris D Nugent, Diane J Cook, and Zhiwen Yu. 2012. Sensor-based activity recognition. IEEE Transactions on
Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 6 (2012), 790–808.
[10]
Li Deng. 2014. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and
Information Processing 3 (2014).
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
1:21
[11]
Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. 2012. Learning where to attend with deep architectures for image
tracking. Neural computation 24, 8 (2012), 2151–2184.
[12]
Marcus Edel and Enrico Köppe. 2016. Binarized-BLSTM-RNN based Human Activity Recognition. In Indoor Positioning and Indoor
Navigation (IPIN), 2016 International Conference on. IEEE, 1–7.
[13]
Hongqing Fang and Chen Hu. 2014. Recognizing human activity in smart home using deep learning algorithm. In Control Conference
(CCC), 2014 33rd Chinese. IEEE, 4716–4720.
[14]
Yu Guan and Thomas Plötz. 2017. Ensembles of Deep LSTM Learners for Activity Recognition Using Wearables. Proc. ACM Interact.
Mob. Wearable Ubiquitous Technol. 1, 2, Article 11 (June 2017), 28 pages. https://doi.org/10.1145/3090076
[15]
Haodong Guo, Ling Chen, Liangying Peng, and Gencai Chen. 2016. Wearable sensor based multimodal human activity recognition
exploiting the diversity of classier ensemble. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous
Computing. ACM, 1112–1123.
[16]
Haodong Guo, Ling Chen, Yanbin Shen, and Gencai Chen. 2014. Activity recognition exploiting classier level fusion of acceleration
and physiological signals. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct
Publication. ACM, 63–66.
[17]
Sojeong Ha, Jeong-Min Yun, and Seungjin Choi. 2015. Multi-modal convolutional neural networks for activity recognition. In Systems,
Man, and Cybernetics (SMC), 2015 IEEE International Conference on. IEEE, 3017–3022.
[18]
Albert Haque, Alexandre Alahi, and Li Fei-Fei. 2016. Recurrent attention models for depth-based person identication. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 1229–1238.
[19]
Georey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7
(2006), 1527–1554.
[20]
Masaya Inoue, Sozo Inoue, and Takeshi Nishida. 2016. Deep Recurrent Neural Network for Mobile Human Activity Recognition with
High Throughput. arXiv preprint arXiv:1611.03607 (2016).
[21]
Wenchao Jiang and Zhaozheng Yin. 2015. Human activity recognition using wearable sensors by deep convolutional neural networks.
In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1307–1310.
[22]
Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michaël Mathieu, and Yann L Cun. 2010. Learning convolutional
feature hierarchies for visual recognition. In Advances in neural information processing systems. 1090–1098.
[23]
Kai Kunze and Paul Lukowicz. 2008. Dealing with sensor displacement in motion-based onbody activity recognition systems. In
Proceedings of the 10th international conference on Ubiquitous computing. ACM, 20–29.
[24]
Zhihui Lai, Yong Xu, Qingcai Chen, Jian Yang, and David Zhang. 2014. Multilinear sparse principal component analysis. IEEE transactions
on neural networks and learning systems 25, 10 (2014), 1942–1950.
[25]
Nicholas D Lane, Petko Georgiev, and Lorena Qendro. 2015. Deepear: robust smartphone audio sensing in unconstrained acoustic
environments using deep learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing.
ACM, 283–294.
[26] Oscar D Lara, Alfredo J Pérez, Miguel A Labrador, and José D Posada. 2012. Centinela: A human activity recognition system based on
acceleration and vital sign data. Pervasive and mobile computing 8, 5 (2012), 717–729.
[27]
Hugo Larochelle and Georey E Hinton. 2010. Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances
in neural information processing systems. 1243–1251.
[28]
Chandrashekhar Lavania, Sunil Thulasidasan, Anthony LaMarca, Jerey Scoeld, and Je Bilmes. 2016. A weakly supervised activity
recognition framework for real-time synthetic biology laboratory assistance. In Proceedings of the 2016 ACM International Joint Conference
on Pervasive and Ubiquitous Computing. ACM, 37–48.
[29] Yann LeCun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[30]
Yongmou Li, Dianxi Shi, Bo Ding, and Dongbo Liu. 2014. Unsupervised feature learning for human activity recognition using smartphone
sensors. In Mining Intelligence and Knowledge Exploration. Springer, 99–107.
[31]
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al
.
2014. Recurrent models of visual attention. In Advances in neural information
processing systems. 2204–2212.
[32]
Shih Yin Ooi, Andrew Beng Jin Teoh, Ying Han Pang, and Bee Yan Hiew. 2016. Image-based handwritten signature verication using
hybrid methods of discrete radon transform, principal component analysis and probabilistic neural network. Applied Soft Computing 40
(2016), 274–282.
[33]
Juha Parkka, Miikka Ermes, Panu Korpipaa, Jani Mantyjarvi, Johannes Peltola, and Ilkka Korhonen. 2006. Activity classication using
realistic data from wearable sensors. IEEE Transactions on information technology in biomedicine 10, 1 (2006), 119–128.
[34] I Phidgets. 2010. 1056-PhidgetSpatial 3/3/3. Code Samples For This Product (2010).
[35]
Thomas Plötz, Nils Y Hammerla, and Patrick Olivier. 2011. Feature learning for activity recognition in ubiquitous computing. In IJCAI
Proceedings-International Joint Conference on Articial Intelligence, Vol. 22. 1729.
[36]
Valentin Radu, Nicholas D Lane, Sourav Bhattacharya, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar. 2016. Towards multimodal
deep learning for activity recognition on mobile devices. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:22 Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Ubiquitous Computing: Adjunct. ACM, 185–188.
[37]
Attila Reiss and Didier Stricker. 2012. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the
5th International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 40.
[38]
Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In Wearable Computers (ISWC),
2012 16th International Symposium on. IEEE, 108–109.
[39]
Monit Shah Singh, Vinaychandran Pondenkandath, Bo Zhou, Paul Lukowicz, and Marcus Liwicki. 2017. Transforming Sensor Data to
the Image Domain for Deep Learning-an Application to Footstep Detection. Neural Networks (IJCNN), 2017 International Joint Conference
on (2017), 3017–3022.
[40]
Emmanuel Munguia Tapia, Stephen S Intille, William Haskell, Kent Larson, Julie Wright, Abby King, and Robert Friedman. 2007.
Real-time recognition of physical activities and their intensities using wireless accelerometers and a heart rate monitor. In Wearable
Computers, 2007 11th IEEE International Symposium on. IEEE, 37–40.
[41]
Aiguo Wang, Guilin Chen, Cuijuan Shang, Miaofei Zhang, and Li Liu. 2016. Human Activity Recognition in a Smart Home Environment
with Stacked Denoising Autoencoders. In International Conference on Web-Age Information Management. Springer, 29–40.
[42]
Hongsong Wang and Liang Wang. 2017. Modeling Temporal Dynamics and Spatial Congurations of Actions Using Two-Stream
Recurrent Neural Networks. The Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
[43]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8,
3-4 (1992), 229–256.
[44]
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3
(1987), 37–52.
[45]
Yaser Yacoob and Michael J Black. 1998. Parameterized modeling and recognition of activities. In Computer Vision, 1998. Sixth International
Conference on. IEEE, 120–127.
[46]
Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krishnaswamy. 2015. Deep Convolutional Neural Networks on
Multichannel Time Series for Human Activity Recognition.. In IJCAI. 3995–4001.
[47]
Lina Yao, Feiping Nie, Quan Z Sheng, Tao Gu, Xue Li, and Sen Wang. 2016. Learning from less for better: semi-supervised activity
recognition via shared structure discovery. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous
Computing. ACM, 13–24.
[48]
Ming Zeng, Le T Nguyen, Bo Yu, Ole J Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neural networks for human
activity recognition using mobile sensors. In Mobile Computing, Applications and Services (MobiCASE), 2014 6th International Conference
on. IEEE, 197–205.
[49]
Mi Zhang and Alexander A Sawchuk. 2012. USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable
sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 1036–1043.
[50]
Maria Zontak, Inbar Mosseri, and Michal Irani. 2013. Separating signal from noise using patch recurrence across scales. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 1195–1202.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Recently, deep learning (DL) methods have been introduced very successfully into human activity recognition (HAR) scenarios in ubiquitous and wearable computing. Especially the prospect of overcoming the need for manual feature design combined with superior classification capabilities render deep neural networks very attractive for real-life HAR application. Even though DL-based approaches now outperform the state-of-the-art in a number of recognitions tasks of the field, yet substantial challenges remain. Most prominently, issues with real-life datasets, typically including imbalanced datasets and problematic data quality, still limit the effectiveness of activity recognition using wearables. In this paper we tackle such challenges through Ensembles of deep Long Short Term Memory (LSTM) networks. We have developed modified training procedures for LSTM networks and combine sets of diverse LSTM learners into classifier collectives. We demonstrate, both formally and empirically, that Ensembles of deep LSTM learners outperform the individual LSTM networks. Through an extensive experimental evaluation on three standard benchmarks (Opportunity, PAMAP2, Skoda) we demonstrate the excellent recognition capabilities of our approach and its potential for real-life applications of human activity recognition.
Conference Paper
Full-text available
Despite the active research into, and the development of, human activity recognition over the decades, existing techniques still have several limitations, in particular, poor performance due to insufficient ground-truth data and little support of intra-class variability of activities (i.e., the same activity may be performed in different ways by different individuals, or even by the same individuals with different time frames). Aiming to tackle these two issues, in this paper, we present a robust activity recognition approach by extracting the intrinsic shared structures from activities to handle intra-class variability, and the approach is embedded into a semi-supervised learning framework by utilizing the learned correlations from both labeled and easily-obtained unlabeled data simultaneously. We use l2,1 minimization on both loss function and regularizations to effectively resist outliers in noisy sensor data and improve recognition accuracy by discerning underlying commonalities from activities. Extensive experimental evaluations on four community-contributed public datasets indicate that with little training samples, our proposed approach outperforms a set of classical supervised learning methods as well as those recently proposed semi-supervised approaches.
Article
Full-text available
In this paper, we propose a method of human activity recognition with high throughput from raw accelerometer data applying a deep recurrent neural network (DRNN), and investigate various architectures and its combination to find the best parameter values. The "high throughput" refers to short time at a time of recognition. We investigated various parameters and architectures of the DRNN by using the training dataset of 432 trials with 6 activity classes from 7 people. The maximum recognition rate was 95.42% and 83.43% against the test data of 108 segmented trials each of which has single activity class and 18 multiple sequential trials, respectively. Here, the maximum recognition rates by traditional methods were 71.65% and 54.97% for each. In addition, the efficiency of the found parameters was evaluated by using additional dataset. Further, as for throughput of the recognition per unit time, the constructed DRNN was requiring only 1.347 [ms], while the best traditional method required 11.031 [ms] which includes 11.027 [ms] for feature calculation. These advantages are caused by the compact and small architecture of the constructed real time oriented DRNN.
Conference Paper
Full-text available
Activity recognition is an important step towards automatically measuring the functional health of individuals in smart home settings. Since the inherent nature of human activities is characterized by a high degree of complexity and uncertainty, it poses a great challenge to build a robust activity recognition model. This study aims to exploit deep learning techniques to learn high-level features from the binary sensor data under the assumption that there exist discriminant latent patterns inherent in the low-level features. Specifically, we first adopt a stacked autoencoder to extract high-level features, and then integrate feature extraction and classifier training into a unified framework to obtain a jointly optimized activity recognizer. We use three benchmark datasets to evaluate our method, and investigate two different original sensor data representations. Experimental results show that the proposed method achieves better recognition rate and generalizes better across different original feature representations compared with other four competing methods.
Conference Paper
Microphones are remarkably powerful sensors of human behavior and context. However, audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse acoustic environments (such as, bedrooms, vehicles, or cafes), that users encounter on a daily basis. Towards addressing this challenge, we turn to the field of deep learning; an area of machine learning that has radically changed related audio modeling domains like speech recognition. In this paper, we present DeepEar - the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks. We train DeepEar with a large-scale dataset including unlabeled data from 168 place visits. The resulting learned model, involving 2.3M parameters, enables DeepEar to significantly increase inference robustness to background noise beyond conventional approaches present in mobile devices. Finally, we show DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone's battery daily.
Recently, skeleton based action recognition gains more popularity due to cost-effective depth sensors coupled with real-time skeleton estimation algorithms. Traditional approaches based on handcrafted features are limited to represent the complexity of motion patterns. Recent methods that use Recurrent Neural Networks (RNN) to handle raw skeletons only focus on the contextual dependency in the temporal domain and neglect the spatial configurations of articulated skeletons. In this paper, we propose a novel two-stream RNN architecture to model both temporal dynamics and spatial configurations for skeleton based action recognition. We explore two different structures for the temporal stream: stacked RNN and hierarchical RNN. Hierarchical RNN is designed according to human body kinematics. We also propose two effective methods to model the spatial structure by converting the spatial graph into a sequence of joints. To improve generalization of our model, we further exploit 3D transformation based data augmentation techniques including rotation and scaling transformation to transform the 3D coordinates of skeletons during training. Experiments on 3D action recognition benchmark datasets show that our method brings a considerable improvement for a variety of actions, i.e., generic actions, interaction activities and gestures.
Conference Paper
Human physical activity recognition based on wearable sensors has applications relevant to our daily life such as healthcare. How to achieve high recognition accuracy with low computational cost is an important issue in the ubiquitous computing. Rather than exploring handcrafted features from time-series sensor signals, we assemble signal sequences of accelerometers and gyroscopes into a novel activity image, which enables Deep Convolutional Neural Networks (DCNN) to automatically learn the optimal features from the activity image for the activity recognition task. Our proposed approach is evaluated on three public datasets and it outperforms state-of-the-arts in terms of recognition accuracy and computational cost.
Conference Paper
High computational complexity hinders the widespread usage of neural networks, especially in mobile devices, which are often the basis of fine-grained localization technology for ubiquitous health monitoring, context awareness, and indoor location tracking. In this paper, we present a binarized recurrent neural network whose weight parameters, input, and intermediate hidden layer output signals, are all binary-valued, and require only basic bit logic for the evaluation and training process. The proposed Binarized Long Short-Term Memory Network (B-BLSTM-RNN) is especially suitable for resource-constrained environments since it replaces either floating or fixed-point arithmetic with significantly more efficient bitwise operations. The model is based on a bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN). Designed to take contextual information into account, the network can process data gathered from different positions, resulting in a system, that's invariant to transformations and distortions of the input patterns. During the forward pass, the B-BLSTM drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. The binarized network is simple, accurate, efficient, and works on challenging gesture recognition tasks using raw MEM data. To validate the effectiveness of the network we conduct three sets of experiments. We achieved a classification accuracy with a the proposed network of about 90% which only 2% less than the full-precision network. We also compare our method with recent methods and outperform these methods by large margins on the conducted datasets.