Content uploaded by Kaixuan Chen
Author content
All content in this area was uploaded by Kaixuan Chen on Nov 16, 2017
Content may be subject to copyright.
1
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional
Aention Model for Activity Recognition
KAIXUAN CHEN∗,University of New South Wales
LINA YAO, University of New South Wales
TAO GU, RMIT University
ZHIWEN YU, Northwestern Polytechnical University
XIANZHI WANG, University of New South Wales
DALIN ZHANG, University of New South Wales
Multimodal features play a key role in wearable sensor based Human Activity Recognition (HAR). Selecting the most salient
features adaptively is a promising way to maximize the eectiveness of multimodal sensor data. In this regard, we propose a
"collect fully and select wisely (Fullie and Wiselie)" principle as well as a dual-stream recurrent convolutional attention model,
Recurrent Attention and Activity Frame (RAAF), to improve the recognition performance. We rst collect modality features
and the relations between each pair of features to generate activity frames, and then introduce an attention mechanism to
select the most prominent regions from activity frames precisely. The selected frames not only maximize the utilization of
valid features but also reduce the number of features to be computed eectively. We further analyze the hyper-parameters,
accuracy, interpretability, and annotation dependency of the proposed model based on extensive experiments. The results
show that RAAF achieves competitive performance on two benchmarked datasets and works well in real life scenarios.
Additional Key Words and Phrases: Human Activity Recognition, wearable sensors, attention mechanism, recurrent neural
networks, reinforcement learning
ACM Reference format:
Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang. 2017. Fullie and Wiselie: A Dual-Stream
Recurrent Convolutional Attention Model for Activity Recognition. 1, 1, Article 1 (November 2017), 22 pages.
https://doi.org/0000001.0000001
1 INTRODUCTION
Human Activity Recognition (HAR) plays a key role in several research elds. It has gained broad attention
due to the increasing popularity of ubiquitous environments, especially in health care and surveillance domains
[
3
,
49
]. Generally, HAR diverges into two categories of approaches: vision-based activity recognition [
42
] and
sensor-based activity recognition [
9
]. The sensor-based approach has several advantages over the vision-based
approach and has seen diverse applications including health monitoring and motion sensing games.
•
Compared with cameras, wearable sensors are not usually conned by environment constraints such as
illumination, point of views, and set up cost. [7].
∗This is the corresponding author
ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. As such,
the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government
purposes only. Permission to make digital or hard copies for personal or classroom use is granted. Copies must bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. To copy otherwise,
distribute, republish, or post, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2017 Association for Computing Machinery.
XXXX-XXXX/2017/11-ART1 $15.00
https://doi.org/0000001.0000001
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:2 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
•
Sensor data obtained from wearable devices typically appear higher quality, and complicated feature
extraction is not necessary as compared to image data.
•
Wearable sensors only detect the data that are strongly related to the dynamics of human motions. Therefore,
sensor data collected do not violate human privacy while image data do.
Despite a large number of sensor-based recognition solutions being proposed over the decade, we discover
several limitations. First, there is still a lack of comprehensive model representation to sensor signals in a way
that dierent activities can be distinguished in a more expressive and eective ways. With the recent advances in
deep neural networks and the notable performance achieved by these methods in the community of HAR [
15
,
17
],
Convolutional Neural Network (CNN) appears to be a promising candidate for building such models. However,
while CNN does well in capturing spatial relationships of features, it focuses merely on the features covered by
the convolutional kernels but overlooks the correlation among non-adjacent features [
22
]. Considering that most
of the data collected by wearable sensors such as accelerometers and gyroscopes are tri-axis, in this paper, we
transform sensor signals into a new activity frame which not only captures the relationships between each pair
of tri-axis signals but also contains the relationship between each pair of single signals. The experiments show
that our new representation is far more discriminative than traditional representations.
Second, the demerits of interperson variability and interclass similarity can greatly reduce system performance.
[
7
]. Interperson variability comes from the fact that the same activity can be performed dierently by dierent
people, and interclass similarity results from the similarity in the behavior patterns of dierent activities like
walking and running. Both the above issues require the classier to be task dependent, i.e., it should automatically
extract the salient information indicative of the true activity and ignore the interclass similarity. To this end,
we propose an attention based model, which is directly related to the HAR task, to address the problems of
interperson variability and interclass similarity.
Attention is originally a concept in biology and psychology that implies focusing the power of noticing or
thinking on something special to achieve better cognitive processes. The attention mechanisms have several
advantages, the rst being task dependence. Intuitively, the motion of dierent body parts has varied contributions
to dierent activities [
42
,
45
]. For example, jumping mostly involves legs while running is related to both arms
and legs. More specically, recognizing the patterns of walking depends more on the acceleration of legs while
distinguishing sitting from lying would rely more on the orientation. In this paper, we separate the data related
to each body part to dierent modals, namely accelerometer data, gyroscope data and magnetometer data,
respectively. With the help of activity frames, we can analyze not only the independent modals but also their
correlations thoroughly. Here, the attention mechanisms ensure that the system only focuses on the most
contributing data and ignores the irrelevant sensors or modals.
The second advantage of the attention mechanisms is that it opens the black box of deep neural networks to a
certain degree. While the inner mechanisms of neural networks remain implicit, interpretable neural network is
becoming another trend in the machine learning and data mining elds. Taking convolutional neural networks for
example, when using convolutional neural networks to recognize a dog from an image, we tend to explicitly know
that one lter distinguishes the dog head and another lter identies the dog paw. Back to activity recognition, the
attention model not only provides the specic body parts it focuses on but also highlights the most contributing
sensors and modals to distinguish diverse activities. The salient sensor data can be inferred from the glimpse
patch (to be detailed in Section 3.2.1 ).
The third advantage is that it reduces the computational cost signicantly. Usually, the dimension of the features
expands as we extract the full spatial relationships among sensors, and the cost increases with the increase of
input data dimension. Most existing models process the entire data every time, resulting in high computational
cost. Some works [
24
,
32
,
44
] aim to limit the input dimension using techniques such as dimensionality reduction
and feature selection. However, feature processing comes with information loss, leading to a new trade-o
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition •1:3
problem between accuracy and cost. Inspired by human attention, our proposed method focuses on only one
small patch of the data each time and goes to the next patch when necessary. This method considerably reduces
computational cost as well as information loss.
In this paper, we tackle the HAR problems by transforming wearable sensor data into activity frames and
deploying a dual-stream recurrent convolutional attention model, including one attention stream and one activity
frame stream, to recognize activities. The main contributions of this work are summarized as follows:
•
We transform the tri-axis sensor data into activity frames to extract the full relationships between data pairs.
This enables the convolutional neural network to cover all features without overlooking any relationships
between data pairs. Furthermore, the activity frames are encoded into convolutional activity frames in
order to extract high-level features. Our model uses a single convolutional layer to encode low level data.
This layer is simple yet generates an eective representation to characterize the local salience of the sensor
data.
•
We propose a dual-stream recurrent model including one attention stream and one activity frame stream
to recognize activities. Firstly, the system focuses on only a small patch of the activity frame that contains
the most salient information to avoid unnecessary cost on less important areas, by leveraging the recurrent
attention model and combining reinforcement learning. Secondly, we deploy a long short-term memory
network to exploit spatial and temporal information in time-series signals and capture the dynamics of the
sensor data.
•
We examine our model on two public benchmarked datasets PAMAP2 [
37
,
38
] and MHEALTH [
4
,
5
] and
perform extensive comparison with other methods, as well re-examine our approach on a new dataset col-
lected in the real world named MARS. The experimental results show that our proposed model consistently
outperforms a series of baselines and state-of-the-arts over three datasets.
The remainder of this paper is organized as follows. Section II introduces the existing wearable sensor based
HAR methods and attention based models briey . Section III details the proposed model. Section IV evaluates the
proposed approach and compares it with state-of-the-art methods on two public datasets and one new dataset
collected in the real world. In this section, we will analyze the experimental results in light of the accuracy,
interpretability, latency and annotation dependency as well. Section V summarizes this paper.
2 RELATED WORK
In this section, owing to the prevalence and outstanding performance of deep learning for HAR in recent years,
we aim at giving a comprehensive review of the existing work related to deep learning for human activity
recognition. Also, we briey introduce attention mechanisms used in previous works to study salient features.
2.1 Deep Learning for Human Activity Recognition
Wearable sensor based human activity recognition is essentially a problem of projecting low-level sensor data to
high-level activity knowledge. In our work, one basic challenge behind the "collect and select" principal is how to
deeply extract features adaptive to the classication tasks and obtain the most discriminative representations.
Some works employ traditional machine learning methods working on heuristic hand-crafted features [
6
,
46
],
which not only requires domain knowledge about activity recognition but also may potentially lead to critical
limitations like error-prone bias that hinders the performance. Recently, since deep learning has embraced massive
success in many elds [
29
], a urry of research has emerged providing deep learning based solutions to various
heterogeneous human activity recognition problems. The state-of-the-art deep learning based methods have
made tremendous progress in improving recognition performance and widely used in either feature extraction or
classication process of HAR. The rationale of the evolution is that deep learning is able to automatically extract
adaptive features and spare the eort on manually extracting features and designing classiers in details.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:4 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Enlightened by the work done in [
10
], we group the deep learning algorithms for human activity recognition
into two categories: generative deep architectures including deep belief network, restricted Boltzmann machine
and autoencoder, and discriminative deep architectures containing convolutional neural network and recurrent
neural network. We will overview the recent representative works as follows.
2.1.1 Generative Deep Architectures. Some existing deep learning based activity recognition solutions utilize
generative deep architectures for feature extraction and deriving more discriminative representations. One of the
most widely used architectures is autoencoder. To briey demonstrate, autoencoder is usually a simple 3-layer
neural network where the output units are directly related to input units and back feeds a latent representation
of the input. The motivation of autoencoder is to study higher-level representation that omits noise and enhance
eective information. In [
30
], Li et al. propose to learn features by using sparse autoencoder that adds sparse
constraints, that is, KL divergence to achieve better performance in activity recognition. Wang et. al [
41
] adopt
greedy pretraining to stacked auto encoder and integrate the feature extraction process and the classier into an
architecture to jointly train them by ne-tuning parameters.
Another widely used generative deep architecture is Restricted Boltzmann Machine (RBM) [
19
]. RBM shares a
similar architecture with autoencoder. The dierence lies in that it uses a stochastic approach. To illustrate, it
uses stochastic units with specic distributions such as Gaussian or binary distribution instead of deterministic
activation functions. The authors in [
35
] rstly propose to deploy RBM to study feature representations for
activity recognition. Inspired by this, a sequence of works take RBM as a measure to extract features for HAR. For
example, [
13
] tend to exploit improving training process for RBM. They utilize contrastive gradient to ne-tune
the parameters and accelerate training. [
25
] employs Gaussian layer for the rst layer of their RBM model and
binary for the rest. Furthermore, [
36
] considers multimodal sensor data and designed a multimodal RBM so that
each modality has an individual RBM.
Generative deep architecture enjoys the merits of unsupervised learning and high-quality representations.
However, it leads to unwanted pretraining while our target is to construct an end-to-end model. Compared with
this, discriminative deep architectures are more applicable and popular in previous works.
2.1.2 Discriminative Deep Architectures. Discriminative deep architectures distinguish patterns by calculating
the posterior distributions of classes based on annotated data [
10
]. Existing research can be categorized into two
main directions: convolutional neural network and recurrent neural network.
According to [
29
], the theories behind convolutional neural network including sparse interactions, parameter
sharing and equivariant representations . Usually, the convolutional neural network contains (a) convolutional
layers that create convolution kernels which is convolved with the layer input over a single spatial dimension to
produce a tensor of outputs; (b) rectied linear unit (ReLU) layers that apply the non-saturating activation function
to increase the nonlinear properties of the decision function and of the overall network without aecting the
receptive elds of the convolution layers and (c) max pooling layers that down-sample the input representation,
reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions
binned. After these, there are usually (d) fully-connected layers which perform classication or regression tasks
and CNNs can learn hierarchical representations or high-performance classiers.
For HAR, stemming from the time-series characteristics, CNN can be used with 1D convolution and 2D
convolution to combine temporal information. 1D convolution treats each axis of sensor data as a channel,
attens and unies the outputs of each channel to be one. One example is [
48
], the authors proposed to treat each
axis of the accelerometer as one channel and conduct the convolutional process individually. On the contrary,
2D convolution transforms the input into 2D matrices and considers them as images. In [
17
], Ha et al. simply
generate data images by combining all axis data. After that, the authors in [
21
] additionally consider temporal
information and yield 2D time series images. Furthermore, [
39
] harnesses multimodal sensor data that integrates
pressure sensor data and performs 2D convolutional neural network.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition •1:5
However, these works require massive domain knowledge when conducting transformation, which is not
feasible in more general situations, compared with which, the activity frames proposed in this work not only
considers temporal information and fully extracts spatial relations but also is applicable to most of multimodal
sensor data with better generalization and adaptivity.
Recurrent neural network (RNN) has been proved to be eective in the elds that contains signicant temporal
information such as speech recognition and natural language processing, which is also the reason why RNN is
applicable to HAR. Dierent from CNN which only takes single vector or matrix as input, RNN requires to input
a sequence of vectors or matrices while each sequence has one corresponding class label. With each recurrent
layer considering both the output of the previous layer and the input vector or matrix at the current layer, RNN
thoroughly analyzes the sequences step by step. To achieve better performance, LSTM (long-short term memory)
cells are introduced and usually combined with RNN. Some previous works utilize RNN for in HAR elds [
14
]. In
spite of the competitive performance, the time consumption and computational cost have caused concern. To
adapt RNN to HAR eld where instantaneity is an important issue for developing real application, [
20
] proposed
a new model to which can perform RNN for HAR with high eciency. [
12
] proposed a binarized-BLSTM RNN
model to simplify all the parameters, input, and output to be binary to save the consumption.
In this paper, we innovatively propose a dual-stream recurrent neural network which not only considers
temporal information as conventional works but also leverages attention mechanisms which are introduced next.
2.2 Aention Mechanisms
In our work, except for conventional deep learning approaches including convolutional neural networks and
recurrent neural networks, we also resort to attention mechanisms to facilitate to select the most salient features.
Tracing back the history of selecting eective regions using attention mechanisms or similar theories, some
works in the eld of computer vision [
1
,
11
,
27
] formulate the process of selecting as a sequential decision task.
In these works, the systems decide where to focus on step by step based on the previous decisions and the whole
environment. [
8
] constructs a policy gradient formulation to simulate eye movement. The authors formulate
eye-move control as a problem in stochastic optimal control based on a model of visual perception. However, the
too strict constraints on RNN limit the performance. [
11
,
27
] further combine attention mechanisms with deep
learning algorithms. [
11
] selects forveated images by controlling the location, orientation, scale and speed of the
attended object. To minimize the selecting uncertainty, they proposed a decision-theoretic probabilistic graphical
model based on RBM. Taking policy gradient formulations and deep learning into consideration , [
31
] proposed
the recurrent attention model (RAM) for image classication with a formulation similar to [
8
] but less restrictive
and leverages RNN as well. Inspired by [
31
], we propose a dual-stream recurrent convolutional attention model.
So far, to the best of our knowledge, our work is the rst one to introduce attention mechanisms to the HAR eld.
As feature relations are fully extracted and represented in activity frames, attention based model wisely selects
salient regions to perform activity recognition.
3 OUR MODEL
To fully collect eective information and wisely select salient features, our model contains two parts: (a) feature
extraction to rstly transform wearable sensor data into 2-D matrices and use the convolutional layer to derive
higher-level features. (b) a dual-stream recurrent model including one attention stream and one activity frame
stream for activity recognition. Our attention stream recurrent model simulates the procedures of human brains
processing visual information within several glimpses. In addition, we introduce reinforcement learning to decide
which part of the activity frames it should glimpse next. The other stream is activity frame stream. Owing to the
facts that the activity recognition largely depends on temporal information and that activity frames naturally
capture serial relations. Activity frame based model and are more suitable for our scenarios.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:6 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
The above process is presented as a three-dimensional model in Figure 1, where the time step
t
and frame
f
represent the attention stream and the activity frame stream in our dual-stream method, respectively.
Fig. 1. Work-flow of the Proposed Approach. Dashed arrows indicate the time step
t
for aention stream and the frame
f
for activity frame stream, respectively. For each time step
t
, the input frame goes through a convolutional layer to obtain a
higher-level representation
Cf
. We extract a retina region
ρ(Cf,lf
t)
at location
lf
t
, which is decided by the last time step
t−
1.
ρ(Cf,lf
t)
next goes through a glimpse layer to get the glimpse
дf
t
as input of the aention stream recurrent network, LSTM-
a
which decides the action
af
t
and the next location
lf
t+
1. For the activity frame stream recurrent network, the LSTM-
f
takes
the last action of each frame aT
fas input and outputs the final prediction.
Fig. 2. Transformation from sequences to frames
3.1 Input Representation
As we transform the wearable sensor data into activity frames, the data are represented as three-dimensional
vectors. Each sample
(x, y)
of the model consists of a 3-d vector
x
and the activity label
y
. Suppose
X,Y,F
denote
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition •1:7
Fig. 3. Flaened Model. (a) Extracting glimpse
дf
t
from the input activity frame, including a CNN, flaening and reshaping,
and a glimpse layer. (b) The detailed description of the glimpse layer which combines the location
lf
t
and the retina region
ρ(Cf,lf
t). (c) Dual-stream recurrent procedure containing aention stream LSTM-aand activity frame stream LSTM- f.
activity frames’ width, height, and number of frames, and Crepresents the number of activity classes, we have:
x∈RX×Y×F(1)
and
y∈ [1, . .., C](2)
3.1.1
Activity Frame
.There already exist some previous works that combine multimodal wearable sensor
data for HAR in feature level [
3
,
46
]. For example, Kunze et al. [
23
] concatenate acceleration and angular velocity
into one vector and [26, 33, 40] combine acceleration and other modalities including microphone and GPS data.
However, these works overlook the relations among sensors which are important to activity recognition. A
popular method for extracting spatial relations is deep learning methods like CNN. Although CNN is proven to
perform well in HAR [
21
,
46
], the accuracy is still not that satisfactory. In fact, CNN is originally proposed for
images where each pixel is only related to its adjacent pixels and this small area can be easily covered by a kernel
patch of a convolutional layer. However, it is still challenging to transform features to extract relations between
each signal and the related signals for HAR. In many cases of HAR [
42
], the sensor data are arranged according
to the physical connection of human body parts. For example, the sensor data of hands should be adjacent to
the data of shoulders and the data of shoulders should be adjacent to the data of the waist, which should be
followed by the data of hips, legs, and feet. Nevertheless, in the real world, activities always depend on more
than one body part. For instance, running relies on the cooperation of arms and legs. In addition, the common
Inertial Measurement Unit in wearable devices usually includes a tri-axis accelerometer, a tri-axis gyroscope, and
a tri-axis magnetometer, and the degree to which these sensors contribute to dierent activities are various. This
makes it even more important to nd a representative transformation to extract the relationships between each
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:8 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
ALGORITHM 1: Transformation from Sequences to Images
Input: Stacked raw signals. Each row is a tri-axis data of a accelerometer, gyroscope or a magnetometer which can be
denoted as x,y,z. As shown in Figure 2 (a), each row has a sequence number. Here the number of rows Nr= 9 as an
example.
Output: The activity frame IAwhich is a 2-D array
1: i=1;
2: j=i+1;
3: permutation sequence Sp=[0];
4: adjacent pair set Sap=∅;
5: activity frame IA= the rst row of stacked signals
6: while i,jdo
7: if j>Nrthen
8: j=1;
9: else if (i,j)<Sapand (j,i)<Sapthen
10: add (i,j)to Sap;
11: add jto Sp;
12: add the j-th row of input data to IA;
13: i=j;
14: j=i+1;
15: else
16: j=j+1
17: end if
18: end while
19: for each row of IAdo
20: if the sequence number of this row is odd then
21: this row is extended as ′x,y,z,x,y,z,x,y,z′
22: else
23: this row is extended as ′x,y,z,y,z,x,z,x,y′
24: end if
25: end for
26: return IA
pair of tri-axis sensor signals (e.g. accelerometer data and gyroscope data) and each pair of single signals (e.g. the
rst dimension of accelerometer data and the second dimension of gyroscope data).
Figure 2 shows the transformation process into activity frames. Each gure is comprised of four parts: sequence
number, sensor location (hand, chest, leg) and modality (acceleration, angular velocity...), notations (x, y, z), and real
data examples. Algorithm 1 further illustrates the transmigration from sequences to images. First, raw signals are
stacked row-by-row as shown in Figure 2 (a). After being permuted in the rst loop (line 6-18 in Algorithm 1), each
tri-axis sensor data has a chance to be adjacent to each of the other sensor data as shown in Figure 2 (b). For example,
supposing
Nr=
9, then the nal
Sp
is
[
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
1
,
3
,
5
,
7
,
9
,
2
,
4
,
6
,
8
,
1
,
4
,
7
,
1
,
5
,
8
,
2
,
5
,
9
,
3
,
6
,
9
,
4
,
8
,
3
,
7
,
2
,
6
,
1
]
.
Since we still need to extract the relationships between each pair of single sensor signals, the second loop (line
19-25 in Algorithm 1) ensures that each single signal has a chance to be adjacent to each of the other signals as
Figure 2 (c) shows. So far we have extracted the relationships between each pair of single sensor signals.
3.1.2
Convolutional Activity Frames
.To derive an eective representation of features, we further trans-
form activity frames into convolutional activity frames. Compared with a convolutional auto-encoder [
18
], we
prefer to train the model end-to-end and omit the pretraining process, as shown in Figure 3. Each activity frame
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition •1:9
If
(
f
denoted the
fth
frame) is transformed into a three-dimensional cube, the height of which depends on the
number of channels of the convolutional network. The convolutional network has two convolutional layers that
learn lters which activate when it detects some specic types of features at some spatial position in the input.
The output is further processed by a ReLU layer and a max pooling layer. The former applies the non-saturating
activation function
relu(ν)=max(ν,
0
)
to increase the nonlinear properties of both the decision function and the
overall network without aecting the receptive elds of the convolution layer. The latter partitions the input
image into a set of non-overlapping rectangles and outputs the maximum for each such sub-region to omit the
less important features.
To obtain new convolutional activity frames, the cubes are attened and reshaped in the same size of original
activity frames by a fully connected layer. After the convolutional layer, the input frame
If
is encoded to be
Cf
.
3.2 Aention and Activity Frame Based Recurrent Model
We propose a dual-stream recurrent model that incorporates both attention and frame to analyze the convolutional
activity frames. Figure 3 shows the structure of this model, where the activity frame stream recurrent modal
leverages the temporal information of sensor data and the attention stream recurrent model solves the human
activity recognition problem.
Since dierent human body parts contribute dierently in recognizing dierent activities, we need to guarantee
that the system only focuses on the most relevant and contributing parts and data. Our dual-stream model
is inpired by Mnih et al. [
31
], who rst adopt the recurrent attention model (RAM) for image classication.
Specically, they address the image classication problem using the basic RAM. As the problem is relatively
simple with only brush strokes in images being salient and the contrast between the strokes and the black
backgrounds being clear. In contrast, analyzing activity frames in this work can be much more complex because
activity frames lack such characteristics compared with image data. Moreover, almost all sensors can detect
motions during activities and sometimes even standstill is still meaningful. Since the convolutional attention
frames fully extract the relationships among all feature pairs, only a part of them is salient to each certain activity.
Therefore, it is natural to introduce attention mechanisms facilitating to mine eective information and minimize
the negative impacts of undesirable information. To the best of our knowledge, our method is the rst one to
leverage the attention model to tackle the activity recognition problems.
Figure 3 shows a attened model, which better interprets the model. Our model is comprised of a glimpse
network, a recurrent attention unit, and a recurrent activity frame unit that we will introduce in the followings.
3.2.1
Glimpse Network
.The rst part after the convolutional layer is a glimpse network. The glimpse
network not only avoids the system processing the whole data in the entirety at a time but also maximally
eliminates the information loss. In our model, each frame will be "understood" within
T
glimpses. For the
transformed frame
Cf
, at each time step
t
, we simulate the process of how the human eyes work. Our model rst
extracts a retina region denoted by
ρ(Cf,lf
t)
from the input data at the location
lf
t
with a retina. The retina image
encodes the region around
lf
t
with high resolution but uses a progressively lower resolution for points further
from lf
t. This has been proved an eective method to remove noises and avoid information loss in [50].
In the human visual system, the retina image is converted into electric signals that are relayed to the brain
via the optic nerves. Likewise, in our model, the retina image is converted into a glimpse
дf
t
as Figure 3 shows.
The retina image
ρ(Cf,lf
t)
and the location
lf
t
are linear transformed independently with two linear layers of
network parameterized by
θρ
д
and
θl
д
, respectively. Next, the summation of these two parts is further transformed
with another linear layer parameterized by
θs
д
and a rectied linear unit. The whole process can be summarized
as the following equation:
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:10 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
дf
t=fд(ρ(Cf,lf
t),lf
t;θρ
д,θl
д,θs
д)=relu(Linear (Linear (ρ(Cf,lf
t)) +Linear (lf
t))) (3)
where
Linear (•)
denotes a linear transformation. Therefore,
дf
t
contains information from both "what" (
ρ(Cf,lf
t)
)
and "where" (lf
t).
3.2.2
Recurrent Aention Unit
.We use the recurrent neural networks as the core to process data step
by step within several glimpses and introduce an attention mechanism to ensure the system only focuses on
the most relevant sensors/modals and the most contributing data. The glimpses at time steps of the recurrent
attention model help visualize the contribution of sensors deployed at dierent body parts, thus achieving better
interpretability of our model.
As Figure 3 shows, the basic structure of the recurrent attention unit is an LSTM-
a
(attention stream LSTM). At
each time step
t
, the LSTM-
a
receives the glimpse
дf
t
and the previous hidden state
hf
t−1
as the inputs parameterized
by θh. Meanwhile, it outputs the current hidden state hf
taccording to the equation:
hf
t=fд(hf
t−1,дf
t;θh)(4)
The recurrent attention model also contains two sub-networks: the location network and the action network.
These two sub-networks receive the hidden state
hf
t
as the input to decide the next glimpse location
lf
t+1
and the
current action
af
t
. The current action not only determines the activity label
ˆy
but also aects the environment in
some cases while the location network outputs the location at time
t+
1stochastically according to the location
policy dened by a Gaussian distribution stochastic process, parameterized by the location network
f(hf
t
;
θt)
. As
it decides the next region to "look at", the location network is the principal component of the recurrent attention
unit.
lf
t+1∼P(· | fl(hf
t;θl)) (5)
Similarly, the action network outputs the corresponding action at time
t
and predicts the activity label given
the hidden state
hf
t
. The action
af
t
obeys the distribution parameterized by
f(hf
t
;
θa)
. Owing to its prediction
function, the network uses a softmax formulation:
af
t=fa(hf
t;θa)=so f tmax(Linear (hf
t)) (6)
3.2.3
Recurrent Activity Frame Unit
.Activity recognition heavily relies on the temporal information.
Therefore, besides the single activity frames used by the aforementioned process, we additionally leverage
activity frames via a recurrent activity frame unit. As the hidden layer
hf
t
of the core LSTM-
a
contributes to
predicting the action
af
t
and deciding the next glimpse location
lf
t+1
. For this reason, we believe the hidden state
is discriminative enough to make the nal prediction for the whole system. In particular, we design an LSTM-
f
(activity frame stream LSTM) to combine the hidden states of all the frames at the last time step
T
to predict the
activity label and to preserve the eciency. Given the hidden state of the last frame, the hidden state of each
frame rf=fr(hf
T,rf−1;θr), parameterized by θr.
3.3 Training and Optimization
Our proposed model depends on the parameters of every components, including the glimpse network, the recurrent
attention network, the two sub-networks, and the activity frame stream recurrent network
Θ=θд,θh,θa,θl,θr
.
Both the action network and the frame-based recurrent network are based on classication methods. Therefore,
their parameters,
θa
and
θr
, can be trained by optimizing the cross-entropy loss and the backpropagation. However,
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
•
1:11
the location network should be able to select a sequence of salient regions from activity frames adaptively. Since
this network is non-dierentiable owing to its stochasticity and the problem can also be regarded as a control
problem to settle the attention region at the next step, it can be trained by reinforcement methods to learn the
optimal policies.
We simply introduce some denitions of reinforcement learning based on our case.
•Agent: the brain to make decisions, which is the location network in our case.
•
Environment: the unknown world that may aect the agent’s decision or may be inuenced by the agent.
•
Reward: the feedback from the environment to evaluate the action. In our case, for each frame, the model
gives a prediction
ˆy=at
and receives a reward
rt
as a feedback for the future correction of the prediction
after each time step
t
. Suppose
T
denotes the number of steps in our attention stream recurrent model.
rt=1if ˆy=yafter Tsteps and 0otherwise. The target of the optimization is to maximize R=ÍT
t=1rt.
•
Policy: the projection from states to actions, denoted by
π(a|s)=P[At=a|St=s]
. To maximize the
reward
R
, we learn an optimal policy
π(lt,at|s1:t
;
Θ)
to map the attention sequence
s1:t
to a distribution
over actions for the current time step, where the policy
π
is decided by
Θ
of the recurrent attention model.
Based on the above discussion, we deploy a Partially Observable Markov Decision Process (POMDP) to
solve the training and optimization problems, for which the true state of the environment is unobserved. Let
s1:t=x1,l1,a1
;
...xt,lt,at
be the sequence of the input, location and action pairs. This sequence, called an
attention sequence, shows the order of the regions our attention focuses on.
To sum up, in our case, the location network is formulated as a random stochastic process (the Gaussian
distribution) parameterized by
Θ
. Each time after the location selection, the prediction
a
is evaluated to back feed
a reward for conducting the backpropagation training process. The process is also dened as policy gradient. Our
goal is to maximize the simulated rewards using gradient.
Generally, for sample xwith its reward f(x)and the probability p(x), we have:
Ex[f(x)] =Õ
x
p(x)f(x)(7)
so that the gradient can be calculated according to the REINFORCE rule [43]:
▽θEx[f(x)] =▽θÕ
x
p(x)f(x)
=Õ
x
▽θp(x)f(x)
=Õ
x
p(x)▽θp(x)
p(x)f(x)
=Õ
x
p(x)▽θloдp(x)f(x)
=Ex[f(x)▽θloдp(x)] (8)
In our case, given the reward
R
and the attention sequence
s1:T
, the reward function to be maximized is as
follows:
J(Θ)=Ep(s1:T;Θ)[
T
Õ
t=1
rt]=Ep(s1:T;Θ)[R](9)
By considering the training problem as a POMDP, a sample approximation to the gradient is calculated as
follows:
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:12 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
ALGORITHM 2: Overall Process of RAAF
Input: Activity frames from Algorithm 1, T: the number of time steps, F: the number of activity frames.
Output: The prediction results
1: rlas t =RandomInitialize()
2: for ffrom 1to Fdo
3: I=the fthactivity frame
4: IC N N =C N N (I)
5: C=Reshape (F latt en(IC N N ))
6: hlas t =RandomInitialize()
7: l=RandomInitialize()
8: for tfrom 0to Tdo
9: ρ=Ext r act Retina(C,l)
10: дlimpse =r elu(Linear (Linear (ρ)+Linear (L)))
11: h=LST Mat t e nt io n (дlimpse,hl as t )
12: a=so f tmax(Linear(h)) trained by cross-entropy and gradient propagation
13: l=tanh(Linear (h)) trained by equation 11
14: hlas t =h
15: end for
16: r=LST Mf r am e (h,rla st )
17: end for
18: activity_label =r
19: return activity_label
▽ΘJ=
T
Õ
t=1
Ep(s1:T;Θ)[▽Θloдπ(y|s1:t;Θ)R](10)
where
i
denotes the
ith
training sample,
y(i
is the correct label for the
ith
sample, and
▽Θloдπ (y(i)|si
1:t
;
Θ)
is the
gradient of LSTM-acalculated by backpropagation.
We use Monte Carlo sampling which utilizes randomness to yield results that might be deterministic theoreti-
cally. Supposing
M
is the number of Monte Carlo sampling copies, we duplicate the same convolutional activity
frames for
M
times and average them as the prediction results to overcome the randomness in the network,
where the Mduplication generates Msubtly dierent results owing to the stochasticity, so we have:
▽ΘJ=
T
Õ
t=1
Ep(s1:t;Θ)[▽Θloдπ (y|s1:t;Θ)R]≈1
M
M
Õ
i=1
T
Õ
t=1
▽Θloдπ (y(i)|si
1:t;Θ)R(i)(11)
Therefore, although the best attention sequences are unknown, our proposed model can learn the optimal
policy in the light of the reward.
To summarize, we propose a dual-stream recurrent convolutional attention model which includes transforming
features into activity frames and a dual-stream recurrent model. Firstly, to fully extract relations between each
pair of sensors and modality features, the inputs are innovatively transformed into convolutional activity frames.
After that, the model eectively combines attention based recurrent spatial relations and recurrent temporal
information to wisely select salient features and perform classication. To further illustrate the process detailedly,
an overall algorithm is shown in Algorithm 2. The experimental results presented next show that the proposed
approach outperforms the state-of-the-art HAR methods.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
•
1:13
4 EXPERIMENTS
In this section, we present the validation of our proposed method via experiments on on two public datasets and
another real-world dataset collected by ourselves. Firstly, we describe the used dataset and the experimental setup.
Secondly, we present our investigation of hyper-parameter study on the classication performance. Thirdly, we
compare the accuracy of our proposed methods with several state-of-the-art HAR methods, present the confusion
matrices on the datasets, and analyze the experimental results. Lastly, we show the interpretability and low
dependency of RAAF on labeled data.
4.1 Datasets and Experimental Seings
We evaluate the proposed method on two public benchmarked activity recognition datasets, PAMAP2 dataset
and MHEALTH dataset and the real-world dataset MARS which is collected by ourselves. These public datasets
are the latest available wearable sensor-based datasets with complete annotation and have been widely used in
the activity recognition research community.
PAMAP2.
The dataset was collected in a constrained setting where 9 participants (1 female and 8 males)
performed 12 daily living activities including basic actions(standing, walking) and sportive exercises(running,
playing soccer). Six activities were carried out by the subjects optionally. The sensor data were collected at
the frequency of 100 Hz from the hardware setup that contains 3 Colibri Inertial Measurement Units (IMUs)
attached to the dominant wrist, the chest and the dominant side’s ankle, respectively. Besides, heart rate (bpm)
was collected by an HR-monitor at the sampling frequency of 9 Hz. All the above collected data include two 3-axis
accelerometer data (
ms−2
), 3-axis gyroscope data (rad/s), 3-axis magnetometer data (
/mu
T), 3-axis orientation
data, and temperature (
◦C
). Specially, temperature is collected from 3 IMUs, so it is also processed to be 3-axis.
Our experiments only consider the high-quality part of data, including temperature, accelerometer, gyroscope,
and magnetometer data, to ensure eective validation of the experimental results.
MHEALTH.
The Mobile Health (MHEALTH) dataset is also devised to benchmark methods of human activities
recognition based on multimodal wearable sensor data. Three IMUs were respectively placed on 10 participants’
chest, right wrist, and left ankle to record the accelerometer (
ms−2
), gyroscope (deg/s) and the magnetometer
(local) data while they were performing 12 activities. The IMU on the chest also collected 2-lead ECG data (mV)
to monitor the electrical activity of the heart. All sensing modals are recorded at the frequency of 50 Hz.
MARS.
Our new dataset, the Multimodal Activity Recognition with Sensing (MARS). MARS dataset, was
collected while 8 participants (6 males, 2 females) were doing 5 basic activities (sitting, standing, walking,
ascending stairs and descending stairs). Three IMU sensors, Phidget Spatial 3/3/3 [
34
] were attached to the
dominant wrist, the waist, and the dominant side’s ankle, respectively, to collect 3-axis accelerometer data
(gravitational acceleration
д
), 3-axis gyroscope data (
◦/s
), and 3-axis magnetometer data (
nT
). Since participants
went up and down through the same ight of stairs during our collecting of data, the magnetometer data contain
signals of two opposite directions. To avoid the misconduct resulted from the opposite data, we excluded the
magnetometer data for activity recognition. All IMUs collected the data at the frequency of 70 Hz.
Similar to [
15
], the experiments conducted on the two public datasets perform background activity recognition
task [
38
]. The activities are categorized into 6 classes: lying, sitting/standing, walking, running, cycling and
other activities. To tackle the task and ensure the rigorousness, all experiments are performed by Leave-One-
Subject-Out (LOSO) cross-validation which can also test the person independence during the evaluation. The
evaluation results are measured by accuracy (%), one of the most commonly used performance measure standards
for classication tasks.
Here, we describe the common design for all the experiments but leave hyper-parameter study to the next
section.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:14 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Convolutional Network:
The convolutional network has three sections. Each of the rst two sections
composes of one convolutional layer with the kernel size of 3x3, one rectied linear unit (ReLU) layer that
applies the non-saturating activation function
relu(ν)=max(ν,
0
)
, and one max pooling layer with the kernel
size of 1x3and the stride of 1x3. The third section has a fully connected layer developed on the attened results
of the second layer. The size of the fully connected layer depends on the size of the input activity frame for the
reason that the output should be reshaped to another 2-D matrix Cfwith the same size of If.
Glimpse Network:
The glimpse network has three fully connected layers dened as
дlf
t=Linear (lf
t)
,
дρf
t=Linear (ρ(If,lf
t))
,
дf
t=relu(Linear (дlf
t+дρf
t))
, respectively. The dimensionality of
дρf
t,дlf
t
and
дf
t
are
128,128 and 220 in our experiments.
Action and Location Networks:
The action network only has one fully connected layer while the policy
for the location network is dened by a dual-component Gaussian with a variance xed to be 0.22. The location
network outputs the location at time
t+
1stochastically according to the location distribution, which is dened
as lf
t+1=tanh(Linear (hf
t)).
Two Recurrent Networks:
The proposed method has two recurrent networks. One is the attention based
LSTM with the cell size of 100. The number of time steps is 40, which denes the number of glimpses. The other
one is the frame-based recurrent network which has an LSTM in a size of 1000 and the number of time steps is
set to 5, which decides the number of frames that are utilized to perform the recognition task.
4.2 Hyper-Parameter Study
In this section, we mainly analyze four most contributing hyper-parameters to which the model is more sensitive
in our experiments, namely the size of glimpse windows (width and height), size of the glimpse output
дf
t
, the
number of copies for Monte Carlo sampling, and the number of glimpses. For the other hyper-parameters, we
just use xed empirical values as suggested in the previous subsection. The variation trend is shown as Figure 4,
Figure 5 and Figure 6.
Taking Figure 4 as an example, rstly, we tune the width and the height of glimpse windows to gure out
their relationship as shown in Figure 4 (a). Specically, there are 13 3-axis vectors to present the temperature,
accelerometer, gyroscope and magnetometer data in our experiments. After Algorithm 1, several 78x9activity
frames are generated. Figure 4 (a) shows that the accuracy achieves the best when the glimpse window size is
64x16 and there is an obvious "ridge" along which the whole gure is almost symmetric. All the points on the
symmetric line are in a ratio of 4 : 1. This suggests that the approach favors a xed ratio of the two dimensions
of the glimpse window, in spite that we used the ratio of the activity frame size of 78 : 9. Also, we can see that
Figure 5 (a) and Figure 6 (a) both show the "ridge" while their optional glimpse window sizes are dierent because
of dierent sizes of activity frames.
Figure 4 (b), (c) and (d) show the experimental results of our studies on the eect of other three hyper-parameters,
the size of the glimpse network, the number of copies for Monte Carlo sampling and the number of glimpses. In
particular, Figure 4 (b), and (d) present similar trends that the accuracy increases remarkably at rst and keeps
rising slowly (Figure 4 (b)) or remains stable (Figure 4 (d)) after getting a turning point. However, for the Monte
Carlo sampling, too low or too high values lead to worse performance, as Figure 4 (c) shows. Considering the
computational complexity increases with larger values of the hyper-parameters, a trade-o between the accuracy
and the computational complexity is necessary, especially for Monte Carlo Sampling. Therefore, we simply select
the points slightly after the turning points (220, 20, 30) as the optimal parameters to conduct our following
experiments. And we can notice that the variation trends in Figure 5 and Figure 6 enjoy the same patterns.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
•
1:15
Fig. 4. Experimental Results for Hyper-Parameter Tuning on PAMAP2
4.3 Accuracy Comparison and Performance Analysis
To evaluate the performance of the proposed approach, RAAF, we conduct extensive experiments to compare its
performance with the state-of-the-art methods on PAMAP2 and MHEALTH. We elaborately select other four
state-of-the-art and multimodal feature-based approaches (MARCEL [
15
], FEM [
26
], CEM [
16
] and MKL [
2
])
and ve baseline methods (Support Vector Machine (SVM), Random Forest(RF), K-Nearest Neighbors(KNN),
Decision Tree(DT) and Single Neural Networks) to show the competitive power of the proposed method. To
ensure fair comparison, the best parameters test, RAAF, is used on both datasets; the best trade-o parameter
(
λ=
0
.
7) is deployed for MARCEL; time-domain features including mean, variance, standard deviation, median
and frequency-domain features including entropy and spectral entropy are utilized for FEM; each modality
feature group are dened an independent kernel for MKL; and for other baseline methods, all modality features
are deployed. All parameters adopted are in reference to the parameters suggested in literature. The results in
Table 1 show the proposed RAAF outperforms all the state-of-the-art methods and the baseline methods.
To further explain the accuracy of RAAF on each specic activity, Figure 7 (a) and (b) show the confusion
matrices on both public datasets performing the background activity recognition task. The results show the
proposed approach performs well for most activities such as lying, sitting and standing, and cycling. However,
more misclassications occur on activities that have similar patterns to the background activities, such as walking,
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:16 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Fig. 5. Experimental Results for Hyper-Parameter Tuning on MHEALTH
Table 1. Comparison among RAAF and four state-of-the-art methods and five baseline methods. For PAMAP2 dataset,
accelerometer, gyroscope and magnetometer data are utilized. For MHEALTH dataset, ECG data are considered additionally.
Datasets Methods
PAMAP2
RAAF MARCEL [15] FEM+SVM [26] CEM [16] FEM+MKL [2, 26]
83.4 82.8 76.4 81 81.6
SVM RF KNN DT Single NN
59.3 64.7 70.3 57.8 72.0
MHEALTH
RAAF MARCEL [15] FEM+SVM [26] CEM [16] FEM+MKL [2, 26]
94.0 92.3 70.7 74.8 90.6
SVM RF KNN DT Single NN
68.7 82.5 86.1 78.7 89.1
ascending stairs and descending stairs, due to the constraint of the background activity recognition task, "others".
This pattern can also been seen in Figure 7 (c) where sitting and standing can be clearly classied while walking
and ascending or descending stairs appear to be slightly confusing. To present the eectiveness of our method
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
•
1:17
Fig. 6. Experimental Results for Hyper-Parameter Tuning on MARS
on other activities, Figure 8 shows the confusion matrices on both public datasets performing the all activity
recognition task that denes separate classes for each of the 12 activities [
38
] on PAMAP2 and MHEALTH. From
Figure 8 we observe that on PAMAP2 dataset the model works well for most activities but is confused with
running, ascending & descending stairs and rope jumping because of their similar patterns. And on MHEALTH
dataset, the performance is remarkable except for some misclassications for knees bending, cycling and jogging.
Fig. 7. The confusion matrices of RAAF for background activity recognition on three datasets
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:18 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Fig. 8. The confusion matrices of RAAF for all activity recognition on two public datasets
Table 2. Feature Extraction Capability of Activity Frames
PAMAP2 Dataset MHEALTH Dataset MARS Dataset
Original Frames 81.35 92.20 77.25
Activity Frames 83.42 94.04 85.28
Table 3. Latency Analysis on Three Datasets
PAMAP Dataset MHEALTH Dataset MARS Dataset
0.68s 0.72s 0.59s
We prove the eectiveness of our activity frames by deploying the dual-stream recurrent convolutional
attention model on original features. To adapt features to the proposed model, multimodal features are stacked to
form original frames, as Figure 2(a) shows. Table 2 presents feature extraction capability of activity frames, which
shows that the proposed model based on the original frames outperforms most of the state-of-the-art methods
(listed in Table 1) even without activity frames. But utilizing the activity frames can signicantly improve the
performance of original model due to the availability of the full relationship among features provided by activity
frames.
As latency is a critical indicator to evaluate the applicability of HAR systems in practical scenarios, table 3
shows the latency for testing one sample on the three datasets (all less than 1 second), which we believe is fairly
acceptable in realistic application scenarios.
4.4 Model Interpretability
One of the merits of our method is its interpretability. For wearable sensor-based activity recognition, subjects
usually wear more than one sensors on their dominant body parts like arms, chest, and ankles, each sensor with
multimodal. Attention mechanisms provide a superiority that it feeds the glimpse location back at each time
step. Owing to the particularity of the activity frames, the attention model in our scenario not only provides the
specic body parts it focuses on but also highlights the most contributing sensors and modals to diverse activities.
In this section, we only present the experimental results of running, walking and lying down on MHEALTH
dataset for simplicity. The available sensors on MHEALTH include ECG, chest accelerometer, ankle accelerometer,
ankle gyroscope, ankle magnetometer, arm accelerometer, arm gyroscope and arm magnetometer. Figure 9 shows
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
•
1:19
Table 4. Modals Involvements on MHEALTH Dataset (%). (Acc, Gyro, Magn denote Accelerometer, Gyroscope and Magne-
tometer, respectively.)
activity ECG Accch es t Acca nk l e Gyroank l e Maдnan kl e Accar m Gyroar m Maдnar m
running 21.98 10.22 30.55 15.86 4.30 6.49 5.03 5.57
walking 7.23 11.05 18.78 19.26 8.72 19.66 9.46 5.83
lying down 6.58 13.45 16.34 10.23 17.92 10.29 10.72 14.47
the glimpse heatmap for all sensors. Taking running as an example, we can observe ankle as the most active part
of running. The chest also contributes a lot while arm involves the least. To further demonstrate the involvement
of all sensors modal data, Table 4 concludes the percentage of our model "looking at" dierent modals for the
latest 120 times (out of 200 times). It shows that for running, the most salient modal is ankle acceleration, which
accounts for 30.55%. ECG and ankle gyroscope data are also signicant. The experimental results totally conform
to the reality that while running, the most active body parts should be legs and ankles. Another self-evident truth
is that in our experiments, one modal that can easily distinguish strenuous exercise like running from others is
ECG. Also, since the model still "looks at" other modals for several times, it is able to better corroborate the claim
that our model minimizes information loss.
Fig. 9. Glimpse Heatmap
4.5 Labeled Data Dependency
It is generally regarded as one of the most serious challenges in human activity recognition to get enough labeled
data, owing to the considerable annotation expense and the possibility of user privacy violation. Semi-supervised
[
47
] or weakly-supervised methods [
28
] may take advantages of unlabeled data but meanwhile incur extra cost
[
47
]. In contrast, we propose to maximize the utilization of features and achieve the best performance with
least cost. With activity frames fully extracting information among features, attention model focusing on the
most salient data, and frame based recurrent network detailedly studying the temporal pattern, RAAF is able to
reduce the dependency on labeled data signicantly. As gure 10 shows although the accuracy decreases with
less labeled data, the downtrend is slow until the number of labeled data is reduced to 1000 on both datasets.
Even 5000 labeled data deliver a relatively satisfactory accuracy. Owning to the fact that the experiments adopts
Leave-One-Subject-Out (LOSO) cross-validation, which means 7 subjects’ data for 6 activities on PAMAP dataset,
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:20 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
8 subjects on MHEALTH dataset and 6 subjects for 5 activities on MARS are used for training, only 119, 104 and
166 data are needed for each subject and each activity on PAMAP2, MHEALTH and MARS, respectively. This fact
fully validates the low dependency of our method on labeled data.
Fig. 10. Labeled Data Dependency
5 CONCLUSION
This paper proposes an innovative human activity recognition approach, RAAF, which includes (a) a novel
form of multimodal sensor features, convolutional activity frames to fully extract relations between each pair of
sensors and modality data and (b) a dual-stream convolutional attention model to combine recurrent attention
and recurrent activity frames. The experiments show our method outperforms the state-of-the-art methods and
has low annotation dependency which indicates that it substantially reduces the requirement for labeled data.
Also, the method enjoys great interpretability in spite of the non-explanation of neural networks. Furthermore,
we conduct the method on a real-world dataset collected by ourselves to validate its applicability in practical
situations, Given the encouraging results, achieving higher performance by RAAF for more complex HAR
scenarios is promising in our future work.
REFERENCES
[1]
Bogdan Alexe, Nicolas Heess, Yee W Teh, and Vittorio Ferrari. 2012. Searching for objects driven by context. In Advances in Neural
Information Processing Systems. 881–889.
[2]
Salah Althloothi, Mohammad H Mahoor, Xiao Zhang, and Richard M Voyles. 2014. Human activity recognition using multi-features and
multiple kernel learning. Pattern recognition 47, 5 (2014), 1800–1812.
[3]
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. 2013. A Public Domain Dataset for Human
Activity Recognition using Smartphones.. In ESANN.
[4]
Oresti Banos, Rafael Garcia, Juan A Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez, and Claudia
Villalonga. 2014. mHealthDroid: a novel framework for agile development of mobile health applications. In International Workshop on
Ambient Assisted Living. Springer, 91–98.
[5]
Oresti Banos, Claudia Villalonga, Rafael Garcia, Alejandro Saez, Miguel Damas, Juan A Holgado-Terriza, Sungyong Lee, Hector Pomares,
and Ignacio Rojas. 2015. Design, implementation and validation of a novel open framework for agile development of mobile health
applications. Biomedical engineering online 14, 2 (2015), S6.
[6]
Yoshua Bengio. 2013. Deep learning of representations: Looking forward. In International Conference on Statistical Language and Speech
Processing. Springer, 1–37.
[7]
Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A tutorial on human activity recognition using body-worn inertial sensors. ACM
Computing Surveys (CSUR) 46, 3 (2014), 33.
[8]
Nicholas J Butko and Javier R Movellan. 2008. I-POMDP: An infomax model of eye movement. In Development and Learning, 2008. ICDL
2008. 7th IEEE International Conference on. IEEE, 139–144.
[9]
Liming Chen, Jesse Hoey, Chris D Nugent, Diane J Cook, and Zhiwen Yu. 2012. Sensor-based activity recognition. IEEE Transactions on
Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 6 (2012), 790–808.
[10]
Li Deng. 2014. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and
Information Processing 3 (2014).
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
Fullie and Wiselie: A Dual-Stream Recurrent Convolutional Aention Model for Activity Recognition
•
1:21
[11]
Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. 2012. Learning where to attend with deep architectures for image
tracking. Neural computation 24, 8 (2012), 2151–2184.
[12]
Marcus Edel and Enrico Köppe. 2016. Binarized-BLSTM-RNN based Human Activity Recognition. In Indoor Positioning and Indoor
Navigation (IPIN), 2016 International Conference on. IEEE, 1–7.
[13]
Hongqing Fang and Chen Hu. 2014. Recognizing human activity in smart home using deep learning algorithm. In Control Conference
(CCC), 2014 33rd Chinese. IEEE, 4716–4720.
[14]
Yu Guan and Thomas Plötz. 2017. Ensembles of Deep LSTM Learners for Activity Recognition Using Wearables. Proc. ACM Interact.
Mob. Wearable Ubiquitous Technol. 1, 2, Article 11 (June 2017), 28 pages. https://doi.org/10.1145/3090076
[15]
Haodong Guo, Ling Chen, Liangying Peng, and Gencai Chen. 2016. Wearable sensor based multimodal human activity recognition
exploiting the diversity of classier ensemble. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous
Computing. ACM, 1112–1123.
[16]
Haodong Guo, Ling Chen, Yanbin Shen, and Gencai Chen. 2014. Activity recognition exploiting classier level fusion of acceleration
and physiological signals. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct
Publication. ACM, 63–66.
[17]
Sojeong Ha, Jeong-Min Yun, and Seungjin Choi. 2015. Multi-modal convolutional neural networks for activity recognition. In Systems,
Man, and Cybernetics (SMC), 2015 IEEE International Conference on. IEEE, 3017–3022.
[18]
Albert Haque, Alexandre Alahi, and Li Fei-Fei. 2016. Recurrent attention models for depth-based person identication. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 1229–1238.
[19]
Georey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7
(2006), 1527–1554.
[20]
Masaya Inoue, Sozo Inoue, and Takeshi Nishida. 2016. Deep Recurrent Neural Network for Mobile Human Activity Recognition with
High Throughput. arXiv preprint arXiv:1611.03607 (2016).
[21]
Wenchao Jiang and Zhaozheng Yin. 2015. Human activity recognition using wearable sensors by deep convolutional neural networks.
In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1307–1310.
[22]
Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michaël Mathieu, and Yann L Cun. 2010. Learning convolutional
feature hierarchies for visual recognition. In Advances in neural information processing systems. 1090–1098.
[23]
Kai Kunze and Paul Lukowicz. 2008. Dealing with sensor displacement in motion-based onbody activity recognition systems. In
Proceedings of the 10th international conference on Ubiquitous computing. ACM, 20–29.
[24]
Zhihui Lai, Yong Xu, Qingcai Chen, Jian Yang, and David Zhang. 2014. Multilinear sparse principal component analysis. IEEE transactions
on neural networks and learning systems 25, 10 (2014), 1942–1950.
[25]
Nicholas D Lane, Petko Georgiev, and Lorena Qendro. 2015. Deepear: robust smartphone audio sensing in unconstrained acoustic
environments using deep learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing.
ACM, 283–294.
[26] Oscar D Lara, Alfredo J Pérez, Miguel A Labrador, and José D Posada. 2012. Centinela: A human activity recognition system based on
acceleration and vital sign data. Pervasive and mobile computing 8, 5 (2012), 717–729.
[27]
Hugo Larochelle and Georey E Hinton. 2010. Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances
in neural information processing systems. 1243–1251.
[28]
Chandrashekhar Lavania, Sunil Thulasidasan, Anthony LaMarca, Jerey Scoeld, and Je Bilmes. 2016. A weakly supervised activity
recognition framework for real-time synthetic biology laboratory assistance. In Proceedings of the 2016 ACM International Joint Conference
on Pervasive and Ubiquitous Computing. ACM, 37–48.
[29] Yann LeCun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[30]
Yongmou Li, Dianxi Shi, Bo Ding, and Dongbo Liu. 2014. Unsupervised feature learning for human activity recognition using smartphone
sensors. In Mining Intelligence and Knowledge Exploration. Springer, 99–107.
[31]
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al
.
2014. Recurrent models of visual attention. In Advances in neural information
processing systems. 2204–2212.
[32]
Shih Yin Ooi, Andrew Beng Jin Teoh, Ying Han Pang, and Bee Yan Hiew. 2016. Image-based handwritten signature verication using
hybrid methods of discrete radon transform, principal component analysis and probabilistic neural network. Applied Soft Computing 40
(2016), 274–282.
[33]
Juha Parkka, Miikka Ermes, Panu Korpipaa, Jani Mantyjarvi, Johannes Peltola, and Ilkka Korhonen. 2006. Activity classication using
realistic data from wearable sensors. IEEE Transactions on information technology in biomedicine 10, 1 (2006), 119–128.
[34] I Phidgets. 2010. 1056-PhidgetSpatial 3/3/3. Code Samples For This Product (2010).
[35]
Thomas Plötz, Nils Y Hammerla, and Patrick Olivier. 2011. Feature learning for activity recognition in ubiquitous computing. In IJCAI
Proceedings-International Joint Conference on Articial Intelligence, Vol. 22. 1729.
[36]
Valentin Radu, Nicholas D Lane, Sourav Bhattacharya, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar. 2016. Towards multimodal
deep learning for activity recognition on mobile devices. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and
, Vol. 1, No. 1, Article 1. Publication date: November 2017.
1:22 •Kaixuan Chen, Lina Yao, Tao Gu, Zhiwen Yu, Xianzhi Wang, and Dalin Zhang
Ubiquitous Computing: Adjunct. ACM, 185–188.
[37]
Attila Reiss and Didier Stricker. 2012. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the
5th International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 40.
[38]
Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In Wearable Computers (ISWC),
2012 16th International Symposium on. IEEE, 108–109.
[39]
Monit Shah Singh, Vinaychandran Pondenkandath, Bo Zhou, Paul Lukowicz, and Marcus Liwicki. 2017. Transforming Sensor Data to
the Image Domain for Deep Learning-an Application to Footstep Detection. Neural Networks (IJCNN), 2017 International Joint Conference
on (2017), 3017–3022.
[40]
Emmanuel Munguia Tapia, Stephen S Intille, William Haskell, Kent Larson, Julie Wright, Abby King, and Robert Friedman. 2007.
Real-time recognition of physical activities and their intensities using wireless accelerometers and a heart rate monitor. In Wearable
Computers, 2007 11th IEEE International Symposium on. IEEE, 37–40.
[41]
Aiguo Wang, Guilin Chen, Cuijuan Shang, Miaofei Zhang, and Li Liu. 2016. Human Activity Recognition in a Smart Home Environment
with Stacked Denoising Autoencoders. In International Conference on Web-Age Information Management. Springer, 29–40.
[42]
Hongsong Wang and Liang Wang. 2017. Modeling Temporal Dynamics and Spatial Congurations of Actions Using Two-Stream
Recurrent Neural Networks. The Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
[43]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8,
3-4 (1992), 229–256.
[44]
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3
(1987), 37–52.
[45]
Yaser Yacoob and Michael J Black. 1998. Parameterized modeling and recognition of activities. In Computer Vision, 1998. Sixth International
Conference on. IEEE, 120–127.
[46]
Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krishnaswamy. 2015. Deep Convolutional Neural Networks on
Multichannel Time Series for Human Activity Recognition.. In IJCAI. 3995–4001.
[47]
Lina Yao, Feiping Nie, Quan Z Sheng, Tao Gu, Xue Li, and Sen Wang. 2016. Learning from less for better: semi-supervised activity
recognition via shared structure discovery. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous
Computing. ACM, 13–24.
[48]
Ming Zeng, Le T Nguyen, Bo Yu, Ole J Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neural networks for human
activity recognition using mobile sensors. In Mobile Computing, Applications and Services (MobiCASE), 2014 6th International Conference
on. IEEE, 197–205.
[49]
Mi Zhang and Alexander A Sawchuk. 2012. USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable
sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 1036–1043.
[50]
Maria Zontak, Inbar Mosseri, and Michal Irani. 2013. Separating signal from noise using patch recurrence across scales. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 1195–1202.
, Vol. 1, No. 1, Article 1. Publication date: November 2017.