ArticlePDF Available

Synthesizing and Reconstructing Missing Sensory Modalities in Behavioral Context Recognition


Abstract and Figures

Detection of human activities along with the associated context is of key importance for various application areas, including assisted living and well-being. To predict a user’s context in the daily-life situation a system needs to learn from multimodal data that are often imbalanced, and noisy with missing values. The model is likely to encounter missing sensors in real-life conditions as well (such as a user not wearing a smartwatch) and it fails to infer the context if any of the modalities used for training are missing. In this paper, we propose a method based on an adversarial autoencoder for handling missing sensory features and synthesizing realistic samples. We empirically demonstrate the capability of our method in comparison with classical approaches for filling in missing values on a large-scale activity recognition dataset collected in-the-wild. We develop a fully-connected classification network by extending an encoder and systematically evaluate its multi-label classification performance when several modalities are missing. Furthermore, we show class-conditional artificial data generation and its visual and quantitative analysis on context classification task; representing a strong generative power of adversarial autoencoders.
Content may be subject to copyright.
Synthesizing and Reconstructing Missing Sensory
Modalities in Behavioral Context Recognition
Aaqib Saeed * , Tanir Ozcelebi and Johan Lukkien
Department of Mathematics and Computer Science, Eindhoven University of Technology,
Eindhoven, The Netherlands; (T.O.); (J.L.)
Received: 12 July 2018; Accepted: 3 September 2018; Published: 6 September 2018
Detection of human activities along with the associated context is of key importance for
various application areas, including assisted living and well-being. To predict a user’s context in
the daily-life situation a system needs to learn from multimodal data that are often imbalanced,
and noisy with missing values. The model is likely to encounter missing sensors in real-life
conditions as well (such as a user not wearing a smartwatch) and it fails to infer the context if
any of the modalities used for training are missing. In this paper, we propose a method based on an
adversarial autoencoder for handling missing sensory features and synthesizing realistic samples.
We empirically demonstrate the capability of our method in comparison with classical approaches
for filling in missing values on a large-scale activity recognition dataset collected in-the-wild.
We develop a fully-connected classification network by extending an encoder and systematically
evaluate its multi-label classification performance when several modalities are missing. Furthermore,
we show class-conditional artificial data generation and its visual and quantitative analysis on context
classification task; representing a strong generative power of adversarial autoencoders.
sensor analytics; human activity recognition; context detection; autoencoders; adversarial
learning; imputation
1. Introduction
The automatic recognition of human activities along with inferring the associated context is
of great importance in several areas such as intelligent assistive technologies. A minute-to-minute
understanding of person’s context can enable immediate support e.g., for elderly monitoring [
timely interventions to overcome addictions [
], voluntary behavior adjustment for living a healthy
lifestyle [
], coping with physical inactivity [
] and in industrial environments to improve workforce
productivity [
]. The ubiquity of sophisticated sensors integrated into smartphones, smartwatches and
fitness trackers provides an excellent opportunity to perform a human activity and behavior analysis
as such devices have become an integral part of our daily lives [
]. However, context recognition in
a real-life setting is very challenging due to the heterogeneity of sensors, variation in device usage,
a different set of routines, and complex behavioral activities [
]. Concretely, to predict people’s
behavior in their natural surroundings, a system must be able to learn from multimodal data sources
(such as an accelerometer, audio, and location signals) that are often noisy with missing data. In reality,
a system is likely to encounter missing modalities due to various reasons such as a user not wearing a
smartwatch, a sensor malfunction or a user not granting permission to access specific data because
of privacy concerns. Moreover, due to large individual differences, the training data could be highly
imbalanced, with very few (sparse) labels for certain classes. Hence, for a context recognizer to perform
well in unconstrained naturalistic conditions; it must handle missing data and class imbalance in a
robust manner while learning from multimodal signals.
Sensors 2018,18, 2967; doi:10.3390/s18092967
Sensors 2018,18, 2967 2 of 20
There are a variety of techniques available for dealing with missing data [
]. Some naive
approaches are, mean substitution or simply discarding instances with missing values. In the former,
replacing by average may lead to bias (inconsistency would arise e.g., if the number of missing values
for different features are excessively unequal and vary over time) [
]. In the latter, removal leads
to a substantial decrease in the number of samples (mostly labeled) that are otherwise available
for learning. It can also introduce bias in the model’s output if data are not missing completely at
random [
]. Similarly, principal component analysis (PCA) approach could be to utilize through
inverse transformation on the reduced dimensions of the original data to restore lost features but the
downside is PCA can only capture linear relationships. Another approach might be training a separate
model for each modality, where the decision can be made on the basis of majority voting from the
available signals. Though in this scheme, the distinct classifiers will fail to learn the correlation that
may exist between different sensory modalities. Besides, this approach is inefficient as we have to train
and manage a separate classifier for every modality available in the dataset.
An autoencoder is an unsupervised representation learning algorithm that reconstructs its
own input usually from a noisy version, which can be seen as a form of regularization to avoid
over-fitting [
]. Generally, the input is corrupted by adding a Gaussian noise, applying dropout [
or randomly masking features as zeros [
]. The model is then trained to learn a latent representation
that is robust to corruption and can reproduce clean samples from partially destroyed features.
Therefore, denoising autoencoders can be utilized to tackle reconstruction, while learning
discriminative representations for an end task e.g., context classification. Furthermore, the adversarial
autoencoder (AAE) extends a typical autoencoder to make it a generative model that is able to produce
synthetic data points by sampling from an arbitrarily chosen prior distribution. Here, a model is
trained with dual losses–reconstruction objective and adversarial criterion to match the hidden code
produced via the encoder to some prior distribution [
]. The decoder then acts as a deep generative
model that maps the enforced prior distribution to the data distribution. We address the issues of
missing data and augmenting synthetic samples with an AAE [15].
In this paper, we present a framework based on AAE to reconstruct features that are likely to go
missing all at once (as they are extracted from the same modality) and augment samples to enable
synthetic data generation (see Figure 1). We demonstrate the representation learning capability of
AAE through accurate reconstruction of missing values and supervised multi-label classification of
behavioral context. In particular, we show AAE is able to provide a more faithful imputation as
compared to techniques such as PCA and show strong predictive performance even in case of several
missing modalities. We analyze the performance of the decoder trained with supervision enabling the
model to generate class conditional artificial training data. Further, we show that AAE can be extended
with additional layers to perform classification; hence leveraging the complete dataset including
labeled and unlabeled instances. The primary contributions of this work are the following:
Demonstration of a method to restore missing sensory modalities using an adversarial autoencoder.
Systematic comparison with other techniques to impute lost data.
Leveraging learned embedding and extending the autoencoder for multi-label context recognition.
Generating synthetic multimodal data and its empirical evaluation through visual fidelity of
samples and classification performance on a real test set.
We first address previous work on using autoencoders for representation learning and handling
missing data in Section 2. Section 3briefly reviews the large-scale real-world dataset utilized in this
work for activity and context recognition. We then describe a practical methodology for restoring
lost sensor data, generating synthetic samples, and learning a context classifier based on adversarial
autoencoder in Section 4. In Section 5, we systematically examine the effect of missing modalities on
context recognition model, show a comparison of different techniques, and evaluate the quality of
synthetic data. We then provide a discussion of the results, highlight potential limitations and future
improvements, followed by conclusions in Section 6.
Sensors 2018,18, 2967 3 of 20
Multi-label Classification
Sensory Features
User Context
Adding Noise
to Emulate
Missing Sensors
Figure 1.
Overview of the proposed framework for robust context classification with missing
sensory modalities.
2. Related Work
Previous work on behavior context recognition has evaluated fusing single-sensor [
] classifiers to
handle missing input data, in addition to utilizing different combinations of sensors to develop models
for each group [
]. However, these methods do not scale well to many sensors and may fail to learn
correlations that exist between different modalities. Furthermore, restoration of missing features with
imputation methods remains a non-trivial task as most procedures fail to account for uncertainty in
the process. In the past, autoencoders have been successfully used for unsupervised feature learning
in several domains thanks to their ability of learning complex, sparse and non-linear features [
To put this work into context, we review contemporary approaches to leveraging autoencoders for
representation learning and handling missing input data.
Recent methods [
] on ubiquitous activity detection have effectually used the restricted
Boltzmann machine, denoising and stacked autoencoders to get compressed feature representations
that are useful for activity classification. These methods performed significantly better for learning
discriminative latent representations from (partial) noisy input, that is not solely possible with
traditional approaches. To the best of our knowledge, no earlier works in activity recognition
domain explicitly addresses missing sensors problem except [
] that utilizes dropout [
] for this
purpose. Nevertheless, several works in different areas have used autoencoders to interpolate
missing data [
]. Thompson et al. [
] used contractive autoencoder for the restoration of
missing sensor values and showed it is generally a non-expensive procedure for most data types.
Similarly, Nelwamondo et al. [
] study the combination of an autoencoder and a genetic algorithm
for an approximation of missing data that have inherent non-linear relationships.
In bioinformatics and healthcare community, denoising autoencoders (DAE) have been used to
learn from imperfect data sources. Li et al. [
] used DAE for pre-training and decoding an incomplete
electroencephalography to predict motor imagery classes. Likewise, Miotto et al. [
] applied DAE
to produce compressed embedding of patients’ characteristics from a very large and noisy set of
electronic health records. Their results showed major improvements over alternative feature learning
strategies (such as PCA) for clinical prediction tasks. Furthermore, Beaulieu-Jones [
] systematically
compared multiple imputation strategies with deep autoencoders on the clinical trial database and
showed strong performance gains in disease progression predictions.
Autoencoders are also extensively used in affective computing to advance emotion recognition
systems. Martinez et al. [
] applied a convolutional autoencoder on raw physiological signals to
extract salient features for affect modeling of game players. In [
], autoencoders are utilized with
transfer learning and domain adaption for disentangling emotions in speech. Similarly, Jaques et al. [
developed a multimodal autoencoder for filling in missing sensor data for mood prediction in a
real-world setting. Furthermore, DAE has been effectively demonstrated for rating prediction tasks in
recommendation systems [34].
Sensors 2018,18, 2967 4 of 20
Generative adversarial network (GAN) [
] as a framework has shown tremendous power
to produce realistic looking data samples, particularly images. It is also successfully applied in
natural language processing domain to generate sequential data with a focus on discrete tokens [
Recently, they are also used in medical domains to produce electronic health records [
] and time-series
data from an intensive care unit [
]. Makhzan et al. [
] combined classical autoencoders with GANs
through the incorporation of adversarial loss to make them a generative model.
This makes AAE a suitable candidate for learning to reconstruct and synthesize with a unified
model. However, to the best of our knowledge, no previous work has utilized them for synthesizing
features extracted from multimodal time-series, specifically for context and activity recognition. Hence,
models capable of successful reconstruction and generation of synthetic samples can help overcome the
issues of noisy, imbalanced and access problems (due to sensitive nature) to the data, which ultimately
helps downstream models to become more robust.
Our work is broadly inspired by efforts to jointly learn from multimodal data sources and it is
similar to [
] in applied training strategy; though it utilizes an AAE for reconstruction, augmentation,
and multi-label behavior context recognition. Besides, as opposed to [
], where a feed-forward
classification model is directly trained with dropout [
] to handle missing modalities, here, the model
first learn to reconstruct the missing features by employing both dropout and structured noise
(see Section 4.4). Then, we extend this model with additional layers for multi-label classification
through either directly exploiting the encoder or training a network from scratch with learned
embedding. In this manner, the AAE based network will not just be able to reconstruct and classify
but it can also be used for class conditional data augmentation.
3. ExtraSensory Dataset
We seek to learn a representation of context and activities by leveraging massive amounts of
multimodal signals collected using smartphones and wearables. While there are a variety of open
datasets available on the web, we choose to use ExtraSensory Dataset [
] because it was collected in a
real-world environment when participants were busy with their daily routines. It provides a more
realistic picture of a person’s life as compared to a scripted lab data collection which constrains users
to a few basic activities. A system developed with data collected in lab settings fails to capture
intrinsic behaviors in every day in-the-wild conditions. The data collection protocol is described
in detail in [
], and we provide a brief summary in this section. The data is collected from sixty
users with their personal devices using specifically designed applications for Android, iPhone,
and Pebble-watch unit. Every minute an app collected 20 s of measurements from multiple sensors and
asked the user to provide multiple labels that define their environment, behavior, and activities
from a selection of 100 contextual labels. In total, the dataset consists of 300,000+ labeled and
unlabeled measurements of various heterogeneous sensors. We utilize pre-computed features from six
modalities: phone-accelerometer (Acc), phone-gyroscope (Gyro), phone-audio (Aud), phone-location
(Loc), phone-state (PS), and watch-accelerometer (WAcc). Among Loc features, we only use quick
location features (such as user movement) and discard absolute location as it is place specific. By adding
features from each sensing modality, we end up with 166 features, where we utilize 51 processed labels
provided in the original dataset.
This dataset also naturally highlights the inevitable problem of missing data in real-world studies.
For instance, the participants turned off the location service to avoid battery drain, did not wear
the smartwatch continuously and sensor malfunction or other factors resulted in missing samples.
In this case, even though labels and signals from other modalities are available but instances with
missing features cannot be directly used to train a classifier or to make a prediction in the production
setting. This either requires imputation or leads to the discarding of expensive-to-obtain labeled
data. From 300 k+ instances in the dataset, approximately half of them have all the features available
and the rest even though labeled cannot be utilized due to missing values. Therefore, an efficient
technique is required to approximate missing data and prevent valuable information from going to
Sensors 2018,18, 2967 5 of 20
waste during learning a context classifier. Similarly, the data collected in-the-wild often have imperfect
and imbalanced classes as some of the labels occur only a few times. It can also be attributed to
the difference between participants’ routines or their privacy concerns as some classes are entirely
missing from their dataset. Hence, learning from imbalanced classes in a principled way becomes
crucial to correctly identify true positives. In summary, the ExtraSensory Dataset highlights several
challenges for context recognition in real-life conditions, including complex behavioral activities,
unrestrained personal device usage, and natural environments with habitual routines.
4. Methodology
4.1. Autoencoder
An autoencoder is an unsupervised representation learning technique in which a deep neural
network is trained to reconstruct its own input
such that the difference between
and the network’s
is minimized. Briefly, it performs two transformations
through deterministic mapping functions, namely, encoder and decoder.
An encoder transforms input vector
to a latent code
, where, a decoder maps the latent representation
to produce an approximation of
. For a single layer neural network these functions can be written as:
fθ(x):z=σ(Wex+be), (1)
gθ0(z):x0=σ(Wdz+bd), (2)
parameterized by
, where
is a non-linear activation function
(e.g., rectified linear unit),
represents a weight coefficient matrix and
is a bias vector. The model
weights are sometimes tied for regularization such that
. Figure 2provides graphical
illustration of an autoencoder.
Reconstructed Input
Encoder Decoder
Figure 2. Illustration of an autoencoder network.
Learning an autoencoder is an effective approach to perform dimensionality reduction and can
be thought of as a strict generalization of PCA. Specifically, a 1-layer encoder with linear activation
and mean squared error (MSE) loss (see Equation (3)) should be able to learn PCA transformation [
Nonetheless, deep models with several hidden layers and non-linear activation functions can learn
better high-level and disentangled features from the original input data.
LMSE (X,X0) = kXX0k2. (3)
The classical autoencoder can be extended in several ways (see for a review [
]). For handling
missing input data, a compelling strategy is to train an autoencoder with artificially corrupted input
which acts as an implicit regularization. Usually, the considered corruption includes isotropic Gaussian
Sensors 2018,18, 2967 6 of 20
noise, salt and pepper noise and masking (setting randomly chosen features to zero) [
]. In this
case, a network learns to reconstruct a noise-free version
. Formally, the DAE is trained with
stochastic gradient descent to optimize the following objective function:
JDAE =min
EX[L(x,gθ0(fθ(˜x)))], (4)
where Lrepresents a loss function like squared error or binary cross entropy.
4.2. Adversarial Autoencoder
The Adversarial Autoencoder (AAE) [
] combines adversarial learning [
] with classical
autoencoders so it can be used for both learning data embedding and generating synthetic samples.
The Generative Adversarial Network (GAN) introduced a novel framework for developing generative
models by simultaneously training two networks: (a) the generator
, it learns the training instances’
distribution to produce new samples emulating the original samples; and (b) the discriminator network
, which differentiates between original and generated samples. Hence, this formulation can be seen
as a minimax game between
as shown in Equation (5), where
represents a randomly
sampled vector from a certain distribution
(e.g., Gaussian), and
is a sample from the empirical
data distribution pd ata(x)i.e., training data.
EXpdata [log D(x)] + Ezp(z)[log(1D(G(z)))]. (5)
In AAE, an additional discriminator network is added to an existing autoencoder (see Figure 2)
architecture to force the encoder output
to match a specific target distribution
as depicted
in Figure 3; hence enabling the decoder to act as a generative model. Its training procedure consists of
three sequential steps:
The encoder and decoder networks are trained simultaneously to minimize the reconstruction
objective (see Equation (6)). Additionally, the class label information with latent code
can also be
provided to the decoder as supervision. Thus, the decoder then uses both
and label information
to reconstruct the input. In addition, conditioning over
enables the decoder to produce class
conditional samples.
JAE =min
EX[L(x,gθ(fθ(x)))]. (6)
The discriminator network is then trained to distinguish between true samples from a prior
distribution and fake data points (z) generated by an encoder.
Subsequently, the encoder, whose goal is to deceive the discriminator by minimizing a separate
loss function, is updated.
4.3. Context Classification
The context recognition under consideration is a multi-label classification problem, where a user’s
context at any particular time can be described by a combination of various labels. For instance,
a person might be in a meeting, indoor, and with a phone on a table. Formally, it can be defined
as follows:
(i.e., a design matrix) is a set of
instances each being
-dimensional feature
vector having a set of labels
. Every instance vector
has a corresponding subset of
also called relevant labels; other labels might be missing or can be considered irrelevant for the
particular example [
]. The goal of the learner is to find a mapping function
fc:xn→ {
0, 1
assigns labels to an instance. Alternatively, the model predicts a one-hot encoded vector
y∈ {
0, 1
1 (i.e., each element in
) indicates the label is suitable and
0 represents inapplicability.
Sensors 2018,18, 2967 7 of 20
Reconstructed Input
Encoder Decoder
z ~ q(z)
Sampling from
Gaussian Distribution
Differentiating between
positive samples p(z) and
negative samples q(z)
Adversarial Loss
Figure 3. An Adversarial autoencoder network [15].
The feed-forward neural network can be directly used for multi-label classification with
activation function in the last layer and binary cross-entropy loss (see Equation (7)); as it is assumed
that each label has an equal probability of being selected independently of others. Thus, the binary
predictions are acquired by thresholding the continuous output at 0.5.
LCE (ˆ
y,y) = [(ylog(ˆ
y) + (1y)log(1ˆ
y))]. (7)
As mentioned earlier that in real-world datasets the available contextual labels for each instance
could be very sparse (i.e., few
1). It may happen as, during data collection phase, a user might
quickly select a few relevant labels and overlook or intentionally not provide other related labels
about the context. In such a setting, just considering an absence of labels as irrelevant may introduce
bias in the model, and simply discarding the instance without complete label information limits the
opportunity to learn from the available states. Moreover, the positive labels could be very few with a
large number of negatives, resulting in an imbalanced dataset. To tackle these issues, we employ a
similar instance weighting strategy to [
] while learning a multi-label classifier. In this situation the
objective function becomes:
(Ψi,c· LCE (ˆ
yi,c,yi,c)), (8)
is the binary cross-entropy loss, and
is an instance-weighting matrix of size
(i.e., number of training examples and total labels, respectively). The instance weights in
assigned by inverse class frequency. The entries for the missing labels are set to zero, to impose no
contribution in the overall cost from such examples.
4.4. Model Architecture and Training
The multimodal AAE is developed to alleviate two problems: (a) the likely issue of losing features
of the same modality all at once; and (b) synthesizing new labeled samples to increase training dataset
size, data augmentation might be helpful to resolve imbalance (in addition to instance weighting),
facilitate better understanding of the modeling process, and enable data sharing when original dataset
cannot be distributed directly, e.g., due to privacy concerns.
Sensors 2018,18, 2967 8 of 20
We start the model training process by normalizing continuous features in the range
0, 1
summary statistics calculated from the training set. Next, all the missing features are filled-in with
a particular value i.e.,
1. It is essential to represent missing data with a distinct value that could
not occur in the original. After this minimal pre-processing, a model is trained to reconstruct and
synthesize from the clean samples (with all the features available) to provide noise-free ground truth
During reconstruction training, each feature vector
is corrupted with a structured noise [
to get a corrupted version
as (1) masked noise is added to randomly selected 5% of the features;
(2) all the features from three or more randomly chosen modalities are set to
1, hence emulating
missing data; and (3) dropout is applied. The goal of the autoencoder is then to reproduce clean
feature vector
from a noisy version
or in other words to predict reasonably close values of the
missing features from the available ones. For example, the model may choose an accelerometer signal
from the phone to interpolate smartwatch’s accelerometer features or phone states and accelerometer
to approximate location features. Furthermore, for synthesizing novel (class conditional) samples,
an independent supervised AAE model is trained without introducing any noise in the input and with
a slightly different architecture.
After training the AAE model with clean examples for which all sensory modalities are available,
it can be extended for multi-label classification. In this situation, either a separate network is learned or
additional layers are connected to encoder network to classify a user’s behavioral context (see Figure 4).
For latter, the error is backpropagated through the full network; including encoder and classification
layers. Moreover, during the classifier training phase, we keep adding noise in the input as mentioned
earlier. To leverage the entire dataset for classification, the noisy features are first reconstructed
with the learned autoencoder model and combined with the cleaned data. The class weights are
calculated from the combined training set (see Section 4.3), where zero weight is assigned to missing
labels. Thus, this formulation allows us to learn from any combination of noisy, clean, labeled and
unlabeled data.
Encoder Additional layers
for classification
128 Hidden Units
128 Hidden Units
128 Hidden Units
64 Hidden Units
51 Hidden Units
Figure 4. Illustration of an adversarial autoencoder (AAE) based classification network.
We employ binary cross-entropy (see Equation (7)) for reconstruction loss rather than MSE as it
led to consistently better results in earlier exploration. Since cross-entropy deals with binary values,
all the features are first normalized to lie between zero and one as mentioned earlier. We train the
reconstruction network in an unsupervised manner, while the synthesizing model is provided with
supervision through the decoder network as one-hot encoded vector
of class labels. The missing
are simply represented with zeros instead of
1 as we wanted to utilize both labeled and
unlabeled instances. The supervision of decoder network also allows the model to better shape the
distribution of the hidden code by disentangling label information from compressed representation [
Likewise, the samples from Gaussian distribution are provided to a discriminator network as positive
examples and hidden code
as negative examples to align the aggregated posterior to match the
prior distribution.
Sensors 2018,18, 2967 9 of 20
To assess the robustness of our approach for filling-in lost sensor features, we compared it with
PCA reconstruction by applying inverse transformation to the reduced 75-dimensional principle
components vector. In addition, we evaluated multi-label classification performance by utilizing the
learned embedding, and training an extended network on top of an encoder and comparing them
with four different ways of dealing with the missing data: mean substitution, filling it with a median,
replacing missing values with
1, and using a dimensionality reduction method i.e., PCA. To facilitate
fair comparison, we limit the reduction of original 166 features to 75-dimensional feature vector,
it allows PCA to capture 98% of the variance. We also experimented with a standard DAE model but
found it to perform similarly to AEE for feature reconstruction.
The visual fidelity and the supervised classification task are used to examine the quality of the
synthetic samples produced by the (decoder) generative model. We train a context classification
model on synthetic data and evaluate its performance on the held-out real test set and vice-versa.
Because the decoder is trained with supervision it enables us to generate class conditional samples.
For generating labeled data, we use labels from the (real) training set and feed it together with the
Gaussian noise into the decoder. Another strategy for data augmentation could be to first sample
class labels and then use those for producing synthetic features. However, as we are dealing with
multi-label classification, where labels jointly explain the user’s context, arbitrarily sampling them is
not feasible as it may lead to inconsistent behaviors and activities (such as, sleeping during running).
Therefore, we straightforwardly utilize the clean training set labels to sample synthetic data points.
4.5. Implementation
Our approach is implemented in Tensorflow [
]. We initialized the weights with Xavier [
technique and biases with zeros. We use Adam [
] optimizer with fixed but different learning
rates for reconstruction and synthesizing models. For the former, the learning rates of 3
, 5
and 5
are used for adversarial and reconstruction and classification losses, respectively. While
in the latter, 1
, 1
and 5
are used for reconstruction, adversarial and classification losses,
respectively. We employ
2-regularization on encoder’s and classifier ’s weights with a rate of 1
The rest of the hyper-parameters are minimally tuned on the (internal) validation set by dividing
the training folds data into a ratio of 80–20 to discover a architecture that gives optimal performance
across users. The suitable configuration of reconstruction network is found to be 3 layers encoder and
decoder with 128 hidden units in each layer and dropout [
] with a rate of 0.2 on the input layer.
The classification network contains a single hidden layer with 64 units. Similarly, the synthesizing
model contains 2 hidden layers with 128 and 10 units and dropout of 0.2 is applied on encoding layer
However, during sampling from the decoder network, we apply dropout with 0.75. The LeakyReLU
activation is used in all the layers except for the classifier trained on synthetic data, where ReLU
performed better. Moreover, we also experimented with several batch sizes and found 64 to produce
optimal results. We train the models for a maximum of 30 epochs and utilize early-stopping to save
the model based on internal validation set performance.
4.6. Performance Evaluation
We evaluate reconstruction and classification performance through five-folds cross-validation,
where each fold has 48 users for training and 12 users for testing; with the same folds as of [
The cross-validation technique is used to show the robustness of our approach when the entire
data of users are held-out as test-set during experiments. For hyper-parameters optimization in
this setting, we randomly divide a training set into 80% training and 20% internal validation set.
The same approach is employed to evaluate the quality of synthetic data points via a supervised
classification task. Figure 5depicts the data division for imputation and classification experiments.
The entire dataset is first split-up into clean and noisy parts, where clean data is used for training and
measuring the performance of restoring missing features as described in Section 4.4. The noisy data
is then interpolated using a learned model and combined with the clean version to use for context
Sensors 2018,18, 2967 10 of 20
classification task. However, we use only clean data to train and evaluate the synthesizing model,
the artificial data generated from the AAE is used to train a classifier and its performance is evaluated
on real test (folds) data.
Figure 5. Data split for reconstruction, synthesizing, and classification experiments.
The performance of approximating missing data is measured with root mean square error (RMSE) as:
RMSE =qE[(X˜
X)2]. (9)
The multi-label classification is evaluated through balanced accuracy (BA) derived from sensitivity
(or recall) and specificity (or true negative rate) as shown in Equation (10). BA is a more robust and fair
measure of performance for imbalanced data as it is not sensitive to class skew as opposed to average
accuracy, precision and f-score which can over or under emphasize the rare labels [
]. Likewise, it is
important to note that, the evaluation metrics are calculated independently for each label of the
51 labels and averaged afterwards.
Sensitivity =tp/(tp +f n ),
Specificity =tn/(tn +f p),
Balanced Accuracy = (Sensitivity +Specificity)/2.
5. Experimental Results
5.1. Modality Reconstruction
We first seek to validate the capability of the AAE network to restore the missing modalities.
It is evaluated in comparison with PCA reconstruction, which is achieved by projecting the original
166 features onto a lower dimensional space, having a feature vector of length 75 and then applying
an inverse transformation on it to get the original data space. The PCA is able to capture 98% of the
variance in the clean training data and thus to set a reasonably strong baseline. However, the AAE
network trained with structured noise significantly outperformed the PCA reconstruction by achieving
an average RMSE of 0.227 compared with 0.937 on the clean subset of the test folds. To assess the
performance of the reconstruction of all the features of each data source, the entire modality is dropped
and restored with both procedures. Table 1provides RMSE averaged across folds and number of
features for each modality used from the original dataset. Apart from location features, the AAE
network outperforms PCA on the reconstruction of every modality. For gyroscope, we noticed a
performance drop on test set of fold 4 which can be due to relatively fewer number of instances from
the participants in the testing fold. The reason for comparatively lower performance on the phone
state can be attributed to these features being binary and cannot be perfectly approximated with
continuous functions.
Sensors 2018,18, 2967 11 of 20
The AAE is able to learn compressed non-linear representations that are sufficient to capture the
correlation between different features. Hence, it provides a close approximation of the features from
the lost modality through leveraging the available signals. Figure 6illustrates this point, where an
accelerometer signal (from phone) is dropped (mimicking a missing signal) and all of its 26 features
are reconstructed by leveraging the rest of the modalities. The AAE network predicted very realistic
values of the missing features that are masked with special value
1. On the contrary, the PCA
restoration is stuck around values near zero; failing to capture the feature variance. We think, it could
be because PCA does a linear transformation, while the features may have an inherent non-linear
relationship that can be extracted well using autoencoders. The difference between the considered
methods is also apparent in Figure 7for filling-in values of features extracted from an audio signal.
Here, PCA fluctuates between zero and one, failing to recover the values, whereas, AAE largely
recovers values that are close to the ground truth.
Table 1. Root mean square error (RMSE) for each modality averaged over 5-folds cross-validation.
Modality # of Features PCA AAE
Accelerometer (Acc) 26 1.104 ±0.075 0.104 ±0.016
Gyroscope (Gyro) 26 1.423 ±0.967 0.686 ±1.291
WAccelerometer (WAcc) 46 1.257 ±0.007 0.147 ±0.003
Location (Loc) 6 0.009 ±0.003 0.009 ±0.003
Audio (Aud) 28 1.255 ±0.015 0.080 ±0.006
Phone State (PS) 34 0.578 ±0.000 0.337 ±0.011
PCA: principal component analysis.
Figure 6.
Restoration of an (phone) accelerometer feature values with the AAE and PCA. The entire
modality is dropped and reconstructed using features from the remaining signals.
Figure 7.
Restoration of an audio (MFCC) feature values with AAE and PCA. An entire modality is
dropped and reconstructed using features from the remaining signals.
5.2. Classification with Adversarial Autoencode Representations
In order to test the ability of AAE to learn a latent code irrespective of missing modalities,
we also performed classification experiments with combined, noisy and clean datasets. The feature
is passed into the learned autoencoder to get a compressed representation
of 128 dimensions.
This embedding is used to train a 1-layer neural network and compared with other methods of
missing data imputation such as filling with mean, median or
1 and a dimensionality reduction
technique i.e., PCA. Figure 8provides results on various metrics for cross-validation using considered
Sensors 2018,18, 2967 12 of 20
procedures. We did not find a significant difference between the classifiers trained on embedding
and other methods. However, the recall (sensitivity) of AAE is found to be better but somewhat
close to the mean imputation. The results obtained here are in line with [
] that used an encoded
representation for mood prediction and found no improvement. Similarly, in our case, the reason
for unchanged performance could be that a large part of the data is clean and the extracted features
are based on extensive domain-knowledge which are highly discriminative. Nevertheless, the latent
encoding acquired via AAE can be seen as privacy-preserving representation of otherwise sensitive
personal data. Moreover, if an autoencoder is trained with recent advancements made in combining
deep models with differential privacy [43], even stronger privacy guarantee can be provided.
Figure 8.
Classification results of 5-folds cross-validation with combined clean and reconstructed noisy
data. This resembles the situation when all the modalities are available during learning and inference
phases. We notice the AAE network performs better than other technique with high recall rate of 0.703.
AC, BA, SN, and SP stand for accuracy, balanced accuracy, sensitivity, and specificity, respectively.
5.3. Context Recognition with Several Missing Modalities
For better assessment of AAE capability to handle missing data, we simulated multiple scenarios
where several modalities are lost at once. These experiments reasonably mimic a real-world situation
for the classifier in which a user may turn-off the location service, forget to wear a smartwatch or may
be taking a call (such that the audio modality is missing). Thus, as a baseline, we employ techniques
to handle missing data through dimensionality reduction and imputation as described earlier and
train a classification model with the same configuration (see Section 4.4). The AAE model is extended
by adding a classifier network on top of an encoder to directly make predictions for the user context,
as explained in Section 4.4.
We begin by investigating the effect of losing each of the six modalities one by one on the classification
performance. Figure 9summarizes the classification results by utilizing different techniques to handle
missing features. The classifier learned through extending the AAE network persistently achieved
superior performance compared to the others as can be seen from high BA and true positive rate.
Next, we experimented with dropping three important signals i.e., Acc,Gyro, and Aud at once.
Figure 10 shows the averaged results across labels and testing folds, when entire feature vectors of the
considered modalities are restored with each method. The simplest technique of filling-in missing data
1 performed poorly with the lowest recall rate and the same goes for PCA which fails to restore
the values. However, mean and median imputation performed moderately better as compared to the
these two. The AAE achieved better BA and recall rate of 0.710 and 0.700, respectively. It is important
to note that the data is highly imbalanced with few positive samples. Therefore, only considering
naïve accuracy or true negative rate provides an incomplete picture of the models’ performance.
Sensors 2018,18, 2967 13 of 20
Moreover, to see the fine differences between true positive rates of each technique, Figure 11 presents
recall rate for all 51 contextual labels. Overall, the AAE network showed superior results across the
labels, highlighting its predictive power to very well handle the noisy inputs.
Figure 9.
Average evaluation metrics for 51 contextual labels with 5-folds cross-validation. All the
features from the corresponding modality are dropped and imputed with all the considered techniques.
Figure 10.
Average evaluation metrics for 51 contextual labels with 5-folds cross-validation. All the
features from Acc,Gyro and Aud modalities are dropped and restored with a specific technique.
Next, we evaluated a scenario when four modalities, namely, Gyro,WAcc,Loc and Aud are missing
together. Specifically, these sensors have high chances of not being available in real-world conditions.
Table 2a provides results of the experiment, as earlier, the traditional imputation procedures failed to
account for the correct identification of true positives. The AAE gracefully handles missing values
with BA of 0.713; through learning important characteristics of data distribution on the training set.
Likewise, we tested another scenario with only WAcc,Loc and Aud being missing. Table 2b shows that
AAE maintained BA at 0.723 even when nearly half of the features from three important modalities are
missing. We further assess the classifier’s behavior, in a case when a user does not provide access to
location service and does not wear a smartwatch, i.e., WAcc and Loc are not available. Table 2c provides
these results and indicates that mean/median imputations and AAE showed similar performance on
BA metric but the AAE has the highest recall rate of 0.704 among the rest. It highlights the consistent
predictive power of AAE based classification network for real-world context recognition applications.
Sensors 2018,18, 2967 14 of 20
Moreover, regardless of the number of missing modalities, the AAE performed superior as compared
to other classical ways to handle the lost data.
Figure 11.
Recall of 51 contextual labels with 5-folds cross-validation. All the features from Acc,
Gyro and Aud modalities are dropped to emulate missing features and imputed with different
techniques to train a classifier.
5.4. Generating Realistic Multimodal Data
One of the key goals of this paper is to build a model capable of producing realistic data points and
especially features extracted from sensory data. To demonstrate the ability of AAE to generate synthetic
data, we evaluate its performance through visual fidelity and classification. The data generated by
the AAE is used to train a classifier, which is then tested on real data instances. Similarly, a model is
also trained on real data and evaluated on synthetic test data generated by the AAE. This requires the
artificial data to have labels, we can provide these labels to the decoder (generator) as supervision,
either by sampling them independently or by an additional network (added to an AAE) predict these
class labels. Here, we utilized (the former method) using training or test set labels to generate the
data, as applicable. This metric of evaluation is also more suitable compared to visual analysis as it
Sensors 2018,18, 2967 15 of 20
determines the ability of synthetic data to be used for real applications. The results of the classification
experiments are presented in Table 3, which compares the performance achieved for multi-label context
recognition with real and artificial data. It can be seen that the model trained on synthetically generated
data achieved close results (BA of 0.715 vs. 0.752) as of when a model is learned on an original data.
Likewise, the performance is also optimal (BA of 0.700) when synthetic test data generated using test
set labels and random noise are assessed on a classifier learned with real samples.
Table 2.
Classification results for 5-folds cross-validation with different missing modalities that are
restored with a specific method. The reported metrics are averaged over 51 labels and BA stands for
balanced accuracy.
(a) Missing: Gyro,WAcc,Loc and Aud
BA Sensitivity Specificity Accuracy
AAE 0.713 ±0.008 0.711 ±0.021 0.716 ±0.021 0.716 ±0.024
PCA 0.526 ±0.007 0.249 ±0.040 0.802 ±0.041 0.825 ±0.034
Mean 0.669 ±0.023 0.548 ±0.056 0.791 ±0.025 0.785 ±0.022
Median 0.657 ±0.015 0.502 ±0.045 0.812 ±0.022 0.808 ±0.017
Fill -1 0.519 ±0.004 0.175 ±0.012 0.862 ±0.004 0.857 ±0.013
(b) Missing: WAcc,Loc and Aud
BA Sensitivity Specificity Accuracy
AAE 0.723 ±0.007 0.729 ±0.017 0.718 ±0.013 0.721 ±0.014
PCA 0.549 ±0.02 0.255 ±0.052 0.842 ±0.013 0.847 ±0.019
Mean 0.682 ±0.017 0.567 ±0.04 0.797 ±0.014 0.79 ±0.014
Median 0.678 ±0.014 0.543 ±0.028 0.814 ±0.005 0.806 ±0.004
Fill -1 0.547 ±0.016 0.209 ±0.087 0.885 ±0.055 0.836 ±0.047
(c) Missing: WAcc and Loc
BA Sensitivity Specificity Accuracy
AAE 0.722 ±0.010 0.704 ±0.029 0.74 ±0.018 0.742 ±0.020
PCA 0.568 ±0.012 0.300 ±0.038 0.835 ±0.016 0.856 ±0.010
Mean 0.735 ±0.011 0.678 ±0.028 0.793 ±0.009 0.789 ±0.008
Median 0.727 ±0.012 0.653 ±0.035 0.801 ±0.020 0.796 ±0.020
Fill -1 0.564 ±0.026 0.270 ±0.064 0.859 ±0.012 0.840 ±0.008
Table 3.
Performance of 1-layer neural network for context recognition when: (a) both the training and
the test sets are real (Real, first row); (b) a model trained with synthetic data and the test set is real
(TSTR, second row); and (c) the training set is real and the test set is synthetic (TRTS, bottom row).
BA Sensitivity Specificity Accuracy
Real 0.753 ±0.011 0.762 ±0.014 0.745 ±0.016 0.749 ±0.015
TSTR 0.715 ±0.011 0.731 ±0.035 0.700 ±0.036 0.705 ±0.034
TRTS 0.700 ±0.020 0.656 ±0.035 0.744 ±0.033 0.744 ±0.030
TSTR: Training on synthetic and testing on real
To get a better appreciation of these results, Figure 12 provides BA of each class label for models
trained on real and synthetic instances
evaluated on a real test set. We notice that, for some class labels
the BA score is equal to or larger than the model learned with real data, such as for classes: Phone in
bag,Singing,On beach, and At a restaurant. It indicates that the AAE generates realistic enough samples
to train a classifier which then achieves high performance on real test data. Furthermore, we also
validate the quality of generated samples by visual inspection. It is helpful as we can see from the
generated samples if they have the similar characteristics and dynamics as the one we wish to model.
Figure 13 illustrates both real and generated examples, the essential thing to notice is that real and
Sensors 2018,18, 2967 16 of 20
synthetic values exhibit similar shift, peaks, and local correlations that are captured well by the AAE.
However, binary (discrete) features belonging to phone states such as, is phone connected to Wi-Fi etc.
are hard to perfectly reconstruct but they can be easily binarized by thresholding at a particular value.
Figure 12.
Obtained balanced accuracy of 51 contextual labels for two classifiers trained with real and
synthetic samples–evaluation is done on real test data with 5-folds cross-validation.
Sensors 2018,18, 2967 17 of 20
(a) Accelerometer (b) Gyroscope
(c) Watch accelerometer (d) Location
(e) Audio (MFCC) (f) Phone state
Figure 13.
Examples of real (blue,
) and generated (red,
) samples of a randomly selected
feature with AAE.
6. Discussion and Conclusions
We proposed a method utilizing an AAE for synthesizing and restoring missing sensory data
to facilitate user context detection. The signals loss commonly happens during real-world data
collection and in realistic situations after model deployment in-the-wild. For example, a user may
prefer to not wear a smartwatch, hence, no signals (or features) from a smartwatch that are used
during development will be available for inference. Our empirical results demonstrate that the AAE
network trained with structured noise can provide a realistic reconstruction of features from the lost
modalities as compared to other methods, such as PCA. Similarly, we show the AAE model trained
with supervision to a decoder network produce realistic synthetic data, which further can be used for
real applications. We have shown the data generation capability of our network through visual fidelity
Sensors 2018,18, 2967 18 of 20
analysis and by comparing classification performance with real data. In the latter, we do training on
the artificial data and evaluation of real instances, and training on real and validation on synthetic
samples. This methodology allows researchers to develop robust models that are able to learn noise
invariant representations and inherently handle several missing modalities. It also enables leveraging
artificial data to increase training set size, and data sharing which is occasionally not possible due to
the sensitive nature of the personal data.
The presented network has several other advantages, it allows to utilize an entire dataset for
learning i.e., any combination of labeled, unlabeled noisy and clean instances. We see a consistent
performance of our classifier trained by extending the encoder network, even when several modalities
(i.e., more than half of the features) are dropped to emulate missing sensors. Broadly, unlike prior
methods for handling missing input data, where a model failed to detect true positive correctly,
AAE maintains its ability to recognize user context with high performance. This highlights an important
characteristic of the described technique that even if some signals are not available e.g., when users
opt-out of location service or do not wear a smartwatch, still their partial data can be used to
get accurate predictions. Besides, the model developed with the proposed technique could be a
very attractive feature for users concerned about their privacy concerns regarding location data.
Likewise, a classifier trained on embedding provides similar performance as the original feature set,
which means raw features would not have to be stored and can be shared with other researchers while
preserving users’ privacy [
]. The privacy guarantee can be further enhanced by taking advantage of
recent advances made in combining deep learning with differential privacy [43].
We notice that labels reported by the users are sparse, resulting in an imbalanced dataset. To deal
with this, an instance weighting strategy same as in [
] is applied. Although, we experimented with
resolving imbalance through synthetic data only but results were not satisfactory (unless combined
with instance weighting); we believe this requires further exploration. Likewise, AAE can be extended
to do semi-supervised learning taking advantage of unlabeled examples. It can further help in the
collection of a large dataset with a low mental load for the user as it reduces the need for labeling every
example. Another area of potential improvement could be an ensemble of multi-layer neural networks
efficiently compressed to do real-time detection on an edge device with minimum resource utilization.
Author Contributions:
A.S. conceptualized, performed the experiments, analyzed the results, made charts,
drafted the initial manuscript and revised the manuscript. T.O. and J.L. supervised, analyzed the results, provided
feedback and revised the manuscript.
SCOTT ( has received funding from the Electronic Component Systems for
European Leadership Joint Undertaking under grant agreement No. 737422. This Joint Undertaking receives
support from the European Union’s Horizon 2020 research and innovation programme and Austria, Spain,
Finland, Ireland, Sweden, Germany, Poland, Portugal, Netherlands, Belgium, Norway.
Various icons used in the figures are created by Anuar Zhumaev, Tim Madle, Shmidt Sergey,
Alina Oleynik, Artdabana@Design and lipi from the Noun Project.
Conflicts of Interest: The authors declare no conflict of interest.
Rashidi, P.; Mihailidis, A. A survey on ambient-assisted living tools for older adults. IEEE J. Biomed. Health Inform.
2013,17, 579–590. [CrossRef] [PubMed]
Nahum-Shani, I.; Smith, S.N.; Tewari, A.; Witkiewitz, K.; Collins, L.M.; Spring, B.; Murphy, S. Just in Time
Adaptive Interventions (JITAIs): An Organizing Framework for Ongoing Health Behavior Support; Methodology Center
Technical Report; The Methodology Center: University Park, PA, USA, 2014; pp. 14–126.
Avci, A.; Bosch, S.; Marin-Perianu, M.; Marin-Perianu, R.; Havinga, P. Activity recognition using inertial
sensing for healthcare, wellbeing and sports applications: A survey. In Proceedings of the 23th International
Conference on Architecture of Computing Systems, Hannover, Germany, 22–23 February 2010; pp. 1–10.
Rabbi, M.; Aung, M.H.; Zhang, M.; Choudhury, T. MyBehavior: Automatic personalized health feedback
from user behaviors and preferences using smartphones. In Proceedings of the 2015 ACM International
Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 7–11 September 2015; pp. 707–718.
Sensors 2018,18, 2967 19 of 20
Althoff, T.; Hicks, J.L.; King, A.C.; Delp, S.L.; Leskovec, J. Large-scale physical activity data reveal worldwide
activity inequality. Nature 2017,547, 336–339. [CrossRef] [PubMed]
Joshua, L.; Varghese, K. Accelerometer-based activity recognition in construction. J. Comput. Civ. Eng.
25, 370–379. [CrossRef]
Dey, A.K.; Wac, K.; Ferreira, D.; Tassini, K.; Hong, J.H.; Ramos, J. Getting closer: An empirical investigation
of the proximity of user to their smart phones. In Proceedings of the 13th International Conference on
Ubiquitous Computing, Beijing, China, 17–21 September 2011; pp. 163–172.
Vaizman, Y.; Ellis, K.; Lanckriet, G. Recognizing Detailed Human Context in the Wild from Smartphones
and Smartwatches. IEEE Pervas. Comput. 2017,16, 62–74. [CrossRef]
Kang, H. The prevention and handling of the missing data. Korean J. Anesthesiol.
,64, 402–406. [CrossRef]
Gelman, A.; Hill, J. Missing-data imputation. In Data Analysis Using Regression and Multilevel/Hierarchical
Models; Analytical Methods for Social Research; Cambridge University Press: Cambridge, UK, 2006;
pp. 529–544. [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans.
Pattern Anal. Mach. Intell. 2013,35, 1798–1828. [CrossRef] [PubMed]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent
neural networks from overfitting. J. Mach. Learn. Res. 2014,15, 1929–1958.
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A. Stacked denoising autoencoders:
Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res.
2010,11, 3371–3408.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y.
Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Proceedings of the Annual
Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; NIPS: La Jolla,
CA, USA, 2014; pp. 2672–2680.
Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial autoencoders. arXiv
, arXiv:1511.05644.
Guiry, J.J.; Van de Ven, P.; Nelson, J. Multi-sensor fusion for enhanced contextual awareness of everyday
activities with ubiquitous devices. Sensors 2014,14, 5687–5701. [CrossRef] [PubMed]
Wang, A.; Chen, G.; Shang, C.; Zhang, M.; Liu, L. Human activity recognition in a smart home environment
with stacked denoising autoencoders. In Proceedings of the International Conference on Web-Age Information
Management, Nanchang, China, 3–5 June 2016; pp. 29–40.
Li, Y.; Shi, D.; Ding, B.; Liu, D. Unsupervised feature learning for human activity recognition using
smartphone sensors. In Mining Intelligence and Knowledge Exploration; Springer: Berlin, Germany, 2014;
pp. 99–107.
Plötz, T.; Hammerla, N.Y.; Olivier, P. Feature learning for activity recognition in ubiquitous computing.
In Proceedings of the IJCAI Proceedings—International Joint Conference on Artificial Intelligence, Barcelona,
Spain, 16–22 July 2011; Volume 22, p. 1729.
Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. arXiv
2017, arXiv:1707.03502.
Ding, M.; Fan, G. Multilayer Joint Gait-Pose Manifolds for Human Gait Motion Modeling. IEEE Trans. Cybern.
2015,45, 2413–2424. [CrossRef] [PubMed]
Zhang, X.; Ding, M.; Fan, G. Video-based human walking estimation using joint gait and pose manifolds.
IEEE Trans. Circuits Syst. Video Technol. 2017,27, 1540–1554. [CrossRef]
Chen, C.; Jafari, R.; Kehtarnavaz, N. A survey of depth and inertial sensor fusion for human action
recognition. Multimedia Tools Appl. 2017,76, 4405–4425. [CrossRef]
Vaizman, Y.; Weibel, N.; Lanckriet, G. Context Recognition In-the-Wild: Unified Model for Multi-Modal
Sensors and Multi-Label Classification. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.
,1, 168.
Thompson, B.B.; Marks, R.; El-Sharkawi, M.A. On the contractive nature of autoencoders: Application
to missing sensor restoration. In Proceedings of the International Joint Conference on Neural Networks,
Portland, OR, USA, 20–24 July 2003; Volume 4, pp. 3011–3016.
Nelwamondo, F.V.; Mohamed, S.; Marwala, T. Missing data: A comparison of neural network and expectation
maximization techniques. arXiv 2007, arXiv:0704.3474v1.
Sensors 2018,18, 2967 20 of 20
Duan, Y.; Lv, Y.; Kang, W.; Zhao, Y. A deep learning based approach for traffic data imputation. In Proceedings
of the 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), Qingdao, China,
8–11 October 2014; pp. 912–917.
Beaulieu-Jones, B.K.; Moore, J.H. Missing data imputation in the electronic health record using deeply
learned autoencoders. In Proceedings of the Pacific Symposium on Biocomputing 2017, Big Island of Hawaii,
HI, USA, 3–7 January 2017; World Scientific: Singapore, 2017; pp. 207–218.
Jaques, N.; Taylor, S.; Sano, A.; Picard, R. Multimodal Autoencoder: A Deep Learning Approach to Filling in
Missing Sensor Data and Enabling Better Mood Prediction. In Proceedings of the International Conference
on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017.
Li, J.; Struzik, Z.; Zhang, L.; Cichocki, A. Feature learning from incomplete EEG with denoising autoencoder.
Neurocomputing 2015,165, 23–31. [CrossRef]
Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep patient: An unsupervised representation to predict the future
of patients from the electronic health records. Sci. Rep. 2016,6, 26094. [CrossRef] [PubMed]
Martinez, H.P.; Bengio, Y.; Yannakakis, G.N. Learning deep physiological models of affect. IEEE Comput.
Intell. Mag. 2013,8, 20–33. [CrossRef]
Deng, J.; Xu, X.; Zhang, Z.; Frühholz, S.; Schuller, B. Universum autoencoder-based domain adaptation for
speech emotion recognition. IEEE Signal Process. Lett. 2017,24, 500–504. [CrossRef]
Kuchaiev, O.; Ginsburg, B. Training Deep AutoEncoders for Collaborative Filtering. arXiv
, arXiv:1708.01715.
Yu, L.; Zhang, W.; Wang, J.; Yu, Y. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient.
In Proceedings of the AAAI, San Francisco, CA, USA, 4–9 February 2017; pp. 2852–2858.
Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating multi-label discrete electronic health
records using generative adversarial networks. arXiv 2017, arXiv:1703.06490.
Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (medical) time series generation with recurrent conditional
GANs. arXiv 2017, arXiv:1706.02633.
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science
313, 504–507. [CrossRef] [PubMed]
Nam, J.; Kim, J.; Mencía, E.L.; Gurevych, I.; Fürnkranz, J. Large-scale multi-label text classification–revisiting
neural networks. In Proceedings of the Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, Nancy, France, 15–19 September 2014; Springer: Berlin, Germany, 2014; pp. 437–452.
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.
TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016,16, 265–283.
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings
of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010;
pp. 249–256.
42. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with
differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications
Security, Vienna, Austria, 24–28 October 2016; pp. 308–318.
2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (
... Instead, we try to recover the features of missing sensors by exploiting their spatial correlations with available sensors, to enhance behavior of the classification model. Some recent work [24,29,47] used a denoising autoencoder [54] to recover missing features. During training, they manually add random noise and turn off parts of the features of input data to make the autoencoder learn to remove noise and recover missing features. ...
... In our design, we reconstruct the feature vector of missing sensors as a basic unit. Meanwhile, unlike previous autoencoder based feature reconstruction methods [24,29,47] that exploit sensor correlations in a fully-connected manner, we only explicitly exploit sensor correlations in the latent space based on their spatial topology. On one hand, we can avoid information flooding at missing sensors where too much noise might overwhelm useful information; on the other hand, we can improve computational efficiency to a large extent, because the number of edges in most real-world networks tends to grow linearly (not quadratically) with the number of nodes [56]. ...
... Regarding the dynamic sensor configuration in sensing models, Rey et al. [44] proposed a similarity-based approach to integrate new sensors into an existing recognition system in a semisupervised manner. Meanwhile, several neural network based general missing sensor handling algorithms (i.e., algorithms where one model handles all missing sensor situations) have been proposed [24,29,37,47,51]. Most papers [24,29,47] borrow the idea from denoising autoencoders [54]. ...
Conference Paper
Full-text available
Reliable data collection, transmission, and delivery on Internet of Things (IoT) systems is crucial in order to provide high-quality intelligent services. However, sensor data delivery can be interrupted for various reasons, such as sensor malfunction, network failures, and external attacks. Thus, only data from a partial set of sensors may be available. We call it the missing sensor problem. This problem can lead to severe performance degradation at inference time by neural-network-based recognition models trained on the complete sensor set. This paper enhances the robustness of neural network models to the missing sensor problem by introducing a novel feature reconstruction module, named the graph recovery module, that handles missing sensors directly inside the network. Specifically, we consider topology-aware IoT applications, where sensors are placed on a physically interconnected network. We design a novel neural message passing mechanism that logically mimics physical network topology, based on recent advances in graph neural networks (GNNs). We rely on a spatial locality assumption, where only correlations between physically connected sensors are explicitly explored. When encountering missing sensors, information is passed from available sensors to missing sensors to be used to reconstruct their features. Moreover, at each message passing step, we utilize a gating mechanism inspired by Gated Recurrent Units (GRUs) to automatically control information flow between available sensors and missing sensors. We empirically evaluate the reconstruction performance of the graph recovery module with two representative IoT applications; human activity recognition (HAR) and electroencephalogram (EEG)-based motor-imagery classification, on three public datasets. Two different backbone networks are utilized for the tasks. Our design is shown to effectively maintain model performance, suffering only 7% to 18% accuracy loss when as much as 90% of sensors 90:2 • Liu et al. are removed, compared to a drop of 15% to 47% in the accuracy of competing state-of-the-art algorithms under the same conditions. The accuracy gap is largest when more sensors are missing.
... For the standard AEs the Sigmoid function is always used (Hau et al., 2016;Sakurai et al., 2017;Jia et al., 2017), but in the work of Sakurai et al. (2017) a custom linear function is used on the output layer. For multi-layer networks the Sigmoid is also used in some works (Duan et al., 2014(Duan et al., , 2016Xie et al., 2019;Sánchez-Morales et al., 2020), but ReLU is the one more often applied (Gondara and Wang, 2017;Ryu et al., 2020;McCoy et al., 2018;Boquet et al., 2019Boquet et al., , 2020Xie et al., 2019;Saeed et al., 2018;Fortuin et al., 2020), sometimes through the Leaky ReLU variant (Ryu et al., 2020;Saeed et al., 2018). The Hyperbolic Tangent is used less often (Gondara and Wang, 2018;Chen et al., 2015), but these works concluded it presents better results than ReLU when the datasets are small. ...
... For the standard AEs the Sigmoid function is always used (Hau et al., 2016;Sakurai et al., 2017;Jia et al., 2017), but in the work of Sakurai et al. (2017) a custom linear function is used on the output layer. For multi-layer networks the Sigmoid is also used in some works (Duan et al., 2014(Duan et al., , 2016Xie et al., 2019;Sánchez-Morales et al., 2020), but ReLU is the one more often applied (Gondara and Wang, 2017;Ryu et al., 2020;McCoy et al., 2018;Boquet et al., 2019Boquet et al., , 2020Xie et al., 2019;Saeed et al., 2018;Fortuin et al., 2020), sometimes through the Leaky ReLU variant (Ryu et al., 2020;Saeed et al., 2018). The Hyperbolic Tangent is used less often (Gondara and Wang, 2018;Chen et al., 2015), but these works concluded it presents better results than ReLU when the datasets are small. ...
... The training phase of an AE depends on the same aspects of any other ANN: an optimization algorithm, a loss function and the maximum number of epochs. From the 17 works that describe the used optimization algorithm, 5 use the well-known Stochastic Gradient Descent (Sánchez-Morales et al., 2017;Hau et al., 2016;Sánchez-Morales et al., 2019;Xie et al., 2019;Lai et al., 2019) and 6 use one of its variants called Adam (El Esawey et al., 2015;Ryu et al., 2020;Boquet et al., 2019Boquet et al., , 2020Saeed et al., 2018;Fortuin et al., 2020). Other algorithms are used less often, namely the Nesterov's Accelerated Gradient (Gondara and Wang, 2018), the Scaled Conjugate Gradient Algorithm (Sakurai et al., 2017) and the RMSProp (McCoy et al., 2018). ...
Full-text available
Missing data is a problem often found in real-world datasets and it can degrade the performance of most machine learning models. Several deep learning techniques have been used to address this issue, and one of them is the Autoencoder and its Denoising and Variational variants. These models are able to learn a representation of the data with missing values and generate plausible new ones to replace them. This study surveys the use of Autoencoders for the imputation of tabular data and considers 26 works published between 2014 and 2020. The analysis is mainly focused on discussing patterns and recommendations for the architecture, hyperparameters and training settings of the network, while providing a detailed discussion of the results obtained by Autoencoders when compared to other state-of-the-art methods, and of the data contexts where they have been applied. The conclusions include a set of recommendations for the technical settings of the network, and show that Denoising Autoencoders outperform their competitors, particularly the often used statistical methods.
... Luo et al. utilized a GAN to infer missing time series data [273]. Saeed et al. proposed an adversarial autoencoder (AAE) framework to perform data imputation [132]. Opportunity: To address this challenge, more research into automated methods for evaluating and quantifying the quality is needed to identify better, remove, and/or correct for poor quality data. ...
... Illustration of an autoencoder network[132]. ...
Full-text available
Mobile and wearable devices have enabled numerous applications, including activity tracking, wellness monitoring, and human–computer interaction, that measure and improve our daily lives. Many of these applications are made possible by leveraging the rich collection of low-power sensors found in many mobile and wearable devices to perform human activity recognition (HAR). Recently, deep learning has greatly pushed the boundaries of HAR on mobile and wearable devices. This paper systematically categorizes and summarizes existing work that introduces deep learning methods for wearables-based HAR and provides a comprehensive analysis of the current advancements, developing trends, and major challenges. We also present cutting-edge frontiers and future directions for deep learning-based HAR.
... deal with potential missing data during inference, improving robustness and generalisation of the model. In , Diethe et al. investigate heterogeneous sensor modality fusion in an AAL setting and propose a Bayesian model to explore the weight of each modality to enhance the interpretability of their fusion model. A different approach is followed by Saeed et al. Saeed et al. [2018], who take advantage of adversarial autoencoders for handling missing modalities by replacing the missing sensors with synthetic data generated from the available modalities. Lee et al. Lee et al. use multimodal latent variables and a variational product-of-experts approach to automatically detect corrupted sensors and also replace them ...
Full-text available
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI). We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time. We evaluate the proposed model on inertial data from a wearable accelerometer device, using RGB videos and skeletons as privileged modalities, and show an improvement of accuracy of an average 6.6% on the UTD-MHAD dataset and an average 5.5% on the Berkeley MHAD dataset, reaching a new state-of-the-art for inertial-only classification accuracy on these datasets. We validate our framework through several ablation studies.
... Sensor data obtained from real-life settings is typically of poor quality (noisy) and frequently has missing data [7]. These issues arise due to factors such as bad or faulty placement of sensors, or sensor malfunctioning [8]. Similarly, sensor data may often be highly imbalanced due to significant individual variations, with limited labels for certain activities [9]. ...
Full-text available
Human activity recognition (HAR) using wearable sensors is an increasingly active research topic in machine learning, aided in part by the ready availability of detailed motion capture data from smartphones, fitness trackers, and smartwatches. The goal of HAR is to use such devices to assist users in their daily lives in application areas such as healthcare, physical therapy, and fitness. One of the main challenges for HAR, particularly when using supervised learning methods, is obtaining balanced data for algorithm optimisation and testing. As people perform some activities more than others (e.g., walk more than run), HAR datasets are typically imbalanced. The lack of dataset representation from minority classes hinders the ability of HAR classifiers to sufficiently capture new instances of those activities. We introduce three novel hybrid sampling strategies to generate more diverse synthetic samples to overcome the class imbalance problem. The first strategy, which we call the distance-based method (DBM), combines Synthetic Minority Oversampling Techniques (SMOTE) with Random_SMOTE, both of which are built around the k-nearest neighbors (KNN). The second technique, referred to as the noise detection-based method (NDBM), combines SMOTE Tomek links (SMOTE_Tomeklinks) and the modified synthetic minority oversampling technique (MSMOTE). The third approach, which we call the cluster-based method (CBM), combines Cluster-Based Synthetic Oversampling (CBSO) and Proximity Weighted Synthetic Oversampling Technique (ProWSyn). We compare the performance of the proposed hybrid methods to the individual constituent methods and baseline using accelerometer data from three commonly used benchmark datasets. We show that DBM, NDBM, and CBM reduce the impact of class imbalance and enhance F1 scores by a range of 9–20 percentage point compared to their constituent sampling methods. CBM performs significantly better than the others under a Friedman test, however, DBM has lower computational requirements.
... For handling missing input data, a compelling strategy is to train an autoencoder with artificially corrupted inputx, which acts as an implicit regularization. Usually, the considered corruption includes isotropic Gaussian Figure 4: Illustration of an autoencoder network [237]. and 10 modalities attached to the body or the environment. ...
Full-text available
Mobile and wearable devices have enabled numerous applications, including activity tracking, wellness monitoring, and human-computer interaction, that measure and improve our daily lives. Many of these applications are made possible by leveraging the rich collection of low-power sensors found in many mobile and wearable devices to perform human activity recognition (HAR). Recently, deep learning has greatly pushed the boundaries of HAR on mobile and wearable devices. This paper systematically categorizes and summarizes existing work that introduces deep learning methods for wearables-based HAR and provides a comprehensive analysis of the current advancements, developing trends, and major challenges. We also present cutting-edge frontiers and future directions for deep learning--based HAR.
... In different areas of research, data interpolation techniques for missing data were proposed in [47,48]. Saeed et al. [49] proposed an approach for handling missing sensory features and realistic samples by adversarial autoencoder. They designed a fully-connected classification network for missing modalities. ...
Full-text available
Sensor-based human activity recognition has various applications in the arena of healthcare, elderly smart-home, sports, etc. There are numerous works in this field—to recognize various human activities from sensor data. However, those works are based on data patterns that are clean data and have almost no missing data, which is a genuine concern for real-life healthcare centers. Therefore, to address this problem, we explored the sensor-based activity recognition when some partial data were lost in a random pattern. In this paper, we propose a novel method to improve activity recognition while having missing data without any data recovery. For the missing data pattern, we considered data to be missing in a random pattern, which is a realistic missing pattern for sensor data collection. Initially, we created different percentages of random missing data only in the test data, while the training was performed on good quality data. In our proposed approach, we explicitly induce different percentages of missing data randomly in the raw sensor data to train the model with missing data. Learning with missing data reinforces the model to regulate missing data during the classification of various activities that have missing data in the test module. This approach demonstrates the plausibility of the machine learning model, as it can learn and predict from an identical domain. We exploited several time-series statistical features to extricate better features in order to comprehend various human activities. We explored both support vector machine and random forest as machine learning models for activity classification. We developed a synthetic dataset to empirically evaluate the performance and show that the method can effectively improve the recognition accuracy from 80.8% to 97.5%. Afterward, we tested our approach with activities from two challenging benchmark datasets: the human activity sensing consortium (HASC) dataset and single chest-mounted accelerometer dataset. We examined the method for different missing percentages, varied window sizes, and diverse window sliding widths. Our explorations demonstrated improved recognition performances even in the presence of missing data. The achieved results provide persuasive findings on sensor-based activity recognition in the presence of missing data.
Fitness tracking devices have risen in popularity in recent years, but limitations in terms of their accuracy and failure to track many common exercises presents a need for improved fitness tracking solutions. This work proposes a multimodal deep learning approach to leverage multiple data sources for robust and accurate activity segmentation, exercise recognition and repetition counting. For this, we introduce the MM-Fit dataset; a substantial collection of inertial sensor data from smartphones, smartwatches and earbuds worn by participants while performing full-body workouts, and time-synchronised multi-viewpoint RGB-D video, with 2D and 3D pose estimates. We establish a strong baseline for activity segmentation and exercise recognition on the MM-Fit dataset, and demonstrate the effectiveness of our CNN-based architecture at extracting modality-specific spatial temporal features from inertial sensor and skeleton sequence data. We compare the performance of unimodal and multimodal models for activity recognition across a number of sensing devices and modalities. Furthermore, we demonstrate the effectiveness of multimodal deep learning at learning cross-modal representations for activity recognition, which achieves 96% accuracy across all sensing modalities on unseen subjects in the MM-Fit dataset; 94% using data from the smartwatch only; 85% from the smartphone only; and 82% on data from the earbud device. We strengthen single-device performance by using the zeroing-out training strategy, which phases out the other sensing modalities. Finally, we implement and evaluate a strong repetition counting baseline on our MM-Fit dataset. Collectively, these tasks contribute to recognising, segmenting and timing exercise and non-exercise activities for automatic exercise logging.
Full-text available
This paper proposes a novel model for the rating prediction task in recommender systems which significantly outperforms previous state-of-the art models on a time-split Netflix data set. Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training. We empirically demonstrate that: a) deep autoencoder models generalize much better than the shallow ones, b) non-linear activation functions with negative parts are crucial for training deep models, and c) heavy use of regularization techniques such as dropout is necessary to prevent over-fiting. We also propose a new training algorithm based on iterative output re-feeding to overcome natural sparseness of collaborate filtering. The new algorithm significantly speeds up training and improves model performance. Our code is available at
Full-text available
Sensor-based activity recognition seeks the profound high-level knowledge about human activity from multitudes of low-level sensor readings. Conventional pattern recognition approaches have made tremendous progress in the past years. However, most of those approaches heavily rely on heuristic hand-crafted feature extraction methods, which dramatically hinder their generalization performance. Additionally, those methods often produce unsatisfactory results for unsupervised and incremental learning tasks. Meanwhile, the recent advancement of deep learning makes it possible to perform automatic high-level feature extraction thus achieves promising performance in many areas. Since then, deep learning based methods have been widely adopted for the sensor-based activity recognition tasks. In this paper, we survey and highlight the recent advancement of deep learning approaches for sensor-based activity recognition. Specifically, we summarize existing literatures from three aspects: sensor modality, deep model and application. We also present a detailed discussion and propose grand challenges for future direction.
Full-text available
Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from 'serialised' MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.
Full-text available
Access to electronic health record (EHR) data has motivated computational advances in medical research. However, various concerns, particularly over privacy, can limit access to and collaborative use of EHR data. Sharing synthetic EHR data could mitigate risk. In this paper, we propose a new approach, medical Generative Adversarial Network (medGAN), to generate realistic synthetic patient records. Based on input real patient records, medGAN can generate high-dimensional discrete variables (e.g., binary and count features) via a combination of an autoencoder and generative adversarial networks. We also propose minibatch averaging to efficiently avoid mode collapse, and increase the learning efficiency with batch normalization and shortcut connections. To demonstrate feasibility, we showed that medGAN generates synthetic patient records that achieve comparable performance to real data on many experiments including distribution statistics, predictive modeling tasks and a medical expert review. We also empirically observe a limited privacy risk in both identity and attribute disclosure using medGAN.
Automatic recognition of behavioral context (location, activities, body-posture etc.) can serve health monitoring, aging care, and many other domains. Recognizing context in-the-wild is challenging because of great variability in behavioral patterns, and it requires a complex mapping from sensor features to predicted labels. Data collected in-the-wild may be unbalanced and incomplete, with cases of missing labels or missing sensors. We propose using the multiple layer perceptron (MLP) as a multi-task model for context recognition. Based on features from multi-modal sensors, the model simultaneously predicts many diverse context labels. We analyze the advantages of the model's hidden layers, which are shared among all sensors and all labels, and provide insight to the behavioral patterns that these hidden layers may capture. We demonstrate how recognition of new labels can be improved when utilizing a model that was trained for an initial set of labels, and show how to train the model to withstand missing sensors. We evaluate context recognition on the previously published ExtraSensory Dataset, which was collected in-the-wild. Compared to previously suggested models, the MLP improves recognition, even with fewer parameters than a linear model. The ability to train a good model using data that has incomplete, unbalanced labeling and missing sensors encourages further research with uncontrolled, in-the-wild behavior.
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
As a new way of training generative models, Generative Adversarial Net (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is nontrivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.
To be able to curb the global pandemic of physical inactivity and the associated 5.3 million deaths per year, we need to understand the basic principles that govern physical activity. However, there is a lack of large-scale measurements of physical activity patterns across free-living populations worldwide. Here we leverage the wide usage of smartphones with built-in accelerometry to measure physical activity at the global scale. We study a dataset consisting of 68 million days of physical activity for 717,527 people, giving us a window into activity in 111 countries across the globe. We find inequality in how activity is distributed within countries and that this inequality is a better predictor of obesity prevalence in the population than average activity volume. Reduced activity in females contributes to a large portion of the observed activity inequality. Aspects of the built environment, such as the walkability of a city, are associated with a smaller gender gap in activity and lower activity inequality. In more walkable cities, activity is greater throughout the day and throughout the week, across age, gender, and body mass index (BMI) groups, with the greatest increases in activity found for females. Our findings have implications for global public health policy and urban planning and highlight the role of activity inequality and the built environment in improving physical activity and health.
One of the serious obstacles to the applications of speech emotion recognition systems in real-life settings is the lack of generalisation of the emotion classifiers. Many recognition systems often present a dramatic drop in performance when tested on speech data obtained from different speakers, acoustic environments, linguistic content, and domain conditions. In this letter, we propose a novel unsupervised domain adaptation model, called Universum Autoencoders, to improve the performance of the systems evaluated in mismatched training and test conditions. To address the mismatch, our proposed model not only learns discriminative information from labelled data, but also learns to incorporate the prior knowledge from unlabelled data into the learning. Experimental results on the labelled Geneva Whispered Emotion Corpus (GeWEC) database plus other three unlabelled databases demonstrate the effectiveness of the proposed method when compared to other domain adaptation methods.