ArticlePDF Available

Abstract and Figures

Advances in wearable technologies have the ability to revolutionize and improve people’s lives. The gains go beyond the personal sphere, encompassing business and, by extension, the global economy. The technologies are incorporated in electronic devices that collect data from consumers’ bodies and their immediate environment. Human activities recognition, which involves the use of various body sensors and modalities either separately or simultaneously, is one of the most important areas of wearable technology development. In real-life scenarios, the number of sensors deployed is dictated by practical and financial considerations. In the research for this article, we reviewed our earlier efforts and have accordingly reduced the number of required sensors, limiting ourselves to first-person vision data for activities recognition. Nonetheless, our results beat state of the art by more than 4% of F1 score.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Learning Human Activity from Visual
Data Using Deep Learning
TAHA ALHERSH1, HEINER STUCKENSCHMIDT1, ATIQ UR REHMAN2, (Member, IEEE)
SAMIR BRAHIM BELHAOUARI2, (Senior Member, IEEE)
1Data and Web Science Group, University of Mannheim, Mannheim, Germany
2ICT Division, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
Corresponding author: Samir Brahim Belhaouari (e-mail: sbelhaouari@hbku.edu.qa).
ABSTRACT Advances in wearable technologies have the ability to revolutionize and improve people’s
lives. The gains go beyond the personal sphere, encompassing business and, by extension, the global
economy. The technologies are incorporated in electronic devices that collect data from consumers’ bodies
and their immediate environment. Human activities recognition, which involves the use of various body
sensors and modalities either separately or simultaneously, is one of the most important areas of wearable
technology development. In real-life scenarios, the number of sensors deployed is dictated by practical
and financial considerations. In the research for this article, we reviewed our earlier efforts and have
accordingly reduced the number of required sensors, limiting ourselves to first-person vision data for
activities recognition. Nonetheless, our results beat state of the art by more than 4% of F1 score.
INDEX TERMS Human activity recognition; Deep learning; First-person vision
I. INTRODUCTION
THE demand for human activity recognition (HAR) has
increased particularly with the advent of ubiquitous mo-
bile and sensor-rich devices. HAR has a wide range of appli-
cations, including human-computer interaction, health sup-
port, industrial settings, video surveillance, mobile/ambient-
assisted living, smart homes, and rehabilitation. Low-cost
wearable devices allow users to customize their size, weight,
and power usage. They’re becoming more accessible, which
is driving demand for - and thus manufacturing of - more
mobile wearable gadgets, as well as embedded sensing for
smart surroundings.
In the field of HAR research, a variety of methodolo-
gies have been developed, the most common among these
methodologies are the sensor-based methods [1] and the
vision-based methods [2] . Based on the design or technology
used, these methodologies can be divided into two categories.
The first category is based on Machine-learning approaches,
these approaches include k-nearest neighbor (K-NN), sup-
port vector machine (SVM), hidden Markov models (HMM),
decision trees (DT), and some others. The second category
of methods is based on neural networks, among these are the
methods based on artificial neural networks (ANN), recurrent
neural networks (RNN), and convolutional neural networks
(CNN) [3].
Visual-sensing technology, such as video cameras, are
used in vision-based human-activity detection techniques
to monitor both an actor’s behavior, and changes in the
surroundings. The generated sensor data takes the form of
digitized visual data or video sequences. To analyze visual
data for pattern recognition, approaches in this domain use
computer-vision techniques such as structural modeling, fea-
ture extraction, movement segmentation, movement tracking,
and activity extraction [4].
The task of human behavior analysis can be semantically
categorized into the following classes: (i) motion, (ii) activity,
(iii) behavior, and (iv) action [5], as shown in Figure 1. From
one point of view, motion is the least semantic level, whereas
behavior is the largest. To put it another way, motion takes the
smallest amount of time to do; developing a behavior takes a
far longer duration of motion capture. Action is created by
combining motion information with diverse interactions, and
more complicated activities culminate in the moulding of a
behavior.
It’s also important and beneficial to differentiate human
behaviour at various levels of mobility. Physical behaviors
are an example of how the terms "activity" and "action"
are commonly employed in activity-recognition settings. In
certain contexts, these phrases are interchangeable; in some
others, they are utilized to represent activities of varying
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
duration and complexity. In the latter scenario, the term "ac-
tivity" usually refers to a single person’s straightforward be-
havior that lasts for a brief length of time. Closing a cabinet,
opening a container, and a variety of other monotonous duties
are examples of actions. In this context, however, the term
"activities" refers to more complex behaviors that are formed
by a series of actions that are interlaced or overlapped.
In order to improve human activity recognition, previous
research recommended merging several sensor modalities
[6]–[10]. The current article is based on new research that
builds on our previous work [9], [10]. Since the original
publication, we have come to the conclusion that the practical
constraints of real-life scenarios dictate the use of fewer
sensors with less complicated feature extraction method and
faster execution time. Thus, in the current investigation, we
have reduced the number of sensors from three to one and
have only utilized the first-person visual data to recognize
the activity. Disparate from third-person camera, e.g., cam-
eras used in surveillance systems, the first-person camera
provides visual data about the subject wearing the camera.
It continuously captures the interactions between the subject
and other surrounding environment as people and objects.
This will directly reflect the preferences from personal and
relational perspective of the subject. Further, those interac-
tions yield that first-person visual data is ideal for human
activity recognition.
FIGURE 1. Components of human conduct, from mobility to moulding a
behavior.
The remainder of this article is organized as follows:
Section II offers an overview of related work and this work’s
contributions; Section III provides details on the dataset;
Section IV presents the methods; Section V scrutinizes the
experiments and their results, as well as discussions on the
findings. Finally, section VI comprises a summarized con-
clusion.
II. RELATED WORK AND CONTRIBUTIONS
Human activity can be detected using a variety of sensor
modalities. Visual and inertial are two of the most commonly
used modalities, and they may be employed together or
separately.
Inertial Measurement Units (IMUs) can be utilized in an
individual setting for human-activity recognition [11]–[18].
They can measure force, angular rate, and orientation, among
other body signals. IMUs, on the other hand, suffer from a
huge scale of uncertainty while monitoring slow motion; at
high speeds, relative uncertainty is reduced. Convolutional
long short-term memory (LSTM) was employed in an ex-
periment to tackle consecutive human-activity recognition
challenges [11]. It involved the development of a multilayer
neural network structural model using an inception neural
network and gated recurrent units (GRU). On the Opportunity
dataset, the highest F-measure score was 94.6%. Bevilacqua
et al. [12] employed CNN to characterize human behaviors
in another study. They have utilized an Otago exercise pro-
gram data, which includes 16 human activities based on the
positioning of five sensors on the participants: one centered
on each foot, one on each (left and right) shank’s distal third,
and one fixed on the lumbar area.
Based on the similarity among different wearable sensing
modules, Jalloul et al. [13] built a structural connectivity
network to investigate the relationships between sensing
modules while performing tasks. The sensors were placed
on various regions of the body and acted as an activity
monitoring system such as walking, sitting, standing, and
lying.
Ashry et al. [14] employed LSTM to identify seven pri-
mary activities captured with distinct motion primitives using
an Apple watch. To solve the challenge of sequential activity
recognition, Sun et al. [15] suggested a hybrid deep model
built on extreme learning machine (ELM) and LSTM. LSTM
recurrent layers, Convolutional layers, and ELM classifier
were used in their system, which is capable of automati-
cally learning feature representations and modeling temporal
relationships between features. With all classes inclusive,
along with a null class, their model received a 91.8% F1
score, and 90.6% excluding the null class when tested on an
Opportunity dataset with 17 distinct gestures.
Rueda et al. used a time series multi-channel to train
CNN for recognising activities, and then tested it using the
Pamap2, Opportunity, and Order Picking datasets. Using
multi-class SVM classifiers, Davila et al. [18] developed a
data-driven architecture built on an iterative learning frame-
work to classify human locomotor activities such as standing,
walking, sitting, and laying. The accuracy of their framework
was 74.08% on average. When compared to the conclusions
generated by the supervised technique utilizing 80% data for
training and 20% for testing, an average accuracy of 81.07%
is achieved utilizing only 6.94% data for training.
It’s worth noting that visual data is only one type of
information that can be utilized to recognize activities. RGB-
D data in the context of deep learning to distinguish human
actions is an example of how various types can be used [19],
[20]. Sudhakaran et al. [21] also published a hierarchical
feature lightweight aggregation approach that can be inte-
grated into any deep architecture with a CNN backbone in
another project. A CNN block’s feature is gated at each layer,
and its residual is passed to the next branch. They tested
their method on the EPIC-KITCHENS, Something-v1, and
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
HMDB51 datasets. When compared to a Temporal Segment
Network (TSN), the results reveal a 24% improvement.
Wang et al. [22], developed the Baidu-UTS object iden-
tification model, which consists of two parts: the first is a
3D CNN branch that takes sampled video clips as input
and creates a clip feature, and the second is a 3D CNN
branch that extracts more accurate object-related features
for 3D CNN training. From the context frames, the sec-
ond retrieves object-related features. They used an EPICK-
ITCHENS dataset to predict verbs, nouns, and actions for
each video segment based on the vocabulary.
In an another effort Sudhakaran et al. [23], presented a
Long Short Term Attention (LSTA), which is a recurrent unit
able to tackle the drawbacks of the LSTM model, especially
when the discriminating information in the input data is
localized spatially. They have also tested their model on four
datasets: EGTEA Gaze+, GTEA71, GTEA61, and EPIC-
KITCHENS, using LSTA in a two-stream network topology
by applying cross-modal fusion. The network was taught to
classify multi-tasks using verb, noun, and activity supervi-
sion. The bias of verb and noun classifiers was controlled
through activity classifier activation. This demonstrates that
other details taken from visual data may also be utilized for
recognizing an activity.
Visual sensing can be used to calculate optical flow. It
depicts the apparent mobility of objects in two-frame se-
quences. Optical flow forward is the displacement vector
for every pixel in the initial frame, and backward optical
flow is the displacement vector from the second frame to
the first, forming a vector field in the uand vdirections.
[24] discusses the connection among the optical flow and the
activity recognition task.
Optical flow’s performance in a variety of activity recog-
nition applications [25]–[28] isn’t attributable to its temporal
structure. Optical flow [25]–[28], dense trajectories based on
motion boundary histograms [29], and deep-learned spatial
descriptors [30], are nonetheless, regarded invariant to the
representation’s appearance [24].
Combining several sensor modalities to better human-
activity detection has been suggested in a thorough research
[6]–[8]. In real-world circumstances, however, it is preferable
to have a reasonable and balanced number of modalities.
Lu et al. [31] used LSTM to categorize activities utilizing
four IMU sensors and egocentric video from the CMU Mul-
timodal Activity (CMU-MMAC) database [32]. For activity
recognition, [33] used visual and audio sensors. [8] described
a methodology for detecting proprioceptive activities using
egocentric data from IMUs. The study used CNN-LSTM
to utilize discriminative properties of multimodal feature
groups supplied by stacked spectrograms from inertial data
via cross-domain knowledge transfer. For human activity
recognition, [34] combined ocular traits with object informa-
tion and inertial data.
The research presented in the current article extends our
work in [9], in which we used two sensors data (visual and in-
ertial). Local visual descriptors were used as features for ego-
centric vision data, while statistical features extracted from
tracking left and right hand positions in space were used.
To overcome the drawbacks we analyzed in our previous
work [9], i.e. heavy calculations for feature extraction and
the time-consuming nature of the overall effort associated
with multi-sensor data, we modified the experimental setup.
In this research, only one sensor - a first-person camera - was
deployed, which resulted in acceleration of the recognition
process due to less data processing associated with only
a single sensor as compared to the multi-sensor approach.
We also used a deep-learning strategy for feature extraction
and recognition, which took advantage of deep learning’s
speed and accuracy. For the Brownie recipe, the findings out-
performed [34]’s state-of-the-art work. The model was also
evaluated on the Sandwich and Eggs recipes, with excellent
results.
III. DATASET
Our activity recognition techniques were trained and tested
using the CMU-MMAC primary database [32]. This database
contains human activity measurements derived from mul-
timodal sensors worn by individuals in Carnegie Mellon’s
Motion Capture Lab while doing tasks related to cooking
and food preparation. Our activity recognition techniques
were trained and tested using the CMU-MMAC primary
database [32]. The dataset contains human activity measure-
ments derived from multimodal sensors worn by individuals
in Carnegie Mellon’s Motion Capture Lab while doing tasks
related to cooking and food preparation. More than forty
people took part in the creation of recipes for five different
foods:
Salad.
Sandwich.
Pizza.
Scrambled eggs.
Brownies.
The following data categories were recorded using a vari-
ety of modalities:
1) Audio:
Five microphones.
2) Video:
Three high-quality video cameras (1024 x 768)
having 30 Hertz temporal resolution.
Two video cameras having a temporal resolution
of 60 Hertz and a spatial resolution of 640 x 480 pixels.
A wearable camera having a high spatial resolu-
tion (800×600/1024×768) and a temporal resolution
of 30 Hertz.
3) Internal Measurement Units (IMUs):
Bluetooth IMUs (6DOF).
Wired IMUs (3DMGX).
4) Motion Capture:
Each of the 12 infrared MX-40 cameras in the
motion-capturing system recorded images with a res-
olution of four megapixel at 120 Hertz.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
5) Wearable devices:
eWatch.
BodyMedia.
This dataset is acquired utilizing 55 volunteers, each of
whom took part in many sub-experiments.
A. ACTION EXTRACTION
Based on [35] annotations, a Matlab application 1was built
to extract the associated actions for both the visual data and
the IMUs from the CMU-MMAC dataset [32]. This action
extraction tool is depicted in the Figure 2. To extract all
the activities based on the annotation file given for each
individual, the user must submit the IMUs ID, video, and
subject. The extracted data for each modality was then syn-
chronized utilizing the specified start and finish times. Each
file extracted- or the associated image name - has a serial
number, a prefix for subject ID, and action name to make
processing easier.
FIGURE 2. Action-extractor tool. The user must specify the following
information to send to Action-extractor tool: IMUs sensor IDs, subject IDs,
IMUs data, and a video, based on an annotation file for all actions.
B. CMU-MMAC ANNOTATIONS
[35] has published recently a collection of annotations for
the CMU-MMAC database, which includes a larger amount
of labels for different kinds of scenarios. [35] outlined the
method for annotating the CMU-MMAC dataset in their
paper. [35] have published one of most recent collection of
annotations for the CMU-MMAC database, which includes
a wider range of labels for different kinds of scenarios. [35]
also described the method for annotating the CMU-MMAC
database in their publication. Furthermore, they also pro-
vided semantic annotations usable in reasoning experiments.
The annotations primarily center on three recipes: brownies,
Eggs, and sandwiches, in which they have utilized all relevant
themes. These new annotations are based on the first-person
camera’s egocentric perception.
1https://github.com/alhersh/ActionExtractor
There are eleven activity classes for Eggs, eleven for
Brownie, and eight for Sandwich. The activities built on verb-
Object1-object2-...-object_n, where the number of objects
varies from one activity to the next, are a derivative of these
activity classes. For example shake-butter_spray_can and
close_drawer.
IV. METHODS
Starting with feature extraction and ending with the classifi-
cation strategy, this section describes the methods employed
in this study.
A. CONVOLUTIONAL NEURAL NETWORKS (CNN)
CNNs were first proposed in 1989 by [36]. They are a form
of neural network that is used to process input with a prede-
termined topology. Time-series data, for example, has been
considered, which may be represented by a 1-D grid when
collected at a given time interval. Image data can alternatively
be thought of as a two-dimensional grid of pixels. The name
CNN refers to the mathematical operation of convolution,
which is used by this form of neural network. The following
parts will go over convolution, which is a form of linear
operation. As a result, CNN is a kind of neural network that
in at least one network layer employs convolution operations
rather than general matrix multiplication.
1) Convolution Operation
Generally, convolution is a mathematical operator on func-
tions of a real-valued input, such as when we use a laser
sensor to track the location of an object and the sensor
produces an output x(t)indicating the object’s position at
time t.
Both tand xare real numbers in this situation, and by
assuming that the laser sensor is noisy, the differences in
various measurements at multiple instant times from the laser
sensor may be calculated. Several readings can be averaged to
get an improved estimation of the object’s position. The more
recent readings are used to represent relevant facts, resulting
in a weighted average that favors the most recent evaluation.
This is achieved by using the weighting function w(a), where
arepresents the age of the reading. We get a new function s,
which provides a smoothed estimate of the object’s position,
if we use this type of weighted average operation at every
moment:
s(t) = Zx(a)w(ta)da. (1)
That last operation is known as convolution. The convolu-
tion operator can be represented by an asterisk [37]:
s(t)=(xw)(t).(2)
In the above example, the first parameter xrepresents the
network’s input, according to convolutional network termi-
nology. The kernel is the second parameter, indicated by the
w. A feature map is a term used to describe the network
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
output. Time should normally be discretized when dealing
with computer data, and the sensor in the preceding example
will send data at synchronized intervals, such as once per
millisecond. After that, the time index tcan only take integer
values. The discrete convolution is as follows, assuming that
wand xare defined only at integer t:
s(t)=(xw)(t) =
X
a=−∞
x(a)w(ta).(3)
The input data for applications in machine learning is fre-
quently a multidimensional array, as illustrated in Figure 3,
and the kernel is often formulated as a multi-dimensional
array of parameters customized by the learning process. Each
input and kernel element must be expressly saved individu-
ally. As a result, except for the finite group of locations for
which the values are stored, these functions are normally
presumed to be zero. To put it another way, an infinite
summation is a sum of a limited number of array components.
Convolutions can also be applied to many axes at the same
time. When working with a two-dimensional image I, for
example, it’s probably best to utilize a 2-D kernel K:
S(i, j) = (KI)(i, j ) = X
mX
n
I(im, j n)K(m, n).
(4)
FIGURE 3. In this example, the input data (brown outline) was used to do a
2-D convolution with the 2×2kernel (represented as blue outline); and the
output is illustrated by the green box.
2) Layers in a CNN
A CNN is a series of layers, each of which performs a
differentiable function to translate one volume of activations
to another. The convolutional layer (CL), fully connected
layer (FCL), and pooling layer (PL) are the three primary
types of layers utilized in CNN designs. When you stack
these together, you’ll get a complete CNN model.
a: The Convolutional layer (CL)
- This layer is the most important component of any con-
volutional network. This layer is where the majority of the
hard computational processing takes place. It computes the
output of neurons connected to local regions in the input, with
each neuron computing the dot product of their weights and a
small region in the input volume to which they are connected.
A set of learnable filters make up the convolutional layer
parameters. Each filter is modest in size, yet it covers the
entire depth of the input volume.
b: The Pooling layer (PL)
- In network architecture, inserting a pooling layer on a
periodic basis is regarded a standard procedure among suc-
ceeding convolutional layers. The major purpose of this layer
is to control overfitting by reducing the amount of parameters
and overhead computations in a network. This layer, in other
words, conducts spatial down sampling.
c: The Fully connected layer (FCL)
- All neurons in a FCL are connected fully to the previous
layer activations, similar to how normal neural networks
work. As a result, a matrix multiplication followed by a bias
offset can be used to determine their activations.
3) CNN Architecture Overview
The layers shaping CNNs, unlike ordinary neural networks,
are made up of neurons arranged in three dimensions (
height, width, and depth), with depth referring to the depth
of activation volume rather than that of a CNN. This is the
situation, for example, if the input images have a volume of
activations with dimensions 32 ×32 ×3(height, width, and
depth, respectively). As a result, instead of being connected
fully to all of the neurons in the layer before it, the neurons in
that particular layer will be connected only to a small fraction
of the layer preceding it. Furthermore, because the entire
image is transformed into a vector of class scores ordered
with the depth, as illustrated in Figure 4, the output layer will
have dimensions 1×1×c(where crepresents the classes).
FIGURE 4. As seen in one of the layers, a CNN arranges its neurons in three
dimensions (width, height, and depth). Every layer of a CNN converts the
three-dimensional input volume into a three-dimensional output volume of
neuron activation. The image is held by the brown input layer in this example,
thus its width and height are the image’s dimensions, and the depth is three
channels (red, green, and blue), with an assumption that the image is a RGB
image.
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
B. FEATURE EXTRACTION
The primary purpose of the feature extraction method is
to reduce the number of dimensions in a large dataset.
Thus, resulting in more easy to handle sets of data that can
be processed with less CPU effort. The feature extraction
approach can pick and/or combine distinct factors to form
features, thereby decreasing data while still accurately and
completely characterizing the original dataset. Following
that, a schematic of the feature extraction methods employed
in this study is provided.
To extract features from videos, GoogLeNet was employed
[38]. This network is a 22-layer CNN that only includes the
layers containing parameters. It was made up of approxi-
mately 100 layers in all. For each frame, the retrieved feature
vector is 1024 bytes long.
By extracting activations from input video frames, the
GoogLeNet convolutional network [38] was employed as a
feature-extraction tool. As a result, videos are transformed
to sequences of feature vector, with feature vectors being
an output of an activation function on the GoogLeNet [38]
network’s last pooling layer ("pool5-7x7_s1"), as shown in
Figure 5.
FIGURE 5. Obtaining activations depicted as a data flow diagram.
We only utilized the recipes for Brownies, Scrambled Eggs
(Eggs), and Sandwiches. Figure 6 shows the distribution and
name of classes in the Brownie dataset.
FIGURE 6. From the CMUMMAC Brownie dataset, the distribution of frames
for the considered activities . The verb part of the annotated label of activity
was used to create the activity name.
C. RECURRENT NEURAL NETWORKS (RNN)
The information from the past can be used to anticipate the
present or the future information in recurrent neural networks
(RNN), for example, utilizing the sequential and multiple
video frames to recognize, classify, or predict the entire
video.
Bidirectional LSTM (BiLSTM) is a type of RNN [39] that
learns in both directions and have interdependence between
time steps in a data, image, or time series sequence. When
each time step requires the network to learn from the full time
series, as illustrated in Figure 7, the significance of these
dependencies is utilised. As a result, in our experiment, we
employed BiLSTM to identify activities that are believed to
be a series of images in time. The error propagated backwards
in time, however layers in BiLSTM can be preserved. Recur-
rent networks can learn over many time steps by maintaining
a large number of constant errors.
FIGURE 7. Architecture of BiLSTM based on [39].
The illustration in Figure 8 depicts an overview of the
strategy for identifying human activities using only visual
input. Activity videos are used as input data, and GoogLeNet
is used to extract features. For human activity recognition,
the activities are extracted and subsequently trained using a
BiLSTM network.
V. EXPERIMENTS AND DISCUSSION
We used the annotations of [35], which consists of a
CMU-MMAC dataset semantic annotation on three differ-
ent recipes: Brownie, Sandwich, and Eggs to compare with
state-of-the-art findings. Activity classes are used from the
Brownie, Sandwich, and Eggs recipes to recognize human
actions. The distribution and name of classes for the Brownie
recipe, for example, are illustrated in Figure6. We used
solely visual data in this experiment, with a fixed 80%-20%
split for the training and testing respectively. An assumption
is made that any 15 frames of an activity may be utilized to
recognize the indicated human activity in this experiment. We
divided every film of a human activity into 15 video frames,
which enhanced the training data and allowed us to test our
hypothesis that a portion of the activity can be recognized on
the other.
To extract features from videos, GoogLeNet [38] was
utilized. It has 22 layers and is classified as a convolutional
neural network (counting layers with parameters only). The
feature vector generated for every frame was 1024 bytes long.
The features generated in this experiment were combined
into recurrent neural networks (RNN). As a result, bidirec-
tional LSTM (BiLSTM), an RNN extension [39], was pro-
posed. When the network is required to learn from the entire
time series at every time step, as illustrated in Figure 7, the
significance of these dependencies is utilised. This is why, in
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
FIGURE 8. An overview of the broad classification approach for human
activities. Activity videos are utilized as input data, and GoogLeNet is used to
extract features, resulting in a feature set that is used to train a BiLSTM model.
our experiment, we employed BiLSTM to identify activities
that are thought to be a succession of images in time. In
BiLSTM, the error can be retained while it propagates over
time and layers. Recurrent networks can learn across many
time steps if they maintain a constant error.
The results for recognizing human activities on the three
datasets are summarized in Table 1, where we also compare
our results for the Brownie dataset to those of [34], which
is regarded as the state-of-the-art method. Table 1 shows the
averages of precision, recall, and F1 score, showing that our
method surpassed the state-of-the-art work, with an F1 score
of over 4% on the Brownie dataset.
TABLE 1. Comparison of our classification method to the state-of-the-art [34].
The precision, recall, and F1 score averages are provided in this table.
Dataset Method Precision Recall F1 Score
Brownie Ours 0.701 0.717 0.707
[34] 0.831 0.604 0.664
Sandwich Ours 0.732 0.697 0.702
Eggs Ours 0.737 0.729 0.730
Each human activity’s F1 score is displayed separately in
the Table 2. For the following classes, [34] produced better
results than the proposed method: close class = 3%; put class
= 3%; other class = 1%; take class = 3%; and Turn_on
TABLE 2. Comparison based on F1 score of our classification approach with
state-of-the-art [34].
Class Ours [34]
Close 0.555 0.589
Clean 0.634 0.571
Open 0.731 0.707
Fill 0.946 0.844
Put 0.567 0.595
Other 0.702 0.719
Stir 0.980 0.939
Shake 0.571 0.426
Take 0.571 0.607
Walk 0.627 0.402
Turn_on 0.892 0.903
class = 1%. Our strategy, on the other hand, surpassed their
findings by an average of 10% in the remaining 6 classes. For
example, the difference is 6% in the clean class, 10% in the
fill class, 3% in open, 15% in shake, 5% in stir, and 22%
in walk.
Even though [34] had marginally superior results in the
close, put, other, Turn_on, and take classes. Our results, on
the other hand, exceeded theirs by a large margin in the shake,
fill, and walk classes, despite the fact that they employed
multi-modality (visual and IMUs) classification whereas we
simply used visual data.
TABLE 3. Comparison between the proposed method and the state-of-the-art
[34] based on Precision results.
Class Ours [34]
Close 0.616 0.764
Clean 0.591 0.947
Open 0.719 0.720
Fill 0.944 0.748
Put 0.530 0.665
Other 0.700 0.866
Stir 0.973 0.904
Shake 0.526 1.000
Take 0.633 0.620
Walk 0.642 0.935
Turn_on 0.839 0.976
Table 3 and Table 4 show precision and recall results
respectively compared to [34] for each activity separately.
Despite the fact that we simply employed visual data
elements in this experiment, human activities recognition
exceeded state-of-the-art work [34] by more than 4% in the
F1 score. Furthermore, in order to generalize our technique,
we evaluated it on two additional CMU-MMAC datasets:
Scrambled eggs (Eggs) and Sandwich.
Table 5 shows the activity recognition results for the Eggs
dataset. Precision, F1 score, and recall are all reported in
this table. In addition, the averaged F1 score results of the
proposed model surpassed state-of-the-art [34] findings by
almost 6%.
Despite having three less activity classes in the Sandwich
dataset, the averaged F1 score surpassed the state-of-the-art
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
TABLE 4. Comparison between the proposed method and the state-of-the-art
[34] based on Recall.
Class Ours [34]
Close 0.505 0.479
Clean 0.684 0.409
Open 0.744 0.696
Fill 0.949 0.967
Put 0.610 0.537
Other 0.704 0.615
Stir 0.987 0.977
Shake 0.625 0.270
Take 0.520 0.595
Walk 0.612 0.256
Turn_on 0.952 0.840
TABLE 5. Activity recognition results based on F1 score, Recall, and
Precision using Eggs recipe.
Class Precision Recall F1 score
Close 0.470 0.572 0.516
Clean 0.667 0.545 0.600
Open 0.654 0.671 0.662
Fill 0.922 0.926 0.924
Put 0.627 0.485 0.547
Other 0.879 0.781 0.827
Stir 0.959 0.980 0.970
Shake 0.776 0.843 0.808
Take 0.551 0.685 0.611
Walk 0.746 0.716 0.731
Turn_on 0.860 0.811 0.835
Average 0.737 0.729 0.730
results of [34] by 4%. Table 6 shows the detailed findings for
F1 score, precision, and recall.
TABLE 6. Activity recognition results based on F1 score, Recall, and
Precision using Sandwich recipe.
Class Precision Recall F1 score
Close 0.833 0.613 0.707
Clean 0.857 1.0 0.923
Open 0.869 0.628 0.729
Fill 0.935 0.980 0.957
Put 0.470 0.362 0.409
Other 0.662 0.461 0.543
Stir 0.722 0.722 0.722
Shake 0.511 0.805 0.625
Average 0.732 0.697 0.702
VI. CONCLUSION
The recognition process was accelerated when a single sensor
and a first-person camera were employed on all the annotated
activity classes based on [35]. In order to benefit from deep
learning’s speed and accuracy, a deep learning strategy was
utilized for the extraction of features and for better classifi-
cation. For the case of Brownie recipe, the results surpassed
the state-of-the-art work performed by [34]. Additionally, the
model was also successfully evaluated on the Sandwich and
Eggs recipes, producing great results. Furthermore, because
of the size of classified videos are only about 15 frames of
long, equivalent to 0.5 seconds, it allows for an approaching
real-time activity categorization.
ACKNOWLEDGMENT
The authors of this study would like to express their grati-
tude to Qatar National Library, QNL, for their support and
assistance in publishing this work.
REFERENCES
[1] A. Zahin, R. Q. Hu et al., “Sensor-based human activity recognition for
smart healthcare: A semi-supervised machine learning,” in International
Conference on Artificial Intelligence for Communications and Networks.
Springer, 2019, pp. 450–472.
[2] S. Zhang, Z. Wei, J. Nie, L. Huang, S. Wang, and Z. Li, “A review
on human activity recognition using vision-based method,” Journal of
healthcare engineering, vol. 2017, 2017.
[3] C. Jobanputra, J. Bavishi, and N. Doshi, “Human activity recognition: A
survey,” Procedia Computer Science, vol. 155, pp. 698–703, 2019.
[4] L. Chen, J. Hoey, C. D. Nugent, D. J. Cook, and Z. Yu, “Sensor-based ac-
tivity recognition,” IEEE Transactions on Systems, Man, and Cybernetics,
Part C (Applications and Reviews), vol. 42, no. 6, pp. 790–808, 2012.
[5] T.-H.-C. Nguyen, J.-C. Nebel, F. Florez-Revuelta et al., “Recognition of
activities of daily living with egocentric vision: A review,” Sensors, vol. 16,
no. 1, p. 72, 2016.
[6] C. Chen, R. Jafari, and N. Kehtarnavaz, “A survey of depth and inertial
sensor fusion for human action recognition,” Multimedia Tools and Appli-
cations, vol. 76, no. 3, pp. 4405–4425, 2017.
[7] T. Alhersh and H. Stuckenschmidt, “On the combination of imu and optical
flow for action recognition,” in 2019 IEEE International Conference on
Pervasive Computing and Communications Workshops (PerCom Work-
shops). IEEE, 2019.
[8] G. Abebe and A. Cavallaro, “Inertial-vision: cross-domain knowledge
transfer for wearable sensors,” in Proceedings of the IEEE International
Conference on Computer Vision, 2017, pp. 1392–1400.
[9] T. Alhersh, S. B. Belhaouari, and H. Stuckenschmidt, “Action recognition
using local visual descriptors and inertial data,” in European Conference
on Ambient Intelligence. Springer, 2019, pp. 123–138.
[10] T. Alhersh, “From motion to human activity recognition,” 2021.
[11] C. Xu, D. Chai, J. He, X. Zhang, and S. Duan, “Innohar: A deep neural
network for complex human activity recognition,” IEEE Access, vol. 7,
pp. 9893–9902, 2019.
[12] A. Bevilacqua, K. MacDonald, A. Rangarej, V. Widjaya, B. Caulfield,
and T. Kechadi, “Human activity recognition with convolutional neu-
ral networks,” in Joint European Conference on Machine Learning and
Knowledge Discovery in Databases. Springer, 2018, pp. 541–552.
[13] N. Jalloul, F. Porée, G. Viardot, P. L’Hostis, and G. Carrault, “Activity
recognition using complex network analysis,” IEEE journal of biomedical
and health informatics, vol. 22, no. 4, pp. 989–1000, 2018.
[14] S. Ashry, R. Elbasiony, and W. Gomaa, “An lstm-based descriptor for
human activities recognition using imu sensors,” in Proceedings of the
15th International Conference on Informatics in Control, Automation and
Robotics, ICINCO, vol. 1, 2018, pp. 494–501.
[15] J. Sun, Y. Fu, S. Li, J. He, C. Xu, and L. Tan, “Sequential human activity
recognition based on deep convolutional network and extreme learning
machine using wearable sensors,” Journal of Sensors, vol. 2018, 2018.
[16] F. Moya Rueda, R. Grzeszick, G. Fink, S. Feldhorst, and M. ten Hompel,
“Convolutional neural networks for human activity recognition using
body-worn sensors,” in Informatics, vol. 5, no. 2. Multidisciplinary
Digital Publishing Institute, 2018, p. 26.
[17] F. Attal, S. Mohammed, M. Dedabrishvili, F. Chamroukhi, L. Oukhellou,
and Y. Amirat, “Physical human activity recognition using wearable sen-
sors,” Sensors, vol. 15, no. 12, pp. 31314–31 338, 2015.
[18] J. C. Davila, A.-M. Cretu, and M. Zaremba, “Wearable sensor data
classification for human activity recognition based on an iterative learning
framework,” Sensors, vol. 17, no. 6, p. 1287, 2017.
[19] E. P. Ijjina and K. M. Chalavadi, “Human action recognition in rgb-d
videos using motion sequence information and deep learning,” Pattern
Recognition, vol. 72, pp. 504–516, 2017.
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3099567, IEEE Access
Alhersh et al.: Learning Human Activity From Visual Data Using Deep Learning
[20] H. Coskun, D. J. Tan, S. Conjeti, N. Navab, and F. Tombari, “Human mo-
tion analysis with deep metric learning,” arXiv preprint arXiv:1807.11176,
2018.
[21] S. Sudhakaran, S. Escalera, and O. Lanz, “Hierarchical feature aggregation
networks for video action recognition,” arXiv preprint arXiv:1905.12462,
2019.
[22] X. Wang, Y. Wu, L. Zhu, and Y. Yang, “Baidu-uts submission to
the epic-kitchens action recognition challenge 2019,” arXiv preprint
arXiv:1906.09383, 2019.
[23] S. Sudhakaran, S. Escalera, and O. Lanz, “Lsta: Long short-term attention
for egocentric action recognition,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 9954–9963.
[24] L. Sevilla-Lara, Y. Liao, F. Guney, V. Jampani, A. Geiger, and M. J. Black,
“On the integration of optical flow and action recognition,” arXiv preprint
arXiv:1712.08416, 2017.
[25] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang, “Optical flow
guided feature: A fast and robust motion representation for video action
recognition,” in CVPR, 2018.
[26] S. Akpinar and F. N. Alpaslan, “Video action recognition using an optical
flow based representation,” in IPCV. The Steering Committee of The
World Congress in Computer Science, Computer Engineering and Applied
Computing (WorldComp), 2014, p. 1.
[27] S. S. Kumar and M. John, “Human activity recognition using optical
flow based feature set,” in Security Technology (ICCST), 2016 IEEE
International Carnahan Conference on. IEEE, 2016, pp. 1–5.
[28] M. Wrzalik and D. Krechel, “Human action recognition using optical flow
and convolutional neural networks,” in Machine Learning and Applica-
tions (ICMLA), 2017 16th IEEE International Conference on. IEEE,
2017, pp. 801–805.
[29] H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin, “Action recognition by
dense trajectories,” in CVPR 2011-IEEE Conference on Computer Vision
& Pattern Recognition. IEEE, 2011, pp. 3169–3176.
[30] S. Singh, C. Arora, and C. Jawahar, “First person action recognition using
deep learned descriptors,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 2620–2628.
[31] Y. Lu and S. Velipasalar, “Human activity classification incorporating
egocentric video and inertial measurement unit data,” in 2018 IEEE Global
Conference on Signal and Information Processing (GlobalSIP). IEEE,
2018, pp. 429–433.
[32] F. De la Torre, J. Hodgins, A. Bargteil, X. Martin, J. Macey, A. Collado,
and P. Beltran, “Guide to the carnegie mellon university multimodal
activity (cmu-mmac) database,” Robotics Institute, p. 135, 2008.
[33] M. A. Arabacı, F. Özkan, E. Surer, P. Janˇ
coviˇ
c, and A. Temizel, “Multi-
modal egocentric activity recognition using audio-visual features,” arXiv
preprint arXiv:1807.00612, 2018.
[34] A. Diete and H. Stuckenschmidt, “Fusing object information and inertial
data for activity recognition,” Sensors, vol. 19, no. 19, p. 4119, 2019.
[35] K. Yordanova and F. Krüger, “Creating and exploring semantic annotation
for behaviour analysis,” Sensors, vol. 18, no. 9, p. 2778, 2018.
[36] Y. LeCun et al., “Generalization and network design strategies,” Connec-
tionism in perspective, vol. 19, pp. 143–155, 1989.
[37] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
2016, http://www.deeplearningbook.org.
[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[39] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,
IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681,
1997.
TAHA ALHERSH received the PhD from the
chair of artificial intelligence in the University of
Mannheim, the MSc degree in Intelligent Systems
from University Utara Malaysia in 2010 and a
BSc in Computer Science from the University
of Jordan. His research interests include Machine
Leaning and Vision.
HEINER STUCKENSCHMIDT received the
Ph.D. degree from the Department of Artificial In-
telligence, Vrije Universiteit Amsterdam, in 2003.
He is a Full Professor of artificial intelligence
with the University of Mannheim. His group is
performing fundamental and applied research in
knowledge representation formalisms with a focus
on reasoning techniques for information extrac-
tion and integration as well as machine learning
for advanced decision making. Before moving to
Mannheim, he was a PostDoctoral Researcher at the Department of Artificial
Intelligence, Vrije Universiteit Amsterdam. He has published more than 200
articles including about 50 articles in international journals and more than
100 papers in computer science at peer-reviewed international conferences.
ATIQ UR REHMAN received the master’s de-
gree in computer engineering from the National
University of Sciences and Technology (NUST),
Pakistan, in 2013, and the Ph.D. degree in com-
puter science and engineering from Hamad Bin
Khalifa University, Qatar, in 2019. He is currently
working as a Postdoc Researcher with the College
of Science and Engineering, Hamad Bin Khal-
ifa University. His research interests include the
development of evolutionary computation, pattern
recognition, and machine learning algorithms.
DR. SAMIR BRAHIM BELHAOURI (SM’19)
received a master’s degree in telecommunications
and Network from the Institut Nationale Polytech-
nique of Toulouse, France, in 2000, and the Ph.D.
degree in mathematics from the Federal Poly-
technic School of Lausanne-Switzerland EPFL, in
2006. He is currently an associate professor at
the University of Hamad Bin Khalifa, Qatar foun-
dation, in the Division of Information and Com-
munication Technologies, College of Science and
Engineering, HBKU. During last years, he also holds several Research and
Teaching positions at Innopolis University-Russia, Alfaisal university-KSA,
University of Sharjah-UAE, university technology PETRONAS-Malaysia,
and EPFL Federal swiss school-Switzerland. His research interests include
vary areas from Applied Mathematics, Statistics, and Data analysis until
Artificial Intelligence, Image Signal Processing due to both Mathematics
and Computer Science backgrounds
VOLUME 4, 2016 9
... Data-driven automatic visualization techniques need to extract visual features from visualization graphics. This task lacks a clear operational standard, and the outcomes directly impact the generalization performance of machine learning models [36]. ...
Article
Full-text available
Automatic visual encoding is frequently employed in automatic visualization tools to automatically map data to visual elements. This paper proposed an automatic visual encoding approach based on deep learning. This approach constructs visual encoding dataset in a more comprehensive and reliable manner to extract and label widely available visualization graphics on the Internet in accordance with three essentials of visualization. The deep learning model is then trained to create a visual encoding model with powerful generalization performance, enabling automated effective visual encoding recommendations for visual designers. The results demonstrated that our approach extends the automatic visual encoding techniques used by existing visualization tools, enhances the functionality and performance of visualization tools, uncovers previously undiscovered data, and increases the coverage of data variables.
... [64,65] Hybrid Approaches These are models based on integration of CNN and RNN architectures. [66][67][68] Among the initially developed hand-crafted representations, improved Dense Trajectories (iDT) [27] is widely considered the state-of-the-art. Whereas, many recent competitive studies demonstrated that hand-crafted features [35][36][37][38], high-level [39,40], and mid-level [41,42] video representations have contributed towards the task of video classification with deep neural networks. ...
Article
Full-text available
The video classification task has gained significant success in the recent years. Specifically, the topic has gained more attention after the emergence of deep learning models as a successful tool for automatically classifying videos. In recognition of the importance of the video classification task and to summarize the success of deep learning models for this task, this paper presents a very comprehensive and concise review on the topic. There are several existing reviews and survey papers related to video classification in the scientific literature. However, the existing review papers do not include the recent state-of-art works, and they also have some limitations. To provide an updated and concise review, this paper highlights the key findings based on the existing deep learning models. The key findings are also discussed in a way to provide future research directions. This review mainly focuses on the type of network architecture used, the evaluation criteria to measure the success, and the datasets used. To make the review self-contained, the emergence of deep learning methods towards automatic video classification and the state-of-art deep learning methods are well explained and summarized. Moreover, a clear insight of the newly developed deep learning architectures and the traditional approaches is provided. The critical challenges based on the benchmarks are highlighted for evaluating the technical progress of these methods. The paper also summarizes the benchmark datasets and the performance evaluation matrices for video classification. Based on the compact, complete, and concise review, the paper proposes new research directions to solve the challenging video classification problem.
... There have been several reviews published in the area of human activity recognition in vision [25]- [31], sensor [32]- [37], machine learning [38]- [43], and deep learningbased methodologies [44]- [51]. Nevertheless, there needs to be a survey that focuses specifically on yogic posture recognition. ...
Article
Full-text available
Yoga has been a great form of physical activity and one of the promising applications in personal health care. Several studies prove that yoga is used as one of the physical treatments for cancer, musculoskeletal disorder, depression, Parkinson’s disease, and respiratory heart diseases. In yoga, the body should be mechanically aligned with some effort on the muscles, ligaments, and joints for optimal posture. Postural-based yoga increases flexibility, energy, overall brain activity and reduces stress, blood pressure, and back pain. Body Postural Alignment is a very important aspect while performing yogic asanas. Many yogic asanas including uttanasana, kurmasana, ustrasana, and dhanurasana, require bending forward or backward, and if the asanas are performed incorrectly, strain in the joints, ligaments, and backbone can result, which can cause problems with the hip joints. Hence it is vital to monitor the correct yoga poses while performing different asanas. Yoga posture prediction and automatic movement analysis are now possible because of advancements in computer vision algorithms and sensors. This research investigates a thorough analysis of yoga posture identification systems using computer vision, machine learning, and deep learning techniques.
... The recent studies on human activity recognition that use deep learning technology are given in Table 1. A variety of DL methods have been successfully used for HAR, such as recurrent neural networks (RNN) [24,26,28,29], long short-term memory (LSTM) [22,23,30], autoencoder (AE) [4,20], deep neural network (DNN) [1,9,13], and convolutional neural network (CNN) [31,32]. [12,35,[37][38][39][40]42,44,45], naive Bayes [37,45,46], logistic regression [33,34,39,48,49], k-nearest neighbors [35][36][37]42,45], AdaBoost [47], and random forest [12,[35][36][37][38][39]43,50]. ...
Article
Full-text available
Traditional indoor human activity recognition (HAR) has been defined as a time-series data classification problem and requires feature extraction. The current indoor HAR systems still lack transparent, interpretable, and explainable approaches that can generate human-understandable information. This paper proposes a new approach, called Human Activity Recognition on Signal Images (HARSI), which defines the HAR problem as an image classification problem to improve both explainability and recognition accuracy. The proposed HARSI method collects sensor data from the Internet of Things (IoT) environment and transforms the raw signal data into some visual understandable images to take advantage of the strengths of convolutional neural networks (CNNs) in handling image data. This study focuses on the recognition of symmetric human activities, including walking, jogging, moving downstairs, moving upstairs, standing, and sitting. The experimental results carried out on a real-world dataset showed that a significant improvement (13.72%) was achieved by the proposed HARSI model compared to the traditional machine learning models. The results also showed that our method (98%) outperformed the state-of-the-art methods (90.94%) in terms of classification accuracy.
... [56], [57] Hybrid Approaches These are models based on integration of CNN and RNN architectures. [58], [59], [60] Different deep learning architectures described above employ different fusion strategies. These fusion strategies are either for the fusion of different features extracted from the video or for the fusion of different models used in the architecture. ...
Preprint
Full-text available
div> Video classification task has gained a significant success in the recent years. Specifically, the topic has gained more attention after the emergence of deep learning models as a successful tool for automatically classifying videos. In recognition to the importance of video classification task and to summarize the success of deep learning models for this task, this paper presents a very comprehensive and concise review on the topic. There are a number of existing reviews and survey papers related to video classification in the scientific literature. However, the existing review papers are either outdated, and therefore, do not include the recent state-of-art works or they have some limitations. In order to provide an updated and concise review, this paper highlights the key findings based on the existing deep learning models. The key findings are also discussed in a way to provide future research directions. This review mainly focuses on the type of network architecture used, the evaluation criteria to measure the success, and the data sets used. To make the review self- contained, the emergence of deep learning methods towards automatic video classification and the state-of-art deep learning methods are well explained and summarized. Moreover, a clear insight of the newly developed deep learning architectures and the traditional approaches is provided, and the critical challenges based on the benchmarks are highlighted for evaluating the technical progress of these methods. The paper also summarizes the benchmark datasets and the performance evaluation matrices for video classification. Based on the compact, complete, and concise review, the paper proposes new research directions to solve the challenging video classification problem. </div
... [56], [57] Hybrid Approaches These are models based on integration of CNN and RNN architectures. [58], [59], [60] Different deep learning architectures described above employ different fusion strategies. These fusion strategies are either for the fusion of different features extracted from the video or for the fusion of different models used in the architecture. ...
Preprint
Full-text available
div> Video classification task has gained a significant success in the recent years. Specifically, the topic has gained more attention after the emergence of deep learning models as a successful tool for automatically classifying videos. In recognition to the importance of video classification task and to summarize the success of deep learning models for this task, this paper presents a very comprehensive and concise review on the topic. There are a number of existing reviews and survey papers related to video classification in the scientific literature. However, the existing review papers are either outdated, and therefore, do not include the recent state-of-art works or they have some limitations. In order to provide an updated and concise review, this paper highlights the key findings based on the existing deep learning models. The key findings are also discussed in a way to provide future research directions. This review mainly focuses on the type of network architecture used, the evaluation criteria to measure the success, and the data sets used. To make the review self- contained, the emergence of deep learning methods towards automatic video classification and the state-of-art deep learning methods are well explained and summarized. Moreover, a clear insight of the newly developed deep learning architectures and the traditional approaches is provided, and the critical challenges based on the benchmarks are highlighted for evaluating the technical progress of these methods. The paper also summarizes the benchmark datasets and the performance evaluation matrices for video classification. Based on the compact, complete, and concise review, the paper proposes new research directions to solve the challenging video classification problem. </div
Article
Blast furnace (BF) ironmaking requires flexible and precise burden distribution. As a critical distributing unit, the angle diagnosis of BF chute is essential for operation under severe conditions. However, the existing methods must be performed during BF overhauls, which cannot satisfy timely and convenient diagnosis requirements. Inspired by cutting-edge computer vision technologies, a novel angle diagnosis method for BF chute based on deep learning of temporal images is proposed, which uses grayscale consistency transformation and data augmentation for data pre-processes, and a Res-LSTM based deep neural network model for angle diagnosis. Experiments were conducted using chute motion videos collected from the BF top imaging system, and experimental results showed that our method can extract spatial-temporal features and accurately identify chute angles. The BF chute angle diagnosis system developed based on this method has been successfully running in No.7 BF (1750m 3 in volume) of Tranvic Steel Co., Ltd in China for half a year. The system can accurately diagnose the chute angle in real time, for a diagnosis rate of 100%.
Chapter
Full-text available
Different body sensors and modalities can be used in human action recognition, either separately or simultaneously. Multi-modal data can be used in recognizing human action. In this work we are using inertial measurement units (IMUs) positioned at left and right hands with first person vision for human action recognition. A novel statistical feature extraction method was proposed based on curvature of the graph of a function and tracking left and right hand positions in space. Local visual descriptors have been used as features for egocentric vision. An intermediate fusion between IMUs and visual sensors has been performed. Despite of using only two IMUs sensors with egocentric vision, our classification result achieved is 99.61% for recognizing nine different actions. Feature extraction step could play a vital step in human action recognition with limited number of sensors, hence, our method might indeed be promising.
Article
Full-text available
In the field of pervasive computing, wearable devices have been widely used for recognizing human activities. One important area in this research is the recognition of activities of daily living where especially inertial sensors and interaction sensors (like RFID tags with scanners) are popular choices as data sources. Using interaction sensors, however, has one drawback: they may not differentiate between proper interaction and simple touching of an object. A positive signal from an interaction sensor is not necessarily caused by a performed activity e.g., when an object is only touched but no interaction occurred afterwards. There are, however, many scenarios like medicine intake that rely heavily on correctly recognized activities. In our work, we aim to address this limitation and present a multimodal egocentric-based activity recognition approach. Our solution relies on object detection that recognizes activity-critical objects in a frame. As it is infeasible to always expect a high quality camera view, we enrich the vision features with inertial sensor data that monitors the users’ arm movement. This way we try to overcome the drawbacks of each respective sensor. We present our results of combining inertial and video features to recognize human activities on different types of scenarios where we achieve an F 1 -measure of up to 79.6%.
Article
Full-text available
Human Activity Recognition (HAR) has been a challenging problem yet it needs to be solved. It will mainly be used for eldercare and healthcare as an assistive technology when ensemble with other technologies like Internet of Things(IoT). HAR can be done with the help of sensors, smartphones or images. In this paper, we present various state-of-the-art methods and describe each of them by literature survey. Different datasets are used for each of the methods wherein the data are collected by different means such as sensors, images, accelerometer, gyroscopes, etc. and the placement of these devices at various locations. The results obtained by each technique and the type of dataset are then compared. Machine learning techniques like decision trees, K-nearest neighbours, support vector machines, hidden markov models are reviewed for HAR and later the survey for deep neural network techniques like artificial neural networks, convolutional neural networks and recurrent neural networks is also presented.
Chapter
Full-text available
Human action recognition is an integral part of smart health monitoring, where intelligence behind the services is obtained and improves through sensor information. It poses tremendous challenges due to huge diversities of human actions and also a large variation in how a particular action can be performed. This problem has been intensified more with the emergence of Internet of Things (IoT), which has resulted in larger datasets acquired by a massive number of sensors. The big data based machine learning is the best candidate to deal with this grand challenge. However, one of the biggest challenges in using large datasets in machine learning is to label sufficient data to train a model accurately. Instead of using expensive supervised learning, we propose a semi-supervised classifier for time-series data. The proposed framework is the joint design of variational auto-encoder (VAE) and convolutional neural network (CNN). In particular, the VAE intends to extract the salient characteristics of human activity data and to provide the useful criteria for the compressed sensing reconstruction, while the CNN aims for extracting the discriminative features and for producing the low-dimension latent codes. Given a combination of labeled and raw time-series data, our architecture utilizes compressed samples from the latent vector in a deconvolutional decoder to reconstruct the input time-series. We intend to train the classifier to detect human actions for smart health systems.
Conference Paper
Full-text available
Human action recognition is an integral part of smart health monitoring, where intelligence behind the services is obtained and improves through sensor information. It poses tremendous challenges due to huge diversities of human actions and also a large variation in how a particular action can be performed. This problem has been intensified more with the emergence of Internet of Things (IoT), which has resulted in larger datasets acquired by a massive number of sensors. The big data based machine learning is the best candidate to deal with this grand challenge. However, one of the biggest challenges in using large datasets in machine learning is to label sufficient data to train a model accurately .Instead of using expensive supervised learning, we propose a semi-supervised classifier for time-series data. The proposed framework is the joint design of variational auto-encoder (VAE) and convolutional neural network (CNN). In particular, the VAE intends to extract the salient characteristics of human activity data and to provide the useful criteria for the compressed sensing reconstruction, while the CNN aims for extracting the discriminative features and for producing the low-dimension latent codes. Given a combination of labeled and raw time-series data, our architecture utilizes compressed samples from the latent vector in a deconvolutional decoder to reconstruct the input time-series. We intend to train the classifier to detect human actions for smart health systems.
Article
Full-text available
Human Activity Recognition (HAR) based on sensor networks is an important research direction in the fields of pervasive computing and body area network. Existing researches often use statistical machine learning methods to manually extract and construct features of different motions. However, in the face of extremely fast-growing waveform data with no obvious laws, the traditional feature engineering methods are becoming more and more incapable. With the development of Deep Learning technology, we do not need to manually extract features and can improve the performance in complex human activity recognition problems. By migrating deep neural network experience in image recognition, we propose a deep learning model (InnoHAR) based on the combination of Inception Neural Network and recurrent neural network. The model inputs the waveform data of multi-channel sensors end-to-end. Multi-dimensional features are extracted by Inception-like modules with using of various kernel-based convolution layers. Combined with GRU, modeling for time series features is realized, making full use of data characteristics to complete classification tasks. Through experimental verification on three most widely used public HAR datasets, our proposed method shows consistent superior performance and has good generalization performance, when compared with state-of-the-arts.
Chapter
Effectively measuring the similarity between two human motions is necessary for several computer vision tasks such as gait analysis, person identification and action retrieval. Nevertheless, we believe that traditional approaches such as L2 distance or Dynamic Time Warping based on hand-crafted local pose metrics fail to appropriately capture the semantic relationship across motions and, as such, are not suitable for being employed as metrics within these tasks. This work addresses this limitation by means of a triplet-based deep metric learning specifically tailored to deal with human motion data, in particular with the problem of varying input size and computationally expensive hard negative mining due to motion pair alignment. Specifically, we propose (1) a novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy; as well as, (2) a novel deep architecture based on attentive recurrent neural networks. One benefit of our objective function is that it enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments. At the same time, our attentive recurrent neural network allows processing varying input sizes to a fixed size of embedding while learning to focus on those motion parts that are semantically distinctive. Our experiments on two different datasets demonstrate significant improvements over conventional human motion metrics.